dataset1001

From Wikitech
Jump to: navigation, search
dataset1001
Location: eqiad
Server group: Miscellaneous eqiad
Node name: dataset1001.wikimedia.org (fingerprint)
Usage:
Mirror rsync server
Status
Overall: Symbol support vote.png Online
Ganglia?: Ganglia status
Icinga?: host status services status
Hardware
Memory: 16GB
HDD: 24 x 2TB
Software
OS: Ubuntu 10.04

See Dumps/Dump servers for the overview of all dump server hardware.

Notes on Dataset1001 deployment:

During rsync testing with Kevin at your.org

Processes running:

  • cron job with bwlimit 10000 to get latest data written to 2012 dumps from dataset2 via rsync
  • background job with bwlimit 40000 to get 2011 dumps from dataset2 via rsync
  • rsync daemon with bwlimit 50000 which your.org connects to

Note that some files transfered are quite large, several GB.

Feb 7 2012

Saw stack traces in /var/log/messages using kernel 2.6.32-38-server (Ubuntu 10.04):

Feb  7 07:46:49 dataset1001 kernel: [384047.468170] swapper: page allocation failure. order:0, mode:0x4020
Feb  7 07:46:49 dataset1001 kernel: [384047.468176] Pid: 0, comm: swapper Not tainted 2.6.32-38-server #83-Ubuntu
Feb  7 07:46:49 dataset1001 kernel: [384047.468179] Call Trace:
Feb  7 07:46:49 dataset1001 kernel: [384047.468181]  <IRQ>  [<ffffffff810fca37>] __alloc_pages_slowpath+0x4a7/0x590
Feb  7 07:46:49 dataset1001 kernel: [384047.468195]  [<ffffffff810fcc99>] __alloc_pages_nodemask+0x179/0x180
Feb  7 07:46:49 dataset1001 kernel: [384047.468202]  [<ffffffff8112fe57>] alloc_pages_current+0x87/0xd0
Feb  7 07:46:49 dataset1001 kernel: [384047.468206]  [<ffffffff81135e97>] new_slab+0x2f7/0x310
Feb  7 07:46:49 dataset1001 kernel: [384047.468210]  [<ffffffff81138771>] __slab_alloc+0x201/0x2d0
Feb  7 07:46:49 dataset1001 kernel: [384047.468216]  [<ffffffff8146e376>] ? __netdev_alloc_skb+0x36/0x60
Feb  7 07:46:49 dataset1001 kernel: [384047.468221]  [<ffffffff8113956f>] __kmalloc_node_track_caller+0xaf/0x160
Feb  7 07:46:49 dataset1001 kernel: [384047.468225]  [<ffffffff8146e376>] ? __netdev_alloc_skb+0x36/0x60
Feb  7 07:46:49 dataset1001 kernel: [384047.468229]  [<ffffffff8146db90>] __alloc_skb+0x80/0x190
Feb  7 07:46:49 dataset1001 kernel: [384047.468233]  [<ffffffff8146e376>] __netdev_alloc_skb+0x36/0x60
Feb  7 07:46:49 dataset1001 kernel: [384047.468243]  [<ffffffffa00a6303>] bnx2_rx_skb+0x63/0x6e0 [bnx2]
Feb  7 07:46:49 dataset1001 kernel: [384047.468249]  [<ffffffffa00a6e1c>] bnx2_rx_int+0x49c/0x760 [bnx2]
Feb  7 07:46:49 dataset1001 kernel: [384047.468255]  [<ffffffff810200c6>] ? show_shared_cpu_map_func+0x66/0x80
Feb  7 07:46:49 dataset1001 kernel: [384047.468262]  [<ffffffff81093de4>] ? clockevents_program_event+0x54/0xa0
Feb  7 07:46:49 dataset1001 kernel: [384047.468266]  [<ffffffff81095488>] ? tick_dev_program_event+0x68/0xd0
Feb  7 07:46:49 dataset1001 kernel: [384047.468272]  [<ffffffffa00a7150>] bnx2_poll_work+0x70/0xa0 [bnx2]
Feb  7 07:46:49 dataset1001 kernel: [384047.468277]  [<ffffffffa00a71bd>] bnx2_poll_msix+0x3d/0xc0 [bnx2]
Feb  7 07:46:49 dataset1001 kernel: [384047.468284]  [<ffffffff81039889>] ? default_spin_lock_flags+0x9/0x10
Feb  7 07:46:49 dataset1001 kernel: [384047.468289]  [<ffffffff8147844f>] net_rx_action+0x10f/0x250
Feb  7 07:46:49 dataset1001 kernel: [384047.468294]  [<ffffffff8106f287>] __do_softirq+0xb7/0x1f0
Feb  7 07:46:49 dataset1001 kernel: [384047.468299]  [<ffffffff810c6610>] ? handle_IRQ_event+0x60/0x170
Feb  7 07:46:49 dataset1001 kernel: [384047.468303]  [<ffffffff810142ac>] call_softirq+0x1c/0x30
Feb  7 07:46:49 dataset1001 kernel: [384047.468306]  [<ffffffff81015c75>] do_softirq+0x65/0xa0
Feb  7 07:46:49 dataset1001 kernel: [384047.468309]  [<ffffffff8106f085>] irq_exit+0x85/0x90
Feb  7 07:46:49 dataset1001 kernel: [384047.468314]  [<ffffffff815654e5>] do_IRQ+0x75/0xf0
Feb  7 07:46:49 dataset1001 kernel: [384047.468318]  [<ffffffff81013ad3>] ret_from_intr+0x0/0x11
Feb  7 07:46:49 dataset1001 kernel: [384047.468320]  <EOI>  [<ffffffff81312c25>] ? acpi_idle_enter_bm+0x296/0x2ca
Feb  7 07:46:49 dataset1001 kernel: [384047.468329]  [<ffffffff81312c1e>] ? acpi_idle_enter_bm+0x28f/0x2ca
Feb  7 07:46:49 dataset1001 kernel: [384047.468335]  [<ffffffff81452297>] ? cpuidle_idle_call+0xa7/0x140
Feb  7 07:46:49 dataset1001 kernel: [384047.468341]  [<ffffffff81011e43>] ? cpu_idle+0xb3/0x110
Feb  7 07:46:49 dataset1001 kernel: [384047.468347]  [<ffffffff81558105>] ? start_secondary+0xa8/0xaa
Feb  7 07:46:49 dataset1001 kernel: [384047.468350] Mem-Info:
Feb  7 07:46:49 dataset1001 kernel: [384047.468352] Node 0 DMA per-cpu:
Feb  7 07:46:49 dataset1001 kernel: [384047.468355] CPU    0: hi:    0, btch:   1 usd:   0
Feb  7 07:46:49 dataset1001 kernel: [384047.468357] CPU    1: hi:    0, btch:   1 usd:   0
Feb  7 07:46:49 dataset1001 kernel: [384047.468360] CPU    2: hi:    0, btch:   1 usd:   0
Feb  7 07:46:49 dataset1001 kernel: [384047.468362] CPU    3: hi:    0, btch:   1 usd:   0
Feb  7 07:46:49 dataset1001 kernel: [384047.468364] Node 0 DMA32 per-cpu:
Feb  7 07:46:49 dataset1001 kernel: [384047.468367] CPU    0: hi:  186, btch:  31 usd: 174
Feb  7 07:46:49 dataset1001 kernel: [384047.468369] CPU    1: hi:  186, btch:  31 usd:  24
Feb  7 07:46:49 dataset1001 kernel: [384047.468371] CPU    2: hi:  186, btch:  31 usd: 183
Feb  7 07:46:49 dataset1001 kernel: [384047.468374] CPU    3: hi:  186, btch:  31 usd: 183
Feb  7 07:46:49 dataset1001 kernel: [384047.468375] Node 0 Normal per-cpu:
Feb  7 07:46:49 dataset1001 kernel: [384047.468378] CPU    0: hi:  186, btch:  31 usd: 179
Feb  7 07:46:49 dataset1001 kernel: [384047.468380] CPU    1: hi:  186, btch:  31 usd:  54
Feb  7 07:46:49 dataset1001 kernel: [384047.468383] CPU    2: hi:  186, btch:  31 usd: 157
Feb  7 07:46:49 dataset1001 kernel: [384047.468385] CPU    3: hi:  186, btch:  31 usd:  59
Feb  7 07:46:49 dataset1001 kernel: [384047.468390] active_anon:26517 inactive_anon:5096 isolated_anon:0
Feb  7 07:46:49 dataset1001 kernel: [384047.468392]  active_file:76812 inactive_file:3796080 isolated_file:0
Feb  7 07:46:49 dataset1001 kernel: [384047.468393]  unevictable:850 dirty:388919 writeback:233464 unstable:0
Feb  7 07:46:49 dataset1001 kernel: [384047.468394]  free:18338 slab_reclaimable:84627 slab_unreclaimable:21641
Feb  7 07:46:49 dataset1001 kernel: [384047.468395]  mapped:2677 shmem:74 pagetables:755 bounce:0
Feb  7 07:46:49 dataset1001 kernel: [384047.468398] Node 0 DMA free:15892kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB
 inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15340kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_re
claimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Feb  7 07:46:49 dataset1001 kernel: [384047.468409] lowmem_reserve[]: 0 3243 16121 16121
Feb  7 07:46:49 dataset1001 kernel: [384047.468413] Node 0 DMA32 free:52708kB min:3264kB low:4080kB high:4896kB active_anon:11084kB inactive_anon:5500kB 
active_file:46444kB inactive_file:2763984kB unevictable:568kB isolated(anon):0kB isolated(file):0kB present:3321540kB mlocked:568kB dirty:299064kB writeb
ack:181156kB mapped:136kB shmem:0kB slab_reclaimable:160712kB slab_unreclaimable:20836kB kernel_stack:64kB pagetables:164kB unstable:0kB bounce:0kB write
back_tmp:0kB pages_scanned:0 all_unreclaimable? no
Feb  7 07:46:49 dataset1001 kernel: [384047.468426] lowmem_reserve[]: 0 0 12877 12877
Feb  7 07:46:49 dataset1001 kernel: [384047.468430] Node 0 Normal free:4752kB min:12972kB low:16212kB high:19456kB active_anon:94984kB inactive_anon:1488
4kB active_file:260804kB inactive_file:12420336kB unevictable:2832kB isolated(anon):0kB isolated(file):0kB present:13186560kB mlocked:2832kB dirty:125661
2kB writeback:752700kB mapped:10572kB shmem:296kB slab_reclaimable:177796kB slab_unreclaimable:65728kB kernel_stack:1184kB pagetables:2856kB unstable:0kB
 bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Feb  7 07:46:49 dataset1001 kernel: [384047.468443] lowmem_reserve[]: 0 0 0 0
Feb  7 07:46:49 dataset1001 kernel: [384047.468447] Node 0 DMA: 3*4kB 3*8kB 3*16kB 2*32kB 2*64kB 2*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 3*4096kB = 158
92kB
Feb  7 07:46:49 dataset1001 kernel: [384047.468458] Node 0 DMA32: 11541*4kB 176*8kB 174*16kB 1*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*40
96kB = 52628kB
Feb  7 07:46:49 dataset1001 kernel: [384047.468469] Node 0 Normal: 994*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB 
= 4808kB
Feb  7 07:46:49 dataset1001 kernel: [384047.468480] 3873689 total pagecache pages
Feb  7 07:46:49 dataset1001 kernel: [384047.468482] 0 pages in swap cache
Feb  7 07:46:49 dataset1001 kernel: [384047.468484] Swap cache stats: add 0, delete 0, find 0/0
Feb  7 07:46:49 dataset1001 kernel: [384047.468486] Free swap  = 976888kB
Feb  7 07:46:49 dataset1001 kernel: [384047.468487] Total swap = 976888kB
Feb  7 07:46:49 dataset1001 kernel: [384047.529621] 4194304 pages RAM
Feb  7 07:46:49 dataset1001 kernel: [384047.529624] 81220 pages reserved
Feb  7 07:46:49 dataset1001 kernel: [384047.529625] 2627094 pages shared
Feb  7 07:46:49 dataset1001 kernel: [384047.529627] 1474311 pages non-shared
Feb  7 07:46:49 dataset1001 kernel: [384047.529631] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
Feb  7 07:46:49 dataset1001 kernel: [384047.529634]   cache: kmalloc-4096, object size: 4096, buffer size: 4096, default order: 3, min order: 0
Feb  7 07:46:49 dataset1001 kernel: [384047.529637]   node 0: slabs: 1541, objs: 4880, free: 0

I upgraded to 2.6.38-13-generic from the natty backports ppa.

Feb 8 2012

Saw similar stack traces in /var/log/messages from the network driver (bnx2):

Feb  8 08:03:26 dataset1001 kernel: [38432.288760] swapper: page allocation failure. order:0, mode:0x4020
Feb  8 08:03:26 dataset1001 kernel: [38432.288765] Pid: 0, comm: swapper Not tainted 2.6.38-13-generic #54~lucid1-Ubuntu
Feb  8 08:03:26 dataset1001 kernel: [38432.288768] Call Trace:
Feb  8 08:03:26 dataset1001 kernel: [38432.288770]  <IRQ>  [<ffffffff81113e50>] ? __alloc_pages_slowpath+0x6d0/0x710
Feb  8 08:03:26 dataset1001 kernel: [38432.288784]  [<ffffffff8114af4a>] ? alloc_pages_current+0xaa/0x120
Feb  8 08:03:26 dataset1001 kernel: [38432.288788]  [<ffffffff81114039>] ? __alloc_pages_nodemask+0x1a9/0x200
Feb  8 08:03:26 dataset1001 kernel: [38432.288792]  [<ffffffff8114af4a>] ? alloc_pages_current+0xaa/0x120
Feb  8 08:03:26 dataset1001 kernel: [38432.288796]  [<ffffffff81152db5>] ? new_slab+0x295/0x2a0
Feb  8 08:03:26 dataset1001 kernel: [38432.288799]  [<ffffffff811544bd>] ? __slab_alloc+0x17d/0x320
Feb  8 08:03:26 dataset1001 kernel: [38432.288806]  [<ffffffff814c8be4>] ? __netdev_alloc_skb+0x24/0x50
Feb  8 08:03:26 dataset1001 kernel: [38432.288810]  [<ffffffff81157134>] ? __kmalloc_node_track_caller+0xa4/0x1d0
Feb  8 08:03:26 dataset1001 kernel: [38432.288814]  [<ffffffff814c8be4>] ? __netdev_alloc_skb+0x24/0x50
Feb  8 08:03:26 dataset1001 kernel: [38432.288818]  [<ffffffff814c8393>] ? __alloc_skb+0x83/0x170
Feb  8 08:03:26 dataset1001 kernel: [38432.288822]  [<ffffffff814c8be4>] ? __netdev_alloc_skb+0x24/0x50
Feb  8 08:03:26 dataset1001 kernel: [38432.288831]  [<ffffffffa00dccf3>] ? bnx2_rx_skb+0x63/0x6f0 [bnx2]
Feb  8 08:03:26 dataset1001 kernel: [38432.288837]  [<ffffffffa00dd7f8>] ? bnx2_rx_int+0x478/0x720 [bnx2]
Feb  8 08:03:26 dataset1001 kernel: [38432.288843]  [<ffffffffa00ddb10>] ? bnx2_poll_work+0x70/0xa0 [bnx2]
Feb  8 08:03:26 dataset1001 kernel: [38432.288848]  [<ffffffffa00ddb7d>] ? bnx2_poll_msix+0x3d/0xc0 [bnx2]
Feb  8 08:03:26 dataset1001 kernel: [38432.288853]  [<ffffffff81038d09>] ? default_spin_lock_flags+0x9/0x10
Feb  8 08:03:26 dataset1001 kernel: [38432.288859]  [<ffffffff814d72a8>] ? net_rx_action+0x108/0x2d0
Feb  8 08:03:26 dataset1001 kernel: [38432.288864]  [<ffffffff8106c04b>] ? __do_softirq+0xab/0x200
Feb  8 08:03:26 dataset1001 kernel: [38432.288870]  [<ffffffff810d2330>] ? handle_IRQ_event+0x50/0x170
Feb  8 08:03:26 dataset1001 kernel: [38432.288875]  [<ffffffff8100cf9c>] ? call_softirq+0x1c/0x30
Feb  8 08:03:26 dataset1001 kernel: [38432.288878]  [<ffffffff8100ea05>] ? do_softirq+0x65/0xa0
Feb  8 08:03:26 dataset1001 kernel: [38432.288882]  [<ffffffff8106bf15>] ? irq_exit+0x85/0x90
Feb  8 08:03:26 dataset1001 kernel: [38432.288886]  [<ffffffff815cf326>] ? do_IRQ+0x66/0xe0
Feb  8 08:03:26 dataset1001 kernel: [38432.288892]  [<ffffffff815c7753>] ? ret_from_intr+0x0/0x15
Feb  8 08:03:26 dataset1001 kernel: [38432.288894]  <EOI>  [<ffffffff81334861>] ? intel_idle+0xc1/0x120
Feb  8 08:03:26 dataset1001 kernel: [38432.288902]  [<ffffffff81334844>] ? intel_idle+0xa4/0x120
Feb  8 08:03:26 dataset1001 kernel: [38432.288907]  [<ffffffff814a7182>] ? cpuidle_idle_call+0xb2/0x1b0
Feb  8 08:03:26 dataset1001 kernel: [38432.288912]  [<ffffffff8100b077>] ? cpu_idle+0xb7/0x110
Feb  8 08:03:26 dataset1001 kernel: [38432.288916]  [<ffffffff815ac442>] ? rest_init+0x72/0x80
Feb  8 08:03:26 dataset1001 kernel: [38432.288923]  [<ffffffff81accc9a>] ? start_kernel+0x374/0x37b
Feb  8 08:03:26 dataset1001 kernel: [38432.288927]  [<ffffffff81acc346>] ? x86_64_start_reservations+0x131/0x135
Feb  8 08:03:26 dataset1001 kernel: [38432.288931]  [<ffffffff81acc452>] ? x86_64_start_kernel+0x108/0x117
Feb  8 08:03:26 dataset1001 kernel: [38432.288933] Mem-Info:
Feb  8 08:03:26 dataset1001 kernel: [38432.288935] Node 0 DMA per-cpu:
Feb  8 08:03:26 dataset1001 kernel: [38432.288938] CPU    0: hi:    0, btch:   1 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.288941] CPU    1: hi:    0, btch:   1 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.288943] CPU    2: hi:    0, btch:   1 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.288945] CPU    3: hi:    0, btch:   1 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.288947] Node 0 DMA32 per-cpu:
Feb  8 08:03:26 dataset1001 kernel: [38432.288950] CPU    0: hi:  186, btch:  31 usd: 179
Feb  8 08:03:26 dataset1001 kernel: [38432.288952] CPU    1: hi:  186, btch:  31 usd:  15
Feb  8 08:03:26 dataset1001 kernel: [38432.288954] CPU    2: hi:  186, btch:  31 usd: 156
Feb  8 08:03:26 dataset1001 kernel: [38432.288956] CPU    3: hi:  186, btch:  31 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.288958] Node 0 Normal per-cpu:
Feb  8 08:03:26 dataset1001 kernel: [38432.288961] CPU    0: hi:  186, btch:  31 usd:  88
Feb  8 08:03:26 dataset1001 kernel: [38432.288963] CPU    1: hi:  186, btch:  31 usd:  21
Feb  8 08:03:26 dataset1001 kernel: [38432.288965] CPU    2: hi:  186, btch:  31 usd: 161
Feb  8 08:03:26 dataset1001 kernel: [38432.288968] CPU    3: hi:  186, btch:  31 usd:  28
Feb  8 08:03:26 dataset1001 kernel: [38432.288973] active_anon:31711 inactive_anon:6004 isolated_anon:0
Feb  8 08:03:26 dataset1001 kernel: [38432.288974]  active_file:26426 inactive_file:3857074 isolated_file:0
Feb  8 08:03:26 dataset1001 kernel: [38432.288976]  unevictable:1019 dirty:729121 writeback:11217 unstable:0
Feb  8 08:03:26 dataset1001 kernel: [38432.288977]  free:18317 slab_reclaimable:79292 slab_unreclaimable:9887
Feb  8 08:03:26 dataset1001 kernel: [38432.288978]  mapped:2051 shmem:71 pagetables:867 bounce:0
Feb  8 08:03:26 dataset1001 kernel: [38432.288981] Node 0 DMA free:15868kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB 
inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_rec
laimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Feb  8 08:03:26 dataset1001 kernel: [38432.288992] lowmem_reserve[]: 0 3243 16121 16121
Feb  8 08:03:26 dataset1001 kernel: [38432.288996] Node 0 DMA32 free:52716kB min:3264kB low:4080kB high:4896kB active_anon:18048kB inactive_anon:7992kB a
ctive_file:4864kB inactive_file:3085260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3321540kB mlocked:0kB dirty:578832kB writeback:74
12kB mapped:148kB shmem:0kB slab_reclaimable:117856kB slab_unreclaimable:6472kB kernel_stack:56kB pagetables:292kB unstable:0kB bounce:0kB writeback_tmp:
0kB pages_scanned:0 all_unreclaimable? no
Feb  8 08:03:26 dataset1001 kernel: [38432.289008] lowmem_reserve[]: 0 0 12877 12877
Feb  8 08:03:26 dataset1001 kernel: [38432.289012] Node 0 Normal free:4684kB min:12972kB low:16212kB high:19456kB active_anon:108796kB inactive_anon:1602
4kB active_file:100840kB inactive_file:12343036kB unevictable:4076kB isolated(anon):0kB isolated(file):0kB present:13186560kB mlocked:4028kB dirty:233765
2kB writeback:37456kB mapped:8056kB shmem:284kB slab_reclaimable:199312kB slab_unreclaimable:33076kB kernel_stack:952kB pagetables:3176kB unstable:0kB bo
unce:0kB writeback_tmp:0kB pages_scanned:76 all_unreclaimable? no
Feb  8 08:03:26 dataset1001 kernel: [38432.289025] lowmem_reserve[]: 0 0 0 0
Feb  8 08:03:26 dataset1001 kernel: [38432.289028] Node 0 DMA: 1*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 1586
8kB
Feb  8 08:03:26 dataset1001 kernel: [38432.289039] Node 0 DMA32: 368*4kB 249*8kB 634*16kB 1146*32kB 4*64kB 0*128kB 3*256kB 3*512kB 0*1024kB 0*2048kB 0*40
96kB = 52840kB
Feb  8 08:03:26 dataset1001 kernel: [38432.289050] Node 0 Normal: 1060*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 
= 4560kB
Feb  8 08:03:26 dataset1001 kernel: [38432.289060] 3884258 total pagecache pages
Feb  8 08:03:26 dataset1001 kernel: [38432.289062] 0 pages in swap cache
Feb  8 08:03:26 dataset1001 kernel: [38432.289064] Swap cache stats: add 0, delete 0, find 0/0
Feb  8 08:03:26 dataset1001 kernel: [38432.289066] Free swap  = 976892kB
Feb  8 08:03:26 dataset1001 kernel: [38432.289068] Total swap = 976892kB



Feb  8 08:03:26 dataset1001 kernel: [38432.326609] kworker/0:1: page allocation failure. order:0, mode:0x4020
Feb  8 08:03:26 dataset1001 kernel: [38432.326615] Pid: 0, comm: kworker/0:1 Not tainted 2.6.38-13-generic #54~lucid1-Ubuntu
Feb  8 08:03:26 dataset1001 kernel: [38432.326618] Call Trace:
Feb  8 08:03:26 dataset1001 kernel: [38432.326620]  <IRQ>  [<ffffffff81113e50>] ? __alloc_pages_slowpath+0x6d0/0x710
Feb  8 08:03:26 dataset1001 kernel: [38432.326633]  [<ffffffff81114039>] ? __alloc_pages_nodemask+0x1a9/0x200
Feb  8 08:03:26 dataset1001 kernel: [38432.326638]  [<ffffffff8114af4a>] ? alloc_pages_current+0xaa/0x120
Feb  8 08:03:26 dataset1001 kernel: [38432.326642]  [<ffffffff81152db5>] ? new_slab+0x295/0x2a0
Feb  8 08:03:26 dataset1001 kernel: [38432.326645]  [<ffffffff811544bd>] ? __slab_alloc+0x17d/0x320
Feb  8 08:03:26 dataset1001 kernel: [38432.326650]  [<ffffffff81524ac3>] ? tcp_send_ack+0x33/0x130
Feb  8 08:03:26 dataset1001 kernel: [38432.326656]  [<ffffffff81573fb0>] ? ip6_pol_route_input+0x0/0x20
Feb  8 08:03:26 dataset1001 kernel: [38432.326660]  [<ffffffff81157134>] ? __kmalloc_node_track_caller+0xa4/0x1d0
Feb  8 08:03:26 dataset1001 kernel: [38432.326663]  [<ffffffff81521a1a>] ? tcp_rcv_established+0x26a/0x6e0
Feb  8 08:03:26 dataset1001 kernel: [38432.326667]  [<ffffffff81524ac3>] ? tcp_send_ack+0x33/0x130
Feb  8 08:03:26 dataset1001 kernel: [38432.326672]  [<ffffffff814c8393>] ? __alloc_skb+0x83/0x170
Feb  8 08:03:26 dataset1001 kernel: [38432.326676]  [<ffffffff81524ac3>] ? tcp_send_ack+0x33/0x130
Feb  8 08:03:26 dataset1001 kernel: [38432.326680]  [<ffffffff8151bafe>] ? __tcp_ack_snd_check+0x5e/0xa0
Feb  8 08:03:26 dataset1001 kernel: [38432.326684]  [<ffffffff815219f9>] ? tcp_rcv_established+0x249/0x6e0
Feb  8 08:03:26 dataset1001 kernel: [38432.326688]  [<ffffffff81529dc5>] ? tcp_v4_do_rcv+0x125/0x1c0
Feb  8 08:03:26 dataset1001 kernel: [38432.326692]  [<ffffffff8152b5c9>] ? tcp_v4_rcv+0x5a9/0x840
Feb  8 08:03:26 dataset1001 kernel: [38432.326697]  [<ffffffff814fffe4>] ? nf_iterate+0x84/0xb0
Feb  8 08:03:26 dataset1001 kernel: [38432.326701]  [<ffffffff815080fd>] ? ip_local_deliver_finish+0xdd/0x290
Feb  8 08:03:26 dataset1001 kernel: [38432.326704]  [<ffffffff81508330>] ? ip_local_deliver+0x80/0x90
Feb  8 08:03:26 dataset1001 kernel: [38432.326708]  [<ffffffff81507901>] ? ip_rcv_finish+0x121/0x3f0
Feb  8 08:03:26 dataset1001 kernel: [38432.326711]  [<ffffffff81507f4d>] ? ip_rcv+0x23d/0x310
Feb  8 08:03:26 dataset1001 kernel: [38432.326715]  [<ffffffff814d1a0a>] ? __netif_receive_skb+0x40a/0x690
Feb  8 08:03:26 dataset1001 kernel: [38432.326720]  [<ffffffff814d6ac0>] ? netif_receive_skb+0x80/0x90
Feb  8 08:03:26 dataset1001 kernel: [38432.326724]  [<ffffffff814d6cd5>] ? napi_gro_complete+0x95/0xe0
Feb  8 08:03:26 dataset1001 kernel: [38432.326728]  [<ffffffff814d715a>] ? napi_complete+0x3a/0x80
Feb  8 08:03:26 dataset1001 kernel: [38432.326736]  [<ffffffffa00ddbc5>] ? bnx2_poll_msix+0x85/0xc0 [bnx2]
Feb  8 08:03:26 dataset1001 kernel: [38432.326741]  [<ffffffff814d72a8>] ? net_rx_action+0x108/0x2d0
Feb  8 08:03:26 dataset1001 kernel: [38432.326746]  [<ffffffff8108af73>] ? hrtimer_get_next_event+0xb3/0xf0
Feb  8 08:03:26 dataset1001 kernel: [38432.326751]  [<ffffffff8106c04b>] ? __do_softirq+0xab/0x200
Feb  8 08:03:26 dataset1001 kernel: [38432.326757]  [<ffffffff810d2330>] ? handle_IRQ_event+0x50/0x170
Feb  8 08:03:26 dataset1001 kernel: [38432.326762]  [<ffffffff810132e9>] ? sched_clock+0x9/0x10
Feb  8 08:03:26 dataset1001 kernel: [38432.326766]  [<ffffffff8100cf9c>] ? call_softirq+0x1c/0x30
Feb  8 08:03:26 dataset1001 kernel: [38432.326769]  [<ffffffff8100ea05>] ? do_softirq+0x65/0xa0
Feb  8 08:03:26 dataset1001 kernel: [38432.326773]  [<ffffffff8106bf15>] ? irq_exit+0x85/0x90
Feb  8 08:03:26 dataset1001 kernel: [38432.326777]  [<ffffffff815cf326>] ? do_IRQ+0x66/0xe0
Feb  8 08:03:26 dataset1001 kernel: [38432.326782]  [<ffffffff815c7753>] ? ret_from_intr+0x0/0x15
Feb  8 08:03:26 dataset1001 kernel: [38432.326784]  <EOI>  [<ffffffff81334861>] ? intel_idle+0xc1/0x120
Feb  8 08:03:26 dataset1001 kernel: [38432.326793]  [<ffffffff81334844>] ? intel_idle+0xa4/0x120
Feb  8 08:03:26 dataset1001 kernel: [38432.326799]  [<ffffffff814a7182>] ? cpuidle_idle_call+0xb2/0x1b0
Feb  8 08:03:26 dataset1001 kernel: [38432.326805]  [<ffffffff8100b077>] ? cpu_idle+0xb7/0x110
Feb  8 08:03:26 dataset1001 kernel: [38432.326809]  [<ffffffff815c085d>] ? start_secondary+0xc2/0xc4
Feb  8 08:03:26 dataset1001 kernel: [38432.326812] Mem-Info:
Feb  8 08:03:26 dataset1001 kernel: [38432.326814] Node 0 DMA per-cpu:
Feb  8 08:03:26 dataset1001 kernel: [38432.326817] CPU    0: hi:    0, btch:   1 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.326819] CPU    1: hi:    0, btch:   1 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.326822] CPU    2: hi:    0, btch:   1 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.326824] CPU    3: hi:    0, btch:   1 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.326826] Node 0 DMA32 per-cpu:
Feb  8 08:03:26 dataset1001 kernel: [38432.326829] CPU    0: hi:  186, btch:  31 usd: 179
Feb  8 08:03:26 dataset1001 kernel: [38432.326831] CPU    1: hi:  186, btch:  31 usd:  15
Feb  8 08:03:26 dataset1001 kernel: [38432.326834] CPU    2: hi:  186, btch:  31 usd: 156
Feb  8 08:03:26 dataset1001 kernel: [38432.326836] CPU    3: hi:  186, btch:  31 usd:   0
Feb  8 08:03:26 dataset1001 kernel: [38432.326838] Node 0 Normal per-cpu:
Feb  8 08:03:26 dataset1001 kernel: [38432.326841] CPU    0: hi:  186, btch:  31 usd:  88
Feb  8 08:03:26 dataset1001 kernel: [38432.326843] CPU    1: hi:  186, btch:  31 usd:  21
Feb  8 08:03:26 dataset1001 kernel: [38432.326845] CPU    2: hi:  186, btch:  31 usd: 161
Feb  8 08:03:26 dataset1001 kernel: [38432.326848] CPU    3: hi:  186, btch:  31 usd:  28
Feb  8 08:03:26 dataset1001 kernel: [38432.326853] active_anon:31711 inactive_anon:6004 isolated_anon:0
Feb  8 08:03:26 dataset1001 kernel: [38432.326854]  active_file:26426 inactive_file:3857074 isolated_file:0
Feb  8 08:03:26 dataset1001 kernel: [38432.326856]  unevictable:1019 dirty:729121 writeback:11217 unstable:0
Feb  8 08:03:26 dataset1001 kernel: [38432.326857]  free:18317 slab_reclaimable:79292 slab_unreclaimable:9887
Feb  8 08:03:26 dataset1001 kernel: [38432.326858]  mapped:2051 shmem:71 pagetables:867 bounce:0
Feb  8 08:03:26 dataset1001 kernel: [38432.326861] Node 0 DMA free:15868kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB 
inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_rec
laimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Feb  8 08:03:26 dataset1001 kernel: [38432.326872] lowmem_reserve[]: 0 3243 16121 16121
Feb  8 08:03:26 dataset1001 kernel: [38432.326876] Node 0 DMA32 free:52716kB min:3264kB low:4080kB high:4896kB active_anon:18048kB inactive_anon:7992kB a
ctive_file:4864kB inactive_file:3085260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3321540kB mlocked:0kB dirty:578832kB writeback:74
12kB mapped:148kB shmem:0kB slab_reclaimable:117856kB slab_unreclaimable:6472kB kernel_stack:56kB pagetables:292kB unstable:0kB bounce:0kB writeback_tmp:
0kB pages_scanned:0 all_unreclaimable? no
Feb  8 08:03:26 dataset1001 kernel: [38432.326889] lowmem_reserve[]: 0 0 12877 12877
Feb  8 08:03:26 dataset1001 kernel: [38432.326893] Node 0 Normal free:4684kB min:12972kB low:16212kB high:19456kB active_anon:108796kB inactive_anon:1602
4kB active_file:100840kB inactive_file:12343036kB unevictable:4076kB isolated(anon):0kB isolated(file):0kB present:13186560kB mlocked:4028kB dirty:233765
2kB writeback:37456kB mapped:8056kB shmem:284kB slab_reclaimable:199312kB slab_unreclaimable:33076kB kernel_stack:952kB pagetables:3176kB unstable:0kB bo
unce:0kB writeback_tmp:0kB pages_scanned:76 all_unreclaimable? no
Feb  8 08:03:26 dataset1001 kernel: [38432.326905] lowmem_reserve[]: 0 0 0 0
Feb  8 08:03:26 dataset1001 kernel: [38432.326909] Node 0 DMA: 1*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 1586
8kB
Feb  8 08:03:26 dataset1001 kernel: [38432.326920] Node 0 DMA32: 368*4kB 249*8kB 634*16kB 1146*32kB 4*64kB 0*128kB 3*256kB 3*512kB 0*1024kB 0*2048kB 0*40
96kB = 52840kB
Feb  8 08:03:26 dataset1001 kernel: [38432.326930] Node 0 Normal: 1060*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 
= 4560kB
Feb  8 08:03:26 dataset1001 kernel: [38432.326941] 3884258 total pagecache pages
Feb  8 08:03:26 dataset1001 kernel: [38432.326943] 0 pages in swap cache
Feb  8 08:03:26 dataset1001 kernel: [38432.326945] Swap cache stats: add 0, delete 0, find 0/0
Feb  8 08:03:26 dataset1001 kernel: [38432.326947] Free swap  = 976892kB
Feb  8 08:03:26 dataset1001 kernel: [38432.326948] Total swap = 976892kB
Feb  8 08:03:26 dataset1001 kernel: [38432.350585] 4194288 pages RAM
Feb  8 08:03:26 dataset1001 kernel: [38432.350588] 81311 pages reserved
Feb  8 08:03:26 dataset1001 kernel: [38432.350589] 2544422 pages shared
Feb  8 08:03:26 dataset1001 kernel: [38432.350591] 1558178 pages non-shared
Feb  8 08:03:26 dataset1001 kernel: [38432.350595] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
Feb  8 08:03:26 dataset1001 kernel: [38432.350598]   cache: kmalloc-2048, object size: 2048, buffer size: 2048, default order: 3, min order: 0
Feb  8 08:03:26 dataset1001 kernel: [38432.350602]   node 0: slabs: 2939, objs: 10974, free: 0

The first is a failure of the driver to atomically allocate memory for a socket buffer associated with the incoming packet. This is at a point where it's not possible to wait for a page to be swapped in. This would probably mean a lost packet. The second stack trace is in the code path for sending a tcp ACK, which means that the remote end would have to retransmit. We saw two groups of 5-6 stack traces, one at 08:03:26 and one at 11:23:17. AFAICT these two incidents didn't actually affect network traffic except for short hiccups. More recent versions of the driver use kmalloc instead of __netdev_alloc and handle it differently; if the issue turns out to be serious we could look into backporting those changes.

We did see more of those later on Feb 8 and 9th, a total of about 5 groups of 5 stack traces within 3 hours. Not sure of the effect on the remote end.

One possibility that has been reported on various mailing lists is to increase the setting for /proc/sys/vm/min_free_kbytes but this needs more checking.

Okay, after a bunch of reading and crunching the numbers, it looks lik we need to buy more time for kswapd to reclaim memory. Increased (live hack) min_free_kbytes from 16259 to 262144. This is still well below the max recommended 6% for a host with 16GB ram.

Feb 9 2012

Saw a single stack trace from xfs:

Feb  9 02:54:20 dataset1001 kernel: [34353.002050] INFO: task rsync:6830 blocked for more than 120 seconds.
Feb  9 02:54:20 dataset1001 kernel: [34353.008374] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb  9 02:54:20 dataset1001 kernel: [34353.016174] rsync           D ffff880407bb3170     0  6830   4054 0x00000000
Feb  9 02:54:20 dataset1001 kernel: [34353.016180]  ffff88003749ba28 0000000000000082 ffff88003749b9f8 ffff880000000000
Feb  9 02:54:20 dataset1001 kernel: [34353.016184]  0000000000013cc0 ffff880407bb2dc0 ffff880407bb3170 ffff88003749bfd8
Feb  9 02:54:20 dataset1001 kernel: [34353.016188]  ffff880407bb3178 0000000000013cc0 ffff88003749a010 0000000000013cc0
Feb  9 02:54:20 dataset1001 kernel: [34353.016193] Call Trace:
Feb  9 02:54:20 dataset1001 kernel: [34353.016203]  [<ffffffff815c558d>] schedule_timeout+0x21d/0x300
Feb  9 02:54:20 dataset1001 kernel: [34353.016209]  [<ffffffff812bbf84>] ? blk_unplug+0x34/0x70
Feb  9 02:54:20 dataset1001 kernel: [34353.016214]  [<ffffffff8148dfec>] ? dm_table_unplug_all+0x5c/0x110
Feb  9 02:54:20 dataset1001 kernel: [34353.016218]  [<ffffffff815c651e>] __down+0x7e/0xc0
Feb  9 02:54:20 dataset1001 kernel: [34353.016257]  [<ffffffffa0162d55>] ? _xfs_buf_find+0xf5/0x280 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016263]  [<ffffffff8108c991>] down+0x41/0x50
Feb  9 02:54:20 dataset1001 kernel: [34353.016291]  [<ffffffffa0161a73>] xfs_buf_lock+0x53/0x120 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016318]  [<ffffffffa0162d55>] _xfs_buf_find+0xf5/0x280 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016346]  [<ffffffffa0162f40>] xfs_buf_get+0x60/0x1c0 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016374]  [<ffffffffa01638dc>] xfs_buf_read+0x2c/0x110 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016402]  [<ffffffffa0158ae0>] xfs_trans_read_buf+0x190/0x420 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016426]  [<ffffffffa012ba09>] xfs_da_do_buf+0x289/0x790 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016431]  [<ffffffff8116dde3>] ? generic_permission+0x23/0xb0
Feb  9 02:54:20 dataset1001 kernel: [34353.016436]  [<ffffffff815c705e>] ? _raw_spin_lock+0xe/0x20
Feb  9 02:54:20 dataset1001 kernel: [34353.016443]  [<ffffffff81181e30>] ? mntput_no_expire+0x60/0x190
Feb  9 02:54:20 dataset1001 kernel: [34353.016448]  [<ffffffff8116efc4>] ? finish_open+0xe4/0x1c0
Feb  9 02:54:20 dataset1001 kernel: [34353.016471]  [<ffffffffa012bf8a>] xfs_da_read_buf+0x2a/0x30 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016496]  [<ffffffffa01303d8>] ? xfs_dir2_block_getdents+0x98/0x220 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016520]  [<ffffffffa01303d8>] xfs_dir2_block_getdents+0x98/0x220 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016542]  [<ffffffffa0118a11>] ? xfs_bmap_last_offset+0x111/0x140 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016547]  [<ffffffff811753b0>] ? filldir+0x0/0xe0
Feb  9 02:54:20 dataset1001 kernel: [34353.016571]  [<ffffffffa012ebf7>] xfs_readdir+0xa7/0x130 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016575]  [<ffffffff811753b0>] ? filldir+0x0/0xe0
Feb  9 02:54:20 dataset1001 kernel: [34353.016580]  [<ffffffff811753b0>] ? filldir+0x0/0xe0
Feb  9 02:54:20 dataset1001 kernel: [34353.016607]  [<ffffffffa0164ac9>] xfs_file_readdir+0x39/0x50 [xfs]
Feb  9 02:54:20 dataset1001 kernel: [34353.016611]  [<ffffffff81175548>] vfs_readdir+0xb8/0xe0
Feb  9 02:54:20 dataset1001 kernel: [34353.016616]  [<ffffffff811756d9>] sys_getdents+0x89/0xf0
Feb  9 02:54:20 dataset1001 kernel: [34353.016621]  [<ffffffff8100c082>] system_call_fastpath+0x16/0x1b

Investigating.

Later saw three more in short succession (not in time to look at the state of things as it was going on):

Feb  9 13:42:20 dataset1001 kernel: [73132.884063] INFO: task rsync:6830 blocked for more than 120 seconds.
Feb  9 13:44:20 dataset1001 kernel: [73252.585031] INFO: task rsync:6830 blocked for more than 120 seconds.
Feb  9 13:46:20 dataset1001 kernel: [73372.285989] INFO: task rsync:6830 blocked for more than 120 seconds.

iostats show 100% util for sdb and dm-0 periodically along with await much greater than svctm. megaraid_cli shows all disks on line with no errors.

Sample iostat from during such an episode:

root@dataset1001:~/atg/iftop# iostat -d -x  2 6
Linux 2.6.38-13-generic (dataset1001) 	02/09/2012 	_x86_64_	(4 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.19  216.59  151.90 55162.19 68725.61   336.21    39.25  106.51   0.97  35.82
sda               0.14    12.37  484.19  167.36 123591.50 75202.41   305.11     9.22   14.15   0.36  23.24
dm-0              0.00     0.00  700.05  317.22 178720.80 143815.45   317.06    48.80   47.97   0.54  55.15

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00  278.50     0.00 126860.00   455.51   144.56  519.32   3.59 100.00
sda               0.00     0.00  181.00    0.00 46336.00     0.00   256.00     0.07    0.47   0.19   3.50
dm-0              0.00     0.00  179.00  276.00 45824.00 125660.00   376.89   144.85  316.81   2.19  99.50

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00  276.00     0.00 125880.00   456.09   145.02  520.00   3.62 100.00
sda               0.00     0.00  193.00    0.00 49408.00     0.00   256.00     0.01    0.05   0.05   1.00
dm-0              0.00     0.00  193.00  276.00 49408.00 125660.00   373.28   146.01  308.12   2.13 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00  274.50     0.00 124940.00   455.15   144.13  525.15   3.64 100.00
sda               0.00     0.00  187.00    1.00 47872.00     8.00   254.68     0.03    0.16   0.11   2.00
dm-0              0.00     0.00  187.00  276.50 47872.00 126120.00   375.39   145.13  313.20   2.16 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00  278.00     0.00 126620.00   455.47   144.91  519.41   3.60 100.00
sda               0.00     0.00  199.00    0.00 50944.00     0.00   256.00     0.02    0.10   0.08   1.50
dm-0              0.00     0.00  199.00  274.50 50944.00 124940.00   371.46   145.91  307.04   2.11 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00  276.50     0.00 126120.00   456.13   144.34  519.24   3.62 100.00
sda               0.00     0.00  192.00    0.00 49152.00     0.00   256.00     0.08    0.42   0.29   5.50
dm-0              0.00     0.00  192.00  278.50 49152.00 126860.00   374.10   145.41  307.45   2.13 100.00

Feb 15 2012

Some testing was done without ionice and nice values for the rsync daemon, to see if that could be related to the rsync issues, on the theory that reads for the remove rsyncer were simply queued after writes from local rsyncs which were running at a much higher priority.

No xfs issues since restarting the daemon with those values, but then we started seeing rsync timeouts. On my end,

Feb 10 17:37:50 dataset1001 rsyncd[4200]: rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Connection reset by peer (104)
Feb 10 17:37:50 dataset1001 rsyncd[4200]: rsync error: error in rsync protocol data stream (code 12) at io.c(1530) [sender=3.0.7]

and on their end,

rsync: read error: Operation timed out (60)
rsync error: error in rsync protocol data stream (code 12) at io.c(764) [receiver=3.0.9]

After a bunch of inconclusive tests and beating our heads against the wall, we gave up on it for the weekend. On Sunday the remote side started up another rsync and it's been running since then with no timeouts or other issues. I have run manually the cronjob updater that pulls from dataset2 2 3 or 4 times a day with no ill effects. So who knows.

March 1

It looks like the min free setting change took care of the bnx2 errors, the ionice and nice setting removals took care of the xfs complaints, and we aren't seeing any other rsync timeouts unless someone actually interrupts the connection.

Next up: speed.

To one of our external mirrors we get 30 to 40 MB/sec (we cap so this is a good value) via rsync, over one route, but with a different peer we get very slow speeds. A second mirror site is coming to us via the one peer from the slow route, and we also see slow speeds with them.

After working with a network engineer at the peer, some things (we don't know what) were changed at their end, affecting traffic to/from their switch that's connected to us. We now see iperf getting transfer speeds of 475 Mbits/sec between the second mirror and dataset1001 which is great, comparable to earlier iperf speeds from dataset1001 to the peer switch (572 MBits/sec). But rsync speed still sucks majorly: 1.5 to 3.6 MB/sec.

Other issues

Initial issues included a firmware bug on the receiving host's array causing a slowdown in writes; the array was swapped out for different hardware, fixing the issue.

The same rsync via a different route (not he.net) was much faster. Somone at he.net made an unspecified configuration change which helped for a short time and then things went downhill again. Using a different route we are still doing well. Speeds bounce around some due to the way the rsync bwlimit works (burst, then throttle).

Reading material