Obsolete:dataset1001
dataset1001 | |
Location: | eqiad |
Cluster: | Miscellaneous eqiad |
Node name: | dataset1001.wikimedia.org (fingerprint) |
Usage: | Mirror rsync server
|
Status | |
---|---|
Overall: | This device has been decommissioned.
|
Icinga?: | host status services status |
Hardware | |
Memory: | 16GB |
HDD: | 24 x 2TB |
Software | |
OS: | Ubuntu 10.04 |
This host used to provide web and rsync service of xml/sql dumps and other datasets to the world.
See Dumps/Dump servers for the overview of all dump server hardware.
Notes on Dataset1001 deployment:
During rsync testing with Kevin at your.org
Processes running:
- cron job with bwlimit 10000 to get latest data written to 2012 dumps from dataset2 via rsync
- background job with bwlimit 40000 to get 2011 dumps from dataset2 via rsync
- rsync daemon with bwlimit 50000 which your.org connects to
Note that some files transfered are quite large, several GB.
Feb 7 2012
Saw stack traces in /var/log/messages using kernel 2.6.32-38-server (Ubuntu 10.04):
Feb 7 07:46:49 dataset1001 kernel: [384047.468170] swapper: page allocation failure. order:0, mode:0x4020 Feb 7 07:46:49 dataset1001 kernel: [384047.468176] Pid: 0, comm: swapper Not tainted 2.6.32-38-server #83-Ubuntu Feb 7 07:46:49 dataset1001 kernel: [384047.468179] Call Trace: Feb 7 07:46:49 dataset1001 kernel: [384047.468181] <IRQ> [<ffffffff810fca37>] __alloc_pages_slowpath+0x4a7/0x590 Feb 7 07:46:49 dataset1001 kernel: [384047.468195] [<ffffffff810fcc99>] __alloc_pages_nodemask+0x179/0x180 Feb 7 07:46:49 dataset1001 kernel: [384047.468202] [<ffffffff8112fe57>] alloc_pages_current+0x87/0xd0 Feb 7 07:46:49 dataset1001 kernel: [384047.468206] [<ffffffff81135e97>] new_slab+0x2f7/0x310 Feb 7 07:46:49 dataset1001 kernel: [384047.468210] [<ffffffff81138771>] __slab_alloc+0x201/0x2d0 Feb 7 07:46:49 dataset1001 kernel: [384047.468216] [<ffffffff8146e376>] ? __netdev_alloc_skb+0x36/0x60 Feb 7 07:46:49 dataset1001 kernel: [384047.468221] [<ffffffff8113956f>] __kmalloc_node_track_caller+0xaf/0x160 Feb 7 07:46:49 dataset1001 kernel: [384047.468225] [<ffffffff8146e376>] ? __netdev_alloc_skb+0x36/0x60 Feb 7 07:46:49 dataset1001 kernel: [384047.468229] [<ffffffff8146db90>] __alloc_skb+0x80/0x190 Feb 7 07:46:49 dataset1001 kernel: [384047.468233] [<ffffffff8146e376>] __netdev_alloc_skb+0x36/0x60 Feb 7 07:46:49 dataset1001 kernel: [384047.468243] [<ffffffffa00a6303>] bnx2_rx_skb+0x63/0x6e0 [bnx2] Feb 7 07:46:49 dataset1001 kernel: [384047.468249] [<ffffffffa00a6e1c>] bnx2_rx_int+0x49c/0x760 [bnx2] Feb 7 07:46:49 dataset1001 kernel: [384047.468255] [<ffffffff810200c6>] ? show_shared_cpu_map_func+0x66/0x80 Feb 7 07:46:49 dataset1001 kernel: [384047.468262] [<ffffffff81093de4>] ? clockevents_program_event+0x54/0xa0 Feb 7 07:46:49 dataset1001 kernel: [384047.468266] [<ffffffff81095488>] ? tick_dev_program_event+0x68/0xd0 Feb 7 07:46:49 dataset1001 kernel: [384047.468272] [<ffffffffa00a7150>] bnx2_poll_work+0x70/0xa0 [bnx2] Feb 7 07:46:49 dataset1001 kernel: [384047.468277] [<ffffffffa00a71bd>] bnx2_poll_msix+0x3d/0xc0 [bnx2] Feb 7 07:46:49 dataset1001 kernel: [384047.468284] [<ffffffff81039889>] ? default_spin_lock_flags+0x9/0x10 Feb 7 07:46:49 dataset1001 kernel: [384047.468289] [<ffffffff8147844f>] net_rx_action+0x10f/0x250 Feb 7 07:46:49 dataset1001 kernel: [384047.468294] [<ffffffff8106f287>] __do_softirq+0xb7/0x1f0 Feb 7 07:46:49 dataset1001 kernel: [384047.468299] [<ffffffff810c6610>] ? handle_IRQ_event+0x60/0x170 Feb 7 07:46:49 dataset1001 kernel: [384047.468303] [<ffffffff810142ac>] call_softirq+0x1c/0x30 Feb 7 07:46:49 dataset1001 kernel: [384047.468306] [<ffffffff81015c75>] do_softirq+0x65/0xa0 Feb 7 07:46:49 dataset1001 kernel: [384047.468309] [<ffffffff8106f085>] irq_exit+0x85/0x90 Feb 7 07:46:49 dataset1001 kernel: [384047.468314] [<ffffffff815654e5>] do_IRQ+0x75/0xf0 Feb 7 07:46:49 dataset1001 kernel: [384047.468318] [<ffffffff81013ad3>] ret_from_intr+0x0/0x11 Feb 7 07:46:49 dataset1001 kernel: [384047.468320] <EOI> [<ffffffff81312c25>] ? acpi_idle_enter_bm+0x296/0x2ca Feb 7 07:46:49 dataset1001 kernel: [384047.468329] [<ffffffff81312c1e>] ? acpi_idle_enter_bm+0x28f/0x2ca Feb 7 07:46:49 dataset1001 kernel: [384047.468335] [<ffffffff81452297>] ? cpuidle_idle_call+0xa7/0x140 Feb 7 07:46:49 dataset1001 kernel: [384047.468341] [<ffffffff81011e43>] ? cpu_idle+0xb3/0x110 Feb 7 07:46:49 dataset1001 kernel: [384047.468347] [<ffffffff81558105>] ? start_secondary+0xa8/0xaa Feb 7 07:46:49 dataset1001 kernel: [384047.468350] Mem-Info: Feb 7 07:46:49 dataset1001 kernel: [384047.468352] Node 0 DMA per-cpu: Feb 7 07:46:49 dataset1001 kernel: [384047.468355] CPU 0: hi: 0, btch: 1 usd: 0 Feb 7 07:46:49 dataset1001 kernel: [384047.468357] CPU 1: hi: 0, btch: 1 usd: 0 Feb 7 07:46:49 dataset1001 kernel: [384047.468360] CPU 2: hi: 0, btch: 1 usd: 0 Feb 7 07:46:49 dataset1001 kernel: [384047.468362] CPU 3: hi: 0, btch: 1 usd: 0 Feb 7 07:46:49 dataset1001 kernel: [384047.468364] Node 0 DMA32 per-cpu: Feb 7 07:46:49 dataset1001 kernel: [384047.468367] CPU 0: hi: 186, btch: 31 usd: 174 Feb 7 07:46:49 dataset1001 kernel: [384047.468369] CPU 1: hi: 186, btch: 31 usd: 24 Feb 7 07:46:49 dataset1001 kernel: [384047.468371] CPU 2: hi: 186, btch: 31 usd: 183 Feb 7 07:46:49 dataset1001 kernel: [384047.468374] CPU 3: hi: 186, btch: 31 usd: 183 Feb 7 07:46:49 dataset1001 kernel: [384047.468375] Node 0 Normal per-cpu: Feb 7 07:46:49 dataset1001 kernel: [384047.468378] CPU 0: hi: 186, btch: 31 usd: 179 Feb 7 07:46:49 dataset1001 kernel: [384047.468380] CPU 1: hi: 186, btch: 31 usd: 54 Feb 7 07:46:49 dataset1001 kernel: [384047.468383] CPU 2: hi: 186, btch: 31 usd: 157 Feb 7 07:46:49 dataset1001 kernel: [384047.468385] CPU 3: hi: 186, btch: 31 usd: 59 Feb 7 07:46:49 dataset1001 kernel: [384047.468390] active_anon:26517 inactive_anon:5096 isolated_anon:0 Feb 7 07:46:49 dataset1001 kernel: [384047.468392] active_file:76812 inactive_file:3796080 isolated_file:0 Feb 7 07:46:49 dataset1001 kernel: [384047.468393] unevictable:850 dirty:388919 writeback:233464 unstable:0 Feb 7 07:46:49 dataset1001 kernel: [384047.468394] free:18338 slab_reclaimable:84627 slab_unreclaimable:21641 Feb 7 07:46:49 dataset1001 kernel: [384047.468395] mapped:2677 shmem:74 pagetables:755 bounce:0 Feb 7 07:46:49 dataset1001 kernel: [384047.468398] Node 0 DMA free:15892kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15340kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_re claimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Feb 7 07:46:49 dataset1001 kernel: [384047.468409] lowmem_reserve[]: 0 3243 16121 16121 Feb 7 07:46:49 dataset1001 kernel: [384047.468413] Node 0 DMA32 free:52708kB min:3264kB low:4080kB high:4896kB active_anon:11084kB inactive_anon:5500kB active_file:46444kB inactive_file:2763984kB unevictable:568kB isolated(anon):0kB isolated(file):0kB present:3321540kB mlocked:568kB dirty:299064kB writeb ack:181156kB mapped:136kB shmem:0kB slab_reclaimable:160712kB slab_unreclaimable:20836kB kernel_stack:64kB pagetables:164kB unstable:0kB bounce:0kB write back_tmp:0kB pages_scanned:0 all_unreclaimable? no Feb 7 07:46:49 dataset1001 kernel: [384047.468426] lowmem_reserve[]: 0 0 12877 12877 Feb 7 07:46:49 dataset1001 kernel: [384047.468430] Node 0 Normal free:4752kB min:12972kB low:16212kB high:19456kB active_anon:94984kB inactive_anon:1488 4kB active_file:260804kB inactive_file:12420336kB unevictable:2832kB isolated(anon):0kB isolated(file):0kB present:13186560kB mlocked:2832kB dirty:125661 2kB writeback:752700kB mapped:10572kB shmem:296kB slab_reclaimable:177796kB slab_unreclaimable:65728kB kernel_stack:1184kB pagetables:2856kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Feb 7 07:46:49 dataset1001 kernel: [384047.468443] lowmem_reserve[]: 0 0 0 0 Feb 7 07:46:49 dataset1001 kernel: [384047.468447] Node 0 DMA: 3*4kB 3*8kB 3*16kB 2*32kB 2*64kB 2*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 3*4096kB = 158 92kB Feb 7 07:46:49 dataset1001 kernel: [384047.468458] Node 0 DMA32: 11541*4kB 176*8kB 174*16kB 1*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*40 96kB = 52628kB Feb 7 07:46:49 dataset1001 kernel: [384047.468469] Node 0 Normal: 994*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 4808kB Feb 7 07:46:49 dataset1001 kernel: [384047.468480] 3873689 total pagecache pages Feb 7 07:46:49 dataset1001 kernel: [384047.468482] 0 pages in swap cache Feb 7 07:46:49 dataset1001 kernel: [384047.468484] Swap cache stats: add 0, delete 0, find 0/0 Feb 7 07:46:49 dataset1001 kernel: [384047.468486] Free swap = 976888kB Feb 7 07:46:49 dataset1001 kernel: [384047.468487] Total swap = 976888kB Feb 7 07:46:49 dataset1001 kernel: [384047.529621] 4194304 pages RAM Feb 7 07:46:49 dataset1001 kernel: [384047.529624] 81220 pages reserved Feb 7 07:46:49 dataset1001 kernel: [384047.529625] 2627094 pages shared Feb 7 07:46:49 dataset1001 kernel: [384047.529627] 1474311 pages non-shared Feb 7 07:46:49 dataset1001 kernel: [384047.529631] SLUB: Unable to allocate memory on node -1 (gfp=0x20) Feb 7 07:46:49 dataset1001 kernel: [384047.529634] cache: kmalloc-4096, object size: 4096, buffer size: 4096, default order: 3, min order: 0 Feb 7 07:46:49 dataset1001 kernel: [384047.529637] node 0: slabs: 1541, objs: 4880, free: 0
I upgraded to 2.6.38-13-generic from the natty backports ppa.
Feb 8 2012
Saw similar stack traces in /var/log/messages from the network driver (bnx2):
Feb 8 08:03:26 dataset1001 kernel: [38432.288760] swapper: page allocation failure. order:0, mode:0x4020 Feb 8 08:03:26 dataset1001 kernel: [38432.288765] Pid: 0, comm: swapper Not tainted 2.6.38-13-generic #54~lucid1-Ubuntu Feb 8 08:03:26 dataset1001 kernel: [38432.288768] Call Trace: Feb 8 08:03:26 dataset1001 kernel: [38432.288770] <IRQ> [<ffffffff81113e50>] ? __alloc_pages_slowpath+0x6d0/0x710 Feb 8 08:03:26 dataset1001 kernel: [38432.288784] [<ffffffff8114af4a>] ? alloc_pages_current+0xaa/0x120 Feb 8 08:03:26 dataset1001 kernel: [38432.288788] [<ffffffff81114039>] ? __alloc_pages_nodemask+0x1a9/0x200 Feb 8 08:03:26 dataset1001 kernel: [38432.288792] [<ffffffff8114af4a>] ? alloc_pages_current+0xaa/0x120 Feb 8 08:03:26 dataset1001 kernel: [38432.288796] [<ffffffff81152db5>] ? new_slab+0x295/0x2a0 Feb 8 08:03:26 dataset1001 kernel: [38432.288799] [<ffffffff811544bd>] ? __slab_alloc+0x17d/0x320 Feb 8 08:03:26 dataset1001 kernel: [38432.288806] [<ffffffff814c8be4>] ? __netdev_alloc_skb+0x24/0x50 Feb 8 08:03:26 dataset1001 kernel: [38432.288810] [<ffffffff81157134>] ? __kmalloc_node_track_caller+0xa4/0x1d0 Feb 8 08:03:26 dataset1001 kernel: [38432.288814] [<ffffffff814c8be4>] ? __netdev_alloc_skb+0x24/0x50 Feb 8 08:03:26 dataset1001 kernel: [38432.288818] [<ffffffff814c8393>] ? __alloc_skb+0x83/0x170 Feb 8 08:03:26 dataset1001 kernel: [38432.288822] [<ffffffff814c8be4>] ? __netdev_alloc_skb+0x24/0x50 Feb 8 08:03:26 dataset1001 kernel: [38432.288831] [<ffffffffa00dccf3>] ? bnx2_rx_skb+0x63/0x6f0 [bnx2] Feb 8 08:03:26 dataset1001 kernel: [38432.288837] [<ffffffffa00dd7f8>] ? bnx2_rx_int+0x478/0x720 [bnx2] Feb 8 08:03:26 dataset1001 kernel: [38432.288843] [<ffffffffa00ddb10>] ? bnx2_poll_work+0x70/0xa0 [bnx2] Feb 8 08:03:26 dataset1001 kernel: [38432.288848] [<ffffffffa00ddb7d>] ? bnx2_poll_msix+0x3d/0xc0 [bnx2] Feb 8 08:03:26 dataset1001 kernel: [38432.288853] [<ffffffff81038d09>] ? default_spin_lock_flags+0x9/0x10 Feb 8 08:03:26 dataset1001 kernel: [38432.288859] [<ffffffff814d72a8>] ? net_rx_action+0x108/0x2d0 Feb 8 08:03:26 dataset1001 kernel: [38432.288864] [<ffffffff8106c04b>] ? __do_softirq+0xab/0x200 Feb 8 08:03:26 dataset1001 kernel: [38432.288870] [<ffffffff810d2330>] ? handle_IRQ_event+0x50/0x170 Feb 8 08:03:26 dataset1001 kernel: [38432.288875] [<ffffffff8100cf9c>] ? call_softirq+0x1c/0x30 Feb 8 08:03:26 dataset1001 kernel: [38432.288878] [<ffffffff8100ea05>] ? do_softirq+0x65/0xa0 Feb 8 08:03:26 dataset1001 kernel: [38432.288882] [<ffffffff8106bf15>] ? irq_exit+0x85/0x90 Feb 8 08:03:26 dataset1001 kernel: [38432.288886] [<ffffffff815cf326>] ? do_IRQ+0x66/0xe0 Feb 8 08:03:26 dataset1001 kernel: [38432.288892] [<ffffffff815c7753>] ? ret_from_intr+0x0/0x15 Feb 8 08:03:26 dataset1001 kernel: [38432.288894] <EOI> [<ffffffff81334861>] ? intel_idle+0xc1/0x120 Feb 8 08:03:26 dataset1001 kernel: [38432.288902] [<ffffffff81334844>] ? intel_idle+0xa4/0x120 Feb 8 08:03:26 dataset1001 kernel: [38432.288907] [<ffffffff814a7182>] ? cpuidle_idle_call+0xb2/0x1b0 Feb 8 08:03:26 dataset1001 kernel: [38432.288912] [<ffffffff8100b077>] ? cpu_idle+0xb7/0x110 Feb 8 08:03:26 dataset1001 kernel: [38432.288916] [<ffffffff815ac442>] ? rest_init+0x72/0x80 Feb 8 08:03:26 dataset1001 kernel: [38432.288923] [<ffffffff81accc9a>] ? start_kernel+0x374/0x37b Feb 8 08:03:26 dataset1001 kernel: [38432.288927] [<ffffffff81acc346>] ? x86_64_start_reservations+0x131/0x135 Feb 8 08:03:26 dataset1001 kernel: [38432.288931] [<ffffffff81acc452>] ? x86_64_start_kernel+0x108/0x117 Feb 8 08:03:26 dataset1001 kernel: [38432.288933] Mem-Info: Feb 8 08:03:26 dataset1001 kernel: [38432.288935] Node 0 DMA per-cpu: Feb 8 08:03:26 dataset1001 kernel: [38432.288938] CPU 0: hi: 0, btch: 1 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.288941] CPU 1: hi: 0, btch: 1 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.288943] CPU 2: hi: 0, btch: 1 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.288945] CPU 3: hi: 0, btch: 1 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.288947] Node 0 DMA32 per-cpu: Feb 8 08:03:26 dataset1001 kernel: [38432.288950] CPU 0: hi: 186, btch: 31 usd: 179 Feb 8 08:03:26 dataset1001 kernel: [38432.288952] CPU 1: hi: 186, btch: 31 usd: 15 Feb 8 08:03:26 dataset1001 kernel: [38432.288954] CPU 2: hi: 186, btch: 31 usd: 156 Feb 8 08:03:26 dataset1001 kernel: [38432.288956] CPU 3: hi: 186, btch: 31 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.288958] Node 0 Normal per-cpu: Feb 8 08:03:26 dataset1001 kernel: [38432.288961] CPU 0: hi: 186, btch: 31 usd: 88 Feb 8 08:03:26 dataset1001 kernel: [38432.288963] CPU 1: hi: 186, btch: 31 usd: 21 Feb 8 08:03:26 dataset1001 kernel: [38432.288965] CPU 2: hi: 186, btch: 31 usd: 161 Feb 8 08:03:26 dataset1001 kernel: [38432.288968] CPU 3: hi: 186, btch: 31 usd: 28 Feb 8 08:03:26 dataset1001 kernel: [38432.288973] active_anon:31711 inactive_anon:6004 isolated_anon:0 Feb 8 08:03:26 dataset1001 kernel: [38432.288974] active_file:26426 inactive_file:3857074 isolated_file:0 Feb 8 08:03:26 dataset1001 kernel: [38432.288976] unevictable:1019 dirty:729121 writeback:11217 unstable:0 Feb 8 08:03:26 dataset1001 kernel: [38432.288977] free:18317 slab_reclaimable:79292 slab_unreclaimable:9887 Feb 8 08:03:26 dataset1001 kernel: [38432.288978] mapped:2051 shmem:71 pagetables:867 bounce:0 Feb 8 08:03:26 dataset1001 kernel: [38432.288981] Node 0 DMA free:15868kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_rec laimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Feb 8 08:03:26 dataset1001 kernel: [38432.288992] lowmem_reserve[]: 0 3243 16121 16121 Feb 8 08:03:26 dataset1001 kernel: [38432.288996] Node 0 DMA32 free:52716kB min:3264kB low:4080kB high:4896kB active_anon:18048kB inactive_anon:7992kB a ctive_file:4864kB inactive_file:3085260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3321540kB mlocked:0kB dirty:578832kB writeback:74 12kB mapped:148kB shmem:0kB slab_reclaimable:117856kB slab_unreclaimable:6472kB kernel_stack:56kB pagetables:292kB unstable:0kB bounce:0kB writeback_tmp: 0kB pages_scanned:0 all_unreclaimable? no Feb 8 08:03:26 dataset1001 kernel: [38432.289008] lowmem_reserve[]: 0 0 12877 12877 Feb 8 08:03:26 dataset1001 kernel: [38432.289012] Node 0 Normal free:4684kB min:12972kB low:16212kB high:19456kB active_anon:108796kB inactive_anon:1602 4kB active_file:100840kB inactive_file:12343036kB unevictable:4076kB isolated(anon):0kB isolated(file):0kB present:13186560kB mlocked:4028kB dirty:233765 2kB writeback:37456kB mapped:8056kB shmem:284kB slab_reclaimable:199312kB slab_unreclaimable:33076kB kernel_stack:952kB pagetables:3176kB unstable:0kB bo unce:0kB writeback_tmp:0kB pages_scanned:76 all_unreclaimable? no Feb 8 08:03:26 dataset1001 kernel: [38432.289025] lowmem_reserve[]: 0 0 0 0 Feb 8 08:03:26 dataset1001 kernel: [38432.289028] Node 0 DMA: 1*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 1586 8kB Feb 8 08:03:26 dataset1001 kernel: [38432.289039] Node 0 DMA32: 368*4kB 249*8kB 634*16kB 1146*32kB 4*64kB 0*128kB 3*256kB 3*512kB 0*1024kB 0*2048kB 0*40 96kB = 52840kB Feb 8 08:03:26 dataset1001 kernel: [38432.289050] Node 0 Normal: 1060*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4560kB Feb 8 08:03:26 dataset1001 kernel: [38432.289060] 3884258 total pagecache pages Feb 8 08:03:26 dataset1001 kernel: [38432.289062] 0 pages in swap cache Feb 8 08:03:26 dataset1001 kernel: [38432.289064] Swap cache stats: add 0, delete 0, find 0/0 Feb 8 08:03:26 dataset1001 kernel: [38432.289066] Free swap = 976892kB Feb 8 08:03:26 dataset1001 kernel: [38432.289068] Total swap = 976892kB Feb 8 08:03:26 dataset1001 kernel: [38432.326609] kworker/0:1: page allocation failure. order:0, mode:0x4020 Feb 8 08:03:26 dataset1001 kernel: [38432.326615] Pid: 0, comm: kworker/0:1 Not tainted 2.6.38-13-generic #54~lucid1-Ubuntu Feb 8 08:03:26 dataset1001 kernel: [38432.326618] Call Trace: Feb 8 08:03:26 dataset1001 kernel: [38432.326620] <IRQ> [<ffffffff81113e50>] ? __alloc_pages_slowpath+0x6d0/0x710 Feb 8 08:03:26 dataset1001 kernel: [38432.326633] [<ffffffff81114039>] ? __alloc_pages_nodemask+0x1a9/0x200 Feb 8 08:03:26 dataset1001 kernel: [38432.326638] [<ffffffff8114af4a>] ? alloc_pages_current+0xaa/0x120 Feb 8 08:03:26 dataset1001 kernel: [38432.326642] [<ffffffff81152db5>] ? new_slab+0x295/0x2a0 Feb 8 08:03:26 dataset1001 kernel: [38432.326645] [<ffffffff811544bd>] ? __slab_alloc+0x17d/0x320 Feb 8 08:03:26 dataset1001 kernel: [38432.326650] [<ffffffff81524ac3>] ? tcp_send_ack+0x33/0x130 Feb 8 08:03:26 dataset1001 kernel: [38432.326656] [<ffffffff81573fb0>] ? ip6_pol_route_input+0x0/0x20 Feb 8 08:03:26 dataset1001 kernel: [38432.326660] [<ffffffff81157134>] ? __kmalloc_node_track_caller+0xa4/0x1d0 Feb 8 08:03:26 dataset1001 kernel: [38432.326663] [<ffffffff81521a1a>] ? tcp_rcv_established+0x26a/0x6e0 Feb 8 08:03:26 dataset1001 kernel: [38432.326667] [<ffffffff81524ac3>] ? tcp_send_ack+0x33/0x130 Feb 8 08:03:26 dataset1001 kernel: [38432.326672] [<ffffffff814c8393>] ? __alloc_skb+0x83/0x170 Feb 8 08:03:26 dataset1001 kernel: [38432.326676] [<ffffffff81524ac3>] ? tcp_send_ack+0x33/0x130 Feb 8 08:03:26 dataset1001 kernel: [38432.326680] [<ffffffff8151bafe>] ? __tcp_ack_snd_check+0x5e/0xa0 Feb 8 08:03:26 dataset1001 kernel: [38432.326684] [<ffffffff815219f9>] ? tcp_rcv_established+0x249/0x6e0 Feb 8 08:03:26 dataset1001 kernel: [38432.326688] [<ffffffff81529dc5>] ? tcp_v4_do_rcv+0x125/0x1c0 Feb 8 08:03:26 dataset1001 kernel: [38432.326692] [<ffffffff8152b5c9>] ? tcp_v4_rcv+0x5a9/0x840 Feb 8 08:03:26 dataset1001 kernel: [38432.326697] [<ffffffff814fffe4>] ? nf_iterate+0x84/0xb0 Feb 8 08:03:26 dataset1001 kernel: [38432.326701] [<ffffffff815080fd>] ? ip_local_deliver_finish+0xdd/0x290 Feb 8 08:03:26 dataset1001 kernel: [38432.326704] [<ffffffff81508330>] ? ip_local_deliver+0x80/0x90 Feb 8 08:03:26 dataset1001 kernel: [38432.326708] [<ffffffff81507901>] ? ip_rcv_finish+0x121/0x3f0 Feb 8 08:03:26 dataset1001 kernel: [38432.326711] [<ffffffff81507f4d>] ? ip_rcv+0x23d/0x310 Feb 8 08:03:26 dataset1001 kernel: [38432.326715] [<ffffffff814d1a0a>] ? __netif_receive_skb+0x40a/0x690 Feb 8 08:03:26 dataset1001 kernel: [38432.326720] [<ffffffff814d6ac0>] ? netif_receive_skb+0x80/0x90 Feb 8 08:03:26 dataset1001 kernel: [38432.326724] [<ffffffff814d6cd5>] ? napi_gro_complete+0x95/0xe0 Feb 8 08:03:26 dataset1001 kernel: [38432.326728] [<ffffffff814d715a>] ? napi_complete+0x3a/0x80 Feb 8 08:03:26 dataset1001 kernel: [38432.326736] [<ffffffffa00ddbc5>] ? bnx2_poll_msix+0x85/0xc0 [bnx2] Feb 8 08:03:26 dataset1001 kernel: [38432.326741] [<ffffffff814d72a8>] ? net_rx_action+0x108/0x2d0 Feb 8 08:03:26 dataset1001 kernel: [38432.326746] [<ffffffff8108af73>] ? hrtimer_get_next_event+0xb3/0xf0 Feb 8 08:03:26 dataset1001 kernel: [38432.326751] [<ffffffff8106c04b>] ? __do_softirq+0xab/0x200 Feb 8 08:03:26 dataset1001 kernel: [38432.326757] [<ffffffff810d2330>] ? handle_IRQ_event+0x50/0x170 Feb 8 08:03:26 dataset1001 kernel: [38432.326762] [<ffffffff810132e9>] ? sched_clock+0x9/0x10 Feb 8 08:03:26 dataset1001 kernel: [38432.326766] [<ffffffff8100cf9c>] ? call_softirq+0x1c/0x30 Feb 8 08:03:26 dataset1001 kernel: [38432.326769] [<ffffffff8100ea05>] ? do_softirq+0x65/0xa0 Feb 8 08:03:26 dataset1001 kernel: [38432.326773] [<ffffffff8106bf15>] ? irq_exit+0x85/0x90 Feb 8 08:03:26 dataset1001 kernel: [38432.326777] [<ffffffff815cf326>] ? do_IRQ+0x66/0xe0 Feb 8 08:03:26 dataset1001 kernel: [38432.326782] [<ffffffff815c7753>] ? ret_from_intr+0x0/0x15 Feb 8 08:03:26 dataset1001 kernel: [38432.326784] <EOI> [<ffffffff81334861>] ? intel_idle+0xc1/0x120 Feb 8 08:03:26 dataset1001 kernel: [38432.326793] [<ffffffff81334844>] ? intel_idle+0xa4/0x120 Feb 8 08:03:26 dataset1001 kernel: [38432.326799] [<ffffffff814a7182>] ? cpuidle_idle_call+0xb2/0x1b0 Feb 8 08:03:26 dataset1001 kernel: [38432.326805] [<ffffffff8100b077>] ? cpu_idle+0xb7/0x110 Feb 8 08:03:26 dataset1001 kernel: [38432.326809] [<ffffffff815c085d>] ? start_secondary+0xc2/0xc4 Feb 8 08:03:26 dataset1001 kernel: [38432.326812] Mem-Info: Feb 8 08:03:26 dataset1001 kernel: [38432.326814] Node 0 DMA per-cpu: Feb 8 08:03:26 dataset1001 kernel: [38432.326817] CPU 0: hi: 0, btch: 1 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.326819] CPU 1: hi: 0, btch: 1 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.326822] CPU 2: hi: 0, btch: 1 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.326824] CPU 3: hi: 0, btch: 1 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.326826] Node 0 DMA32 per-cpu: Feb 8 08:03:26 dataset1001 kernel: [38432.326829] CPU 0: hi: 186, btch: 31 usd: 179 Feb 8 08:03:26 dataset1001 kernel: [38432.326831] CPU 1: hi: 186, btch: 31 usd: 15 Feb 8 08:03:26 dataset1001 kernel: [38432.326834] CPU 2: hi: 186, btch: 31 usd: 156 Feb 8 08:03:26 dataset1001 kernel: [38432.326836] CPU 3: hi: 186, btch: 31 usd: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.326838] Node 0 Normal per-cpu: Feb 8 08:03:26 dataset1001 kernel: [38432.326841] CPU 0: hi: 186, btch: 31 usd: 88 Feb 8 08:03:26 dataset1001 kernel: [38432.326843] CPU 1: hi: 186, btch: 31 usd: 21 Feb 8 08:03:26 dataset1001 kernel: [38432.326845] CPU 2: hi: 186, btch: 31 usd: 161 Feb 8 08:03:26 dataset1001 kernel: [38432.326848] CPU 3: hi: 186, btch: 31 usd: 28 Feb 8 08:03:26 dataset1001 kernel: [38432.326853] active_anon:31711 inactive_anon:6004 isolated_anon:0 Feb 8 08:03:26 dataset1001 kernel: [38432.326854] active_file:26426 inactive_file:3857074 isolated_file:0 Feb 8 08:03:26 dataset1001 kernel: [38432.326856] unevictable:1019 dirty:729121 writeback:11217 unstable:0 Feb 8 08:03:26 dataset1001 kernel: [38432.326857] free:18317 slab_reclaimable:79292 slab_unreclaimable:9887 Feb 8 08:03:26 dataset1001 kernel: [38432.326858] mapped:2051 shmem:71 pagetables:867 bounce:0 Feb 8 08:03:26 dataset1001 kernel: [38432.326861] Node 0 DMA free:15868kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_rec laimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Feb 8 08:03:26 dataset1001 kernel: [38432.326872] lowmem_reserve[]: 0 3243 16121 16121 Feb 8 08:03:26 dataset1001 kernel: [38432.326876] Node 0 DMA32 free:52716kB min:3264kB low:4080kB high:4896kB active_anon:18048kB inactive_anon:7992kB a ctive_file:4864kB inactive_file:3085260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3321540kB mlocked:0kB dirty:578832kB writeback:74 12kB mapped:148kB shmem:0kB slab_reclaimable:117856kB slab_unreclaimable:6472kB kernel_stack:56kB pagetables:292kB unstable:0kB bounce:0kB writeback_tmp: 0kB pages_scanned:0 all_unreclaimable? no Feb 8 08:03:26 dataset1001 kernel: [38432.326889] lowmem_reserve[]: 0 0 12877 12877 Feb 8 08:03:26 dataset1001 kernel: [38432.326893] Node 0 Normal free:4684kB min:12972kB low:16212kB high:19456kB active_anon:108796kB inactive_anon:1602 4kB active_file:100840kB inactive_file:12343036kB unevictable:4076kB isolated(anon):0kB isolated(file):0kB present:13186560kB mlocked:4028kB dirty:233765 2kB writeback:37456kB mapped:8056kB shmem:284kB slab_reclaimable:199312kB slab_unreclaimable:33076kB kernel_stack:952kB pagetables:3176kB unstable:0kB bo unce:0kB writeback_tmp:0kB pages_scanned:76 all_unreclaimable? no Feb 8 08:03:26 dataset1001 kernel: [38432.326905] lowmem_reserve[]: 0 0 0 0 Feb 8 08:03:26 dataset1001 kernel: [38432.326909] Node 0 DMA: 1*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 1586 8kB Feb 8 08:03:26 dataset1001 kernel: [38432.326920] Node 0 DMA32: 368*4kB 249*8kB 634*16kB 1146*32kB 4*64kB 0*128kB 3*256kB 3*512kB 0*1024kB 0*2048kB 0*40 96kB = 52840kB Feb 8 08:03:26 dataset1001 kernel: [38432.326930] Node 0 Normal: 1060*4kB 0*8kB 0*16kB 0*32kB 1*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4560kB Feb 8 08:03:26 dataset1001 kernel: [38432.326941] 3884258 total pagecache pages Feb 8 08:03:26 dataset1001 kernel: [38432.326943] 0 pages in swap cache Feb 8 08:03:26 dataset1001 kernel: [38432.326945] Swap cache stats: add 0, delete 0, find 0/0 Feb 8 08:03:26 dataset1001 kernel: [38432.326947] Free swap = 976892kB Feb 8 08:03:26 dataset1001 kernel: [38432.326948] Total swap = 976892kB Feb 8 08:03:26 dataset1001 kernel: [38432.350585] 4194288 pages RAM Feb 8 08:03:26 dataset1001 kernel: [38432.350588] 81311 pages reserved Feb 8 08:03:26 dataset1001 kernel: [38432.350589] 2544422 pages shared Feb 8 08:03:26 dataset1001 kernel: [38432.350591] 1558178 pages non-shared Feb 8 08:03:26 dataset1001 kernel: [38432.350595] SLUB: Unable to allocate memory on node -1 (gfp=0x20) Feb 8 08:03:26 dataset1001 kernel: [38432.350598] cache: kmalloc-2048, object size: 2048, buffer size: 2048, default order: 3, min order: 0 Feb 8 08:03:26 dataset1001 kernel: [38432.350602] node 0: slabs: 2939, objs: 10974, free: 0
The first is a failure of the driver to atomically allocate memory for a socket buffer associated with the incoming packet. This is at a point where it's not possible to wait for a page to be swapped in. This would probably mean a lost packet. The second stack trace is in the code path for sending a tcp ACK, which means that the remote end would have to retransmit. We saw two groups of 5-6 stack traces, one at 08:03:26 and one at 11:23:17. AFAICT these two incidents didn't actually affect network traffic except for short hiccups. More recent versions of the driver use kmalloc instead of __netdev_alloc and handle it differently; if the issue turns out to be serious we could look into backporting those changes.
We did see more of those later on Feb 8 and 9th, a total of about 5 groups of 5 stack traces within 3 hours. Not sure of the effect on the remote end.
One possibility that has been reported on various mailing lists is to increase the setting for /proc/sys/vm/min_free_kbytes but this needs more checking.
Okay, after a bunch of reading and crunching the numbers, it looks lik we need to buy more time for kswapd to reclaim memory. Increased (live hack) min_free_kbytes from 16259 to 262144. This is still well below the max recommended 6% for a host with 16GB ram.
Feb 9 2012
Saw a single stack trace from xfs:
Feb 9 02:54:20 dataset1001 kernel: [34353.002050] INFO: task rsync:6830 blocked for more than 120 seconds. Feb 9 02:54:20 dataset1001 kernel: [34353.008374] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 9 02:54:20 dataset1001 kernel: [34353.016174] rsync D ffff880407bb3170 0 6830 4054 0x00000000 Feb 9 02:54:20 dataset1001 kernel: [34353.016180] ffff88003749ba28 0000000000000082 ffff88003749b9f8 ffff880000000000 Feb 9 02:54:20 dataset1001 kernel: [34353.016184] 0000000000013cc0 ffff880407bb2dc0 ffff880407bb3170 ffff88003749bfd8 Feb 9 02:54:20 dataset1001 kernel: [34353.016188] ffff880407bb3178 0000000000013cc0 ffff88003749a010 0000000000013cc0 Feb 9 02:54:20 dataset1001 kernel: [34353.016193] Call Trace: Feb 9 02:54:20 dataset1001 kernel: [34353.016203] [<ffffffff815c558d>] schedule_timeout+0x21d/0x300 Feb 9 02:54:20 dataset1001 kernel: [34353.016209] [<ffffffff812bbf84>] ? blk_unplug+0x34/0x70 Feb 9 02:54:20 dataset1001 kernel: [34353.016214] [<ffffffff8148dfec>] ? dm_table_unplug_all+0x5c/0x110 Feb 9 02:54:20 dataset1001 kernel: [34353.016218] [<ffffffff815c651e>] __down+0x7e/0xc0 Feb 9 02:54:20 dataset1001 kernel: [34353.016257] [<ffffffffa0162d55>] ? _xfs_buf_find+0xf5/0x280 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016263] [<ffffffff8108c991>] down+0x41/0x50 Feb 9 02:54:20 dataset1001 kernel: [34353.016291] [<ffffffffa0161a73>] xfs_buf_lock+0x53/0x120 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016318] [<ffffffffa0162d55>] _xfs_buf_find+0xf5/0x280 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016346] [<ffffffffa0162f40>] xfs_buf_get+0x60/0x1c0 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016374] [<ffffffffa01638dc>] xfs_buf_read+0x2c/0x110 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016402] [<ffffffffa0158ae0>] xfs_trans_read_buf+0x190/0x420 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016426] [<ffffffffa012ba09>] xfs_da_do_buf+0x289/0x790 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016431] [<ffffffff8116dde3>] ? generic_permission+0x23/0xb0 Feb 9 02:54:20 dataset1001 kernel: [34353.016436] [<ffffffff815c705e>] ? _raw_spin_lock+0xe/0x20 Feb 9 02:54:20 dataset1001 kernel: [34353.016443] [<ffffffff81181e30>] ? mntput_no_expire+0x60/0x190 Feb 9 02:54:20 dataset1001 kernel: [34353.016448] [<ffffffff8116efc4>] ? finish_open+0xe4/0x1c0 Feb 9 02:54:20 dataset1001 kernel: [34353.016471] [<ffffffffa012bf8a>] xfs_da_read_buf+0x2a/0x30 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016496] [<ffffffffa01303d8>] ? xfs_dir2_block_getdents+0x98/0x220 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016520] [<ffffffffa01303d8>] xfs_dir2_block_getdents+0x98/0x220 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016542] [<ffffffffa0118a11>] ? xfs_bmap_last_offset+0x111/0x140 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016547] [<ffffffff811753b0>] ? filldir+0x0/0xe0 Feb 9 02:54:20 dataset1001 kernel: [34353.016571] [<ffffffffa012ebf7>] xfs_readdir+0xa7/0x130 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016575] [<ffffffff811753b0>] ? filldir+0x0/0xe0 Feb 9 02:54:20 dataset1001 kernel: [34353.016580] [<ffffffff811753b0>] ? filldir+0x0/0xe0 Feb 9 02:54:20 dataset1001 kernel: [34353.016607] [<ffffffffa0164ac9>] xfs_file_readdir+0x39/0x50 [xfs] Feb 9 02:54:20 dataset1001 kernel: [34353.016611] [<ffffffff81175548>] vfs_readdir+0xb8/0xe0 Feb 9 02:54:20 dataset1001 kernel: [34353.016616] [<ffffffff811756d9>] sys_getdents+0x89/0xf0 Feb 9 02:54:20 dataset1001 kernel: [34353.016621] [<ffffffff8100c082>] system_call_fastpath+0x16/0x1b
Investigating.
Later saw three more in short succession (not in time to look at the state of things as it was going on):
Feb 9 13:42:20 dataset1001 kernel: [73132.884063] INFO: task rsync:6830 blocked for more than 120 seconds. Feb 9 13:44:20 dataset1001 kernel: [73252.585031] INFO: task rsync:6830 blocked for more than 120 seconds. Feb 9 13:46:20 dataset1001 kernel: [73372.285989] INFO: task rsync:6830 blocked for more than 120 seconds.
iostats show 100% util for sdb and dm-0 periodically along with await much greater than svctm. megaraid_cli shows all disks on line with no errors.
Sample iostat from during such an episode:
root@dataset1001:~/atg/iftop# iostat -d -x 2 6 Linux 2.6.38-13-generic (dataset1001) 02/09/2012 _x86_64_ (4 CPU) Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.19 216.59 151.90 55162.19 68725.61 336.21 39.25 106.51 0.97 35.82 sda 0.14 12.37 484.19 167.36 123591.50 75202.41 305.11 9.22 14.15 0.36 23.24 dm-0 0.00 0.00 700.05 317.22 178720.80 143815.45 317.06 48.80 47.97 0.54 55.15 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 0.00 278.50 0.00 126860.00 455.51 144.56 519.32 3.59 100.00 sda 0.00 0.00 181.00 0.00 46336.00 0.00 256.00 0.07 0.47 0.19 3.50 dm-0 0.00 0.00 179.00 276.00 45824.00 125660.00 376.89 144.85 316.81 2.19 99.50 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 0.00 276.00 0.00 125880.00 456.09 145.02 520.00 3.62 100.00 sda 0.00 0.00 193.00 0.00 49408.00 0.00 256.00 0.01 0.05 0.05 1.00 dm-0 0.00 0.00 193.00 276.00 49408.00 125660.00 373.28 146.01 308.12 2.13 100.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 0.00 274.50 0.00 124940.00 455.15 144.13 525.15 3.64 100.00 sda 0.00 0.00 187.00 1.00 47872.00 8.00 254.68 0.03 0.16 0.11 2.00 dm-0 0.00 0.00 187.00 276.50 47872.00 126120.00 375.39 145.13 313.20 2.16 100.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 0.00 278.00 0.00 126620.00 455.47 144.91 519.41 3.60 100.00 sda 0.00 0.00 199.00 0.00 50944.00 0.00 256.00 0.02 0.10 0.08 1.50 dm-0 0.00 0.00 199.00 274.50 50944.00 124940.00 371.46 145.91 307.04 2.11 100.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 0.00 276.50 0.00 126120.00 456.13 144.34 519.24 3.62 100.00 sda 0.00 0.00 192.00 0.00 49152.00 0.00 256.00 0.08 0.42 0.29 5.50 dm-0 0.00 0.00 192.00 278.50 49152.00 126860.00 374.10 145.41 307.45 2.13 100.00
Feb 15 2012
Some testing was done without ionice and nice values for the rsync daemon, to see if that could be related to the rsync issues, on the theory that reads for the remove rsyncer were simply queued after writes from local rsyncs which were running at a much higher priority.
No xfs issues since restarting the daemon with those values, but then we started seeing rsync timeouts. On my end,
Feb 10 17:37:50 dataset1001 rsyncd[4200]: rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Connection reset by peer (104) Feb 10 17:37:50 dataset1001 rsyncd[4200]: rsync error: error in rsync protocol data stream (code 12) at io.c(1530) [sender=3.0.7]
and on their end,
rsync: read error: Operation timed out (60) rsync error: error in rsync protocol data stream (code 12) at io.c(764) [receiver=3.0.9]
After a bunch of inconclusive tests and beating our heads against the wall, we gave up on it for the weekend. On Sunday the remote side started up another rsync and it's been running since then with no timeouts or other issues. I have run manually the cronjob updater that pulls from dataset2 2 3 or 4 times a day with no ill effects. So who knows.
March 1
It looks like the min free setting change took care of the bnx2 errors, the ionice and nice setting removals took care of the xfs complaints, and we aren't seeing any other rsync timeouts unless someone actually interrupts the connection.
Next up: speed.
To one of our external mirrors we get 30 to 40 MB/sec (we cap so this is a good value) via rsync, over one route, but with a different peer we get very slow speeds. A second mirror site is coming to us via the one peer from the slow route, and we also see slow speeds with them.
After working with a network engineer at the peer, some things (we don't know what) were changed at their end, affecting traffic to/from their switch that's connected to us. We now see iperf getting transfer speeds of 475 Mbits/sec between the second mirror and dataset1001 which is great, comparable to earlier iperf speeds from dataset1001 to the peer switch (572 MBits/sec). But rsync speed still sucks majorly: 1.5 to 3.6 MB/sec.
Other issues
Initial issues included a firmware bug on the receiving host's array causing a slowdown in writes; the array was swapped out for different hardware, fixing the issue.
The same rsync via a different route (not he.net) was much faster. Somone at he.net made an unspecified configuration change which helped for a short time and then things went downhill again. Using a different route we are still doing well. Speeds bounce around some due to the way the rsync bwlimit works (burst, then throttle).