Swift/Server issues Aug-Sept 2012
Appearance
< Swift
We've seen some issues with the ms-be boxes (all C2100s) in tampa.
Current status as of this writing:
Hardware Issues and Troubleshooting.
- Per conversation with Dell, checked the jumpers on the backplane to ensure that j15 was indeed empty. After examining the backplanes on ms-be6,7,8, it is confirmed to have only the 3 jumpers and j15 is empty.
- ms-be6 replaced all 12 HDD w/different manufacturer .
- ms-be6 has a replaced main board, replaced, backplane, replaced sas2008 controller card
- ms-be6 replaced the LSI sas2008 controller that supports jbod with a LSI 9680 that does not support jbod. The card and configuration change did not fix the issue. (several disk had issues mounting)
These boxes are powered on, have had puppet disabled in /etc/default/puppet and the swift processes are shut off via swift-init stop all.
- ms-be6 has had ssds uncabled and power pulled, reinstall, latest LSI driver (mpt2sas0), latest controller firmware. After recreating all xfs filesystems on /dev/sdc and up, by hand, and manual remount, it boots and continues to see all drives and mount them. It reported a degraded raid array on /dev/md0 (= /dev/sda and /dev/sdb, /deb/sdb is the one that fell out of the raid), I repaired the raid, a couple hours later it reported degraded again. So we can use this box for testing the other drives but not put it back into the rings.
- ms-be7 has ssds which are cabled up. I followed the same procedure of recreating all xfs filesystems except for the os, remounting manually, rebooting, and it sees all drives. We can put this back into the swift rings next Monday. (After re-enabling puppet and restarting swift processes on the box)
- ms-be10 has ssds which are cabled up. After recreating all xfs filesystems except for the OS, remounting manually, on reboot it reports a few disks (not always the same ones) as not ready/not present, but if you wait it out they eventually mount. Obviously this is a problem. I don't think this should go back into production.
These boxes are powered off.
- ms-be8 replaced the sas disk with NL SATA disks. Once data was written to the disk and they were rebooted several times, the disk failed. "firmware status: failed"
These boxes have one problem disk.
- ms-be5 has one disk reporting errors, needs replaced once other servers are stable
- ms-be11 has one disk replaced but it shows the wrong logical id so it needs reboot once other servers are stable