Incidents/20150617-LabsNFSOutage

Summary

At some point near the end of Jun 17, 2015 the filesystem backing the NFS storage used by LABS suffered a catastrophic failure, preventing most of Labs from working. Because efforts to recover the filesystem did not succeed, the decision was made to restore from a June 8th backup to a fresh set of volumes.

Service returned gradually as NFS files were restored to the different projects; tools was back in function early on the 19th with most other Labs project returned later that day. A prose summary with updates was maintained here.

Timeline

2015-06-18

[00:30] Labs NFS switches readonly, Andrew notices and consults Gage for opinion. Only symptom in logs is ext4 on labstore 1001 hitting a single bad inode and switching to readonly for protection
[00:36] Andrew calls Marc
[00:37] Gage believes a fsck and remount is likely to fix the issue
[00:40] Marc arrives online, diagnoses
[00:45] Marc arrives at the same diagnosis; with the logs reporting a single (comparatively minor) issue, decides to halt NFS and fsck before remount
[02:21] fsck progresses, but first signs of serious trouble show up as it begings reporting multiply-allocated blocks in the filesystem pointing to severe damage
[05:13] fsck still working, but shows signs of distress. Giuseppe joins the effort.
[05:20] Filesystem estimated to be damaged beyond putting back in function, Mark paged
[06:37] Mark arrives.
[06:50] Evaluating the backup options. Labstore1002 backup found to be good, hardware issues with labstore2001
[07:30] Plan formulated to rebuild a new volume to restore the labstore1002 backup after moving some extents to the older raid6 drives
[08:49] labstore1001 rebooted without assembled raid arrays to allow 1002 to take over
[09:07] While attempting to reboot 1002 to start the recovery process, labstore1002 H800 controller fails to pass POST and server does not boot
[09:09] Chris contacted. Attempts to repeatedly powercycle 1002 continue, hoping flea power is the issue.
[09:39] 1002 boots; some Jessie/Precise diffences rear their heads and boot does not work
[12:20] Issue found (Jessie ignores the mdadm.conf AUTO stanza); fixed and rebooted
[12:41] Moving old data to make room for new volume group begins
[13:35] New volume group (not thin, on raid 10) created
[13:50] Attempt to restore from the backup using dump(1).
[15:08] dump found to not scale right and would take too long to complete
[15:15] Plan B: resize the backed up filesystem and do a block-level copy
[18:27] resize found to be unusable as it would take many days to complete
[18:30] Plan C: create new filesystems, selectively rsync data to them, and bring NFS service back gradually
[18:52] Rsync of tools started, for all but four tools that are outliers in size
[19:40] Rsync working well, Marc goes for a nap while it completes
[23:52] Rsync done, Marc called for second Rsync

2015-06-19

[02:54] Restoring NFS service for tools
[03:40] Issue with the Jessie NFS configuration found (ports switched)
[03:59] Issue fixed at the firewall level,
[04:05] Beginning to restart tools project with the new NFS
[04:18] gridengine returns, begin to restart all grid nodes
[04:30] Rsync for originally excluded tools begins
[04:48] Rsync for most other projects begins (excluded are maps, osmit, deployment-prep and mwoffliner)
[06:30] Tools is back online
[14:22] Issue found with NFS speed
[14:54] Issue determined to be confusion about the scratch space mount, restarting NFS and redoing the exports fixed it.
[15:40] fsck of the old filesystem started
[17:23] maps project delayed until the weekend (too big), osmit rsync'ed and avaliable
[17:31] deployment-prep restore started
[18:08] deployment-prep available to NFS
[18:30] rsync of maps started on a new volume

Conclusions

It's not clear what caused the corruption of the filesystem; the logs contain no error or indication of issues before the single (relatively minor) hit on a broken inode, at which point the filesystem automatically switched to read-only mode. There are a number of plausible hypoteses about the underlying cause^[1] but the net result was extensive damage to the block allocation structures of the filesystem (mostly around the files being actively written at the time – log files being the hardest hit).

Recovery time was long because of the large amount of data to restore and the requirement to keep the previous filesystem for recovery (restricting the amount of space avaliable for manipulation of the filesystem and restoration) as well as the raw quantity of data, where a number of projects stored a large number of files that may have been safely discarded had they been properly noted.

Troubleshooting, recovery and team coordination was made more difficult and further lengthened the recovery time because of the comparative complexity of the system as a whole, as well as its poorly maintained state: inexistent or inconsistent configuration management, inconsistent environments (precise/jessie, configuration files) between primary and backup systems, on-going hardware issues in in both of the backup systems, multiple prolonged migrations in flight.

↑ The more likely of which are (a) the secondary server having accidentally assembled and written to the RAID arrays despite the volumes not having been active; or (b) an issue or incompatibility with the then-ongoing pvmove over the thin volume holding the filesystem

Actionables

(All should be tracked in https://phabricator.wikimedia.org/tag/incident-20150617-labsnfsoutage/ )

Maintain labstore systems better, by employing standard operations team practices such as configuration management
Reduce NFS server SPOFs (e.g. by employing sharding)
Reduce the size of the filesystem(s) underlying NFS to speed backups and recovery
Make certain that all hardware issues (labstore1002 & labstore2001 in particular) are fixed (task T102626)
Formulate a rigid, well-known backup plan across servers and locations and apply it (task T103691)
Simplify the NFS server setup: no added complexity unless absolutely needed (task T102520, task T103265, task T94609, task T95559)
Reduce reliance on NFS for projects that do not strictly require a networked filesystem for their operation (task T102240)

Updates

Update 2015-06-19 21:00 UTC: All other projects should be up now (including tools) - restored from a backup taken on June 9. Some have had NFS disabled - but those mostly have had no significant NFS usage or have had members of the project confirm NFS is unused. This increases their reliability significantly. If your project has something missing, please file a bug or respond on list.

Update 2015-06-19 15:25 UTC: NFS for other projects made available, they are being brought back one at a time. A fsck is in progress on the old filesystem, and on completion it will tell us if we can recover data newer than 10 days old.

Update 2015-06-19 14:55 UTC: NFS stall issue identified and fixed, tool labs back.

Update 2015-06-19 14:50 UTC: NFS stalled again, investigation under way.

Update 2015-06-19 12:30 UTC: The 6 excluded tools are back, maintainers please start them back up. Webservices have automatically been started up. https://lists.wikimedia.org/pipermail/labs-l/2015-June/003831.html for more information.

Update 2015-06-19 06:20 UTC: Tools are back, see https://lists.wikimedia.org/pipermail/labs-l/2015-June/003814.html for update.

Update 2015-06-18 17:20 UTC: We are prioritizing bringing tools.wmflabs.org back up, and in the interest of time have excluded the following tools from initial copy: cluebot, zoomviewer, oar, templatetiger, bub, fawiki-tools. They'll be copied over in subsequent iterations.

Labs (including tool labs) is down, and it's not clear when it will be back up again

Yesterday, the file system used by many Labs tools suffered a catastrophic failure, causing most tools to break. This was noticed quickly but recovery is taking a long time because of the size of the filesystem.

There has been file system corruption on the filesystem backing the NFS setup that all of labs uses, causing a prolonged outage. The Operations team is at work attempting to restore a backup made on June 9 at 16:00.

More information:

Mailing list: https://lists.wikimedia.org/pipermail/labs-l/2015-June/date.html
Phabricator: T102925
On irc: #wikimedia-labs ^connect
#Details

If you are an editor on one of our projects

Sorry; you will not be able to use the tool you want to use at this moment. We are working hard to get everything back up and running, but it's going to take some time. Please be patient.

If you are a tool developer

It's not clear yet what the impact of the file system corruption is. The backup is more than a week old, so it is possible recent changes will be lost.

If you manage your own project on Labs

If you have a non-tools project on labs that does not depend on NFS and is currently down, you can recover it by getting rid of NFS and we will help with that. Recover instance from NFS shows how to do it, and we would prefer you showed up on #wikimedia-labs ^connect so we can help you do it faster / easier as well.

Details

wikibugs and grrrit-wm

are now temporarily running in a screen (under user valhallasw) on tools-webproxy-01

https://phabricator.wikimedia.org/T102984

tool labs web server

Returns 503s on every request

[1] The more likely of which are (a) the secondary server having accidentally assembled and written to the RAID arrays despite the volumes not having been active; or (b) an issue or incompatibility with the then-ongoing pvmove over the thin volume holding the filesystem

[1]