At about 22:30 UTC on 2016-02-14, labsdb1002 suffered a disk failure and unmounted the volume that hosted several tools databases. We were unable to remount the system, so the db server was depooled pending a disk replacement.
22:28: incinga reports
mysqld processes on labsdb1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld MariaDB disk space on labsdb1002 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error Disk space on labsdb1002 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error
Giuseppe, Andrew, Chase, Jaime, Ariel and Alex Monk responded to the alerts. Logs include the following:
Feb 14 22:24:55 labsdb1002 kernel: [21554975.965750] XFS (dm-1): Log I/O Error Detected. Shutting down filesystem Feb 14 22:24:55 labsdb1002 kernel: [21554975.965752] XFS (dm-1): Please umount the filesystem and rectify the problem(s) Feb 14 22:24:55 labsdb1002 kernel: [21554975.965754] XFS (dm-1): metadata I/O error: block 0xc00f2fb0 ("xlog_iodone") error 5 numblks 64 Feb 14 22:24:55 labsdb1002 kernel: [21554975.965757] XFS (dm-1): xfs_do_force_shutdown(0x2) called from line 1170 of file /build/buildd/linux-3.13.0/fs/xfs/xfs_log.c. Return address = 0xffffffffa02db801
22:50: Giuseppe attempts to unmount and remount the failed volume
22:59: After discussion it's agreed that the other db servers can handle the load from 1002, so Jaime submits https://gerrit.wikimedia.org/r/#/c/270650/ which directs access for affected DBs to other servers. Andrew merges the patch and applies on labservices1001.
23:05: Andrew restarts the 'replag' tool by hand, at which point it resumes normal operation. Many tools recover spontaneously, and a few others are restarted by hand by operations staff.
23:22: Andrew sends an email to labs-announce encouraging tool maintainers to restart their services.
Affected tools replica databases:
Tools with sensible reconnect logic recovered immediately after labsdb1002 was depooled. Those without will require a manual restart, which is largely left up to tool maintainers.
- Replace the broken disk and repool labsdb1002: https://phabricator.wikimedia.org/T126946
- Consider implementing an automatic failover system for labsdb shards