Incidents/20160214-labsdb1002

Summary

At about 22:30 UTC on 2016-02-14, labsdb1002 suffered a disk failure and unmounted the volume that hosted several tools databases. We were unable to remount the system, so the db server was depooled pending a disk replacement.

Timeline

22:28: incinga reports

mysqld processes on labsdb1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
MariaDB disk space on labsdb1002 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error
Disk space on labsdb1002 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error

Giuseppe, Andrew, Chase, Jaime, Ariel and Alex Monk responded to the alerts. Logs include the following:

Feb 14 22:24:55 labsdb1002 kernel: [21554975.965750] XFS (dm-1): Log I/O Error Detected.  Shutting down filesystem
Feb 14 22:24:55 labsdb1002 kernel: [21554975.965752] XFS (dm-1): Please umount the filesystem and rectify the problem(s)
Feb 14 22:24:55 labsdb1002 kernel: [21554975.965754] XFS (dm-1): metadata I/O error: block 0xc00f2fb0 ("xlog_iodone") error 5 numblks 64
Feb 14 22:24:55 labsdb1002 kernel: [21554975.965757] XFS (dm-1): xfs_do_force_shutdown(0x2) called from line 1170 of file /build/buildd/linux-3.13.0/fs/xfs/xfs_log.c.  Return address = 0xffffffffa02db801

22:50: Giuseppe attempts to unmount and remount the failed volume

22:59: After discussion it's agreed that the other db servers can handle the load from 1002, so Jaime submits https://gerrit.wikimedia.org/r/#/c/270650/ which directs access for affected DBs to other servers. Andrew merges the patch and applies on labservices1001.

23:05: Andrew restarts the 'replag' tool by hand, at which point it resumes normal operation. Many tools recover spontaneously, and a few others are restarted by hand by operations staff.

23:22: Andrew sends an email to labs-announce encouraging tool maintainers to restart their services.

Conclusions

Affected tools replica databases:

'bgwiki'
'bgwiktionary',
'commonswiki'
'cswiki'
'dewiki'
'enwikiquote'
'enwiktionary'
'eowiki'
'fiwiki'
'idwiki'
'itwiki',
'nlwiki'
'nowiki'
'plwiki'
'ptwiki'
'svwiki'
'thwiki'
'trwiki'
'wikidatawiki'
'zhwiki'

Tools with sensible reconnect logic recovered immediately after labsdb1002 was depooled. Those without will require a manual restart, which is largely left up to tool maintainers.

Actionables

Replace the broken disk and repool labsdb1002: https://phabricator.wikimedia.org/T126946

Consider implementing an automatic failover system for labsdb shards