Incident documentation/20160214-labsdb1002

From Wikitech
Jump to: navigation, search

Summary

At about 22:30 UTC on 2016-02-14, labsdb1002 suffered a disk failure and unmounted the volume that hosted several tools databases. We were unable to remount the system, so the db server was depooled pending a disk replacement.

Timeline

22:28: incinga reports

mysqld processes on labsdb1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
MariaDB disk space on labsdb1002 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error
Disk space on labsdb1002 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error

Giuseppe, Andrew, Chase, Jaime, Ariel and Alex Monk responded to the alerts. Logs include the following:

Feb 14 22:24:55 labsdb1002 kernel: [21554975.965750] XFS (dm-1): Log I/O Error Detected.  Shutting down filesystem
Feb 14 22:24:55 labsdb1002 kernel: [21554975.965752] XFS (dm-1): Please umount the filesystem and rectify the problem(s)
Feb 14 22:24:55 labsdb1002 kernel: [21554975.965754] XFS (dm-1): metadata I/O error: block 0xc00f2fb0 ("xlog_iodone") error 5 numblks 64
Feb 14 22:24:55 labsdb1002 kernel: [21554975.965757] XFS (dm-1): xfs_do_force_shutdown(0x2) called from line 1170 of file /build/buildd/linux-3.13.0/fs/xfs/xfs_log.c.  Return address = 0xffffffffa02db801

22:50: Giuseppe attempts to unmount and remount the failed volume

22:59: After discussion it's agreed that the other db servers can handle the load from 1002, so Jaime submits https://gerrit.wikimedia.org/r/#/c/270650/ which directs access for affected DBs to other servers. Andrew merges the patch and applies on labservices1001.

23:05: Andrew restarts the 'replag' tool by hand, at which point it resumes normal operation. Many tools recover spontaneously, and a few others are restarted by hand by operations staff.

23:22: Andrew sends an email to labs-announce encouraging tool maintainers to restart their services.

Conclusions

Affected tools replica databases:

  • 'bgwiki'
  • 'bgwiktionary',
  • 'commonswiki'
  • 'cswiki'
  • 'dewiki'
  • 'enwikiquote'
  • 'enwiktionary'
  • 'eowiki'
  • 'fiwiki'
  • 'idwiki'
  • 'itwiki',
  • 'nlwiki'
  • 'nowiki'
  • 'plwiki'
  • 'ptwiki'
  • 'svwiki'
  • 'thwiki'
  • 'trwiki'
  • 'wikidatawiki'
  • 'zhwiki'

Tools with sensible reconnect logic recovered immediately after labsdb1002 was depooled. Those without will require a manual restart, which is largely left up to tool maintainers.

Actionables

  • Consider implementing an automatic failover system for labsdb shards