Incidents/2018-07-31 Phabricator

From Wikitech

Summary

On 31st July 2018, Wikimedia's Phabricator instance ( https://phabricator.wikimedia.org ) was instable, showing non-canonical data, unavailable or in read only mode for about 10 minutes, and within those, also lost some ticket information for a period of about 6 minutes.

Timeline

15:02: Started network maintenance to move servers on EQIAD row B to a different switch Task T183585

15:18: dbproxy1008 and dbproxy1003 detect db1072 as down (this host was part of the network maintenance and was expected to have a small downtime)

  • At that moment some connections (both reads and writes) started going to db1117:3323 (which is supposed to be on READ ONLY), while others continue writing to db1072
  • DBAs start investigating

15:18: <icinga-wm> PROBLEM - MariaDB Slave SQL: m3 on db1117 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 24361579 for key PRIMARY on query. Default database: phabricator_file.

  • Icinga also alerts that there are duplicate errors, creating a split-brain scenario

15:22: DBAs realize db1117:3323 wasn't in read only, being the cause of the replca writes and the broken replication.

15:24: db1117:3323 is set to read only while research continues

  • At this point phabricator starts being unavailable, because database read-only mode is not supported
  • A decision has to be made if to continue letting db1117:3323 as the master, or reverting back to db1072. Both options imply a small, but indetermined amount of data loss. db1072 is decided on grounds that it will provide higher availability (as it will allow codfw replicas be an exact replica)

15:26: DBAs reload haproxy to make sure db1072 is back as being the master.

15:27: Due to phabricator connection pool, connections already established to db1117:3323 would need to be killed just to make sure - to stop MySQL is decided as it is the safest option

15:28: Phabricator is confirmed to be fully back up

Conclusions

A series of events triggered this event.

  • dbproxy1003/8 were not handled/prepared appropriately for the scheduled network maintenance, as they were not direct part of the maintenance (the most they monitor were!). A special reminder has to be set in the future to test haproxy hosts being potentially involved.
  • A downtime a bit bigger than expected (haproxy at the time considered a host failed after 9 seconds (3000ms * 3 retries) and bigger than the usual other net maintenances that have been done. Previous network maintenance hadn't caused an automatic failover
  • Phabricator uses heavy connection pooling, which affects negatively the time it takes for the failover to happen (resulting on only some writes being diverted, resulting on the actual split brain)
  • It was thought that the failed to host was in read_only, preventing accidental split brain. This wasn't true, an incorrect read_only configuration on the MySQL replicas for misc made the split brain possible, as the replica was writable. This was due to a mistake on the relatively new role puppet manifests (misc_multiinstance), due to, at the time, the 2 methods to configure read_only: on the template and as a config parameter.

Actionables

  • Status:    Done - Fix puppet code so the appropriate hosts are in read_only mode gerrit:450205
  • Status:    Done - Monitor read only variables on replicas and alert (on IRC) if, in the future, replicas are writable phab:T172489
  • Status:    Done - Increase the timeout, from 9 seconds to 60 seconds, for haproxy to consider a host hard down gerrit:450542
  • Status:    Pending - (Not strictly related to this) haproxy is lacking proper logging on dbproxy roles (or at least, dbproxy1002) phab:T201021