Incidents/2017-05-03 x1 outage

Tracked in Phabricator
Task T164407

Summary

x1 db outage after the datacenter switchover back to eqiad from codfw due to too many connections

Too many connections to the database being held open
Connections being held open waiting for table level locks
Table level locks were apparently caused by the query killer killing long running queries
What was the cause of these long running queries?

It should be noted that during the eqiad->codfw switch Cognate was not deployed and the Cognate databases were not populated.

We observed a lot of secondary effects, and a lot of things that could possible be connected, but couldn't figure out the root cause that lead to this chain of events. The changes made to the Cognate code should not be seen as resolving the underlying issue.

Possible causes

During the initial incident it appeared that googlebot may have been making large numbers of requests to wiktionaries. We don't know exactly how many of these requests triggered db queries, but the list of interwiki links provided by Cognate is cached in the parser cache and unless these requests all caused re-parses it is unlikely to be the cause of the issue.
One idea is that there were a lot of write operations backed up in the job queue which all hit the DB at once when the JQ was restarted. The Cognate write rate is so low however that we couldn't think of how there could be more than a few hundred jobs in there. It is probably worth checking for similar JobQueue related issues from other extensions hosted on x1 as the timeline fits rather well with the JQ switch on.
Increased Disk Utilization and Disk Write latency is apparently normal on the DB servers after the switch over. This increased write latency is probably the cause of the extended write query times, however these queries getting to 60 seconds + still seems far fetched, but this is definitely a contributing factor. The servers in each datacenter appear to have different setups. Below are some differences on the x1 master for each switch regarding disk / query performance and latency:
- Eqiad -> Codfw
  - No real contention issue spikes Highest is 0.117 for row lock waits.
  - Disk latency Spike on read to 5.7s for a tiny amount of time, generally no more than 1.3s for read and 14ms for write. Read descended to a level of under 200ms after around 2 hours.
  - Disk utilization Spike to 37%, 2 hours later @ 11%, after 4 hours @ 6% and continues to decend.
- Codfw -> Eqiad
  - No immediate contention issues, these appear 12 hours after the switch over.
  - Disk latency Write latency remains bouncing between 1.4s and 0.5s for the period after the switch. Read latency appears 24 hours after the switch.
  - Disk utilization Spike to 61% straight after the switch. 24 hours later (after a small dip) this is at 51%, 48 hours see this still at 80%, drop to 2% within the next 6-12 hours.

Timeline

This is a step by step outline of what happened to cause the incident and how it was remedied.

13:58 - Switch back actually starts
... <More SAL logs relating to switchover>
14:15 END: (PASS) - Wipe and warmup caches in eqiad - t04_cache_wipe (switchdc/oblivian@neodymium)
14:15 START: - Switch MediaWiki master datacenter and read-write discovery records from codfw to eqiad - t05_switch_datacenter (switchdc/oblivian@neodymium)
... <More SAL logs relating to switchover>

14:21 START: - Restore the TTL of all the MediaWiki read-write discovery records and cleanup confd stale files - t09_restore_ttl (switchdc/oblivian@neodymium)
14:21 END: (PASS) - Restore the TTL of all the MediaWiki read-write discovery records and cleanup confd stale files - t09_restore_ttl (switchdc/oblivian@neodymium)

14:21:41 - First table level lock appears grafana
14:21:59 - First SlowTimer log logstash (one log) , logstash (whole period).
14:22 START: - Start MediaWiki jobrunners, videoscalers and maintenance in eqiad - t09_start_maintenance (switchdc/oblivian@neodymium)
14:24 END: (PASS) - Start MediaWiki jobrunners, videoscalers and maintenance in eqiad - t09_start_maintenance (switchdc/oblivian@neodymium)
16:30 - jynus@naos: Synchronized wmf-config/db-eqiad.php: Fine-tune per-server load to reduce db connection errors (duration: 01m 27s)
16:51:51 - Spike in DB connections starts grafana
16:53:42 - Peak DB connections
17:05 - Last table level lock
17:14 - jynus@naos: Synchronized wmf-config/InitialiseSettings.php: Disable cognate- it is causing an outage on x1 (duration: 01m 06s)

Conclusions

Adding monitoring within extensions, even when you don't feel that it is needed, is often helpful further down the line. It is probably worth adding such monitoring to any new extensions being deployed on the cluster so that in situations like this we have more information.
Read only modes in code are also a good idea, and avoid people taking all functionality offline.
The Cognate extension & x1 in general should be watched carefully during the next switchover

Actions Taken

Cognate
- Fix selectSitesForPage to use getReadConnectionRef.
- Release / reuse connections as early as possible.
- Add a read only mode to avoid the whole extension having to be disabled as was done in this incident.
- Add stats tracking for db access methods counting calls to the production db methods as well as execution times.
  - Dashboard: https://grafana.wikimedia.org/dashboard/db/mediawiki-cognate
  - Code: CognateRepo.php
- Created (and running) a maintenance script to delete pages that should not be in the database (due to use of read only mode)
- Run the populate cognate pages maintenance script to add any entries that were missed during the read only mode time.

Actionables

Db
- The query killing script can cause table level locks causing things to back up. Are these locks needed? Are there alternatives? They contributed to the traffic jam here.
- Do the database setups in each datacenter differ? If so why? Could this have contributed to the issue?
Cognate Extension
- PHAB - Cognate does some updates synchronously, and others via the JQ. That may lead to inconsistencies in the DB. Simply writing synchronously all of the time would decrease JQ odd interactions for Cognate.
Other
- Mention in extension writing docs that read only modes might be a good idea.
- Investigate the JobQueue immediately after it was turned back on as part of the data centre switch, were there lots of Echo jobs? or jobs that would touch the x1 cluster?
- Investigate the cause of the high write latency on db masters during / after the switch over. Is this something we need to expect during switches?