Incidents/2020-03-19 parsercache

document status: in-review

Summary

Parsercache databases got overloaded due to a malfunctioning host which resulted on spikes of connections on the other 2 active hosts and increased latency on our mwapps servers.

Impact

Query latency was increased https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200&from=1584378662200&to=1584387599259&fullscreen&panelId=31

mw app servers got their workers saturated: https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&fullscreen&panelId=92&from=1584358493597&to=1584421429638

Higher than usual response time https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200&from=1584378662200&to=1584387599259&fullscreen&panelId=9

Detection

Icinga paged for pc1008 host that was having performance degradation

18:43:14 <+icinga-wm> PROBLEM - MariaDB Slave SQL: pc2 #page on pc1008 is CRITICAL: CRITICAL slave_sql_state could not connect 
https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave

Timeline

All times in UTC.

18:00 Degradation begins
18:00 pc1008 starts having performance issues and its disk latency starts increasing, connections start to pile up on pc1008

18:00 Other hosts (pc1007 and pc1009) also start suffering more idle connections as the result of pc1008 failing to handle connections as fast as usual

18:00 Average response time increases

18:43 <+icinga-wm> PROBLEM - MariaDB Slave SQL: pc2 #page on pc1008 is CRITICAL: CRITICAL slave_sql_state could not connect
18:43-19:44 A number of SREs and 2 DBAs respond and troubleshooting starts
19:11 DBAs Replace pc1008 with pc1010 (which is a spare for a different pc group, and has 1/3 of the key), but worth trying as there were no more ideas and pc1008 was checked for HW errors, misconfigurations and such and all looked fine anyways.
19:12 Response time, idle connections on other hosts, latency...they all start to get better
19:24 Values almost around the same before the incident (considering that 1/3 of the pc keys were gone)
19:24 Degradation stops

Conclusions

The hardware performance degradation was hard to detect via the usual checks: broken BBU, degraded RAID, disks with errors that hasn't removed from the RAID, memory issues.... As nothing appeared to be broken, DBAs didn't consider pc1008 as the core of the issue. The fact that all the parsercache showed similar connections spike pattern made us think that the problem was on the other side of the spectrum (MW).

We later learned thanks to Brad, that parsercache has a "double write" behaviour we didn't know of and if one of those fails, the others keep hanging until the request is processed or shutdown.

What went well?

When we planned the parsercache refresh a year ago, we decided to buy a host to have it as a spare, precisely for these kind of situations.

What went poorly?

DBAs were not aware of this parsercache behaviour so they didn't consider pc1008 affecting other host as a possibility (later explained by Brad on https://phabricator.wikimedia.org/T247788#5975667%7CT247788#5975667):

Each write to ParserCache sets two keys into the backend, which will probably get sharded to two different servers. Once SqlBagOStuff opens a connection to one of the servers, it keeps 
that connection open until request shutdown. So if we assume that pc1008 is somehow failing in a way that has connections hang open for a while, we'd also see a smaller increase in 
idle open connections on pc1007 and pc1009 for the cases where ParserCache's first write goes to pc1007/pc1009 and the second one goes to pc1008. That seems consistent with what the 
three graphs show.

Trying to get ahold of CPT via IRC wasn't possible.

The hardware degradation pc1008 had, was hard to detect and was only detected a day after, with lots of testing (https://phabricator.wikimedia.org/T247787#5975506)

Where did we get lucky?

Just to try things, we decided to replace pc1008 with pc1010 but without much expectations and it worked

How many people were involved in the remediation?

2 DBAs
3 SREs
2 WMDE Devs

Links to relevant documentation

This explanation by Brad resumes what was happening from MW side https://phabricator.wikimedia.org/T247788#5975667 and https://phabricator.wikimedia.org/T247788#5976651

Actionables

[RFC] improve parsercache replication, sharding and HA: https://phabricator.wikimedia.org/T133523
Investigate pc1008 for possible hardware issues / performance under high load: https://phabricator.wikimedia.org/T247787
- Once pc1008 is back full - repool it to make sure it is fully fixed after re-creating the raid
- Purge pc1010 old rows once it is out of rotation
Parsercache sudden increase of connections: https://phabricator.wikimedia.org/T247788#5976651