Incidents/2019-10-31 wikidata

document status: final

Summary

Wikidata API calls were not getting responses (getting timeouts) due to DB read load due to backported changes reducing deadlocks around writing to the new terms store for wikibase.

Impact

Wikidata editors received timeouts to API requests, API response time for writes went through the roof. It seems like most edits from API calls were actually made, but the clients didn't get a response confirming that.

Detection

Humans in Telegram chat. Confirmed by Addshore.

Timeline

All times in UTC.

16:00 - Maintenance script migrating wb_terms data restarted
16:02 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Wikibase deadlock reduction, Stop locking and use DISTINCT when finding used terms to delete (T236466) (duration: 01m 05s)
16:05 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Wikibase deadlock reduction, Stop locking and use DISTINCT when finding used terms to delete (T234948) (duration: 01m 04s)
16:12 - Read rows on 1 db slave shot up https://phabricator.wikimedia.org/T236928#5620434
~16:18 - Edit rate on wikidata really started dropping
16:30 - Maintenance script migrating wb_terms data restarted (picking up code changes)?
16:38 - Reported timeout in UI editing on wikidata.org in Telegram chat
16:54 - Phabricator task created after Addshore spotted this message - https://phabricator.wikimedia.org/T236928
16:59 <jynus> killed rebuildItemTerms on mwmaint1002
~17:00 Edit rate on wikidata recovering, but drops again - https://phabricator.wikimedia.org/T236928#5620364
17:26 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Revert 16:02 UTC T236928 (duration: 01m 04s)
~17:26 Edit rate on wikidata recovering again
17:29 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Revert 16:05 UTC T236928 (duration: 01m 05s)

Conclusions

We could do with more alarms on things that often indicate a problem

What went well?

Was not a total outage, just severe slowness

What went poorly?

No alarms went off
Only a message in Telegram alerted us to an issue (not even a phab task)

Where did we get lucky?

People were on hand that knew what the problem was (as the issue did not coincide with deployment time)

How many people were involved in the remediation?

2 SREs?
2 Wikidata Devs

Actionables

Done More wikidata alerting https://gerrit.wikimedia.org/r/#/c/547404/ https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts
Done Add alerting to API response times for wikidata
Done Add alerting for wikidata edit rate (if below 100 per minute something somewhere is wrong)
Done Add alerting for MASSIVE database read rate on s8