document status: final
Wikidata API calls were not getting responses (getting timeouts) due to DB read load due to backported changes reducing deadlocks around writing to the new terms store for wikibase.
Wikidata editors received timeouts to API requests, API response time for writes went through the roof. It seems like most edits from API calls were actually made, but the clients didn't get a response confirming that.
Humans in Telegram chat. Confirmed by Addshore.
All times in UTC.
- 16:00 - Maintenance script migrating wb_terms data restarted
- 16:02 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Wikibase deadlock reduction, Stop locking and use DISTINCT when finding used terms to delete (T236466) (duration: 01m 05s)
- 16:05 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Wikibase deadlock reduction, Stop locking and use DISTINCT when finding used terms to delete (T234948) (duration: 01m 04s)
- 16:12 - Read rows on 1 db slave shot up https://phabricator.wikimedia.org/T236928#5620434
- ~16:18 - Edit rate on wikidata really started dropping
- 16:30 - Maintenance script migrating wb_terms data restarted (picking up code changes)?
- 16:38 - Reported timeout in UI editing on wikidata.org in Telegram chat
- 16:54 - Phabricator task created after Addshore spotted this message - https://phabricator.wikimedia.org/T236928
- 16:59 <jynus> killed rebuildItemTerms on mwmaint1002
- ~17:00 Edit rate on wikidata recovering, but drops again - https://phabricator.wikimedia.org/T236928#5620364
- 17:26 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Revert 16:02 UTC T236928 (duration: 01m 04s)
- ~17:26 Edit rate on wikidata recovering again
- 17:29 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Revert 16:05 UTC T236928 (duration: 01m 05s)
- We could do with more alarms on things that often indicate a problem
What went well?
- Was not a total outage, just severe slowness
What went poorly?
- No alarms went off
- Only a message in Telegram alerted us to an issue (not even a phab task)
Where did we get lucky?
- People were on hand that knew what the problem was (as the issue did not coincide with deployment time)
How many people were involved in the remediation?
- 2 SREs?
- 2 Wikidata Devs
- Done More wikidata alerting https://gerrit.wikimedia.org/r/#/c/547404/ https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts
- Done Add alerting to API response times for wikidata
- Done Add alerting for wikidata edit rate (if below 100 per minute something somewhere is wrong)
- Done Add alerting for MASSIVE database read rate on s8