Incident documentation/20191031-wikidata

From Wikitech
Jump to navigation Jump to search

document status: final

Summary

Wikidata API calls were not getting responses (getting timeouts) due to DB read load due to backported changes reducing deadlocks around writing to the new terms store for wikibase.

Impact

Wikidata editors received timeouts to API requests, API response time for writes went through the roof. It seems like most edits from API calls were actually made, but the clients didn't get a response confirming that.

Detection

Humans in Telegram chat. Confirmed by Addshore.

Timeline

All times in UTC.

  • 16:00 - Maintenance script migrating wb_terms data restarted
  • 16:02 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Wikibase deadlock reduction, Stop locking and use DISTINCT when finding used terms to delete (T236466) (duration: 01m 05s)
  • 16:05 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Wikibase deadlock reduction, Stop locking and use DISTINCT when finding used terms to delete (T234948) (duration: 01m 04s)
  • 16:12 - Read rows on 1 db slave shot up https://phabricator.wikimedia.org/T236928#5620434
  • ~16:18 - Edit rate on wikidata really started dropping
  • 16:30 - Maintenance script migrating wb_terms data restarted (picking up code changes)?
  • 16:38 - Reported timeout in UI editing on wikidata.org in Telegram chat
  • 16:54 - Phabricator task created after Addshore spotted this message - https://phabricator.wikimedia.org/T236928
  • 16:59 <jynus> killed rebuildItemTerms on mwmaint1002
  • ~17:00 Edit rate on wikidata recovering, but drops again - https://phabricator.wikimedia.org/T236928#5620364
  • 17:26 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Revert 16:02 UTC T236928 (duration: 01m 04s)
  • ~17:26 Edit rate on wikidata recovering again
  • 17:29 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Revert 16:05 UTC T236928 (duration: 01m 05s)

Conclusions

  • We could do with more alarms on things that often indicate a problem

What went well?

  • Was not a total outage, just severe slowness

What went poorly?

  • No alarms went off
  • Only a message in Telegram alerted us to an issue (not even a phab task)

Where did we get lucky?

  • People were on hand that knew what the problem was (as the issue did not coincide with deployment time)

How many people were involved in the remediation?

  • 2 SREs?
  • 2 Wikidata Devs

Actionables