Incident documentation/20190807-s8-cawiki-errors

From Wikitech
Jump to navigation Jump to search

document status: in-review

Summary

Table wb_terms in wikidatawiki that provides labels and description lookup of entities (but also getting labels and description of entities; works both ways) is being replaced by a set normalized tables we call TermStore. Part of TermStore, is a class called DatabaseTermIdsResolver, picked up the right host for retrieving wikidata data in client wikis (including cawiki) but didn't use the right db name -wikidatawiki- thus code was trying to get information from wrong database -cawiki-. The patch that fixes the issue gives more in-depth information on it. It was deployed five hours earlier on client wikis (The patch that enabled it on client wiki is 527087) then after caches got invalidated and started to use the new store.

Impact

Some pages on cawiki were returning a 500 error due to failing to connect to the database. It potentially impacted all non-wikidata wikis but cawiki and ruwiki use wikidata extensively enough for this to cause issues (retrieving property term data is not very commons in client wikis)

Detection

We got the alert on IRC:

PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0]

Timeline

All times in UTC.

  • 11:31 the config variable gets deployed
  • 17:26 OUTAGE BEGINS
  • 17:33 Some mw* servers complain about having communication issues to s8 replicas
  • 17:42 Otto runs scap-file again as we were suspecting that a previous revert didn't make it to the intended servers; it didn't work:)
  • 17:45 We find out that cawiki (s7) is trying to access the cawiki db on s8 (wikidata) replicas
  • 17:46 We notice that this affects maintly cawiki (catalan wikipedia)
  • 17:53 We verify that some pages on cawiki return 500 errors due to database issues
  • 18:00 Train gets blocked until this is resolved
  • 18:15 Effie restarts hhvm and php-fpm on app canary servers to rule out any possible cache corruptions (s7 (cawiki) and s8 are off by one); it didn't work:)
  • 18:36 Reached out for more help
  • <snip mediwiki internals discussion>
  • 18:57 Amir says he knows what it up
  • 18:59 Reedy revert's said patch
  • 19:15 Jenkins finally merges the patch and Reedy deploys to prod
  • 19:17 OUTAGE ENDS (Voila)

Conclusions

We don't have proper knowledge of how wikidata works and interacts with wikis.

What went well?

  • People with Mediawiki knowledge were online and available to help.

What went poorly?

  • We didn't think to look for commits to mediawiki-config further back from before the issue started.

Where did we get lucky?

  • Amir was online and knew exactly to what this was related to.

Other information

Fatals/Exceptions on Kibana
  • Logs from mwlog: https://phabricator.wikimedia.org/P8883
  • SAL
    • 19:16 reedy@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Revert Switch property terms migration to WRITE_NEW on client wikis T225053 (duration: 00m 58s)
    • 18:15 jijiki: Restart hhvm and php-fpm on canary mw hosts
    • 17:42 otto@deploy1001: Synchronized wmf-config/InitialiseSettings.php: Retry - Revert "Switch high-traffic jobs to eventgate." (duration: 00m 58s)
    • 16:40 mobrovac@deploy1001: Synchronized wmf-config/InitialiseSettings.php: JobQueue: Revert switching high-traffic jobs to eventgate (duration: 00m 55s)
    • 16:34 mobrovac@deploy1001: scap failed: average error rate on 6/11 canaries increased by 10x (rerun with --force to override this check, see <> for details)

Actionables

None for now - TBA
NOTE: Please add the #wikimedia-incident Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.