Incidents/20160713-MediaWiki

From Wikitech
Jump to navigation Jump to search

Summary

On any request that used the PoolCounter feature (search, rendering of pages on view), users experienced a 503 due to a HHVM fatal error.

This was caused the because the duplicate PoolCounterClient.php entry point referenced in the MediaWiki configuration was removed in favor of the properly named PoolCounter.php one and extension.json. Reedy had prepared a patch to do the switch, but it hadn't been deployed yet, and the wmf.10 train went ahead. Furthermore, the entry point was using PHP's "include" and not "require", meaning that PHP would not fatal if the file was missing, so we didn't notice it immediately. It would only fatal when something actually invoked PoolCounter, leading to intermittent user facing errors, depending on which articles they were viewing.

After checking fatal.log and seeing the error, Legoktm fixed up Reedy's patch and deployed it.

Timeline

This is a step by step outline of what happened to cause the incident and how it was remedied.

  • 17:48 First report by Steinsplitter in #wikimedia-operations about getting 503s
  • 17:50 Confirmed by others in channel on the <https://www.mediawiki.org/wiki/Help:Extension:ParserFunctions> page
  • 17:53 legoktm looks at fatal.log on fluorine and sees errors like:
    2016-07-13 17:53:08 [V4aABApAMEoAAFRU2eIAAABO] mw1239 mediawikiwiki 1.28.0-wmf.10 fatal ERROR: [5c5c2c0c] PHP Fatal Error: Class undefined: PoolCounter_Client
  • 17:56 legoktm fixes, rebases and merges https://gerrit.wikimedia.org/r/298096 ("PoolCounterClient.php -> extension.json")
  • 17:57 legoktm syncs to mw1017 and tests using X-Wikimedia-Debug
  • 17:58 legoktm deploys to all servers
  • 18:00 confirmed fixed by #wikimedia-operations channel members

Conclusions

What weakness did we learn about and how can we address them?

This should have been noticed much sooner. We should have noticed in beta that the file was missing, and then again when wmf.10 was first deployed

Actionables

Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.

  • Set up PoolCounter on beta cluster (bug T38891)
  • Audit configuration to make sure all extensions are being loaded using "require" and not "include"
  • The proper canary deploy stuff that would notice fatals earlier...