Incidents/20160126-WikimediaDomainRedirection

From Wikitech
Jump to navigation Jump to search

wikimedia.org domain redirection outage on 2016-01-26.

Summary

Due to a faulty configuration change that was deployed to the production application server cluster, all wiki sites with "*.wikimedia.org" domains redirected /wiki/* requests to www.wikimediafoundation.org until the mistake was discovered, the change reverted, and the effects of HTTP caching could be undone by purging the invalid redirections in the caching layer. This caused many pages under the wikimedia.org domains to be unavailable for the duration of up to one hour. (~ 18:14 -> 19:19 UTC)

Timeline

Pre-Outage

  • Puppetswat start is delayed due to swat deploy runing long. We started our puppetswat ~20m late, pushing the entire incident well past puppet swat window. RobH and cmjohnson1 are on puppetswat.
  • 17:20 cmjohnson: disabling puppet on mw cluster (part of puppetswat and pushing apache changes)
  • 17:33 An Apache configuration change, https://gerrit.wikimedia.org/r/#/c/265642/ gets merged. This patch causes all web sites under the .wikimedia.org domain to redirect /wiki/* to HTTP redirect to www.wikimediafoundation.org
  • 18:00 patch owner goes away from IRC for meeting + dinner, no longer testing
  • 18:10 cmjohnson ran apache-fast-test with incomplete URL's

Outage

  • outage begins
  • 18:14 cmjohnson1: starting puppet on mw cluster - this is the push of the puppetswat patches to more than the single test mw system
  • 18:24: first user reports of 404 errors in #wikimedia-operations when creating pages
  • 18:28 https://phabricator.wikimedia.org/T124804 gets created
  • 18:31 < mutante> revert this https://gerrit.wikimedia.org/r/#/c/265642/4/modules/mediawiki/files/apache/sites/wikimedia.conf
  • 18:36 https://gerrit.wikimedia.org/r/#/c/265642/ gets reverted in https://gerrit.wikimedia.org/r/#/c/266551/
  • 18:44 _joe_: running salt --batch-size=20 -C 'G@luster:appserver and G@site:eqiad' cmd.run 'puppet agent -t --tags mw-apache-config'
  • 19:09 Revert of the Apache configuration change on the application server cluster completes
  • 19:10 akosiaris starts issuing bans on the Varnish text caching cluster to clear bad caching entries in specified order (first backends then frontends per DC in eqiad,codfw,ulsfo,esams order)
  • 19:19 akosiaris and Joe finish issueing Varnish bans on all Text varnish caches (most of the problem is resolved for most users)
  • 19:36 akosiaris issues Varnish bans on mobile Varnish caches for completeness (small trailing problems, only for a fraction of mobile users)

Conclusions

The Apache configuration system backing the MediaWiki application cluster is a complex system that has its origins in the early days of the projects. Although some work has already been done to simplify and automate it over time, it still has a significant amount of technical debt and lacks a rigid framework for testing it and safely making changes to it without risk. This should be rectified, and more rigid procedures around review and deployment of changes should prevent this from happening until then.

Actionables

  • Status:    Done Revisit Puppet SWAT and general +2 merge procedures around Apache configuration changes
  • Implement rigid testing framework for Apache configuration changes:
    • Status:    Unresolved (bug T45266) Write and implement tests for Wikimedia's Apache configuration (redirects.conf, etc.)
    • Status:    Done (bug T72068) Jenkins: Re-enable lint checks for Apache config in operations-puppet
    • Status:    Unresolved (bug T114801) operations-apache-config-lint replacement doesn't check syntax
      • include base url file for all apache testing - which should include all projects and a broad base of sub-projects; currently each opsen generates this as needed
  • Update status.wikimedia.org to catch HTTP redirects as downtime, make Icinga page on stuff like this
  • Ideally we'd be able to test apache config in beta. However, beta uses mostly separate apache config. Alex M has been trying to work on slowly merging it with production apache config - this patch was part of that effort. However, this specific issue looks like it would've been catchable in beta as wwwportals is now part of the common (both beta and prod) config. TODO: Merge the rest?
  • Automate and/or better-document Varnish ban procedure for operations staff, so it can be accomplished with more speed and confidence in these scenarios.