Incident documentation/20120430-MWCoreVersions

From Wikitech
Jump to: navigation, search

Summary

We experienced some problems with failed images, scripts, and other static resources (anything served from bits.wikimedia.org). This problem started 2012-04-30 23:34 UTC (4:34pm PDT), and was cleared up 2012-05-01 00:26 UTC (5:26pm PDT).

Details

This started as a problem isolated to test2.wikipedia.org, as part of the 1.20wmf2 deployment. It turns out that, in the process of deploying MediaWiki to those hosts, we filled up the main partition on some of them (notably mw60). This manifested itself as some URLs (notably this one) returning a persistent 404 error, even though the file appeared to be there. Appending "?foobar" to the end of a broken URL caused the resource to load. We determined that the problem was that, for this URL, it was trying to load from mw60, which didn't have the file. Since we've enabled a feature in varnish to persistently request from a single back end, the 404 was persistent.

In an attempt to fix the problem, we deleted the "php-1.19" directory from all hosts, which (we thought) was no longer needed, since 1.20wmf1 is used. This freed up a substantial amount of space. However, php was symlinked to php-1.19, and many scripts were expecting to find the images there, resulting in 503 errors for all sites (both 1.20wmf1 and 1.20wmf2). We deleted the broken php->php-1.19 symlink, symlinked php to php-1.20wmf1, and attempted to push the link out. Unfortunately, our sync scripts dereferenced the link, and started pushing out the whole directory. Switching tactics, we manually used dsh to delete the php directory and replaced it with the proper symlink. After this, the problem was resolved.

Things to work on

  • Need to use tools for checking availability of resources on all sites in the event of 404s. Had we done this, we would have caught the mw60 issue sooner than we did.
  • Need to make php symlink fixup part of the deployment script process, and hopefully deprecate usage of these. See bug 36363
  • Need to fix up partitioning on our Apaches, so that we aren't constantly hitting artificial resource constraints. Yes check.svg Done