Incidents/20151026-MediaWiki

Summary

During the SF morning SWAT deploy I sent out patches that moved the Search eventlogging schema from CirrusSearch repository to the WikimediaEvents repository. Upon deployment ResourceLoader started emitting an error about duplicate module registration. This occured because the patch for the CirrusSearch repository that removed the schema should have been deployed before the change that adds it to the WikimediaEvents repository. This was not noticed immediately because they are not included in `fatalmonitor` from fluorine which was open in my shell to monitor deployment issues. After a couple minutes this was noticed in https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor and the revert process started.

``` MWException from line 331 of /srv/mediawiki/php-1.27.0-wmf.3/includes/resourceloader/ResourceLoader.php: ResourceLoader duplicate registration error. Another module has already been registered as schema.Search ```

Timeline

15:13 ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/WikimediaEvents.php: Move search schema from cirrussearch -> wikimediavents (duration: 00m 19s)
15:13 varnish starts reporting errors to logstash
15:14 Reports in #wikimedia-operations of the site going down
15:18 Revert the three patches to WikimediaEvents in gerrit
15:21 ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents: rollback (duration: 00m 18s)
15:22 All error graphs return to normal

Conclusions

The fatalmonitor on fluorine is the easiest monitor to keep on screen while working from a laptop with minimal screen space, but does not contain all information about fatals on the site. Perhaps this could integrate 5xx reporting from graphite or logstash. Rolling back changes in gerrit takes much too long when there are multiple patch sets and time is of the essence, this should be done on tin directly and fixed up in the deployment branches after the production issue has been fixed. Finally, ResourceLoader should not fatal the site due to configuration issues. The problem should be logged and the site should continue on serving requests the best it can.

Actionables

Include information about 5xx rate in fatalmonitor (bug T116627)
ResourceLoader should not fatal the site due to configuration issues (bug T116628)