Incidents/20140618-Wikitech

Summary

Wikitech failed for about 10 minutes. The proximate cause was fixing puppet; the ultimate cause was a bug in Apache puppet refactoring.

Timeline

Andrew B notices that puppet runs are failing on wikitech, fixes duplicate definition problem
Puppet is fixed, thus breaking Apache
Andrew B fixes hotfixes apache
Proper apache puppet fix is merged

Rough outline:

Earlier in afternoon I discovered that puppet was failing to run on virt1000 (aka wikitech) along with the other puppet servers (e.g. palladium). This was due to a duplicate package definition (libapache2-mod-passenger) that crept in somewhere around the 'Kill apache_module' patch [1].

Despite this being a catalog compilation failure, icinga failed to notify us about puppet failures.

I fixed the duplicate package [2] and forced a puppet update on palladium. Everything went fine, puppet ran cleanly and palladium's clients continued happily. So, I forced a puppet run on virt1000, which also ran cleanly. Alas, during the intervening puppet outage, virt1000 had accumulated a backlog of unapplied patches. Another Apache refactor patch [3] added a new (spurious) sites-enabled file to wikitech with some settings that conflicted with the proper wikitech virthost. When apache2 restarted, it just chose one, and chose poorly, such that the only functional vhost on wikitech redirected to virt1000. This worked poorly.

I fixed the accidental renaming with a couple of hurried patches [4][5]. After a bit of fumbling, wikitech is happy again.

[1] https://gerrit.wikimedia.org/r/#/c/140211/

[2] https://gerrit.wikimedia.org/r/#/c/140571/

[3] https://gerrit.wikimedia.org/r/#/c/140218/

[4] https://gerrit.wikimedia.org/r/140576

[5] https://gerrit.wikimedia.org/r/140582