Incidents/20140910-BetaCluster

Summary from Jeremy B

superm401/mattflaschen noticed the beta bits 503s ~01h UTC. (#wikimedia-labs)

in between Matt and Krinkle, puppetmaster on deployment-salt killed by oom-killer:

> Sep 10 05:51:58 deployment-salt kernel: [8255978.329157] Out of memory: Kill
> process 8394 (puppet) score 371 or sacrifice child
> Sep 10 05:51:58 deployment-salt kernel: [8255978.330110] Killed process 8394
> (puppet) total-vm:1647904kB, anon-rss:1487172kB, file-rss:2416kB

I noticed it when Krinkle replied to Matt ~7 hours later (~08h UTC) and then I started investigating.

looked at `varnishadm health.debug`. said both backends were sick but I misread it.
looked at `varnishlog` and immediately saw the 404 responses for health checks
did my own curl tests against both backends. =404
logged into one backend, looked at the apache conf (arbitrary file in sites-enabled) for one of them and saw srv/mediawiki/... as docroot. the path in conf did not exist on fileystem

sometime around now, tried steps from bug 70597. for some steps I didn't get an error message but also couldn't tell for sure if it worked. I still don't know if there's steps in there that require advanced jenkins privs. (e.g. to mark offline/online)

also, looked at puppet logs/tried puppet run/investigated puppetmaster state. normally might have hesitated about some of that on a new+unfamiliar system but given I knew the change to /srv/mediawiki was done recently and the boxes were so completely broken i figured it couldn't do too much damage.

[10:14:53] <jeremyb> !log deployment-bastion killed puppet lock (file)
[10:15:09] <jeremyb> !log deployment-salt started puppetmaster && puppet run
[10:15:27] <jeremyb> !log deployment-mediawiki0[12] both had good puppet runs
[10:16:34] <jeremyb> !log deployment-salt had an oom-kill recently.
and some box (maybe master, maybe client?) had a disk fill up
^^ same one quoted above
[10:17:07] <jeremyb> !log deployment-bastion good puppet run
[10:20:02] <jeremyb> !log deployment-bastion /var at 97%, freed up ~500MB. apt-get clean && rm -rv /var/log/account/pacct*

addressed some puppet run problems (not sure if these were before or after the !logs above). some quick hacks to get stuff working (e.g. kill the duplicate resource definition) also, a change which I thought was right but then it turns out we're also moving off of that path *too*.

fixed a few other things like deployment-jobrunner01 was set in ldap (puppetVar via wikitech interface) to use a non-existant hostname as trebuchet master.

deployment_server_override=deployment-scap.eqiad.wmflabs.
deployment-videoscaler01.eqiad.wmflabs still has that (old?) setting

but i'm not worrying about it tonight. audited the rest of the project, they all have the same trebuchet master

Conclusions

What weakness did we learn about and how can we address them?

Actionables

Explicit next steps to prevent this from happening again as much as possible, with Bugzilla bugs or RT tickets linked for every step.

put them here