Incidents/20150515-LabsOutage
Summary
Around 00:00 on 2015-05-17 labvirt1003 entered an unhealthy state: ssh failed and many new processes started hanging. Existing VMs running on the system continued to accept ssh connections and behave somewhat reasonably. Around 08:00 several critical VMs were shut down gracefully. At 09:00 labvirt1003 was rebooted, all VMs gradually restarted, and services returned to normal.
No good explanation has been found for this problem.
Timeline
Because the cause of this outage is unknown, this timeline includes many details that may or may not be correlated to the outage.
- [18:00] In order to fix the 'Venom' vulnerability Moritz updated several qemu packages on all labvirt100* hosts: qemu-keymaps qemu-system qemu-system-arm qemu-system-common qemu-system-mips qemu-system-misc qemu-system-ppc qemu-system-sparc qemu-system-x86 qemu-utils. A simple suspend/resume was found adequate to switch a running VM over to the new binary, so Andrew started a simple script which suspended and resumed each labs VM in turn. A few instances have difficulties resuming from the suspended state, so Andrew continues to watch the running script and monitor CPU graphs on ganglia. Loads all appear normal.
- [23:00] Timo reports clock drift on several instances. Andrew assumes this is due to the instance suspension and advises that clocks be reset via ntpd. (This is likely unrelated to the outage: The clock drift is seen on instances both on and off labvirt1003.)
- [00:00] Ganglia loses contact with labvirt1003. A few minutes later, icinga alerts sound: SSH, puppet, RAID, eth0. Daniel briefly investigates and attributes this to 'libvirt on labvirt being really busy.' At this point the only shinken alarm to fire is a brief warning about the tool-labs homepage being down for a minute or two. Other monitored VMs are behaving well enough to keep shinken happy.
- [04:00] Andrew checks in on the suspend/resume script before bed, sees that labvirt1003 has dropped off of Ganglia, and kills the script. Labvirt1003 won't accept ssh connections, but hosted VMs seem generally fine. Via the console Andrew checks'top' and sees one instance, wikidata-wdq-mm, running at 800% CPU usage. In dmesg there are many warnings about CPUs ramping down due to overheating. Andrew's (incorrect) diagnosis: CPU-starvation due to overheating and a rampant VM. Andrew stops wikidata-wdq-mm to ease CPU usage.
- [04:30] It's Friday night and Andrew has IRC all to himself. Due to the fact that most hosted VMs are running fine, Andrew opts not to reboot the system pending a proper diagnosis and/or a fix to the (later determined nonexistent) cooling problems. Andrew files a ticket and sends emails, watches things for a while to ensure stability, and goes to bed.
- [06:30] Mark, Giuseppe and Ariel trickle on to IRC and begin investigating the problem. Giuseppe sees process lockups and many blocked processes, quickly lands in an uninterruptible sleep while tracing a process. Diagnosis: Reboot needed. Mark confirms via ILO that nothing is wrong with fans or temperature. Everyone is hesitant to reboot due to Andrew's warning that this will cause an outage.
- [08:30] Marc-Andre shuts down instances in preparation for a host reboot.
- [08:55] rebooting labvirt1003
- [09:00] labvirt1003 is rebooted and inspected for normal function
- [09:08] Marc-Andre starts all hosted instances gradually.
- [09:27] All instances are up, service is restored to normal.
Conclusions
There's no clear explanation for this issue, but a few possible triggers:
- The qemu package update
- Increased load due to frequent suspend/resume actions
- Some Cron job that ran at midnight
One hypothesis is that it was a kernel bug which was triggered by the cron job at midnight (and aligned by some activity pattern induced by the suspend/resume actions resulting from the qemu update):
- The error was only seen on labvirt1003, while the other nine updated systems didn't have such a problem
- The error occured briefly after midnight; there are multiple timed-out kernel tasks starting at 0:04:49 (these kick in after a two minute timeout, so started around 0:02:49)
Apart from interrupting people's weekend, the troubleshooting/resolving process went reasonably well. Andrew could have avoided a lot of commotion by simply rebooting the system as soon as he noticed the problem, but that wasn't the obvious right course at the time.
Actionables
Largely unknown. Some monitoring/alerting changes are in order (since people, in particular Andrew, 'noticed' the ganglia warnings rather than receiving pages.)
- Noticed that both tools-checker-01 and its failovers live on the same host; those need separated (T99347)
- We should consider to updating the kernels running on labvirt*, the updated kernel in trusty is based on a new stable upstream kernel (there's no obvious bugfix in the list of chanegs which would explain that error, though)
Affected Instances
Despite the host itself being a mess, most VMs were largely unaffected. Of course all were shutdown and restarted when the server it self was rebooted.
+--------------------------------------+-------------------------------+---------+------------+-------------+-------------------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------------+---------+------------+-------------+-------------------------------------+ | 7095e081-1d5a-4f32-8e34-3fe0663d917b | accounts-appserver2 | ACTIVE | - | Running | public=10.68.16.54 | | a64d56a6-88f0-4d8c-bdc6-b1ed180755bd | bacdir | ACTIVE | - | Running | public=10.68.16.143 | | d1746dad-c8c6-4be6-acb1-ac83bcb15bb2 | backpuppet | ACTIVE | - | Running | public=10.68.16.172 | | 7289bc39-db91-457d-bced-31dc8b17b709 | bastion-pdns | ACTIVE | - | Running | public=10.68.16.128 | | c066f1a9-c757-4323-85f2-bf669205e415 | bastion-restricted1 | ACTIVE | - | Running | public=10.68.16.68, 208.80.155.155 | | a645edb8-fe00-43e1-9839-f9c0567e077d | butterfly-m4m | ACTIVE | - | Running | public=10.68.17.165 | | 670fb64c-58e0-487a-9a4b-242a70f0b0c3 | contributors-metrics | ACTIVE | - | Running | public=10.68.16.141, 208.80.155.168 | | d0da5ac8-34ca-43a9-b63d-64ee438c29cc | deployment-cxserver03 | ACTIVE | - | Running | public=10.68.16.150 | | 8a9c2ff4-ef08-4fd9-89aa-bc954e982c2d | deployment-db2 | ACTIVE | - | Running | public=10.68.17.94 | | 05cec48f-ed80-40a9-b6fc-eff9d3c40fbe | deployment-fluoride | ACTIVE | - | Running | public=10.68.16.190 | | 91247f8c-e524-4d20-9c9e-2fc2ae3cdc23 | deployment-mediawiki01 | ACTIVE | - | Running | public=10.68.17.170 | | 811cd53f-855b-490f-b28e-c80184600dd5 | deployment-mx | ACTIVE | - | Running | public=10.68.17.78, 208.80.155.193 | | fb5507a9-6488-47b8-9737-ed739f8faff5 | deployment-redis01 | ACTIVE | - | Running | public=10.68.16.177 | | ee28b0f9-7071-4724-a852-63b0a95a7416 | deployment-sentry2 | ACTIVE | - | Running | public=10.68.17.204 | | 1754601e-6b04-49d1-a1f9-0c85e361379b | deployment-zookeeper01 | ACTIVE | - | Running | public=10.68.17.157 | | cf61a2a9-7dc2-4569-8fb7-021cc63b464c | embed-demo | ACTIVE | - | Running | public=10.68.17.81 | | d516ce0d-c3e3-42fc-bb4e-55383ca35fb4 | encoding01 | ACTIVE | - | Running | public=10.68.17.26 | | 4b42c3dc-b4ae-4e75-a9e6-a1a8dde60998 | etcd02 | ACTIVE | - | Running | public=10.68.16.131 | | 2aa1b8db-9913-463b-9c1d-538e225350f9 | filippo-test-jessie | ACTIVE | - | Running | public=10.68.17.232 | | fea432bb-9dee-408e-a12f-f650ab050817 | flow-localpuppet | ACTIVE | - | Running | public=10.68.17.28 | | 511f9e77-0e15-4745-87f9-1a46f9c5a1fa | huggle-win32 | ACTIVE | - | Running | public=10.68.17.56 | | 89c0282a-3960-4db8-aa1d-1c02245b8a2b | integration-publisher | ACTIVE | - | Running | public=10.68.16.255 | | 37e1bf72-d42e-4cfd-b5c2-f042232f1b1c | integration-slave-trusty-1021 | ACTIVE | - | Running | public=10.68.16.8 | | 8fb72713-cfef-4748-b540-d2d595f1c0c7 | integration-zuul-server | ACTIVE | - | Running | public=10.68.17.180 | | 2c793c19-3e84-436c-81d4-62b4f24f6f82 | labs-vmbuilder-precise | SHUTOFF | - | Shutdown | public=10.68.16.115 | | 51782b35-4149-461d-bb02-78ab146d5e6c | language-dev | ACTIVE | - | Running | public=10.68.16.78 | | eec5b5c3-64a2-4d4e-9e73-b095e942e717 | language-testlink | ACTIVE | - | Running | public=10.68.16.252 | | e1d96d55-1df8-4da0-9412-b56938b7691c | maps-warper | ACTIVE | - | Running | public=10.68.17.33 | | 100416ee-c2cc-4f39-b560-dc50a0e447ed | marathon-master-01 | ACTIVE | - | Running | public=10.68.17.197 | | 3fcb96a0-7bd2-42c2-9b1e-6a63ddf7ef8e | marathon-slave-02 | ACTIVE | - | Running | public=10.68.18.57 | | 1341e84e-4f43-49ce-9859-d45aad4518da | master | ACTIVE | - | Running | public=10.68.16.153 | | 9e19aa6b-644c-4d9f-bd00-64e24c6feb37 | math-min | ACTIVE | - | Running | public=10.68.16.76 | | f596f00e-5771-43f6-805e-7bf97c97a0f9 | mobile-hierator4 | ACTIVE | - | Running | public=10.68.18.55 | | d6451356-145a-4c57-aabd-f89eb622ee99 | officetools | SHUTOFF | - | Shutdown | public=10.68.16.43, 208.80.155.161 | | d121be17-8399-4a09-88e7-f7e725bac897 | oophpui | ACTIVE | suspending | Running | public=10.68.16.148 | | ae42ae26-22d2-4f39-a92b-f125de8bf5d8 | otrs-test | ACTIVE | - | Running | public=10.68.16.236 | | c3a82ada-8d29-4915-b2ef-44d85994a7ab | otto-hadoop-master01 | ACTIVE | - | Running | public=10.68.18.39 | | 1c193409-2e52-4e82-a111-d9be6649cd9c | pdbhandler-1 | ACTIVE | - | Running | public=10.68.16.205 | | 7a5acd2f-070c-4392-8d83-bea76627e6af | phabdb | ACTIVE | - | Running | public=10.68.16.223 | | 71fc5647-2660-4c58-a34b-3ba11a30cdf0 | pronunciationrecording | ACTIVE | - | Running | public=10.68.17.41 | | 91a881b6-fdd6-494f-a987-98ca39278f3a | puppet-jmm-client1 | ACTIVE | - | Running | public=10.68.18.60 | | 46002ce4-49b7-46eb-943e-09ed2bd7d341 | quarry-main-01 | ACTIVE | - | Running | public=10.68.17.14 | | 634a0b71-1ea5-4655-b623-5d18d1a7016f | reflex | ACTIVE | - | Running | public=10.68.18.6 | | eb9c97ca-da20-4c6b-a05c-ade3336978dc | relic | ACTIVE | - | Running | public=10.68.16.162, 208.80.155.197 | | eaffba1b-e4cf-42c8-8384-67c3f9f7ec48 | sca1 | ACTIVE | - | Running | public=10.68.16.206 | | fb03cb95-1447-4705-98a6-aeefce3420b0 | sectools-web1 | ACTIVE | - | Running | public=10.68.16.168 | | 984833be-dfb9-4afc-ac31-08402be207bc | securepoll-farm | ACTIVE | - | Running | public=10.68.17.178 | | 37d67171-f2d2-4e5e-a29b-4ffc757019cb | servermon-tests | ACTIVE | - | Running | public=10.68.17.229 | | 7d318f8e-e116-41a3-afe6-28b9f1653494 | signwriting-ase-wiki | ACTIVE | - | Running | public=10.68.17.99, 208.80.155.178 | | f935259d-e57b-4afb-8e22-e83167d77be5 | staging-elastic01 | ACTIVE | - | Running | public=10.68.16.96 | | 00112297-e081-482d-b767-48f4359c8882 | staging-mc2 | ACTIVE | - | Running | public=10.68.17.71 | | 63356051-954e-4d82-965f-2718d5976fe9 | staging-ms-be01 | ACTIVE | - | Running | public=10.68.17.245 | | e2dc144b-bdb0-4833-8167-73e7c1e3aa3b | staging-mw01 | ACTIVE | - | Running | public=10.68.18.1 | | 20cd2542-f813-4f62-88cf-d2ee9b8b4632 | staging-rdb2 | ACTIVE | - | Running | public=10.68.16.240 | | 43e70ef9-78a6-407d-a64c-f6f1e523fd31 | structured-wikiquote | ACTIVE | - | Running | public=10.68.17.3 | | 1ad65adb-bb34-4db9-8b2f-a4d97bbfead4 | testtest | ACTIVE | - | Running | public=10.68.17.15 | | 213e3592-b964-4a18-bd48-9f05a33bce29 | tools-checker-01 | ACTIVE | - | Running | public=10.68.16.97 | | 31d1ff6e-f1ed-4dae-a3fd-8df42e81f215 | tools-checker-02 | ACTIVE | - | Running | public=10.68.16.17, 208.80.155.229 | | 30b98f1d-1c5a-49c1-b800-f4c535addc12 | tools-exec-07 | ACTIVE | - | Running | public=10.68.16.36 | | 651bbe97-12a5-4aa4-b324-4a2f89b0e91e | tools-exec-1216 | ACTIVE | - | Running | public=10.68.17.255, 208.80.155.225 | | 4131bf3d-774f-49b5-83a0-0d7ea146e872 | tools-exec-1219 | ACTIVE | - | Running | public=10.68.18.40, 208.80.155.228 | | 44aba1c0-8cc7-4346-9069-755566e3bc22 | tools-exec-1407 | ACTIVE | - | Running | public=10.68.18.16, 208.80.155.147 | | 78af272f-f268-4541-932d-63f5c931b31f | tools-exec-wmt | ACTIVE | - | Running | public=10.68.16.41 | | 8c499e6e-1b79-4bb1-8f7f-72fee1f74ea5 | tools-mail | ACTIVE | - | Running | public=10.68.16.27, 208.80.155.162 | | 1ae5da93-f64e-4964-9cf6-ca2ad7f60721 | tools-services-02 | ACTIVE | - | Running | public=10.68.18.36 | | 141d6240-9d5c-4991-9e21-e80a371e49ea | tools-trusty | ACTIVE | - | Running | public=10.68.16.63, 208.80.155.194 | | 5fdc0eb2-bed9-483a-842b-f29c14b39a25 | tools-webgrid-generic-1401 | ACTIVE | - | Running | public=10.68.18.51 | | 1353e0ae-bca2-421e-9834-d3e92df30185 | tools-webgrid-lighttpd-1202 | ACTIVE | - | Running | public=10.68.18.46 | | 87fb80ae-5f4d-4a33-8048-1f87a7e7cf1f | tools-webgrid-lighttpd-1207 | ACTIVE | - | Running | public=10.68.16.215 | | 60b6b6e5-f407-4f05-965c-46ff33d9a3db | tools-webgrid-lighttpd-1210 | ACTIVE | - | Running | public=10.68.17.163 | | 25cdede2-3cc2-4981-860a-fb0d1b61159f | tools-webgrid-lighttpd-1402 | ACTIVE | - | Running | public=10.68.16.35 | | ad1b90fc-3fd0-410c-b1d8-a726109bd84b | tools-webgrid-lighttpd-1407 | ACTIVE | - | Running | public=10.68.17.251 | | 802b0bb3-cb8a-4b57-952d-ef89b1781342 | topicmaps-beta-launch | ACTIVE | - | Running | public=10.68.17.128 | | 937d2acb-c0c5-4edd-932c-e7e0bd4b9b41 | ttmserver-elasticsearch01 | ACTIVE | - | Running | public=10.68.17.73 | | 0302f09b-0257-40ec-bb8f-0bca6a8c399c | ttmserver-salt01 | ACTIVE | - | Running | public=10.68.17.115 | | e7486031-d9db-464b-be5d-8ae2ed27060a | verpremotemx | ACTIVE | - | Running | public=10.68.16.179 | | 9c5338a3-ecbf-498d-bdd6-efcd30df4c44 | wdq-mm-01 | ACTIVE | - | Running | public=10.68.16.218 | | 40f83247-91af-418a-8130-25077deb50fc | wikidata-wdq-mm | ACTIVE | - | Running | public=10.68.16.56 | | 98fb4235-d09c-4ca0-8551-a180874a7e89 | wikimetrics1 | ACTIVE | - | Running | public=10.68.16.80 | +--------------------------------------+-------------------------------+---------+------------+-------------+-------------------------------------+