Incident documentation/20150224-LabsOutage

From Wikitech
Jump to: navigation, search

Summary

At 05:30 on 2015-02-24 all of the labs instances hosted on virt1012 lost network connectivity. Investigation turned up very little, so virt1012 was rebooted and each instance in turn reset and restarted. Instances were largely back to normal by 08:00.

Timeline

  • [05:30] Instances drop off network. Shinken fails to notify IRC until much later, for unclear reasons. Because virt1012 contains mostly refugees from last week's outage, nearly all the same instances are affected as during that outage.
  • [06:00] Yuvi begins investigating, but is derailed by the fact that nova reports confusing host information for the migrated hosts:
| OS-EXT-SRV-ATTR:host                 | virt1012                                                                            |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | virt1005.eqiad.wmnet                                                                |

hypervisor_hostname seems to reflect the original host and doesn't reflect anything about current running state, but that's not at all obvious and the flag is poorly documented.

  • [06:30] Andrew Bogott joins the investigation, frantically restarts nova services on labnet1001 and virt1012 to no avail.
  • [06:45] Yuvi and Andrew (independently) decide that this is a networking issue, as nova seems happy to schedule new instances on virt1012 and they start up and are promptly unreachable. No actual network symptoms or warnings are evident, though.
  • [07:10] Andrew makes preparations for rebooting virt1012, runs a batch job to suspend all instances in hopes that they can be resumed after reboot and avoid actual instance reboots. Nova-compute crashes during the 'suspend', leaving lots of instances in an ERROR state
  • [07:15] Giuseppe joins investigation, looks for network issues, finds some suspicious log lines:
lldpd[1534]: lldp_send: unable to send packet on real device for vnet15: No buffer space available
dnsmasq-dhcp[2355]: DHCP packet received on br1102 which has no address
  • [07:35] Out of ideas, everyone agrees to reboot virt1012. It restarts, instance networking is restored, and we've learned nothing.
  • [07:55] To clear the ERROR and SHUTDOWN states of many instances, Andrew runs 'nova reset-state --active' and then 'nova reboot' for all instances. Most instances recover and are fine, a few remain in the ERROR state. A repeat of reset-state/reboot restores everything to working order.

Conclusions

No explanation has been found for the cause of this problem. Next time it happens, a swift reboot is probably the best approach.

Actionables

Most actionables are identical to those listed for last week's outage -- in general we need to reduce Tools vulnerability to such failures.

We should also investigate ways of improving shinken notification so that failures don't silence the exact warnings about those failures.

Affected Instances

Here is a complete list of instances affected, as reported by "nova list --all-tenants --host virt1012":

+--------------------------------------+-------------------------+---------+----------------+-------------+-------------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-------------------------+---------+----------------+-------------+-------------------------------------+
| 684e5f6f-3fbf-42a1-a44e-2953448448fd | accounts-mwoauth | ACTIVE | - | Running | public=10.68.17.44 |
| 6f147938-fe0c-4868-abd6-da7d0667c07f | bastion2 | ACTIVE | - | Running | public=10.68.16.66, 208.80.155.153 |
| 8b1c0173-bfad-45d4-a464-1f6772339b66 | bbdevel | ACTIVE | - | Running | public=10.68.17.31 |
| b141a571-5fb0-4c62-90b2-e76268511a6b | catalogcompiler | ACTIVE | - | Running | public=10.68.16.24 |
| eded2579-f205-4d34-b935-d8fa0361c237 | cephticon3 | SHUTOFF | - | Shutdown | public=10.68.16.132 |
| d33dba53-45b6-4525-943a-ad24b100cd48 | cg-puppetmaster | ACTIVE | - | Running | public=10.68.17.133 |
| e97213bf-a56e-4d79-a6e3-508e6157d19c | cvresearch-web | ACTIVE | - | Running | public=10.68.16.91 |
| f8d3db4c-6fd3-4ea7-9b80-ef1c50e4b0f8 | dannyb | ACTIVE | - | Running | public=10.68.16.139 |
| d0da5ac8-34ca-43a9-b63d-64ee438c29cc | deployment-cxserver03 | ACTIVE | - | Running | public=10.68.16.150 |
| aec2245c-e965-4f06-8496-a0d9abf519ee | deployment-db1 | ACTIVE | - | Running | public=10.68.16.193 |
| 0294f3af-eba5-4ac8-9205-7c1aba8808d9 | deployment-logstash1 | ACTIVE | - | Running | public=10.68.16.134 |
| 09bbebac-631d-4126-ad3e-48ef64e72eb8 | deployment-restbase02 | ACTIVE | - | Running | public=10.68.16.234 |
| fcaa135e-3fce-4fae-afe0-ded789fe6f6a | deployment-sca01 | ACTIVE | - | Running | public=10.68.17.54 |
| 4b75951b-d874-4805-a4a7-28bd5c3ba6b4 | designate-devel | ACTIVE | - | Running | public=10.68.17.102 |
| 07976e9b-6676-46d8-afc2-ee1cfd3a25b1 | dns-test-dzahn | REBOOT | reboot_started | Shutdown | public=10.68.16.232 |
| 4c354f1e-c733-4e01-ab25-a7954e3d8d67 | dynamicproxy-gateway | ACTIVE | - | Running | public=10.68.16.65,  208.80.155.156 | 
| 5d3003d5-2d9a-4c88-80a5-b9a3db8415ea | etcd02 | ACTIVE | - | Running | public=10.68.17.157 |
| 8eb0fd07-292d-448e-9124-76b5e51317b5 | grantreview-dev | ACTIVE | - | Running | public=10.68.16.158 |
| 51dc707e-f9f8-4e59-a4d9-5073908d22e1 | hovercards | ACTIVE | - | Running | public=10.68.17.136 |
| 33da85da-eea6-4bf5-977c-5a75cc32215f | incident-test | ACTIVE | - | Running | public=10.68.17.109 |
| 2198a8cf-da89-416b-bed5-a0521b122d0f | integration-slave1010 | ACTIVE | - | Running | public=10.68.17.200 |
| 77a8ccdd-f29e-4d07-89e4-ab9f13f0248b | ipsec-c8 | ACTIVE | - | Running | public=10.68.16.38 |
| 0db39308-f680-414a-8d15-48fc21d65859 | ipsec-c9 | ACTIVE | - | Running | public=10.68.17.202 |
| 42cdf096-eedf-4f64-8995-ab9874976a5f | jawiki-echo | SHUTOFF | - | Shutdown | public=10.68.16.174 |
| e9ac64d1-5594-49dc-98fb-573ba98d29ef | joal-hadoop-master | ACTIVE | - | Running | public=10.68.16.220 |
| d6ff0d45-023e-4299-9b27-cdc93500cf30 | joal-hadoop-worker1 | ACTIVE | - | Running | public=10.68.17.203 |
| 7f65413f-b449-458b-968d-89afb15f8915 | joal-hadoop-worker2 | ACTIVE | - | Running | public=10.68.17.206 |
| a25019a1-fa5f-4c5a-be57-ff975c2e4552 | legoktm | ACTIVE | - | Running | public=10.68.17.84 |
| 3cc2200a-5432-4fe0-8db6-33ec35732b12 | map | ACTIVE | - | Running | public=10.68.16.181 |
| e381861f-aa7e-4d7a-894e-c919e86c73e3 | mathoid2 | ACTIVE | - | Running | public=10.68.16.194 |
| b7076a0a-008f-4282-88b2-c5c4ea0b3ace | megacron-two | ACTIVE | - | Running | public=10.68.16.49, 208.80.155.149 |
| e082a270-d3f1-4934-8659-3224d06de417 | mlp | SHUTOFF | - | Shutdown | public=10.68.16.3 |
| d05d5ec9-48d1-42cc-96de-b40c485bed51 | mwui | ACTIVE | - | Running | public=10.68.16.61 |
| 20853100-12c6-480d-9eb7-a3d9e1864280 | nemobis | ACTIVE | - | Running | public=10.68.17.131 |
| 0153e94a-2f43-4b87-ac06-d6be88cb0c6c | opengrok-web | ERROR | - | Shutdown | public=10.68.16.184 |
| d65b95f9-d28f-484c-a98f-7119aa085b89 | openid-wiki | SHUTOFF | - | Shutdown | public=10.68.16.185 |
| 46eb5169-4168-47ff-a7c9-037b8c2097ce | openstack-juno-testing | ACTIVE | - | Running | public=10.68.17.87 |
| 1e4668ee-d3aa-4ff7-9dbe-819267f6ec84 | otto-cass2 | ACTIVE | - | Running | public=10.68.17.60 |
| 6dbee509-88fb-43ac-990c-f8e47f16d7e2 | pirsquared-dev | ACTIVE | - | Running | public=10.68.17.57 |
| 9fcfcd5f-7467-43b3-bc54-bfd02e5fbfc3 | pubsubhubbub | ACTIVE | - | Running | public=10.68.16.69 |
| 89040a4e-6eeb-466b-8a22-484fe37f2eb0 | rcstream | ACTIVE | - | Running | public=10.68.17.114, 208.80.155.180 |
| 0142a505-7072-479b-ae41-de35e6ee4110 | staging-palladium | ACTIVE | - | Running | public=10.68.17.17 |
| 23b5633f-1e4f-41d3-b8f3-c3c557c08f92 | testlabs-dnsbreak | ACTIVE | - | Running | public=10.68.16.169 |
| 6c9327c4-990c-48da-849e-21ce845054ad | testlabs-dnscmdline | ACTIVE | - | Running | public=10.68.16.215 |
| 120cc401-ed7a-44c5-b905-2d0eae23b6af | tools-exec-03 | ACTIVE | - | Running | public=10.68.16.32, 208.80.155.142 |
| 30b98f1d-1c5a-49c1-b800-f4c535addc12 | tools-exec-07 | ACTIVE | - | Running | public=10.68.16.36, 208.80.155.146 |
| 5cd684db-d0a6-4241-a11f-daf4c1b2f717 | tools-exec-09 | ACTIVE | - | Running | public=10.68.17.64, 208.80.155.152 |
| 523df61c-07f0-41ba-924d-e2b8e474b4d7 | tools-exec-cyberbot | ACTIVE | - | Running | public=10.68.16.39 |
| 96c37c36-970b-4cc7-a7ba-d1ee90a225b5 | tools-submit | ACTIVE | - | Running | public=10.68.17.1 |
| cdce426b-ef6f-47e7-96e4-bcb3647f4709 | tools-webgrid-04 | ACTIVE | - | Running | public=10.68.17.174 |
| 79aeb31c-a1c1-41af-9e00-df2c7e248924 | tools-webgrid-tomcat | ACTIVE | - | Running | public=10.68.16.29 |
| 8d92c507-d253-425d-b7f4-2af3678a39ae | tools-webproxy | ACTIVE | - | Running | public=10.68.16.4, 208.80.155.131 |
| 7d4a9768-c301-4e95-8bb9-d5aa70e94a64 | tools-webproxy-01 | ACTIVE | - | Running | public=10.68.17.139 |
| 1b7e971a-36d8-4d5c-8130-7211a4d00e2e | tools-webproxy-02 | SHUTOFF | - | Shutdown | public=10.68.17.145 |
| 8df6ac98-05b8-4d60-bc5c-15b3c206c450 | toolsbeta-mail | ACTIVE | - | Running | public=10.68.16.113 |
| 5584c52e-619d-424b-b060-14b385754147 | toolsbeta-master | ACTIVE | - | Running | public=10.68.16.146 |
| 31e8206d-fa5c-4e62-a805-8cfb7def1f46 | toolsbeta-puppetmaster3 | ACTIVE | - | Running | public=10.68.16.92 |
| a0a49d0b-0fae-45bf-938b-7c942567fa8c | upload-wizard | SHUTOFF | - | Shutdown | public=10.68.16.228 |
| 29704829-3c25-4f79-b6f6-ed985946ec1f | wdq-bg1 | ACTIVE | - | Running | public=10.68.16.156 |
| 0bb0601e-381d-491d-a580-c13ca963da8c | wdq-bg2 | ACTIVE | - | Running | public=10.68.16.159 |
| 29f1b7a1-b3a5-4be7-8af3-81e620d69555 | wdq-bg3 | ACTIVE | - | Running | public=10.68.16.168 |
| 09733bff-d485-40d8-9a7f-4a3670509741 | wikidata-reports | SHUTOFF | - | Shutdown | public=10.68.17.34 |
| 303fcd47-12dd-4c00-8961-13dd82d03ee7 | wikimetrics-staging1 | ACTIVE | - | Running | public=10.68.16.77 |
+--------------------------------------+-------------------------+---------+----------------+-------------+-------------------------------------+