Incidents/20150518-LabsOutage

From Wikitech
Jump to navigation Jump to search

Summary

Labvirt1001 failed at around 16:50 on May 18th. All hosted instances (including two bastions) become unresponsive. The system was rebooted and all instances restarted; normal service was resumed by 17:20.

Timeline

  • [ 16:00 ] Andrew restarts a script that suspends/resumes instances affected by the Venom issue. (This fact included because it may correlate this outage with one a few days ago.)
  • [ 16:45 ] Shinken sent 'host down' alerts for a couple of tools instances. Andrew notes that both instances are on labvirt1001, further notes that labvirt1001 is not responding to ssh attempts, connects to the serial console. Serial console is unresponsive.
  • [ 16:53 ] Andrew reboots labvirt1001. Once it's up he runs a dist-upgrade in order to address potential kernel (speculative!) kernel issues, and reboots a second time.
  • [ 17:05 ] Andrew runs a scripted 'start' of each instance formerly running on labvirt1001
  • [ 17:20 ] All instances have resumed normal operation
  • [ 17:30 ] Inspection of logs by Coren shows no unusual activity until 14:42:02 at which point the kernel loses its lunch entirely. A half dozen broken RIP causes kernel oops on as many cores, with every other core being halted as softlocked within 30 seconds of that (and most likely as a direct consequence). As far as can be told, the kernel simply ground abruptly to a halt bringing all of userland with it, but there are no indications of anything amiss or unusual activity prior.

Affected instances

+--------------------------------------+-------------------------------+-----------+------------+-------------+-------------------------------------+
| ID                                   | Name                          | Status    | Task State | Power State | Networks                            |
+--------------------------------------+-------------------------------+-----------+------------+-------------+-------------------------------------+
| 61d805aa-c65e-47a0-ae91-2acc78c18ead | abusefilter-global-main       | ACTIVE    | -          | Running     | public=10.68.16.121                 |
| 5946324b-38ab-4b6d-a154-cb79114237b9 | bastion-restricted-pdns       | ACTIVE    | -          | Running     | public=10.68.17.243, 208.80.155.209 |
| 4ede1bb6-9af7-4e37-b006-bcb9f0b39182 | bastion1                      | ACTIVE    | -          | Running     | public=10.68.16.5, 208.80.155.129   |
| 582bffe4-fa20-4f13-a47b-a58dcac2c01d | citoidtest                    | ACTIVE    | -          | Running     | public=10.68.16.182                 |
| e2c147fd-dc09-417a-8d11-ce8a2a652467 | deployment-bastion            | ACTIVE    | -          | Running     | public=10.68.16.58, 208.80.155.191  |
| 94d3fa4b-5bf7-4c15-acdf-35d20bb4942d | deployment-cache-text02       | ACTIVE    | -          | Running     | public=10.68.16.16, 208.80.155.135  |
| a9a9522f-883d-4f24-b290-960c57a91f2d | deployment-elastic08          | ACTIVE    | -          | Running     | public=10.68.17.188                 |
| 15d6d50c-aae9-4320-9e48-ea3022fba95f | deployment-memc03             | ACTIVE    | -          | Running     | public=10.68.16.15                  |
| b26e5c79-7190-431c-9fc9-e12bf05c0cd6 | deployment-parsoid05          | ACTIVE    | -          | Running     | public=10.68.16.120                 |
| d46df8b9-6c41-409d-9853-b2b4dc876088 | deployment-pdf01              | ACTIVE    | -          | Running     | public=10.68.16.73                  |
| a71ec107-2c2a-4a5a-bdb5-d35d1ca95302 | deployment-restbase01         | ACTIVE    | -          | Running     | public=10.68.17.227                 |
| 5a4610ff-3fb1-443c-92f2-995ed63d3e79 | deployment-rsync01            | ACTIVE    | -          | Running     | public=10.68.17.66                  |
| abb50762-93d0-4c9d-8853-adbbb6b56e00 | deployment-salt               | ACTIVE    | -          | Running     | public=10.68.16.99                  |
| ec228fb1-7cca-4c1b-9f5f-63bfc0aee45c | deployment-test               | ACTIVE    | -          | Running     | public=10.68.16.149                 |
| 02bed745-a849-4f95-8d0c-ef2633b19ac0 | deployment-urldownloader      | ACTIVE    | -          | Running     | public=10.68.16.135                 |
| d32eb3e7-b00b-41d2-bb11-ed879d2ddfcb | diffengine                    | ACTIVE    | -          | Running     | public=10.68.17.127                 |
| 48cf430a-41ec-4548-8d0d-363a59770d5f | educationdashboard-i18n       | SHUTOFF   | -          | Shutdown    | public=10.68.16.235                 |
| 51d29587-7e81-42fd-bb99-1a384913eefe | ee-flow-extra                 | ACTIVE    | -          | Running     | public=10.68.16.102                 |
| 3c0dc5b1-b0f0-4abd-abc9-d7b987d1f051 | etcd01                        | ACTIVE    | -          | Running     | public=10.68.16.130                 |
| 06eeac90-a10e-4c7c-8c49-bbb566fc3936 | etcd03                        | ACTIVE    | -          | Running     | public=10.68.16.132                 |
| 05820cfb-748e-4d16-8273-4b0c623fcd58 | firstinstance                 | SHUTOFF   | -          | NOSTATE     | public=10.68.16.212                 |
| 9f6c33ca-55c0-45d2-9e3a-23564f0fdc63 | graphite-trusty               | ACTIVE    | -          | Running     | public=10.68.17.181                 |
| b76d3c87-ed26-4ad0-aa66-49bca3d7496b | huggle-d2                     | ACTIVE    | -          | Running     | public=10.68.17.194                 |
| 82d56208-ce2d-438d-88fa-b2c77bacdf9d | icinga                        | ACTIVE    | -          | Running     | public=10.68.16.195                 |
| 92196d8b-2520-4fc1-b4f8-93c29c4661fb | integration-raita             | ACTIVE    | -          | Running     | public=10.68.16.53                  |
| 879166bf-36c6-45a5-b26b-fb7b6d3c0520 | integration-slave-trusty-1013 | ACTIVE    | -          | Running     | public=10.68.18.28                  |
| f8e5e68b-a6f7-4f9d-ac5d-280bb3e260d4 | integration-slave-trusty-1015 | ACTIVE    | -          | Running     | public=10.68.18.30                  |
| 9cea892a-e35d-47fa-afe3-4e0483f556cb | kafka02                       | ACTIVE    | -          | Running     | public=10.68.17.240                 |
| e0ec7d63-3d7a-4897-a6e5-834dd015a1ad | kartotherian1                 | ACTIVE    | -          | Running     | public=10.68.16.117                 |
| 04f65926-79fb-4c6e-b91f-276eb0e19e44 | language-replag-slave         | SHUTOFF   | -          | Shutdown    | public=10.68.16.248                 |
| 0e82f3c8-af65-433a-89dc-0f3425e7f585 | maps-tiles2                   | ACTIVE    | -          | Running     | public=10.68.17.110                 |
| 8545bf86-f068-483b-a8b3-c5d43981fd17 | mediawiki-verp                | ACTIVE    | -          | Running     | public=10.68.17.11                  |
| 35519d06-b932-4073-9d5d-bb14689c15f8 | mwreview-proxy-test           | ACTIVE    | -          | Running     | public=10.68.16.83                  |
| 3ef62478-dc92-47b5-8385-78a3010115d3 | osmit-cruncher1               | SUSPENDED | resuming   | Shutdown    | public=10.68.17.92                  |
| 3ed900d6-e943-43c7-982a-99a1fda75aa8 | puppet-jmm-client2            | ACTIVE    | -          | Running     | public=10.68.16.101                 |
| 6a73ec36-5f5b-4074-9c30-128a738f91ee | puppet-jmm-salt-trusty-minion | ACTIVE    | -          | Running     | public=10.68.16.238                 |
| 2e2e7624-7264-4c28-9cf5-16ccefa794a4 | puppet-mailman                | ACTIVE    | -          | Running     | public=10.68.17.177                 |
| 0d61121b-5f29-4c3c-a5db-3a2b5f20ad56 | sol                           | ACTIVE    | -          | Running     | public=10.68.17.29                  |
| 2f7a4792-0e76-4e32-b53c-204d5c54c9b8 | staging-cache-text01          | ACTIVE    | -          | Running     | public=10.68.18.4                   |
| e40ecd84-c0b2-4148-a7c8-9f94ab34f5e4 | staging-eventlogging          | ACTIVE    | -          | Running     | public=10.68.16.199                 |
| 3842a06c-6b40-4bd8-845b-539adb9259df | staging-ms-be03               | ACTIVE    | -          | Running     | public=10.68.17.249                 |
| 555bef0f-c83b-41ea-bf09-1e359a17f4cf | staging-rdb01                 | ACTIVE    | -          | Running     | public=10.68.17.193                 |
| 01fc4af5-0f1a-44c7-bc17-9425f60d6235 | staging-tin                   | ACTIVE    | -          | Running     | public=10.68.16.110                 |
| dc88c3f6-b685-4e5a-8a54-436b63147497 | tools-bastion-02              | ACTIVE    | -          | Running     | public=10.68.16.44, 208.80.155.132  |
| 5154e889-ce2c-46c2-b07c-9f383707e417 | tools-exec-1201               | ACTIVE    | -          | Running     | public=10.68.17.49, 208.80.155.203  |
| 162e2a9e-2e97-4f7f-9fba-e377115e5bb6 | tools-exec-1202               | ACTIVE    | -          | Running     | public=10.68.16.57, 208.80.155.211  |
| 82323ee4-762e-4b1f-87a7-d7aa7afa22f6 | tools-exec-1204               | ACTIVE    | -          | Running     | public=10.68.17.88, 208.80.155.213  |
| b75192c7-7019-4e6d-a94b-77207f513431 | tools-exec-1206               | ACTIVE    | -          | Running     | public=10.68.17.105, 208.80.155.215 |
| 6b0d7c1f-e514-4749-85cd-76379bf39d2a | tools-exec-1209               | ACTIVE    | -          | Running     | public=10.68.17.129, 208.80.155.218 |
| 37b3aee2-d1e5-419f-bee7-650266d89998 | tools-exec-1213               | ACTIVE    | -          | Running     | public=10.68.17.252, 208.80.155.222 |
| 44efcd75-808a-4af2-8c7a-8d70b703d7f2 | tools-exec-1217               | ACTIVE    | -          | Running     | public=10.68.18.20, 208.80.155.226  |
| 6e8a570f-6746-41d6-87d0-cdb165b384d3 | tools-exec-1218               | ACTIVE    | -          | Running     | public=10.68.18.19, 208.80.155.227  |
| b90fdc70-04e4-4de3-9c06-b8c6c4371d8b | tools-exec-1408               | ACTIVE    | -          | Running     | public=10.68.18.14, 208.80.155.152  |
| 523df61c-07f0-41ba-924d-e2b8e474b4d7 | tools-exec-cyberbot           | ACTIVE    | -          | Running     | public=10.68.16.39                  |
| 5a47a24d-56a2-428d-8d7c-a5bd362f3222 | tools-mailrelay-01            | ACTIVE    | -          | Running     | public=10.68.17.83, 208.80.155.188  |
| 006e08ad-eb3e-451a-bcb8-4a1fc4155d0c | tools-redis-slave             | ACTIVE    | -          | Running     | public=10.68.17.150                 |
| 4c758222-0b46-4d84-91fa-51a2df81c18e | tools-static-02               | ACTIVE    | -          | Running     | public=10.68.16.216                 |
| 0be801fb-7523-42b9-8c21-5fbceee07a53 | tools-webgrid-generic-1404    | ACTIVE    | -          | Running     | public=10.68.18.53                  |
| 00a8caee-b097-4959-bfd4-4660e93a1f7d | tools-webgrid-lighttpd-1409   | ACTIVE    | -          | Running     | public=10.68.18.43                  |
| 8d572912-bcb5-49a6-93de-b10bb90552b5 | tools-webgrid-lighttpd-1410   | ACTIVE    | -          | Running     | public=10.68.18.44                  |
| 3fd88e9c-cf82-4a64-82e2-e015ae90f489 | toolsbeta-exec-101            | ACTIVE    | -          | Running     | public=10.68.16.7                   |
| 8e7089bd-e14d-43cd-bb03-e2d626b0c7a1 | toolsbeta-exec-201            | ACTIVE    | -          | Running     | public=10.68.16.250                 |
| fde6e5e8-a78d-4950-8b6c-b12a84170a3d | wikidata-mobile               | ACTIVE    | -          | Running     | public=10.68.18.41                  |
| 4975124c-20b2-4b88-94b2-f619f42e00d3 | wikispy                       | ACTIVE    | -          | Running     | public=10.68.17.119                 |
| 53deb39f-1335-418a-b952-401d8c7466fa | wlmjurytool2014               | ACTIVE    | -          | Running     | public=10.68.17.134                 |
| a59fff7d-5404-4eb0-a021-c596b29167a5 | wmt-exec                      | ACTIVE    | -          | Running     | public=10.68.17.236                 |
+--------------------------------------+-------------------------------+-----------+------------+-------------+-------------------------------------+