Incident documentation/20150422-LabsOutage

From Wikitech
Jump to: navigation, search

Summary

Many labs instances were migrated to new virtualization hardware. A kernel bug on the new hosts resulted in bad behavior of the guest VMs: poor response time, network interruptions and a flurry of monitoring alerts. Kernel update and reboot on the affected systems resolved the problem, but the accompanying reboot further interrupted many VMs.

Affected hosts were running a kernel having this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917 which was found by investigating the symptoms reported at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1307473 on comparable Precise->Trusty upgrades.

Timeline

prehistory

  • There are six new labs virtualization boxes, labvirt1001-1006. They run the same hardware as old, tried-and-true nodes virt1010, 1011 and 1012. virt1010 and 1011 are running Ubuntu Precise, virt1012 is running Trusty with 3.13.0-46 kernel. The new nodes use a stock install of Trusty, kernel version 3.13.0-24.
  • Andrew migrates select instances to the new labvirt hardware. Projects 'openstack,' 'testlabs,' and a few miscellaneous instances are moved to the hardware. No ill-effects are observed.

2015-04-20

  • Andrew migrates the 'cvn' and 'staging' projects to labvirt hosts.

2015-04-21

  • Andrew runs a scripted migration of the deployment-prep project to labvirt hosts. This is the first large-scale migration to the new hardware.

2015-04-22

  • [02:00] Shinken starts to send many, many alerts to #wikimedia-releng, reporting deployment hosts to be flapping. Page loads fail intermittently.
  • [12:30] Andrew wakes up, begins a scripted migration of Tools instances to the labvirt hardware.
  • [13:00] Andrew converses with Tyler Cipriani and becomes aware of the deployment-prep issues, starts debugging in earnest.
  • [15:00] By this time it's clear that the issue is localized to instances on labvirt hardware. Scripted migration of tools is halted.
  • [16:30] The first working theory is that there's a competition for resources on labvirt1005 and 1006, as instances on those hosts are sending the most alerts. Ganglia graphs are spiky and concerning and most instances on those hosts are unresponsive, so Andrew reboots them. Symptoms are temporarily alleviated
  • [18:00] It's clear now that reboots were insufficient and we still have issues, including on labvirt1001-1004.
  • [20:00] Alex Monk notes that ping times are very irregular; sometimes jumping to multiple seconds. Andrew confirms that this issue is also isolated to instances on labvirt hosts. Marc joins the debug effort.
  • [21:00] Marc notices clock drift on instances, quickly locates a kernel bug that fits. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1307473
  • [21:15] It's agreed to try a kernel upgrade. Andrew starts migrating tools hosts away from labvirt1001 so it can be restarted without causing further interruption.
  • [21:45] Andrew upgrade the kernel of labvirt1001 to 3.16.0-34, reboots. The instance fails to start as it is unable to mount the filesystem,
  • [22:30] labvirt1001 is finally back up, running kernel 3.13.0-48. Instances seem to be running properly. Andrew migrates tools nodes away from labvirt1002.
  • [23:45] Andrew upgrades labvirt1002 to 3.13.0-48, reboots, restarts all instances.

2015-04-23

  • [14:00] labvirt1001 and labvirt1002 are declared healthy. Andrew migrates tools hosts away from 1003-1006, Marc drains jobs from affected tools-exec nodes.
  • [14:24] labvirt1006 upgraded and rebooted
  • [15:08] labvirt1005 upgraded and rebooted
  • [15:42] labvirt1003 upgraded and rebooted
  • [16:06] labvirt1004 upgraded and rebooted
  • [16:30] All labvirt hosts up, all instances running.

Actionables

(Use https://phabricator.wikimedia.org/tag/incident-20150422-labsoutage/ for any follow up tasks)

All labvirt nodes are now upgraded and fine. When the other HP systems (virt1010, 1011, 1012) are re-imaged, it's critical that a dist-upgrade and reboot be run before any instances are migrated to them.

  • Guard against nova-compute being deployed on affected kernels (T97152)

Affected Instances

c6f1aa6d-a52b-4234-9835-630658a71940 | compiler
228e1ae7-eee6-4930-b706-a5b20423cfd1 | consul1
1b7b03fb-9c28-42d1-b332-32b1847bc64d | consul2
e4091940-9eda-43c9-b888-3c89c7e26cb3 | consul3
b52e2a1e-bb61-4819-b2f5-d552c1cfc825 | cvn-apache8
e43664a5-e763-469a-9948-2f2c6c539db2 | cvn-app4
9cb1f5db-e6b0-4e47-a508-29feb705bcf2 | cvn-app5
742f631a-6bcc-4bb9-8ab1-8cacc1d376da | dashboard-sentry
fd78c4b0-353e-4b8e-a079-dd59d6232751 | deployment-apertium01
e2c147fd-dc09-417a-8d11-ce8a2a652467 | deployment-bastion
9d05dbda-4103-432c-9449-498243e10db6 | deployment-cache-bits01
aa3c3550-96a7-4f20-a1ab-c88c01a8e5e9 | deployment-cache-mobile03
94d3fa4b-5bf7-4c15-acdf-35d20bb4942d | deployment-cache-text02
5e4a6717-6db8-4033-aa2b-14282dad290e | deployment-cache-upload02
d0da5ac8-34ca-43a9-b63d-64ee438c29cc | deployment-cxserver03
aec2245c-e965-4f06-8496-a0d9abf519ee | deployment-db1
8a9c2ff4-ef08-4fd9-89aa-bc954e982c2d | deployment-db2
cacacac3-010d-4ce6-a13c-004e12a17f5b | deployment-elastic05
1411d0ec-e934-4bfa-8327-81bfbbe4df32 | deployment-elastic06
52c5fd7a-9b14-45da-b531-7c7a458be5c2 | deployment-elastic07
a9a9522f-883d-4f24-b290-960c57a91f2d | deployment-elastic08
abca73aa-4b99-4442-a662-adbfcaadd40b | deployment-eventlogging02
05cec48f-ed80-40a9-b6fc-eff9d3c40fbe | deployment-fluoride
ff9aac2d-ba32-4b86-91be-5aa4181589f3 | deployment-jobrunner01
dfacf7e3-d60c-4990-9681-30610df4ae3d | deployment-kafka02
0294f3af-eba5-4ac8-9205-7c1aba8808d9 | deployment-logstash1
012b196d-c795-4ab6-94ea-e453214d39c2 | deployment-lucid-salt
e8cdee8b-d4b9-4ccb-8be5-944093ae3bf3 | deployment-mathoid
91247f8c-e524-4d20-9c9e-2fc2ae3cdc23 | deployment-mediawiki01
beb4a87e-cb4f-4e2f-a442-fda67ce20c98 | deployment-mediawiki02
2cfaf18c-e6ea-4c2d-b96f-df7f50b6bc9a | deployment-mediawiki03
8290c03a-a64c-4d22-bbce-f7c92afe30cd | deployment-memc02
15d6d50c-aae9-4320-9e48-ea3022fba95f | deployment-memc03
507ed00f-b7fc-42e8-803c-53224646598d | deployment-memc04
811cd53f-855b-490f-b28e-c80184600dd5 | deployment-mx
cec6f6dc-5ab0-420e-8bc0-871ad3c9999b | deployment-parsoid01-test
b26e5c79-7190-431c-9fc9-e12bf05c0cd6 | deployment-parsoid05
9a284217-479f-4cab-9652-58fb3659aa66 | deployment-parsoidcache02
d46df8b9-6c41-409d-9853-b2b4dc876088 | deployment-pdf01
54c66f88-4c39-487b-802b-2eec751f4300 | deployment-pdf02
fb5507a9-6488-47b8-9737-ed739f8faff5 | deployment-redis01
e9e794b0-df85-4181-8b8a-c4a784bf11e7 | deployment-redis02
a71ec107-2c2a-4a5a-bdb5-d35d1ca95302 | deployment-restbase01
088a0575-c09b-4c42-88a5-4ef57d8705c0 | deployment-restbase02
5a4610ff-3fb1-443c-92f2-995ed63d3e79 | deployment-rsync01
abb50762-93d0-4c9d-8853-adbbb6b56e00 | deployment-salt
fcaa135e-3fce-4fae-afe0-ded789fe6f6a | deployment-sca01
ee28b0f9-7071-4724-a852-63b0a95a7416 | deployment-sentry2
70e51d3c-f898-4c87-9b3f-e11bee0087d6 | deployment-stream
ec228fb1-7cca-4c1b-9f5f-63bfc0aee45c | deployment-test
eba7ec1f-8fcf-4ab3-a616-33f486cfb099 | deployment-upload
02bed745-a849-4f95-8d0c-ef2633b19ac0 | deployment-urldownloader
4d4ee285-2eaa-4286-9e98-4fd705c50de4 | deployment-videoscaler01
1754601e-6b04-49d1-a1f9-0c85e361379b | deployment-zookeeper01
be146908-b8ad-45ee-894b-9c7c1ed983ff | deployment-zotero01
071aaf64-0da6-463f-9793-1a847774b816 | designate-devel
e7720543-b214-41ea-824e-60626717509e | etcd1
0995d912-6091-4f1e-bd82-b2e547b558fd | etcd2
939f577c-1057-4fbb-aca4-c6436ffe3130 | etcd3
f7e8f15f-d5b3-4cf7-847b-612f4443b86c | etherpadt
78c56d53-1770-466b-9ad2-6955a539561c | integration-saltmaster
a4958be9-7226-4485-b569-5dedeaccc9be | integration-slave-trusty-1021
b1af424a-e4f8-4291-8ca0-e572c4db7ff5 | labs-bootstrapvz-jessie
06893e3c-cb62-40bb-b198-ba9f1be01725 | labs-vmbuilder-trusty
c3a82ada-8d29-4915-b2ef-44d85994a7ab | otto-hadoop-master01
5853c165-347a-4597-b9ff-a80288b9332d | otto-hadoop-worker01
022b065e-af9f-4731-9dda-bdbafdf31673 | otto-impala-master01
c1036bc0-58fb-4c1e-9b4c-3fb265816af0 | puppet-andrew
7cbdffce-592c-41f4-82bc-4287ed889e9c | puppet-jmm
cbf588f5-169d-4fa9-899b-0fc5419b0630 | puppet-jmm-debian
bd22137c-3d58-4d86-9ec5-dcd104259c4a | puppet-jmm-precise
9d9666b5-5c1e-402f-81e1-dd811eff2f1c | puppet-jmm-salt-trusty
6a73ec36-5f5b-4074-9c30-128a738f91ee | puppet-jmm-salt-trusty-minion
a0451388-9e99-4620-8c23-c8d64667dd12 | puppet-jmm-trusty
2e2e7624-7264-4c28-9cf5-16ccefa794a4 | puppet-mailman
1c9789e7-2e0f-4ecb-ba70-1561112574f7 | puppet-matanya
0d61121b-5f29-4c3c-a5db-3a2b5f20ad56 | sol
2f7a4792-0e76-4e32-b53c-204d5c54c9b8 | staging-cache-text01
887d3809-46b3-4281-a301-e0a2629eb790 | staging-db01
f935259d-e57b-4afb-8e22-e83167d77be5 | staging-elastic01
cfb5bbd0-f76a-416f-8794-f33d9571b43a | staging-elastic02
43224d6d-882a-4da0-9e4b-6a594edb3901 | staging-elastic03
929c9324-b6f2-444f-9a43-a17b012d1c5c | staging-elastic04
e40ecd84-c0b2-4148-a7c8-9f94ab34f5e4 | staging-eventlogging
abcb156c-640b-4e17-a5c0-fb8eefbfbd42 | staging-mc1
00112297-e081-482d-b767-48f4359c8882 | staging-mc2
63356051-954e-4d82-965f-2718d5976fe9 | staging-ms-be01
728acb5a-72a4-41ce-90a3-83173dc7673e | staging-ms-be02
3842a06c-6b40-4bd8-845b-539adb9259df | staging-ms-be03
bf30c55f-c79f-43c5-8ead-871955a5237f | staging-ms-fe01
e2dc144b-bdb0-4833-8167-73e7c1e3aa3b | staging-mw01
8d000f33-5751-4704-b150-026b2a0e2013 | staging-mx
d35af3fe-0e9e-41e3-82df-ae5bcad08812 | staging-ocg01
45bc9d67-a7ea-41a2-bb36-aa2c496f2119 | staging-palladium
555bef0f-c83b-41ea-bf09-1e359a17f4cf | staging-rdb01
140c242c-4e96-4552-9603-6d45f2d13439 | staging-rdb1
20cd2542-f813-4f62-88cf-d2ee9b8b4632 | staging-rdb2
923b0a84-729b-4b80-9fa1-e5c9e26ee330 | staging-sca01
10d943dc-5012-4a38-9201-fb614838e0a9 | staging-stream
3d06622c-c4cb-434e-9fe4-0d47ac8b3e9d | staging-test-tin
ed5f01c6-9157-4962-8360-7e696c2fdefb | staging-tin
ca015784-def3-4231-a225-e7844f179ce4 | tools-bastion-01
dc88c3f6-b685-4e5a-8a54-436b63147497 | tools-bastion-02
086da8f4-e7d2-4541-830b-9f946510e7dc | toolsbeta-quarry-labsdebrepo-test
605caf6e-f642-4cd3-8d42-268fb5e2c612 | tools-dev
4222c0f5-b3bd-41a9-94d2-30faad4202ce | tools-exec-01
eb6e8fad-8646-4251-a706-fc90bf0be0c9 | tools-exec-02
fa611e16-6b85-4f74-92a3-2ed1635fa481 | tools-exec-04
6a1a2095-8474-4378-8290-9dece5b9c3d8 | tools-exec-05
ad12146e-b225-47b2-97f0-330527688331 | tools-exec-06
30b98f1d-1c5a-49c1-b800-f4c535addc12 | tools-exec-07
cb2940d6-2560-4dc5-9e12-f894efd33dfc | tools-exec-08
5cd684db-d0a6-4241-a11f-daf4c1b2f717 | tools-exec-09
ec414ae4-a46f-425f-b9d5-950df155f137 | tools-exec-10
a4fc3c84-bc8e-42bf-9209-0549c9872e84 | tools-exec-11
47608ad4-1adc-4104-b1c5-96281a945ff8 | tools-exec-12
dcb7a789-5c33-42c5-85ea-b8dd50dcbf1b | trusty-manual
bf212ca9-9e05-427b-ac44-7155900f7cba | trusty-medium-1429649189
3d2e226d-a74d-4188-a0b5-1c56093d599c | util-abogott
fde6e5e8-a78d-4950-8b6c-b12a84170a3d | wikidata-mobile
095763a3-84ca-4c7a-90dc-d143be64722c | wikitech-test-network
bf85a20d-beba-47db-ae38-86b149acf9a2 | wt-test-api
ea6a001c-b2a9-44b8-8c47-17a51885232d | wt-test-compute
8e592f62-3dbc-43f1-ae4c-94cc807f6418 | wt-test-controller
36de46b1-c196-4253-adf9-0bb3b036ddb3 | zk1
533c8e50-facc-4659-bb9a-b934e5585be3 | zk2
920a53e7-91a1-4a60-a8c5-92db1ab8aa55 | zk3