Portal:Cloud VPS/Admin/notes/Meltdown Response
https://phabricator.wikimedia.org/T184189 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Meltdown_Response
Rollout checklist (done in mediawiki style so it can be archived)
Summary
- Baseline performance on a labvirt with existing
- Figure out right kernel versions to move to
- Upgrade guests (with some special handling in Toolforge to ensure we don't have general guest issues with these kernels before fleet wide)
- Upgrade labvirts
- Reboots all around
Preparation
Commands
aptitude install linux-image-4.9.0-0.bpo.5-amd64
[x] Create a bunch of instances on labvirt1018
- Some Jessie
- Some Trusty
- Some Stretch
OS_TENANT_NAME=testlabs openstack server create --flavor 4 --image 85e8924b-b25d-4341-ad3e-56856d4de2cc --availability-zone host:labvirt1018 labvirt1018stresstest-4
[x] Profile performance on labvirt1018 on existing kernel
- over a few hours or a day?
- what performance charts or metrics are we watching here?
https://phabricator.wikimedia.org/T184189#3893388
[x] Choose upgrade candidate for Jessie (Seems like: linux-image-4.9.0-0.bpo.5 ... -> Ack, but you should run "apt-get -y install linux-meta"
[x] Choose upgrade candidate for Stretch (Seems likelinux-image-4.9.0-5-amd64 .-> Acl, but you should run "apt-get -y install linux-image-amd64"
[x] Choose upgrade candidate for Trusty (Assumed same for labvirts and guests) No: On labvirt you need "apt-get -y linux-image-generic-lts-xenial" and on instances "apt-get -y install linux-image-generic"
Labvirts:
- apt-get install -y linux-image-4.4.0-109-generic linux-image-extra-4.4.0-109-generic linux-lts-xenial-tools-4.4.0-109 linux-tools-4.4.0-109-generic
[x] Upgrade labvirt1018 to Trusty Kernel candidate (reboot)
[x] Upgrade labvirt1018 pilot guest instances to candidate kernels (reboot)
[x] Profile performance on labvirt1018 on existing kernel
- over a few hours or a day?
https://phabricator.wikimedia.org/T184189#3893388
Guest updates: We know what to upgrade to for each distro and believe performance will be survivable
Question!
Should we reboot all Toolforge nodes or a serious subset to see if we turn up any problems with guests on new kernels in our controlled environment before going full on? I think potentially yes.
Final answer is no - since the performance impact has been fairly predicatable and this would be more graceful for Tools but require dual reboots.
Commands
apt-get install <kernel> uname -r
PIlot in Toolforge
[x] Upgrade canary candidates for Trusty in Toolforge
- seems like kernel landed here as a security update already
apt-get -s install linux-image-generic (would confirm as a noop)
sudo apt-get update && sudo apt-get -y install linux-image-generic && sudo mv /boot/grub/menu.lst /boot/grub/menu.lst.old && sudo update-grub -y && sudo uname -r
tools-exec-1401 tools-exec-1402 tools-exec-1403 tools-exec-1404 tools-exec-1405
[x] Upgrade canary candidates for Jessie in Toolforge
tools-worker-1011.tools.eqiad.wmflabs tools-worker-1012.tools.eqiad.wmflabs tools-worker-1013.tools.eqiad.wmflabs tools-worker-1014.tools.eqiad.wmflabs tools-worker-1015.tools.eqiad.wmflabs tools-worker-1016.tools.eqiad.wmflabs
[x] Upgrade canary candiadates for Stretch in Toolforge
- I think this is only PAWS?
- Yuvi's upgrade to k8s 1.9 caught the correct kernel so all have been updated since yesterday
tools-paws-master-01
15:11:51 up 1 day, 16:10, 0 users, load average: 0.72, 1.36, 1.54
Linux tools-paws-master-01 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64 GNU/Linux
[x] Review performance implications over half a day
First canaries from 1001-1009 (pcid) and 1010-1019 (pcid and invpcid) (both should have headroom)
[x] figure out how to target guests on a particular labvirt https://phabricator.wikimedia.org/T184756
[x] send email to effected guest projects for pilot labvirts
[x] Update all guests on labvirt1017
$ sudo cumin --force --timeout 120 -o json "host:labvirt1017" "lsb_release -si | grep Ubuntu && apt-get install -y linux-image-generic" $ sudo cumin --force --timeout 120 -o json "host:labvirt1017" "lsb_release -si | grep Ubuntu && mv /boot/grub/menu.lst /boot/grub/menu.lst.old && update-grub -y" $ sudo cumin --force --timeout 120 -o json "host:labvirt1017" "lsb_release -sd | grep jessie && apt-get -y install linux-meta && update-grub" $ sudo cumin --force --timeout 120 -o json "host:labvirt1017" "lsb_release -sd | grep stretch && apt-get -y install linux-image-amd64 && update-grub"
[x] Update labvirt1017.eqiad.wmnet
root@tools-bastion-03:~# exec-manage depool tools-webgrid-lighttpd-1420.tools.eqiad.wmflabs etc. andrew@tools-k8s-master-01:~$ kubectl cordon tools-worker-1012.tools.eqiad.wmflabs etc.
andrew@labcontrol1001:~$ nova list --all-tenants --host labvirt1017 > restartme.txt
Set 2 hour downtime for labvirt1017 root@labvirt1017:~# apt-get install -y linux-image-4.4.0-109-generic linux-image-extra-4.4.0-109-generic linux-lts-xenial-tools-4.4.0-109 linux-tools-4.4.0-109-generic root@labvirt1017:~# update-grub root@labvirt1017:~# reboot
(wait)
andrew@labvirt1017:~$ dmesg | grep -i isolation [ 0.000000] Kernel/User page tables isolation: enabled
(restart everything in restartme.txt)
sudo cumin --force --timeout 120 -o json "host:labvirt1017" "dmesg | grep -i isolation"
[x] Update all guests on labvirt1003
[x] update labvirt1003.eqiad.wmnet
[x] send notice email
[x] reboot labvirt1017
[x] reboot labvirt1003
[x] confirm labvirt kernels
andrew@labvirt1017:~$ dmesg | grep -i isolation [ 0.000000] Kernel/User page tables isolation: enabled
andrew@labvirt1003:~$ dmesg | grep -i isolation [ 0.000000] Kernel/User page tables isolation: enabled
[x] confirm all guest kernels
$ sudo cumin --force --timeout 120 -o json "host:labvirt1017" "dmesg | grep -i isolation"
One straggler that I fixed by changing the apt settings for the kernel (it had upgrades disabled somehow)
$ sudo cumin --force --timeout 120 -o json "host:labvirt1003" "dmesg | grep -i isolation"
One casualty: ttmserver-elasticsearch01.ttmserver.eqiad.wmflabs didn't come back up. It's an old Trusty instance in a project that's a candidate for deletion and had many full drives, so I suspect it was unable to upgrade fully. No logs.
[x] wait over the weekend for performance indicators
Rollout to all guests
[x] Upgrade kernels on all remaining Trusty guests
$ sudo cumin --force --timeout 120 -o json "A:all" "lsb_release -si | grep Ubuntu && apt-get install -y linux-image-generic" $ sudo cumin --force --timeout 120 -o json "A:all" "lsb_release -si | grep Ubuntu && mv /boot/grub/menu.lst /boot/grub/menu.lst.old && update-grub -y"
[x] Upgrade kernels on all remaining Jessie guests to candidate pending reboot
$ sudo cumin --force --timeout 120 -o json "A:all" "lsb_release -sd | grep jessie && apt-get -y install linux-meta && update-grub"
[x] Upgrade kernels on all remaining Stretch guests pending reboot
$ sudo cumin --force --timeout 120 -o json "A:all" "lsb_release -sd | grep stretch && apt-get -y install linux-image-amd64 && update-grub"
Labvirts: At this point all guests are pending a kernel upgrade post reboot along w/ the labvirts
- Labvirts to update (All Trusty currently on 4.4.0-81-generic)
- Make sure to grab linux-image and linux-image-extras!!!!
Commands
apt-get install <kernel> uname -r
Remaining Main deployment labvirt pool
[x] silence tools.checker
- Ensure mgmt interface is available before rebooting!
[x] labvirt1001.eqiad.wmnet
[x] labvirt1002.eqiad.wmnet
[x] labvirt1003.eqiad.wmnet
[x] labvirt1004.eqiad.wmnet
[x] labvirt1005.eqiad.wmnet
[x] labvirt1006.eqiad.wmnet
[x] labvirt1007.eqiad.wmnet
-- break to see if any unexpected effects --
[x] labvirt1008.eqiad.wmnet
[x] labvirt1009.eqiad.wmnet
[x] labvirt1010.eqiad.wmnet
[x] labvirt1011.eqiad.wmnet
[x] labvirt1012.eqiad.wmnet
[x] labvirt1013.eqiad.wmnet
[x] labvirt1014.eqiad.wmnet
[x] labvirt1015.eqiad.wmnet
-- dormant --
[x] labvirt1016.eqiad.wmnet <== this is in the spare pool so not a useful canary for profiling normal workloads
[x] labvirt1017.eqiad.wmnet
[x] labvirt1018.eqiad.wmnet <== this is in the spare pool so not a useful canary (thought we used it for initial profiling with generated load)
[x] labvirt1019.eqiad.wmnet <== new DB pair with 20
[x] labvirt1020.eqiad.wmnet <== new DB pair with 19
[] labvirt1021.eqiad.wmnet <== racked, not live yet
[] labvirt1022.eqiad.wmnet <== racked, not live yet
Post: at his point all guests and labvirts are rebooted with new kernels
[x] amend kernel whitelist to only include the relevant 4.4.0-109 kernel (which will catch on 21 and 22 when ready)
- stare at performance charts biting finger nails
- https://graphite.wikimedia.org/render/?width=959&height=320&areaMode=stacked&hideLegend=false&target=cactiStyle(averageSeries(servers.labvirt1001.cpu.total.{irq,user,system,steal,softirq,nice,irq,iowait}))
- https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1515702661.283&target=servers.labvirt1010.cpu.total.guest_nice&target=servers.labvirt1017.cpu.total.guest&areaMode=stacked&hideLegend=false&from=-8h
- https://graphite.wikimedia.org/render/?width=959&height=320&_salt=1515705013.51&areaMode=stacked&hideLegend=false&target=cactiStyle(servers.labvirt1017.loadavg.05)&from=-7d
- https://grafana.wikimedia.org/dashboard/db/labs-capacity-planning?orgId=1
"* https://grafana.wikimedia.org/dashboard/db/labs-monitoring?refresh=5m&orgId=1
Checks for Arturo on Monday 15th
- check several random VMs from Andrew's email: htop and so on, to see if they are good with performance with the new kernel
- physical servers: labvirt1003.eqiad.wmnet and labvirt1017.eqiad.wmnet
- check for graphine trends in physical servers
- if something breaks: 1) put a message in WhatsApp, 2) try to fix it myself