Portal:Cloud VPS/Admin/Meltdown Response

From Wikitech
Jump to navigation Jump to search

https://phabricator.wikimedia.org/T184189 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Meltdown_Response

Rollout checklist (done in mediawiki style so it can be archived)

Summary

  • Baseline performance on a labvirt with existing
  • Figure out right kernel versions to move to
  • Upgrade guests (with some special handling in Toolforge to ensure we don't have general guest issues with these kernels before fleet wide)
  • Upgrade labvirts
  • Reboots all around

Preparation

Commands

aptitude install linux-image-4.9.0-0.bpo.5-amd64

[x] Create a bunch of instances on labvirt1018

  • Some Jessie
  • Some Trusty
  • Some Stretch
   OS_TENANT_NAME=testlabs openstack server create --flavor 4 --image 85e8924b-b25d-4341-ad3e-56856d4de2cc --availability-zone host:labvirt1018 labvirt1018stresstest-4

[x] Profile performance on labvirt1018 on existing kernel

  • over a few hours or a day?
  • what performance charts or metrics are we watching here?

https://phabricator.wikimedia.org/T184189#3893388

[x] Choose upgrade candidate for Jessie (Seems like: linux-image-4.9.0-0.bpo.5 ... -> Ack, but you should run "apt-get -y install linux-meta"

[x] Choose upgrade candidate for Stretch (Seems likelinux-image-4.9.0-5-amd64 .-> Acl, but you should run "apt-get -y install linux-image-amd64"

[x] Choose upgrade candidate for Trusty (Assumed same for labvirts and guests) No: On labvirt you need "apt-get -y linux-image-generic-lts-xenial" and on instances "apt-get -y install linux-image-generic"

Labvirts:

  • apt-get install -y linux-image-4.4.0-109-generic linux-image-extra-4.4.0-109-generic linux-lts-xenial-tools-4.4.0-109 linux-tools-4.4.0-109-generic

[x] Upgrade labvirt1018 to Trusty Kernel candidate (reboot)

[x] Upgrade labvirt1018 pilot guest instances to candidate kernels (reboot)

[x] Profile performance on labvirt1018 on existing kernel

  • over a few hours or a day?

https://phabricator.wikimedia.org/T184189#3893388

Guest updates: We know what to upgrade to for each distro and believe performance will be survivable

Question!

Should we reboot all Toolforge nodes or a serious subset to see if we turn up any problems with guests on new kernels in our controlled environment before going full on? I think potentially yes.

Final answer is no - since the performance impact has been fairly predicatable and this would be more graceful for Tools but require dual reboots.

Commands

apt-get install <kernel> uname -r

PIlot in Toolforge

[x] Upgrade canary candidates for Trusty in Toolforge

  • seems like kernel landed here as a security update already
apt-get -s install linux-image-generic (would confirm as a noop)

sudo apt-get update && sudo apt-get -y install linux-image-generic && sudo mv /boot/grub/menu.lst /boot/grub/menu.lst.old && sudo update-grub -y && sudo uname -r

tools-exec-1401 tools-exec-1402 tools-exec-1403 tools-exec-1404 tools-exec-1405

[x] Upgrade canary candidates for Jessie in Toolforge

tools-worker-1011.tools.eqiad.wmflabs tools-worker-1012.tools.eqiad.wmflabs tools-worker-1013.tools.eqiad.wmflabs tools-worker-1014.tools.eqiad.wmflabs tools-worker-1015.tools.eqiad.wmflabs tools-worker-1016.tools.eqiad.wmflabs

[x] Upgrade canary candiadates for Stretch in Toolforge

  • I think this is only PAWS?
  • Yuvi's upgrade to k8s 1.9 caught the correct kernel so all have been updated since yesterday

tools-paws-master-01

15:11:51 up 1 day, 16:10,  0 users,  load average: 0.72, 1.36, 1.54

Linux tools-paws-master-01 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64 GNU/Linux

[x] Review performance implications over half a day

First canaries from 1001-1009 (pcid) and 1010-1019 (pcid and invpcid) (both should have headroom)

[x] figure out how to target guests on a particular labvirt https://phabricator.wikimedia.org/T184756

[x] send email to effected guest projects for pilot labvirts

[x] Update all guests on labvirt1017

$ sudo cumin --force --timeout 120 -o json "host:labvirt1017" "lsb_release -si | grep Ubuntu && apt-get install -y linux-image-generic" $ sudo cumin --force --timeout 120 -o json "host:labvirt1017" "lsb_release -si | grep Ubuntu && mv /boot/grub/menu.lst /boot/grub/menu.lst.old && update-grub -y" $ sudo cumin --force --timeout 120 -o json "host:labvirt1017" "lsb_release -sd | grep jessie && apt-get -y install linux-meta && update-grub" $ sudo cumin --force --timeout 120 -o json "host:labvirt1017" "lsb_release -sd | grep stretch && apt-get -y install linux-image-amd64 && update-grub"

[x] Update labvirt1017.eqiad.wmnet

root@tools-bastion-03:~# exec-manage depool tools-webgrid-lighttpd-1420.tools.eqiad.wmflabs etc. andrew@tools-k8s-master-01:~$ kubectl cordon tools-worker-1012.tools.eqiad.wmflabs etc.

andrew@labcontrol1001:~$ nova list --all-tenants --host labvirt1017 > restartme.txt

Set 2 hour downtime for labvirt1017 root@labvirt1017:~# apt-get install -y linux-image-4.4.0-109-generic linux-image-extra-4.4.0-109-generic linux-lts-xenial-tools-4.4.0-109 linux-tools-4.4.0-109-generic root@labvirt1017:~# update-grub root@labvirt1017:~# reboot

(wait)

andrew@labvirt1017:~$ dmesg | grep -i isolation [ 0.000000] Kernel/User page tables isolation: enabled

(restart everything in restartme.txt)

sudo cumin --force --timeout 120 -o json "host:labvirt1017" "dmesg | grep -i isolation"

[x] Update all guests on labvirt1003

[x] update labvirt1003.eqiad.wmnet

[x] send notice email

[x] reboot labvirt1017

[x] reboot labvirt1003

[x] confirm labvirt kernels

andrew@labvirt1017:~$ dmesg | grep -i isolation [ 0.000000] Kernel/User page tables isolation: enabled

andrew@labvirt1003:~$ dmesg | grep -i isolation [ 0.000000] Kernel/User page tables isolation: enabled

[x] confirm all guest kernels

$ sudo cumin --force --timeout 120 -o json "host:labvirt1017" "dmesg | grep -i isolation"

One straggler that I fixed by changing the apt settings for the kernel (it had upgrades disabled somehow)

$ sudo cumin --force --timeout 120 -o json "host:labvirt1003" "dmesg | grep -i isolation"

One casualty: ttmserver-elasticsearch01.ttmserver.eqiad.wmflabs didn't come back up. It's an old Trusty instance in a project that's a candidate for deletion and had many full drives, so I suspect it was unable to upgrade fully. No logs.

[x] wait over the weekend for performance indicators

Rollout to all guests

[x] Upgrade kernels on all remaining Trusty guests

$ sudo cumin --force --timeout 120 -o json "A:all" "lsb_release -si | grep Ubuntu && apt-get install -y linux-image-generic" $ sudo cumin --force --timeout 120 -o json "A:all" "lsb_release -si | grep Ubuntu && mv /boot/grub/menu.lst /boot/grub/menu.lst.old && update-grub -y"

[x] Upgrade kernels on all remaining Jessie guests to candidate pending reboot

$ sudo cumin --force --timeout 120 -o json "A:all" "lsb_release -sd | grep jessie && apt-get -y install linux-meta && update-grub"

[x] Upgrade kernels on all remaining Stretch guests pending reboot

$ sudo cumin --force --timeout 120 -o json "A:all" "lsb_release -sd | grep stretch && apt-get -y install linux-image-amd64 && update-grub"

Labvirts: At this point all guests are pending a kernel upgrade post reboot along w/ the labvirts

  • Labvirts to update (All Trusty currently on 4.4.0-81-generic)
  • Make sure to grab linux-image and linux-image-extras!!!!

Commands

apt-get install <kernel> uname -r

Remaining Main deployment labvirt pool

[x] silence tools.checker

  • Ensure mgmt interface is available before rebooting!


[x] labvirt1001.eqiad.wmnet

[x] labvirt1002.eqiad.wmnet

[x] labvirt1003.eqiad.wmnet

[x] labvirt1004.eqiad.wmnet

[x] labvirt1005.eqiad.wmnet

[x] labvirt1006.eqiad.wmnet

[x] labvirt1007.eqiad.wmnet

-- break to see if any unexpected effects --

[x] labvirt1008.eqiad.wmnet

[x] labvirt1009.eqiad.wmnet

[x] labvirt1010.eqiad.wmnet

[x] labvirt1011.eqiad.wmnet

[x] labvirt1012.eqiad.wmnet

[x] labvirt1013.eqiad.wmnet

[x] labvirt1014.eqiad.wmnet

[x] labvirt1015.eqiad.wmnet

-- dormant --

[x] labvirt1016.eqiad.wmnet <== this is in the spare pool so not a useful canary for profiling normal workloads

[x] labvirt1017.eqiad.wmnet

[x] labvirt1018.eqiad.wmnet <== this is in the spare pool so not a useful canary (thought we used it for initial profiling with generated load)

[x] labvirt1019.eqiad.wmnet <== new DB pair with 20

[x] labvirt1020.eqiad.wmnet <== new DB pair with 19

[] labvirt1021.eqiad.wmnet <== racked, not live yet

[] labvirt1022.eqiad.wmnet <== racked, not live yet

Post: at his point all guests and labvirts are rebooted with new kernels

[x] amend kernel whitelist to only include the relevant 4.4.0-109 kernel (which will catch on 21 and 22 when ready)

  • stare at performance charts biting finger nails

"* https://grafana.wikimedia.org/dashboard/db/labs-monitoring?refresh=5m&orgId=1

Checks for Arturo on Monday 15th

  • check several random VMs from Andrew's email: htop and so on, to see if they are good with performance with the new kernel
  • physical servers: labvirt1003.eqiad.wmnet and labvirt1017.eqiad.wmnet
  • check for graphine trends in physical servers
  • if something breaks: 1) put a message in WhatsApp, 2) try to fix it myself