Incidents/20170629-Labstore Kernel Upgrade

Tracked in Phabricator
Task T169289

Summary

On 2017 June 29, at 15:00 UTC, we planned to do a scheduled upgrade of labstore1005 (the active NFS server in a drbd pair) to the latest pre-installed version of the kernel - 4.9.25-1~bpo8+3, which was to provide us with important security fixes. A reboot was necessary for the kernel upgrade, so a week ago we first rebooted the passive server in the pair - labstore1004. We failed over NFS to labstore1004, which had already been upgraded, and rebooted labstore1005, to ensure both servers now have the same latest kernel. Soon after the failover, we noticed extremely high load on the labstore1004 box. Due to similar load issues on 1004 when we failed over from it a few months ago, we wondered if the server had some underlying issues, and tried failing back to 1005, where we had never experienced load issues in the past. Labstore1005 also reported high load, similar to 1004. We spent some time tweaking the IO scheduler, some sysctl settings, forcing all clients to reconnect to let go of any connections (high number of nfsd threads were in D state) - but the load spikes persisted. We finally degraded kernel version of 1004 to the previous, and failed back over - and noticed load numbers go back to normal immediately. We have since also degraded the kernel version of 1005 and rebooted to it.

Unrelated to the kernel upgrade, we also had other issues surface after the first failover.

Incident documentation/20170629-ganeti occured at the same time when we completed the failover, causing wikitech to be unreachable, and the ldap servers in eqiad were down - causing authentication issues to tool users. This caused some mild panic, but we soon discovered it was not labstore related.
The tools puppetmaster was rebuilt on 2017 June 27 task T169350, and the puppet certs for the clients had been updated across tools. Post the switchover, lots of k8s workers and the k8s master reported extremely high load, and puppet runs reported SSL cert related errors. We tried to resolve the load issues on master by unmounting NFS on the instance, and restarting all the k8s components, which failed reporting etcd misconfiguration errors. We then discovered that the certs provided by etcd was being validated by the master against it's puppet certs, but the etcd services hadn't been restarted since the puppet certs had been updated - causing a mismatch. Restarting etcd fixed this portion of the problem.

Timeline

On 2017/06/21

19:13 madhuvishy: Rebooting labstore1004 (secondary in drbd pair) - Upgrades to kernel version 4.9.25-1~bpo8+3

On 2017/06/29

15:09 chasemp: set downtimes for labstore1004/1005 failover see https://etherpad.wikimedia.org/p/labstore_reboots
15:35~ chasemp: nfs-manage down on labstore1005, some shares refuse to unmount because busy and hold on to bind mounts, nfs is down already, so reboots labstore1005 to force unmount
15:36~ chasemp: nfs-manage up on labstore1004, shares come up as DRBD Primary
15:38 chasemp: restart nfs-exportd on labstore1004
15:40~ Madhu and Chase spot check tools. Lots of k8s worker nodes reporting high load/puppet errors. Wikitech is down, some users report that they can't login to tools, nfs-exportd is reporting keystone auth issues, openstack server list on labcontrol1001 also reports 502 (Skip this subsection if you only want to know about the kernel upgrade. Read on if you care about the k8s/other madness)
- 15:45 We learn of Incident documentation/20170629-ganeti which is the cause for Keystone/openstack/wikitech issues (ldap server is down). The ldap server recovers and we are now able to run nfs-exportd and openstack server list without errors, Wikitech resumes working, users can login/run sudo.
- 15:45 - 16:30 - k8s workers are either reporting high load or puppet failures, so we try a series of things to fix the k8s cluster
  - 15:57 chasemp: reboot tools-docker-registery-01 for nfs
  - 16:04 madhuvishy: reboot tools-worker-1022 tools-worker-1009
  - 16:10 chasemp: tools-flannel-etcd-01:~$ sudo service etcd restart
  - 16:15 madhuvishy: Load on k8s-master is super high (>100), unmounts all nfs mounts (nfs-mount-manager clean), stops puppet, and restarts all k8s components. kube-apiserver fails to start and reports errors about etcd cluster misconfiguration
  - 16:15 - 17:45
    - We knew that puppet certs had been updated across tools clients since task T169350, Andrew's new puppetmaster build. We looked at the etcd certs at /var/lib/etcd/ssl/certs on tools-k8s-etcd-[01-03], the etcd servers the tools-k8s-master was reporting as misconfigured. They seemed to be updated, but the certs in the master in the same location were from Feb 2016. We thought clearing the old certs on k8s master and running puppet would refresh them, but nothing happened
    - Yuvi hopped on our hangout and explained that the k8s-master was being presented certs by the etcd clients, and was trying to validate them against it's own puppet certs.
    - Chase realizes that it's possible that the k8s-master's puppet certs were the new ones, but even though the etcd clients' puppet certs had been updated, the etcd services themselves hadn't been restarted and they were still presenting old certs to the master
    - Madhu restarts etcd on tools-k8s-etcd-[01-03]. Things get better, puppet runs succeed on k8s-master and the rebooted workers that were reporting puppet failures
    - Few more workers/other instances are still reporting high load, so we now reboot them
      - 17:12 madhuvishy: drain cordon reboot uncordon tools-worker-1022, tools-worker-1009, tools-worker-1002
      - 17:20 andrewbogott: rebooting tools-static-10
      - 17:22 andrewbogott: rebooting tools-static-11
      - 17:37 madhuvishy: drain cordon reboot uncordon tools-worker-1005 tools-worker-1007 tools-worker-1008
- 17:45 The world seems okay ish, Madhu and Andrew go to interview someone
18:30 Chase notices high load on labstore1004 (the now primary NFS server). Restarts nfs kernel server
18:34 Madhu drops out of interview to help look into the load issues
18:34 - 18:44 Madhu notices that lots of nfsd threads on labstore1004 are going into D state. Reboots some instances on tools that seem to be intensive on IO. Load on labstore1004 is still high. Chase and Madhu recall similar incident in January (Incident documentation/20170119-Labstore ), and wonder if the server itself had underlying issues. Kernel is also a suspect, but decide to failover to labstore1005 first to eliminate underlying server issues in 1004.
18:45 ~ Failover to labstore1005 as drbd primary. Labstore1005 at this point has been rebooted and has the newer 4.9 kernel
18:50 ~ We look at nethogs to see if clients are being particularly zealous. The bandwidth utilization by some clients seem suspiciously high given our traffic shaping settings, Chase reruns to reapply our tc settings.
18:55 ~ Madhu cordons (disables scheduling of new pods on client) all k8s workers and reboots them to make sure they reconnect to NFS (Still high on nfsd threads in D state)
19:00 - 20:30 Keith joins us for troubleshooting, and helps look into nfs specific things along with Chase.
- We'd previously noticed suspicious messages reported by nfsd processes in dmesg (see:task T169281). Chase adjusts net.core settings, the number of those errors seem to go down, but we have no way to be sure.
- Chase finds some commentary on similar reports on redhat forums and such, like https://access.redhat.com/solutions/28104, which ask to upgrade nfs-utils, which is not a package available on debian
20:33 Andrew and Madhu depool, reboot, and repool every exec/webgrid node to see if this will help. Load is still high post this round of rolling reboots.
21:40 We think the next thing to do might be to downgrade the kernel, so in preparation we downgrade the labstore1004 kernel and reboot. (chasemp: reboot labstore1004 with grub set to gnulinux-advanced-1773f282-5a1b-441e-865c-8b70a0ebc925>gnulinux-4.4.0-3-amd64-advanced-1773f282-5a1b-441e-865c-8b70a0ebc925). We are still operating with labstore1005 as primary at this point.
22:16 Chase suggests that before we failover to 1004 with the lower kernel version, as a last resort we try to switch the IO scheduler from deadline to cfq on labstore1005 to see if it'll have an effect. We try it (chasemp: set cfq scheduler on labstore1005)
22:16 - 23:45 We observe load behavior, and notice that 5 minute and 15 minute averages, even though pretty high (~20-25), have been steadily trending down. Since the IO scheduler switch, load changes are less spiky and a little more stable, albeit high compared to the <5 load numbers we've always seen in the last 8 months of these servers being operational.
23:00 Madhu uncordons k8s worker nodes
23:45 Madhu suggests taking a 1.5 hour break, so we can have enough data to decide if it'll hold steady enough to wait a day so we can investigate further into the kernel, or if the conditions are still bad, failover to labstore1004 with the downgraded kernel.

It's a new day! In UTC atleast. On 2017-06-30

01:00 We reconvene (Keith, Bryan, Andrew, Madhu, Chase) - load is not doing good, and has only spiked up. We decide that failing back over to labstore1004 is the next step
01:23 chasemp: fail nfs from labstore1005 to labstore1004
01:25 chasemp: reboot labstore1005
01:25 - 2:10 Load is normal on labstore1004 (<5 and holds through). See task T169290 for dramatic graph depicting load spikes and fall post kernel downgrade.
2:15 Chase downgrades the kernel for labstore1005 too, and reboots
2:20 Labstore1005 still hasn't come up, Keith hops onto console, sees that it's waiting for keypress at "A PCIe link training failure error on boot". See task T169286. He presses F1 and it boots up
02:29 chasemp: labstore1005 start drbd (reintroducing drbd sync traffic at this point, load still stays low)
02:45 Things have held good. We blame it all on the kernel and call it a day. Labstore1004 is still primary, and both labstore1004 and 1005 are on kernel version 4.4.0-3-amd64 #1 SMP Debian 4.4.2-3+wmf8

Conclusions

The new anti-stackclash (4.9.25-1~bpo8+3 ) kernel plays badly with NFS.
The nfs-manage script that we use to failover between the labstores had some issues (outdated commands, failure to unmount shares when busy) and needs some love
Etcd services need to get restarted in the k8s-etcd cluster when puppet certs are updated
Labstore1005 has some hardware issue (reports PCIe link training failure error on boot on boot)
Installing kernels but not rebooting is spooky. The next reboot will pick up the change whether it is wanted or not.

Actionables

Tracking task for incident - task T169289

Investigate issues with kernel 4.9.25-1~bpo8+3 (task T169290)
Follow up on dmesg errors reported by nfsd (task T169281)
The nfs-manage failover script needs to be tested with real load and fixed (task T169570)
Fix puppet to notify k8s etcd cluster on puppet cert changes (task T169287)
Debug hardware issue reported on labstore1005 reboot (task T169286)
Stop installing kernel upgrades without rebooting to see them take effect. Especially important on clusters so that if there are issues with the new kernel we don't have to figure out how to make grub2 pick the desired kernel while under pressure to restore services.