Incidents/2019-05-29 NFS-keystone

Summary

What happened?

We are trying to upgrade cloudcontrol1003 from Debian Jessie to Debian Stretch
That involves reallocating control plane workload to cloudcontrol1004 which is already Debian Stretch
The reallocation caused issues in NFS, affecting all NFS attached OpenStack projects

Impact

Toolforge was down.
Other Openstack projects using NFS had troubles using NFS.

Detection

Mixed dectection mostly at the same time:

human reports
icinga alerts

Timeline

All times in UTC.

11:51 UTC Andrew merged 512954 (Make cloudcontrol1004 the primary keystone host)
11:54 UTC labstore1005 nfs-exportd fails to get project list from keystone
11:55 UTC labstore1004 nfs-exportd fails to get project list from keystone
11:57 UTC Arturo downtimes cloudservices1003.wikimedia.org because we plan to rebuild as stretch
11:57 UTC Arturo detected stashbot on IRC not responding
11:57 UTC icinga toolschecker alert on IRC about toolforge cron
11:59 UTC icinga toolschecker page: stale file handle for toolfoge project NFS
12:08 UTC Arturo detects that nfs-exportd is using a hardcoded keystone server (cloudcontrol1003.wikimedia.org)
12:18 UTC Andrew merged 513097 (nfs-exportd: use cloudcontrol1004 endpoint for now)
12:25 UTC Krenair reports Toolforge is back apparently
12:29 UTC stashbot joins the #wikimedia-cloud-admin IRC channel, indicating that Toolforge is indeed good
12:34 UTC Brooke identifies some code in nfs-exportd that can be improved to handle a situation in which keystone returns 401 (not authed)
12:35 UTC stashbot joins the #wikimedia-cloud-admin IRC channel again
12:37 UTC icinga page for cloudcontrol1004: keystone_novaobserver_delete_tokens, may be a delayed message for whatever reason
12:39 UTC Andrew decides to downtime a bunch of stuff in icinga while on operations
12:45 UTC Toolforge seems to be working correctly
12:49 UTC Brooke merges 513105 (nfs-exportd: if auth errors happen, do not proceed)
13:00 UTC we consider the incident done, all systems seems to be working and we understand the issue.

Conclusions

What weaknesses did we learn about and how can we address them?

The following sub-sections should have a couple brief bullet points each.

What went well?

Alerts went out relatively quickly
All the OpsEng in the WMCS team immediately were available to handle the outage. Worked together as a team pretty well.

What went poorly?

The outage was the result of an upgrade operation that we were unable to reproduce in a testing/staging environment
Since some of the storage environment is in flux at the moment, there were justified concerns that documentation would be outdated

Where did we get lucky?

incident occurred when the most people were online to assist

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, there should be an action item to create it.

Actionables

513105 -- bad branch in nfs-exportd code Done
513128 -- nfs-exportd: get essential openstack information from yaml files Done
Improve nfs-exportd to consult keystone about the project list to recognize deleted projects
hiera cleanup: novacontroller vs keystone_host
hardcoding keystone endpoint. Many scripts/configs are harcoding keystone endpoint (cloudcontrol1003.wikimedia.org)
use service FQDN everywhere rather than a concrete server (cloudcontrol: decide on FQDN for service endpoints task T223902)
revisit the design of the nfs-exportd code: we could remove any ability to drop exports in the code, but that's a security and practical concern because we reuse IP addresses.
possibly introduce a maintenance flag for openstack services to watch for in hiera or a puppet-controlled file such as novaobserver.yaml
We don't know why keystone @ cloudcontrol1003 refused to auth queries when we started failovering the service to cloudcontrol1004.
Add unit tests to nfs-exportd.py -- we can absolutely check for how it behaves in countless failure states easily if it has tests like gridconfigurator does