Incidents/20150602-gridengine-dns-failure

Summary

While attempting to determine the issue that prevented the automatic failover of the gridengine master during last week's outage, the cause was determined to be inconsistencies in name resolution by labs instances due to the ongoing DNS migration. While intermittent issues could be seen in job scheduling and reporting over the past several days, the issues were made sallient by switching the tool labs bastions to the new DNS server earlier that day.

Attempts to fix the gridengine configuration to be compatible with the new node naming scheme lasted several hours, during which job scheduling was intermitently unstable and suffered random failures. While a proper solution was eventually found, a side effect of its implementation caused an unexpected failure during a manipulation step that caused a complete outage of the scheduling system and required rolling configuration back and manually fixing the database.

Effects of the outage was unreliable scheduling of new jobs over a period of approximately four hours, followed by a complete outage of that system for approximately 30 minutes. Already running jobs and web services were unaffected by the partial outage, nor were any other system serving Tool Labs.

Timeline

13:30 - Issues apparently related to the ongoing DNS change to designate are noted, as some gridengine commands intermittently fail authentication because of host name mismatches
13:40 - problem is identified to mismatch between the DNS servers used by clients (designate-backed pdns) and the master and shadow masters (older, dnsmasq)
13:50 - Gridengine servers switched to new DNS backend, solves the immediate issue with authentication, but shakes out more intermitent issues; begin trying to unify DNS resolution
14:27 - Attempts to convert gridengine configuration to the new naming scheme has limited success; admin and submit host are trivially converted but exec nodes cause issues
15:06 - While running exec nodes are unaffected, the transition proves problematic because restarting the nodes have them fail
15:37 - A plan is formulated to transition exec nodes from the old names to the new ones, possibly over an extended period of time. Exec node entries are created with the new names and left unqueued.
15:47 - Transition is attempted with an exec node that only held restartable jobs (1401) and could thus be evacuated quickly
16:08 - Gridengine does not seem to deal well with exec nodes having been renamed (the old and new name conflicting), intermittent authentication error persists (which name is used seems to be random)
16:16 - Debugging the DNS inconsistency issue
17:05 - While examining the gridengine code, a "proper" if obscure way to handle renaming is found that allows gridengine to make both names strictly equivalent is found (http://gridscheduler.sourceforge.net/htmlman/htmlman5/host_aliases.html) that completely fixes the name divergence issues.
17:10 - fix applied, gridengine returns to normal function
17:25 - while attempting to remove "new name" exec nodes entries created at 17:35 (and which are now undeeded) a nasty issue causes corruption of the configuration: because the names are now considered equivalent by gridengine, removing the newer redundant entries in fact removes both entries while bypassing the safeguards preventing active nodes from being removed. The configuration becomes inconsistent and the gridengine masters shut down.
17:50 - configuration is returned to working state by doing a low-level database recovery of the underlying config db to a known good point in time, and manually deleting the redundant host entries
18:03 - gridengine returns to full health, with all previous issues gone thanks to the aliasing feature.

Conclusion

The change to DNS configuration required by the transition to Designate caused unexpected side effects to gridengine as it relies on fully qualified domain names of hosts for operation; intermittent DNS resolution has caused issues as early as last week (but from a distinct and unrelated cause), but came to a head when the Tool Labs bastions were converted to use the new DNS server. Since gridengine has no support for renaming hosts, and the host_aliases workaround was not known, the transition was more disruptive than it might have been.

Long term fix is to get rid of GridEngine - it has no active upstream, and Debian doesn't consider is maintained enough to include it in Jessie. Short term fixes listed below.

Actionables

Puppetize the aliases file (T101296)
Document what 'low level database recovery of the underlying config db' was in wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin
(Not sure how feasible that is; it involved BDB dump manipulation is is not amenable to a recipe).