Incident documentation/20160525-Toollabs Webservices

From Wikitech
Jump to: navigation, search

Summary

An incomplete deploy of the toollabs webservice package caused restarting webservices to not register themselves with the proxy, causing an outage that lasted about 90mins.

Timeline

  1. https://gerrit.wikimedia.org/r/#/c/290612/ gets merged
  2. New toollabs-webservice package gets built and deployed on to all the bastions & webgrid nodes.
  3. Yuvi tests it with a few restarts, is happy, goes away
  4. Things happening at this point:
    1. webservice-runner (which is what the gridengine runs) now has a new parameter (--register-proxy) that is *required* to tell it to register with the proxy.
    2. Grid has jobs running that *do not* have this parameter specified
    3. This is fine for right then, but as soon as they restart for any reason, they'll use the new webservice-runner script and lose their proxy registration, since the --register-proxy isn't passed
  5. For unknown reasons, some or all tools are restarted by webservicemonitor
  6. Puppet's ensure => latest is used for deploying toollabs-webservice package, and puppet is stuck on /public/dumps on tools-services-01, which runs webservicemonitor. So even when webservices get restarted by webservicemonitor, they do not get the --register-proxy flag since it's still the old version of the code on tools-service-01 but new version on the gridengine nodes.
  7. the qmod -rj run on all webservice nodes exaggerates this problem, since it continues to run them without the --register-proxy parameter, causing them to not register
  8. Yuvi updates package on tools-services-01, and deletes all webservice jobs. This causes webservicemonitor to bring them back up by launching new jobs, which do have the --register-proxy flag

Conclusions

  1. Tools webservice infrastructure has a number of moving parts that are kinda fragile. The port assignment + proxy registration is particularly so
  2. Debian packages + ensure => latest in puppet do not make for a 'deployment system'
  3. Puppet staleness / errors should be looked at with more frequency on the tools project

Actionables

  1. Switch to an actual deployment system for toollabs-webservice bug T136168
  2. Enable email nags for tools puppet failures bug T136167
  3. Reduce probability of unrelated NFS failures causing puppet / other issues (https://phabricator.wikimedia.org/T136222 and others)
  4. Better paging for tools webservices being down (Volunteer noticed: T136162)