Jump to content

Incidents/20160525-Toollabs Webservices

From Wikitech

Summary

An incomplete deploy of the toollabs webservice package caused restarting webservices to not register themselves with the proxy, causing an outage that lasted about 90mins.

Timeline

  1. https://gerrit.wikimedia.org/r/#/c/290612/ gets merged
  2. New toollabs-webservice package gets built and deployed on to all the bastions & webgrid nodes.
  3. Yuvi tests it with a few restarts, is happy, goes away
  4. Things happening at this point:
    1. webservice-runner (which is what the gridengine runs) now has a new parameter (--register-proxy) that is *required* to tell it to register with the proxy.
    2. Grid has jobs running that *do not* have this parameter specified
    3. This is fine for right then, but as soon as they restart for any reason, they'll use the new webservice-runner script and lose their proxy registration, since the --register-proxy isn't passed
  5. For unknown reasons, some or all tools are restarted by webservicemonitor
  6. Puppet's ensure => latest is used for deploying toollabs-webservice package, and puppet is stuck on /public/dumps on tools-services-01, which runs webservicemonitor. So even when webservices get restarted by webservicemonitor, they do not get the --register-proxy flag since it's still the old version of the code on tools-service-01 but new version on the gridengine nodes.
  7. the qmod -rj run on all webservice nodes exaggerates this problem, since it continues to run them without the --register-proxy parameter, causing them to not register
  8. Yuvi updates package on tools-services-01, and deletes all webservice jobs. This causes webservicemonitor to bring them back up by launching new jobs, which do have the --register-proxy flag

Conclusions

  1. Tools webservice infrastructure has a number of moving parts that are kinda fragile. The port assignment + proxy registration is particularly so
  2. Debian packages + ensure => latest in puppet do not make for a 'deployment system'
  3. Puppet staleness / errors should be looked at with more frequency on the tools project

Actionables

  1. Switch to an actual deployment system for toollabs-webservice bug T136168
  2. Enable email nags for tools puppet failures bug T136167
  3. Reduce probability of unrelated NFS failures causing puppet / other issues (https://phabricator.wikimedia.org/T136222 and others)
  4. Better paging for tools webservices being down (Volunteer noticed: T136162)