Incident documentation/20141126-ocg

From Wikitech
Jump to: navigation, search

Summary

OCG (offline content generator) was not able to serve user requests (e.g. PDF versions of pages)


Timeline

  • 20141128T1140 user report on #wikimedia-operations about PDF generation not working, investigation begins
  • 20141128T1147 disk space is suspected to be the root cause, investigation begins on that
  • 20141128T1200 older (14d) PDFs are removed from ocg100* servers, ocg doesn't recover
  • 20141128T1200 ocg logs on logstash indicate failure while talking to redis, investigation proceeds on that
 Nov 25 15:37:43 ocg1002 ganglia-ocg[15741]: ocg_job_status_queue 503449
 Nov 25 15:38:19 ocg1002 ganglia-ocg[25920]: ocg_job_status_queue 0
  • 20141128T1220 it is discovered that ocg configuration ships with a blank password
  • 20141128T1228 the impacting configuration change is fixed

Conclusions

  • There was user impact on the PDF generation starting 20141125T1537, no pages were issued
  • The alarm for "ocg.svc.eqiad.wmnet" was silenced, and thus didn't fire pages
  • The icinga OCG health check issue WARNING even for CRITICAL issues (e.g. returning HTTP 500, connection refused, etc)
  • OCG disks were almost full, at >90% utilization

Actionables

  • Permanent silencing alarms for production services is discouraged, if silencing is desired for a given service the "downtime" facility is to be preferred. Downtime will auto-expire after the chosen period and thus lessen these problems.
  • OCG icinga health checks should correctly report CRITICAL vs WARNING conditions
  • OCG service excessive disk utilization should be checked and automatically reclaimed (e.g. utilization thresholds or date thresholds)