Incidents/2017-06-13 ORES

From Wikitech

Summary

ORES had an intermittent outage from 1600 - 1940 UTC on June 13th. The issue was traced to scb1001.eqiad.wmnet.

Timeline

See https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=1497366649350&to=1497383640251&panelId=2&fullscreen

  • 1600 UTC: Errors rise for ORES (not noticed. no icinga pings)
  • 1700 UTC: Deployment for task T167223 begins
  • 1715 UTC: During canary check, error rate is noted and task T167819 is created with "Unbreak now"
  • 1740 UTC: Problem is independent of deploy. The decision is made to continue with deploy.
  • 1816 UTC: Ops is pulled in (mutante responds). Rollback of deploy is considered but rejected.
  • 1828 UTC: Problem is narrowed down to scb1001 specifically. Logs show no errors despite intermittent 500s
  • 1923 UTC: Mutante notes that pdf rendering is taking a lot of CPU and kills it
  • 1940 UTC: Recovery confirmed.

Conclusions

  • icinga didn't tell us about the issue
  • for some reason, the error wasn't being written to app.log
  • it looks like there was some conflict with resource usage WRT pdf rendering
  • memory was very tight on SCB for the duration of the outage:

Actionables

  • task T167830 -- "Extend icinga check to catch 500 errors like those of the 20170613 incident"
  • task T146664 -- "Limit resources used by ORES", move ORES to dedicated hardware. See task T157222.
  • Limit resources used by the pdfrender service: task T167834