Incident documentation/20151216-ores

From Wikitech
Jump to: navigation, search


Summary

ORES stopped serving requests for 3 hours because Redis ran out of disk space and the Workers could no longer function.

Timeline

1130 UTC
a report was made via IRC that ORES was returning 503 "server overloaded" errors.
1300 UTC
this message was seen by Halfak, the workers were restarted and came back online briefly only to go back offline again
1325 UTC
web proxy for ores.wmflabs.org was changed to direct traffic to the staging server (cpu usage went to 100%, but most requests were being served effectively)
1345 UTC
workers and logs are observed. Workers appear to come back online and the crash with an error:
redis.exceptions.ResponseError: MISCONF Errors writing to the AOF file: No space left on device
1350 UTC
Halfak calls Yuvipanda for help. Steps taken:
  1. hand hack redis config file to turn off AOF
  2. physically move the files out of /srv/redis
  3. restart redis
1410 UTC
web proxy for ores.wmflabs.org was changed to direct traffic back to the prod cluster. precached turned back on
1430 UTC
Victory declared. Yuvipanda goes back to sleep and Halfak continues monitoring

Conclusions

  1. Our monitoring does not catch when the webservers are up but the workers are down. We need to implement monitoring for the worker nodes.
  2. Our LRU policy for redis did not account for the additional disk space that would be required when using the AOF (append) persistence strategy. Switch to using RDB (snapshots)

Actionables

  • Phab:T121658 -- Switch from AOF to RDB persistence strategy for ORES redis
  • Phab:T121656 -- Add monitoring to ORES workers