Incidents/20151216-ores

Summary

ORES stopped serving requests for 3 hours because Redis ran out of disk space and the Workers could no longer function.

Timeline

1130 UTC

a report was made via IRC that ORES was returning 503 "server overloaded" errors.

1300 UTC

this message was seen by Halfak, the workers were restarted and came back online briefly only to go back offline again

1325 UTC

web proxy for ores.wmflabs.org was changed to direct traffic to the staging server (cpu usage went to 100%, but most requests were being served effectively)

1345 UTC

workers and logs are observed. Workers appear to come back online and the crash with an error:

redis.exceptions.ResponseError: MISCONF Errors writing to the AOF file: No space left on device

1350 UTC

Halfak calls Yuvipanda for help. Steps taken:

hand hack redis config file to turn off AOF
physically move the files out of /srv/redis
restart redis

1410 UTC: web proxy for ores.wmflabs.org was changed to direct traffic back to the prod cluster. precached turned back on
1430 UTC: Victory declared. Yuvipanda goes back to sleep and Halfak continues monitoring

Conclusions

Our monitoring does not catch when the webservers are up but the workers are down. We need to implement monitoring for the worker nodes.
Our LRU policy for redis did not account for the additional disk space that would be required when using the AOF (append) persistence strategy. Switch to using RDB (snapshots)

Actionables

Phab:T121658 -- Switch from AOF to RDB persistence strategy for ORES redis
Phab:T121656 -- Add monitoring to ORES workers