Incidents/20151216-ores
Appearance
(Redirected from Incident documentation/20151216-ores)
Summary
ORES stopped serving requests for 3 hours because Redis ran out of disk space and the Workers could no longer function.
Timeline
- 1130 UTC
- a report was made via IRC that ORES was returning 503 "server overloaded" errors.
- 1300 UTC
- this message was seen by Halfak, the workers were restarted and came back online briefly only to go back offline again
- 1325 UTC
- web proxy for ores.wmflabs.org was changed to direct traffic to the staging server (cpu usage went to 100%, but most requests were being served effectively)
- 1345 UTC
- workers and logs are observed. Workers appear to come back online and the crash with an error:
redis.exceptions.ResponseError: MISCONF Errors writing to the AOF file: No space left on device
- 1350 UTC
- Halfak calls Yuvipanda for help. Steps taken:
- hand hack redis config file to turn off AOF
- physically move the files out of /srv/redis
- restart redis
- 1410 UTC
- web proxy for ores.wmflabs.org was changed to direct traffic back to the prod cluster.
precached
turned back on - 1430 UTC
- Victory declared. Yuvipanda goes back to sleep and Halfak continues monitoring
Conclusions
- Our monitoring does not catch when the webservers are up but the workers are down. We need to implement monitoring for the worker nodes.
- Our LRU policy for redis did not account for the additional disk space that would be required when using the AOF (append) persistence strategy. Switch to using RDB (snapshots)
Actionables
- Phab:T121658 -- Switch from AOF to RDB persistence strategy for ORES redis
- Phab:T121656 -- Add monitoring to ORES workers