Jump to content

Incidents/20160925-ores

From Wikitech

(Redirected from Incident documentation/20160925-ores)

Summary

At September 25th, ORES service had higher ~14%) timeout ratio for six hours. Because it ran out space due to too verbose logging.

Timeline

Sept 25 10:34:40 UTC 2016: icinga test on ORES failed due to timeout.
14:13 UTC: phab:T146581 is created.
16:03 The fix deployed in labs.
16:26 The fix deployed in prod.

Conclusions

We should have better monitoring disk space and be careful on verbosity of production services logs

Actionables

Less verbose ORES task T146581
Grafana monitor on disk space in ORES task T147163

Retrieved from "https://wikitech.wikimedia.org/w/index.php?title=Incidents/20160925-ores&oldid=1965833"

Incident documentation