Incidents/2019-03-20 ORES

ORES in CODFW stopped processing requests. The result was sustained overload errors and a growing backlog of requests to process.

Summary

This is a short (<= 1 paragraph) of what happened. Please ensure to remove private information.

Timeline

All times in UTC

March 19th
- 0400 - We observe a very high, sustained request rate from Google Cloud in SFO The sustained request rate brings EQIAD/CODFW near capacity. (grafana of external requests)

March 20
- 15:58 DNS oresrdb.svc.codfw.wmnet is switched over to oresrdb2002
- 14:02 - oresrdb2002 is rebooted for maintenance.
- 14:10 - oresrdb2002 comes back up.
- 14:12:22 - Score cache redis start and begins loading the redis databases from disk
- 14:13:57 - the score cache redis database loads the file and start accepting connections. It also begins a full resynchronization from master
- 14:12 - ORES codfw stops returning any scores (grafana of scores processed)
- 14:14 - ORES codfw begins to return overload errors (grafana of overload errors)
- 14:40 - Reversal of previous DNS change and forced restart of workers.

[14:40:21] <akosiaris> lemme reverst the switchover of the redis just in case
[14:43:21] <akosiaris> I 'll force a worker restart just to make sure it was that

14:43 - ORES codfw begins to return scores again (grafana of scores processed)

Conclusions

Miscommunication between SRE team members ended up in the reboot of the backup redis server after it was switched to serve redis traffic.

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (cookbook / runbook). If that documentation does not exist, there should be an action item to create it.

Actionables

Implement a better redis HA solution https://phabricator.wikimedia.org/T122676