Incidents/2018-11-29 ores

From Wikitech

Summary

ores.wikimedia.org was sending 500 for all score requests for 3 hours starting from 6AM UTC. It was due to config changes that was done as part of upgrading celery version of ores from three to two causing it to change its task serializer.

Timeline

  • November 28th 12:04 UTC: the problematic puppet change got merged
  • November 29th, 6:25 UTC: Logrotate restarted uwsgi services of ORES causing it to pick up the new config and start sending 500s
  • 9:51 UTC: The revert was created and deployed

Conclusions

  • Puppet should bind ores services to ores configs so it picks up the changes right away.
  • Logrotate should restart services in a better time. Not really doable
  • Contact number of WMDE staff should be avalible to SREs.

Links to relevant documentation

Actionables