Incidents/2019-05-06 zuul

From Wikitech

Summary

CI was unresponsive and was not voting on Gerrit patches. It then had a very slow response time in processing changes. Those are two distinct events having different root causes, but in the end cause a bad user experience.

Impact

The incident affected Zuul, the CI system. It affected any CI user, mainly SREs, developers and volunteers which were not getting CI jobs running for their patches or, later with a very large delay.

Detection

An Icinga error got reported but not immediately taken in account by releng

SRE reported errors on #-operations, #-releng, #_security and directly to Antoine "hashar" Musso

Timeline

All times in UTC.

Incident 1

  • 09:26 Antoine deploys Zuul config change https://phabricator.wikimedia.org/T105474
  • 10:00 CI reported processing times skyrocket for many repositories.
  • 10:33: second to last operations/puppet patch processed
  • 10:35 Giuseppe (and then Arturo) notice CI is down. Attempts to reach anyone in release engineering in IRC fail.
  • 11:18: final operations/puppet patch processed before fix
  • 11:30(approx): People complain about CI not working in #wikimedia-operations
  • 11:51: Seeing how everyone's work is blocked, Giuseppe opens https://phabricator.wikimedia.org/T222605 and sets its severity to Unbreak now! Further attempts to reach to anyone in #wikimedia-releng by several people fail.
  • 12:11: first automated alert about the issue: PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
  • 12:56: mark attempts to reach Antoine on IRC (but he misses it due to other notifications)
  • 14:00 roughly: Antoine notices the problem.
  • "14:04" reverted zuul patch https://phabricator.wikimedia.org/T222605 that broke the gate and submit queue
  • 14:08: operations/puppet patches begin to be processed again

Incident 2

View of the Gearman queue from 14:00 UTC to 16:30 UTC. The peak is reached at 14:30 UTC and is slowly drained by the two zuul-merger processes running on contint1001 and contint2001. Around 16:10 UTC, the quick drain is due to purposely causing those jobs to fail fast.

Conclusions

What went well?

  • In both cases the root cause has been fairly easily to pin point. Respectively a config change and a huge amount of merger:merge function enqueued.

What went poorly?

  • it was difficult to get in touch with those who could fix the issue: IRC pings and an UBN ticket went unseen. (perhaps a phone call would have been better?)
  • there is no shared knowledge about the CI stack beside a few people. During European time, its only covered by Antoine.

Where did we get lucky?

  • Root causes were "easy"

Links to relevant documentation


Actionables