Incident documentation/20150106-Zuul

From Wikitech
Jump to: navigation, search

Summary

On Jan 6th 2015 slightly after midnight, Zuul ended up stalled completely. The Zuul status page had a single change with all jobs completed and the events queue was filling not being processed. The reason was a ssh connection stalled while reporting a change result which is a blocking operation with no timeout. Restarting Gerrit at 8:35am UTC killed the stalled connection and unleashed Zuul. It dropped all events and resumed operations.

Timeline

00:06:46 UTC Zuul attempted to report results for https://gerrit.wikimedia.org/r/#/c/182962/ and the ssh gerrit connection stalled for some reason

01:13 UTC Aude reports on ops and qa mailling list that Zuul / Jenkins are broken

01:30 UTC Chris McMahon and others restarted the Jenkins Gearman connection based on the "Gearman deadlock" known issue and described at https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues

08:10 UTC Antoine notices a +2ed patch did not get merged in

08:10 - 08:38 UTC Antoine investigate (see below)

08:38 UTC Andrew Bogott restart Gerrit killing the stalled TCP connection. Zuul drop its whole event queue and resume work

10:30 UTC Gerrit started to spurts out errors about not being to reach the ReviewDB database. That caused Zuul to be stuck in exactly the same way. A restart of Zuul fixed it.

Diagnostic

Diagnostic done by Antoine on gallium.wikimedia.org

The Zuul Gearman embedded daemon had all jobs properly registered with available workers (zuul-gearman.py status).

Some jobs were listed has being completed (zuul show running-jobs).

The debug log (/var/log/zuul/debug.log) had little activity beside receiving events from Gerrit.

I requested Zuul to dump a stacktrace by sending SIGUSR2 to the main server process (https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Debugging). The interesting thread trace is:

Thread: 140085626386176
  File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
    self.__bootstrap_inner()
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 784, in run
    while pipeline.manager.processQueue():
  File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 1358, in processQueue
    item, nnfi, ready_ahead)
  File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 1330, in _processOneItem
    self.reportItem(item)
  File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 1415, in reportItem
    item.reported = not self._reportItem(item)
  File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 1456, in _reportItem
    ret = self.sendReport(actions, item.change, report)
  File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 1050, in sendReport
    ret = action_reporter.report(change, message)
  File "/usr/local/lib/python2.7/dist-packages/zuul/model.py", line 293, in report
    return self.reporter.report(change, message, self.params)
  File "/usr/local/lib/python2.7/dist-packages/zuul/reporter/gerrit.py", line 37, in report
    params)
  File "/usr/local/lib/python2.7/dist-packages/zuul/lib/gerrit.py", line 125, in review
    out, err = self._ssh(cmd)
  File "/usr/local/lib/python2.7/dist-packages/zuul/lib/gerrit.py", line 198, in _ssh
    out = stdout.read()
  File "/usr/lib/python2.7/dist-packages/paramiko/file.py", line 134, in read
    new_data = self._read(self._DEFAULT_BUFSIZE)
  File "/usr/lib/python2.7/dist-packages/paramiko/channel.py", line 1215, in _read
    return self.channel.recv(size)
  File "/usr/lib/python2.7/dist-packages/paramiko/channel.py", line 584, in recv
    out = self.in_buffer.read(nbytes, self.timeout)
  File "/usr/lib/python2.7/dist-packages/paramiko/buffered_pipe.py", line 137, in read
    self._cv.wait(timeout)
  File "/usr/lib/python2.7/threading.py", line 243, in wait
    waiter.acquire()

That shows Zuul scheduler waiting for reporter/gerrit.py report() to reply, itself does a ssh connection to Gerrit to add a comment and votes on the change. That connection is stalled which is blocking the main scheduling thread. Killing that connection would unblock Zuul.

When restarting Gerrit, Zuul ended up not being able to process any event since it could not fetches informations from Gerrit. That caused Zuul to drop roughly 280 events from its queue. Users thus have to retrigger their patches (by commenting 'recheck') or re gate them (by removing themselves as a reviewer and re voting +2).

Conclusions

Zuul lacks a timeout when reporting a change.

When Gerrit is not reachable, Zuul drops them with a ssh exception when it should probably queue them and wait for Gerrit to resume.

The issue could have been fixed by restarting the Zuul service on gallium.

The lost of the events queue seems to be addressed by upstream patch Process event multiple times.

Actionables

  • Ask on ops list whether anyone thought about restarting Zuul
  • Train other people to be able to investigate Zuul issue
  • Report bug to upstream
  • Filled bug T85916 about the error "Gerrit Cannot open ReviewDb"