Incidents/20150106-Zuul
Summary
On Jan 6th 2015 slightly after midnight, Zuul ended up stalled completely. The Zuul status page had a single change with all jobs completed and the events queue was filling not being processed. The reason was a ssh connection stalled while reporting a change result which is a blocking operation with no timeout. Restarting Gerrit at 8:35am UTC killed the stalled connection and unleashed Zuul. It dropped all events and resumed operations.
Timeline
00:06:46 UTC Zuul attempted to report results for https://gerrit.wikimedia.org/r/#/c/182962/ and the ssh gerrit connection stalled for some reason
01:13 UTC Aude reports on ops and qa mailling list that Zuul / Jenkins are broken
01:30 UTC Chris McMahon and others restarted the Jenkins Gearman connection based on the "Gearman deadlock" known issue and described at https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues
08:10 UTC Antoine notices a +2ed patch did not get merged in
08:10 - 08:38 UTC Antoine investigate (see below)
08:38 UTC Andrew Bogott restart Gerrit killing the stalled TCP connection. Zuul drop its whole event queue and resume work
10:30 UTC Gerrit started to spurts out errors about not being to reach the ReviewDB database. That caused Zuul to be stuck in exactly the same way. A restart of Zuul fixed it.
Diagnostic
Diagnostic done by Antoine on gallium.wikimedia.org
The Zuul Gearman embedded daemon had all jobs properly registered with available workers (zuul-gearman.py status).
Some jobs were listed has being completed (zuul show running-jobs).
The debug log (/var/log/zuul/debug.log
) had little activity beside receiving events from Gerrit.
I requested Zuul to dump a stacktrace by sending SIGUSR2 to the main server process (https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Debugging). The interesting thread trace is:
Thread: 140085626386176 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap self.__bootstrap_inner() File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner self.run() File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 784, in run while pipeline.manager.processQueue(): File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 1358, in processQueue item, nnfi, ready_ahead) File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 1330, in _processOneItem self.reportItem(item) File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 1415, in reportItem item.reported = not self._reportItem(item) File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 1456, in _reportItem ret = self.sendReport(actions, item.change, report) File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 1050, in sendReport ret = action_reporter.report(change, message) File "/usr/local/lib/python2.7/dist-packages/zuul/model.py", line 293, in report return self.reporter.report(change, message, self.params) File "/usr/local/lib/python2.7/dist-packages/zuul/reporter/gerrit.py", line 37, in report params) File "/usr/local/lib/python2.7/dist-packages/zuul/lib/gerrit.py", line 125, in review out, err = self._ssh(cmd) File "/usr/local/lib/python2.7/dist-packages/zuul/lib/gerrit.py", line 198, in _ssh out = stdout.read() File "/usr/lib/python2.7/dist-packages/paramiko/file.py", line 134, in read new_data = self._read(self._DEFAULT_BUFSIZE) File "/usr/lib/python2.7/dist-packages/paramiko/channel.py", line 1215, in _read return self.channel.recv(size) File "/usr/lib/python2.7/dist-packages/paramiko/channel.py", line 584, in recv out = self.in_buffer.read(nbytes, self.timeout) File "/usr/lib/python2.7/dist-packages/paramiko/buffered_pipe.py", line 137, in read self._cv.wait(timeout) File "/usr/lib/python2.7/threading.py", line 243, in wait waiter.acquire()
That shows Zuul scheduler waiting for reporter/gerrit.py report() to reply, itself does a ssh connection to Gerrit to add a comment and votes on the change. That connection is stalled which is blocking the main scheduling thread. Killing that connection would unblock Zuul.
When restarting Gerrit, Zuul ended up not being able to process any event since it could not fetches informations from Gerrit. That caused Zuul to drop roughly 280 events from its queue. Users thus have to retrigger their patches (by commenting 'recheck') or re gate them (by removing themselves as a reviewer and re voting +2).
Conclusions
Zuul lacks a timeout when reporting a change.
When Gerrit is not reachable, Zuul drops them with a ssh exception when it should probably queue them and wait for Gerrit to resume.
The issue could have been fixed by restarting the Zuul service on gallium.
The lost of the events queue seems to be addressed by upstream patch Process event multiple times.
Actionables
- Ask on ops list whether anyone thought about restarting Zuul
- Train other people to be able to investigate Zuul issue
- Report bug to upstream
- Filled bug T85916 about the error "Gerrit Cannot open ReviewDb"