Incidents/2019-03-27 gerrit

From Wikitech

Summary

Gerrit's CPU consumption has been elevated for some time, possibly due to git gc having been disabled for some time. Slow pageloads have been noticeable. Today gerrit stopped responding to queries and had to be restarted, after which it worked again.

Impact

  • 800 requests returned some 5xx code from apache during the outage
  • 5 servers failed their puppet runs
  • Over the past several days there have been CI failures due to git pull timeouts
  • Over the past several days users have reported slow clones and slow patch-uploads

Detection

Automated monitoring and humans noticed Gerrit going fully unresponsive approximately simultaneously.

Timeline

This is a step by step outline of what happened to cause the incident and how it was remedied. Include the lead-up to the incident, as well as any epilogue, and clearly indicate when the user-visible outage began and ended.

All times in UTC.

  • 2019-03-05: regularly-scheduled git gc disabled to prevent corruption of packfile list due to concurrent access during GC jGit issue 544199
  • approx 2019-03-20: gerrit CPU usage begins to be higher than usual
  • 2019-03-27 18:39: two users report that gerrit seems down OUTAGE BEGINS
  • 18:40: icinga reports socket timeouts for gerrit
  • 18:47: thcipriani attempts to use jstack to gather stack traces; gets error: "Unable to open socket file: target process not responding or HotSpot VM not loaded"; re-runs with -F
    Attaching to process ID 13929, please wait...
    Debugger attached successfully.
    Server compiler detected.
    JVM version is 25.181-b13
    Deadlock Detection:
    
    Can't print deadlocks:Unable to deduce type of thread from address 0x00007f3a5445e800 (expected type JavaThread, CompilerThread, ServiceThread, JvmtiAgentThread, or SurrogateLockerThread)
    
  • 18:48: thcipriani restarts gerrit service
  • 18:51: icinga reports recovery OUTAGE ENDS
  • 19:27: thcipriani makes a local backup and runs gerrit gc on mediawiki/core repo
  • 19:39: that gerrit gc completes; size difference approx 353MB
  • 19:51: thcipriani repeats the process for operations/puppet repo
  • 20:0x (didn't note time): gerrit gc completes for operations/puppet with a 121MB savings

Conclusions

What weaknesses did we learn about and how can we address them?

The following sub-sections should have a couple brief bullet points each.

What went well?

  • Automated monitoring quickly detected the actual outage
  • Recovered from outage relatively quickly
  • Lots of people paying attention/quick response
  • Staggered puppet runs mean that impact for servers was minimal

What went poorly?

  • Unclear if our theory of the root cause of the issue is actually true; time will tell

Where did we get lucky?

  • Incident occurred when thcipriani was around
  • release train was not impacted

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, there should be an action item to create it.

Actionables

Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.

NOTE: Please add the #wikimedia-incident Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.

  • task T184086 Install gerrit's prometheus_exporter plugin so we have clearer visibility into gerrit internals
  • 499289 Now that it should be safe to do so, re-enable regular git gc
  • TODO update gerrit documentation for things to check in case of high-load
  • TODO update gerrit documentation how to grab a stack trace (thcipriani frantically ctrl-r/looking for irc scrollback)