Incidents/20140211-Parsoid

From Wikitech

Summary

Verbose logging combined with broken log rotation led to disks on about 3/4 of the Parsoid nodes filling up, which caused the Parsoid daemons to stop accepting requests. This led to some user-visible errors for VisualEditor users in a 17-minute window. An estimate is that less than 10% of VE page loads / saves were affected during this period.

Timeline

All times UTC on Tuesday the 11th (03:00 UTC = 7pm PST on Monday evening):

03:02 First disk space alerts for wtp* [1]
03:06 First connection refused alert
03:10 <springle> those look real. root full on wtp1008
03:12 Most wtp* servers now refusing connections [2]
03:12 Sean removes log file on wtp1008 and restarts service
03:14 parsoid.svc.eqiad.wmnet LVS check goes CRITICAL, sends pages but not to me
03:16 wtp1008 comes back up
03:17 Roan gets on IRC, having been dragged out of a conversation by Erik
03:20 wtp1021 and the LVS check magically come back up (??)
03:20 Roan saves a copy of wtp1005's log for analysis; later discovered it
was the wrong file
03:23 Roan starts a rolling restart of the Parsoid cluster using the
command documented on wikitech
03:26 The LVS check goes CRITICAL again; wtp10{01,02,04,10} go down
03:28 Roan uses the old init script to restart Parsoid instead
03:29 Entire Parsoid cluster comes back up

Conclusions

  • Log rotation in puppet was not properly tested, and did not run often enough to prevent failures
  • Current Parsoid logging via stdout/stderr redirection can block. Work on async logging is ongoing, but was not ready before this outage.
  • Disk space monitoring on Parsoid boxes should trigger much earlier
  • Need to better check the logging volume in the Parsoid tests (recursion bug in error logging code produced megabytes of log data per error)
  • salt restarts were using old init script instead of upstart, see bug

Actionables

  • Status:    Done - Fix log rotation, run it hourly instead of daily
  • Status:    Done - Remove old init scripts and update documentation on the log file path
  • Status:    Done - Lower the warning threshold on parsoid node disk space to provide time to react
  • Status:    Done - Finish migration to async logging backend in Parsoid so that a full disk does not affect the service availability
  • Status:    Unresolved - Check the logging volume in Parsoid unit tests, less critical once logging is async