Incidents/20140211-Parsoid

Summary

Verbose logging combined with broken log rotation led to disks on about 3/4 of the Parsoid nodes filling up, which caused the Parsoid daemons to stop accepting requests. This led to some user-visible errors for VisualEditor users in a 17-minute window. An estimate is that less than 10% of VE page loads / saves were affected during this period.

Timeline

All times UTC on Tuesday the 11th (03:00 UTC = 7pm PST on Monday evening):

03:02 First disk space alerts for wtp* [1]
03:06 First connection refused alert
03:10 <springle> those look real. root full on wtp1008
03:12 Most wtp* servers now refusing connections [2]
03:12 Sean removes log file on wtp1008 and restarts service
03:14 parsoid.svc.eqiad.wmnet LVS check goes CRITICAL, sends pages but not to me
03:16 wtp1008 comes back up
03:17 Roan gets on IRC, having been dragged out of a conversation by Erik
03:20 wtp1021 and the LVS check magically come back up (??)
03:20 Roan saves a copy of wtp1005's log for analysis; later discovered it
was the wrong file
03:23 Roan starts a rolling restart of the Parsoid cluster using the
command documented on wikitech
03:26 The LVS check goes CRITICAL again; wtp10{01,02,04,10} go down
03:28 Roan uses the old init script to restart Parsoid instead
03:29 Entire Parsoid cluster comes back up

Conclusions

Log rotation in puppet was not properly tested, and did not run often enough to prevent failures
Current Parsoid logging via stdout/stderr redirection can block. Work on async logging is ongoing, but was not ready before this outage.
Disk space monitoring on Parsoid boxes should trigger much earlier
Need to better check the logging volume in the Parsoid tests (recursion bug in error logging code produced megabytes of log data per error)
salt restarts were using old init script instead of upstart, see bug

Actionables

Status: Done - Fix log rotation, run it hourly instead of daily
Status: Done - Remove old init scripts and update documentation on the log file path
Status: Done - Lower the warning threshold on parsoid node disk space to provide time to react
- RT 6851
Status: Done - Finish migration to async logging backend in Parsoid so that a full disk does not affect the service availability
Status: Unresolved - Check the logging volume in Parsoid unit tests, less critical once logging is async