Incidents/20140910-swift-syslog
Appearance
Summary
swift frontends were maxed out on CPU following an rsyslog configuration change, impacting the image scaler cluster and regular swift traffic
Timeline
- 2014-09-10T08:52 https://gerrit.wikimedia.org/r/#/c/159348/ is submitted and merged, changing rsyslog and swift proxy configuration
- 2014-09-10T08:58 first alarm, HTTP timeout on ms-fe1002
- 2014-09-10T08:59 impact seen on image scalers, LVS alarm for rendering.svc.eqiad.wmnet
- 2014-09-10T09:01 HTTP 5xx alarm
- 2014-09-10T09:11 rolling restart of swift frontends
- 2014-09-10T09:12 LVS recover for ms-fe.eqiad.wmnet
- 2014-09-10T09:14 LVS recover for rendering.svc.eqiad.wmnet
Conclusions
This was an instance of Incident_documentation/20131205-Swift in which swift busyloops when the syslog socket goes away. The issue was thought to be fixed by latest swift upstream and confirmed during testing. The testing has proven to not replicate the exact conditions for reoccurrence however, as this incident demonstrates. Extra care should be put when deploying rsyslog configuration changes that restart rsyslog as a side effect.
Related graphs:
- ganglia CPU http://ganglia.wikimedia.org/latest/?r=custom&cs=9%2F10%2F2014+8%3A51&ce=9%2F10%2F2014+9%3A20&m=cpu_report&s=by+name&c=Swift+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4
- imagescaler apache workers sending replies http://ganglia.wikimedia.org/latest/?r=custom&cs=9%2F10%2F2014+8%3A51&ce=9%2F10%2F2014+9%3A20&m=ap_sending_reply&s=by+name&c=Image+scalers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4
Actionables
See related actions for Incident_documentation/20131205-Swift