Incidents/2017-11-25 statsd

Summary

A change in service-runner was deployed on Nov 22th which caused an ever-increase amount of metrics to be sent to statsd, eventually overwhelming the machine with UDP traffic.

Timeline

20171121 Report GC metrics from service-runner pull request is merged, https://github.com/wikimedia/service-runner/pull/170
20171121T1344 cpjobqueue is deployed with GC metrics reporting enabled, http://tools.wmflabs.org/sal/log/AV_e0yGGF4fsM4DBdggW
20171123T0558 cxserver is deployed with GC metrics reporting enabled, http://tools.wmflabs.org/sal/log/AV_ndQBAwg13V6286Gp_
20171125T1117 The "carbon frontend relay drops" alarm starts going off
20171125T1305 Filippo starts investigating and bug T181333 is open
20171125T1340 statsd traffic from scb machines is banned to let statsd recover
20171125T1410 cxserver and cpjobqueue are roll-restarted to alleviate the metrics "leak"
20171125T1545 root cause is found and Petr rollbacks service-runner for cpjobqueue
20171125T1605 scb statsd traffic is unbanned
20171125T1609 Kartik rolls back cxserver with the previous version of service-runner

Conclusions

Services sending statsd UDP traffic receive no feedback from the aggregation server, thus the client needs to do self-pacing in order to avoid overwhelming the aggregation server. Also the aggregation server had alerts for its graphite carbon traffic, i.e. after statsd aggregation has happened. Said alerts fired only when the machine was overwhelmed with network traffic and carbon traffic (over tcp) was impacted too.

Actionables

Alert on statsd udp loss (phab:T181382)
Adaptive metrics auto-sampling in service-runner (phab:T181382)