Incidents/2017-11-25 statsd

From Wikitech

Summary

A change in service-runner was deployed on Nov 22th which caused an ever-increase amount of metrics to be sent to statsd, eventually overwhelming the machine with UDP traffic.

Timeline

  • 20171121 Report GC metrics from service-runner pull request is merged, https://github.com/wikimedia/service-runner/pull/170
  • 20171121T1344 cpjobqueue is deployed with GC metrics reporting enabled, http://tools.wmflabs.org/sal/log/AV_e0yGGF4fsM4DBdggW
  • 20171123T0558 cxserver is deployed with GC metrics reporting enabled, http://tools.wmflabs.org/sal/log/AV_ndQBAwg13V6286Gp_
  • 20171125T1117 The "carbon frontend relay drops" alarm starts going off
  • 20171125T1305 Filippo starts investigating and bug T181333 is open
  • 20171125T1340 statsd traffic from scb machines is banned to let statsd recover
  • 20171125T1410 cxserver and cpjobqueue are roll-restarted to alleviate the metrics "leak"
  • 20171125T1545 root cause is found and Petr rollbacks service-runner for cpjobqueue
  • 20171125T1605 scb statsd traffic is unbanned
  • 20171125T1609 Kartik rolls back cxserver with the previous version of service-runner


Conclusions

Services sending statsd UDP traffic receive no feedback from the aggregation server, thus the client needs to do self-pacing in order to avoid overwhelming the aggregation server. Also the aggregation server had alerts for its graphite carbon traffic, i.e. after statsd aggregation has happened. Said alerts fired only when the machine was overwhelmed with network traffic and carbon traffic (over tcp) was impacted too.

Actionables