Incident documentation/20151126-graphite-grafana

From Wikitech
Jump to: navigation, search

Summary

graphite.wikimedia.org was overwhelmed with legitimate requests, yielding 500s returned to clients

Timeline

  • 2015/11/25 14:48 - graphite starts throwing 500s to clients
  • 2015/11/25 14:52 - alert on icinga, investigation begins
  • 2015/11/25 14:57 - heavy query/dashboard suspected, uwsgi bounced
  • 2015/11/25 15:11 - big influx of requests on graphite1001's apache identified as being the root cause. likely a misbehaving dashboard
  • 2015/11/25 15:24 - labs-monitoring grafana dashboard change default refresh inverval from 5s to 5m
  • 2015/11/25 15:53 - kafka dashboard also suspected and banned from apache

Conclusions

grafana dashboards relying on intensive graphite queries can easily overwhelm graphite itself, particularly dashboard that refresh frequently, resulting in denial of service.

in addition, it has been observed that misc varnish retries the request on 5xx from a backend, further contributing to thundering herd of requests.

Actionables

  • Status:    In progress Make it easier to ban misbehaving dashboards from graphite (bug T119718)
  • Status:    In progress Enforce a minimum refresh period for grafana dashboards hitting graphite (bug T119719)
  • Yes check.svg Done 500 errors from graphite shouldn't be retried by varnish (bug T119721)