Incident documentation/20151126-graphite-grafana

From Wikitech
Jump to navigation Jump to search

Summary was overwhelmed with legitimate requests, yielding 500s returned to clients


  • 2015/11/25 14:48 - graphite starts throwing 500s to clients
  • 2015/11/25 14:52 - alert on icinga, investigation begins
  • 2015/11/25 14:57 - heavy query/dashboard suspected, uwsgi bounced
  • 2015/11/25 15:11 - big influx of requests on graphite1001's apache identified as being the root cause. likely a misbehaving dashboard
  • 2015/11/25 15:24 - labs-monitoring grafana dashboard change default refresh inverval from 5s to 5m
  • 2015/11/25 15:53 - kafka dashboard also suspected and banned from apache


grafana dashboards relying on intensive graphite queries can easily overwhelm graphite itself, particularly dashboard that refresh frequently, resulting in denial of service.

in addition, it has been observed that misc varnish retries the request on 5xx from a backend, further contributing to thundering herd of requests.


  • Status:    Unresolved Make it easier to ban misbehaving dashboards from graphite (bug T119718)
  • Status:    Unresolved Enforce a minimum refresh period for grafana dashboards hitting graphite (bug T119719)
  • Yes Done 500 errors from graphite shouldn't be retried by varnish (bug T119721)