graphite.wikimedia.org was overwhelmed with legitimate requests, yielding 500s returned to clients
- 2015/11/25 14:48 - graphite starts throwing 500s to clients
- 2015/11/25 14:52 - alert on icinga, investigation begins
- 2015/11/25 14:57 - heavy query/dashboard suspected, uwsgi bounced
- 2015/11/25 15:11 - big influx of requests on graphite1001's apache identified as being the root cause. likely a misbehaving dashboard
- 2015/11/25 15:24 - labs-monitoring grafana dashboard change default refresh inverval from 5s to 5m
- 2015/11/25 15:53 - kafka dashboard also suspected and banned from apache
grafana dashboards relying on intensive graphite queries can easily overwhelm graphite itself, particularly dashboard that refresh frequently, resulting in denial of service.
in addition, it has been observed that misc varnish retries the request on 5xx from a backend, further contributing to thundering herd of requests.