Incident documentation/20151023-graphite-grafana

From Wikitech
Jump to: navigation, search

Summary

Graphite was unable to serve data properly due to large queries in dashboards

Timeline

  • 16:16 - 502 bad gateway from graphite
  • 16:18 - investigation begins
  • 16:31 - recovery, root cause still unknown, large query suspected
  • 16:47 - stop graphite-index cronjob, suspected as a factor and later excluded
  • 17:57 - the offending queries are found and the related grafana dashboard deleted
  • 18:54 - offending client banned from apache on graphite1001

Conclusions

Graphite doesn't include query cancellation or timeout capabilities for local queries it seems, so queries involving a lot of time series can occupy all uwsgi workers, resulting in "bad gateway" from apache. In addition, grafana clients don't seem to reload dashboards when the dashboard definition is itself updated. This results in clients keep requesting the same (problematic, in this case) dashboard and thus needing a bad server-side.

Actionables

  • N Not done limit the impact of heavy/large graphite queries (bug T116767)