SRE/Observability/Dashboard guidelines

From Wikitech
Jump to navigation Jump to search

Dashboard methods

Utilization Saturation Errors (USE)

This method is most effective to quickly diagnose any system performance issue. To quote Brendan Gregg's guide to USE:

 For every resource, check utilization, saturation, and errors.

The host overview dashboard shows and example of this method applied to inspect a single host's performance. Resources (CPU/network/etc) are placed in rows, the left column is used for the resource's utilization, while the right column displays saturation or errors, as applicable.

Four golden signals (4GS)

This method is described in detail in Google's SRE book and focuses on the system's user-impacting metrics. Specifically it can be used as a basis for alerting and diagnosis of ongoing problems.

This method can be seen applied to swift for example or sessionstore or any other service dashboard in the "Services" Grafana folder.

Data panel recommendations

  • Axes must be labeled
  • Y axis should be zero-based
  • Use fill zero, unless the graph is stacked
  • Ideally no more than four lines/metrics per panel