SRE/Observability/Ownership

From Wikitech
Jump to navigation Jump to search
Backlog Services Description Phabricator tag
Alerting All things alerting, including AlertManager, Icinga and Splunk-On-Call #observability-alerting
Metrics Aggregatable metrics systems and their interfaces such as Prometheus, Thanos, Grafana Graphite #observability-metrics
Logging The log pipeline, logstash, and opensearch ecosystem #observability-logging
Incident Tooling Incident workflow-related tooling, such as dispatch and any other related systems. #incident_tooling, #observability-ir-tools
Tracing This is not developed yet but is in future plans, distributed tracing support. #observability-tracing
Prometheus Prometheus is a free software application used for event monitoring and alerting. It records real-time metrics in a time series database built using a HTTP pull model, with flexible queries and real-time alerting.
Graphite Graphite is a real-time time series data store and graph renderer. https://phabricator.wikimedia.org/tag/graphite/
Alertmanager Alertmanager is the service (and software) in charge of collecting, de-duplicating and sending notifications for alerts across WMF infrastructure. It is part of the Prometheus ecosystem and therefore Prometheus itself has native support to act as Alertmanager client. The alerts dashboard, implemented by Karma, can be reached at https://alerts.wikimedia.org/