SRE/Observability/Ownership

From Wikitech

Alerting

Tools and process of setting up and managing notifications and alerts to proactively detect and respond to incidents, anomalies, or other relevant events that may impact the reliability or availability of our systems and services. Our alerting helps to ensure that potential issues or problems are identified and addressed promptly. By monitoring various metrics, logs, and system conditions, we configure alerting rules that trigger notifications when certain thresholds or conditions are met.

We are responsible for:

1. Alert Manager

Component of Prometheus monitoring and alerting toolkit. It serves as a centralized alert management system that receives alerts from various sources and helps in deduplicating, grouping, and routing them to the appropriate recipients for further action.

Alert Manager provides functionalities such as alert aggregation, silencing, deduplication, and notification customization. It allows us to define alert rules and configure how alerts should be handled and escalated.

2. Icinga

is a host and service monitoring software using a binary daemon, some cgi scripts for the web interface and binaries plugins to check various things. Basically, automated testing of our site that screams and sends up alarms when it fails. It originated as a fork of the earlier project "Nagios", from which WMF transitioned in 2013.

2. Splunk On-Call (formerly VictorOps)

Incident management and response platform for managing and resolving incidents. It provides features and functionalities to streamline incident response processes, and ensure timely resolution of issues

Metrics

The metrics we gather play a crucial role in assessing the health, performance, and reliability of systems and services. They provide valuable insights into the system's behavior and help SRE teams make data-driven decisions.

Here are some key metrics we collect:

  1. Service-Level Objectives (SLOs): SLOs define the desired level of reliability and performance for a service. They include metrics related to availability, latency, error rates, and other service-specific goals. SLOs provide a measurable target for system performance.
  2. Availability Metrics: track the uptime and accessibility of a system or service. They measure the percentage of time that a service is operational and available to users, indicating its reliability.
  3. Latency Metrics: Latency metrics measure the time it takes for a request to be processed or a response to be received. They help assess the performance of the system and ensure it meets the expected response time.
  4. Error Rates and Failure Metrics: Error rate metrics quantify the occurrence of errors or failures within the system. They include metrics such as HTTP error codes, database query errors, or exceptions. Tracking error rates helps us identify issues impacting the system's reliability and can guide improvements.
  5. Request Throughput Metrics: Throughput metrics measure the number of requests or transactions processed by the system per unit of time. They provide insights into our system's capacity and performance limits and help assess whether the system can handle the expected load.
  6. Saturation Metrics: Saturation metrics indicate the degree to which a resource or component is overloaded or saturated. They help identify potential performance bottlenecks and capacity limitations within the system.
  7. Incident Metrics: Incident metrics track the frequency, duration, and impact of incidents. They provide insights into our system's stability and help us identify areas for improvement in terms of incident response and mitigation.

These metrics, along with well-defined thresholds and alerting mechanisms, enable us to proactively monitor systems, detect anomalies, and take appropriate actions to maintain and improve the reliability and performance of services.

We are responsible for:

1. Grafana

Data visualization and monitoring platform used to create and display interactive dashboards, graphs, and charts. It allows us to connect to various data sources, including databases, time-series databases, and monitoring systems, and visualize the data in real-time.

2. Graphite

Time-series data storage and visualization system. It is meant for collecting, storing, and rendering time-series data, primarily focusing on performance monitoring and graphing.

Components of Graphite:

  • Carbon - the data ingestion component. It accepts and stores time-series data sent in a specific format called "Whisper" (a binary file) or using the plaintext protocol.
  • Whisper - the default storage format for persisting time-series data. It employs a fixed-size database file structure optimized for high-performance and efficient storage of metric data.
  • Graphite Web - web-based user interface for querying and visualizing the stored time-series data. It allows us to create graphs, set various display options, and apply functions for data manipulation.
  • Render API - that allows us to programmatically retrieve time-series data and generate graphs or other visualizations.

3. Prometheus

Monitoring and alerting system designed for collecting, storing, and analyzing time-series data. It focuses on monitoring the health and performance of systems and applications in a distributed environment.

4. Thanos

Project that extends Prometheus' capabilities by providing a highly scalable and fault-tolerant solution for long-term storage, global query federation, and high availability of Prometheus data. It allows us to overcome the limitations of single Prometheus instances, achieve high availability and durability for Prometheus data, and enable efficient querying and analysis. Thanos enhances the scalability and long-term storage capabilities of Prometheus.

5. StatsD is a metrics aggregation server. We use it to aggregate metrics for Graphite. It flushes data points to Graphite at an interval of 60 seconds (the highest resolution supported by our Graphite configuration), otherwise only the last data point in a given minute would be stored. Wikimedia currently uses the statsite implementation (package, website), written is a C program that is wire-compatible with Etsy's original Statsd (written in Node.js).

Logging

Tools and processes of capturing and recording events, activities, and system information in a chronological manner. It involves storing data in log files or centralized logging systems, which can be later analyzed for troubleshooting, performance monitoring, incident investigation and capacity planning.

Logging helps us capture and store valuable information about our system's behavior, events, and errors.

We gather and store our logs for:

  • debugging and troubleshooting.

When an issue or incident occurs, we can analyze the logs to understand the sequence of events leading up to the problem, identify root causes, and troubleshoot the issue effectively.

  • incident response.

Logs play a vital role in post-incident analysis. They enable us to reconstruct the timeline of events, investigate the factors contributing to the incident, and gain insights into the impact and scope of the problem. Logs help us in identifying patterns, trends, or anomalies that might have led to the incident and guide remediation efforts.

  • performance monitoring and optimization.

Logging provides visibility into system performance by capturing metrics and indicators related to latency, response times, resource utilization, and other performance-related factors. Analyzing performance logs allows us to identify performance bottlenecks, optimize resource usage, and improve the overall system efficiency.

  • performance and availability reporting.

Logs can be aggregated and analyzed to generate reports on system performance and availability. These reports provide visibility into the system's reliability, adherence to service level objectives (SLOs), and overall health.

We are responsible for:

1. OpenSearch

community-driven search and analytics engine that is built on the Elasticsearch and Kibana codebases. It provides powerful search and indexing capabilities, allowing us to store, search, and analyze large volumes of data in near real-time.

2. Logstash

data processing pipeline tool that collects, processes, and transforms data from various sources and sends it to our storage. It acts as a data ingestion and transformation layer in the ELK (Elasticsearch, Logstash, Kibana) stack.

3. kafka-logging

We utilize Kafka as a message broker to receive log events and distribute them for processing, storage, and analysis. It acts as a reliable and scalable transport layer for log data, allowing for decoupling of log producers and consumers.

Tracing

Tracing the path of requests as they move through interconnected services and components. It includes capturing and correlating data about the request flow across different system boundaries, revealing valuable information about latency, dependencies, and performance bottlenecks. Tracing helps us in understanding and optimizing the behavior of Media Wiki and other services by providing visibility into the interactions between different components.

Incident Tooling

Tools that are meant to optimize and automate the processes related to incident handling and response. They are meant to streamline collaboration, communication, and resolution, enabling efficient incident management by automating various tasks and ensuring timely actions and effective teamwork


see: Technology/SRE/Observability/Documentation