SRE/Observability/About

From Wikitech
SRE Observability
The Observability team

The Observability team, affectionately known as "o11y", is a part of the Site Reliability Engineering (SRE) team within the Product and Technology org at Wikimedia. Our mission is to equip teams across SRE and Technology with the tools, platforms, and insights they need to understand how their systems and services are performing. We utilize a variety of technologies, including Grafana, OpenSearch/Logstash, Prometheus, and AlertManager, to monitor and log system performance.

What is observability?

Observability is a term that originates from control theory and is used in the context of software engineering and operations to describe the ability to infer the internal states of a system based on its external outputs. In simpler terms, it's about being able to understand what's happening inside a system just by observing it from the outside, without needing to interfere with its operation.

In the context of software systems, observability typically involves collecting metrics, logs, and traces from applications and infrastructure, and then using this data to monitor system health, troubleshoot issues, understand system behavior, and improve performance and reliability.

In the "strict definition" the three pillars of observability are:

  • Metrics: Metrics are numerical values that represent the state of a system at a specific point in time. They are typically collected at regular intervals and can provide a high-level overview of system health and performance. Metrics can include data like CPU usage, memory consumption, network latency, error rates, transaction times, and more. They are crucial for identifying trends, detecting anomalies, and setting up alerts for when predefined thresholds are breached. Metrics can also be used to create dashboards for real-time monitoring, allowing teams to quickly identify and respond to performance issues or system failures.
  • Logs: Logs are timestamped records of discrete events that have occurred within a system. They provide a detailed and chronological account of what has happened in a system, making them invaluable for troubleshooting and debugging. Logs can include information such as system errors, transaction statuses, user activities, system messages, and more. By analyzing logs, teams can identify when and where an issue occurred, what led up to it, and what the effects were. This can help in diagnosing problems, understanding user behavior, and even in detecting security threats. Log data can be quite voluminous, so log management systems and log analysis tools are often used to collect, store, and analyze logs effectively.
  • Traces: Tracing provides a detailed view of how a transaction or operation flows through a system. Each trace records the path of an operation through a system, along with timing data for each step in the path. Tracing can be particularly useful in distributed systems, where an operation might involve multiple services running on different servers. Traces can help teams understand the relationship between different components of a system, identify bottlenecks, and troubleshoot performance issues. For example, if a user's request is taking longer than expected, a trace might reveal that the delay is due to a slow database query in one part of the system. Tracing requires instrumentation of the code to generate trace data, and specialized tools to collect and visualize this data.

While metrics, logs, and traces form the traditional pillars of observability, the Observability team at Wikimedia extends beyond these to include Alerting, Incident Response Tooling, and Performance Monitoring.

These areas, although not fitting neatly into the conventional observability framework, are crucial for system health and performance. Given their close ties to observability tasks, it's logical for them to fall under the observability umbrella. Let's explore these additional areas:

  • Alerting: This involves setting up and managing alerts that notify the team when certain conditions are met, such as when system performance degrades or when an error rate exceeds a certain threshold. Alerting is crucial for proactively identifying and addressing issues before they impact users or system performance.
  • Incident Response Tooling: This refers to the tools and processes that the team uses to respond to incidents. This can include everything from incident management platforms that help coordinate response efforts, to runbooks that provide step-by-step instructions for addressing common issues. The goal is to ensure that when incidents occur, the team can respond quickly and effectively to mitigate impact.
  • Performance Monitoring: This involves some new components from the former https://wikitech.wikimedia.org/wiki/Performance team, this section will evolve over FY2023/2024