Wikifunctions/Performance observability

From Wikitech

This document aims to capture the metrics we will use to measure performance of the Wikifunctions project. The metrics are divided into a few categories:

  1. Performance Service-level objectives (SLOs)
  2. Resource saturation metrics
  3. Wikifunctions-specific metrics

Note that not all metrics will be implemented pre-launch. They are in fact listed with descending order of urgency. Some of the goals will take additional support from the backend services and iterations of change. This document simply provides a road map for the future performance monitoring setup.

Another thing to note is that most metrics will be separated by service. That means the function orchestrator and function evaluator will each have their own separate metrics.

Performance SLOs (current phase)

The performance SLOs are the service level objectives the project should meet before launching to production. Note that we outlined the SLIs (indicators), but the SLOs (objectives) in the table are arbitrary placeholders gathered from other Wikimedia services. The exact numbers need to be decided with consideration of user needs and organizational standards.

Since the staging environment of the beta cluster does not support metric export, we cannot observe the current level of performance until it “goes live”.

SLI SLO (per service) Definition Collection method
Request latency (server side) Known simple function: 100ms

50%: 300ms

99%: 900ms

Time it takes to return a response to a request Existing Prom metric (express_router_request_duration_seconds)
Error rate 1% Error percentage of all requests received Existing Prom metric
System Throughput Capacity 500 req/s Rate of successful request Load testing framework (e.g. apache bench)

Resource Saturation Metrics (current phase)

The resource saturation metrics help us see how much of the allocated resources are used. This will guide throttling thresholds and additional resource upgrade requests. These metrics do not require additional setup as they are supported by default.

Metrics Upper Limit Definition Collection method
CPU 100% The percentage of total CPU allocated Container Default (e.g. ​​container_cpu_user_seconds_total)
Memory 90% The percentage of total memory allocated Container Default (e.g. container_memory_*)

Wikifunctions-Specific Metrics (future phase)

In addition to system-level SLOs and performance metrics, we’d like visibility for how Wikifunctions is used. The following metrics provide insights for function popularity, connectivity, and their resource consumption.

Note that the current and historic WikiLambda content is stored in blobs, and querying the live content databases (DBs) for some of the metrics would be impossibly expensive. In order to support performant direct queries on WikiLambda, we can set up alternative data storage solutions, or set up secondary indexes in existing DBs.

Alternatively, instead of reading real time values, we could set up periodic scans that run daily or weekly. This mechanism can be implemented on Toolforge like the Wikidata stats, with the intermediate data saved in replica DBs (see Data Services). A benefit of this approach is that the queries can be crowdsourced with volunteers. Here is an example of the query configuration.

Metrics Format Collection method
Traffic Time series (request per day) Existing Prometheus metric
Number of functions Time series (accumulated) Replica DB
Number of function implementations Time series (accumulated) Replica DB
Number of implementations per function Daily average, median, top ten Replica DB
Number of Languages Daily lists Replica DB
Number of function calls Time series (per day) Additional metrics event hook and additional storage solution needed
Most called implementations Daily top 10 Additional metrics event hook and additional storage solution needed
Most frequently referenced functions from other implementations Current most-linked Replica DB (will need an implementations calls to functions link table)
Function connectivity Connectivity graph of the current top 100 referenced functions Replica DB (will need an implementations calls to functions link table)

Health Checks

To confirm the liveness and correctness of the WikiLambda API, we will implement periodic health checks that ping the API with basic requests and check the response against expected outcomes. These health checks are embedded in the WikiLambda API. A Prometheus blackbox check will invoke the health-check endpoint. Alerts will be triggered if any of the checks fails the test.

Dashboards

The performance SLOs, resource saturation metrics, as well as traffic info will be presented as dashboards on grafana.

All the Wikifunctions specific metrics, sans traffic, will be hosted elsewhere, as they are not production critical. Metrics like function call graphs are more complicated and potentially require custom graph drawing, like the wikidata-todo.toolforge.org charts.

Although most metrics will be separated by service, we will group function orchestrator and evaluator metrics side-by-side to provide insight into any potential performance bottleneck.

In addition to Grafana, we expect a stats.wikimedia.org page for the Wikifunctions project that will display viewership, contributor, and content information. In order to be included in the Analytics Query Service (AQS), we will need to edit the MW history include list and the Pageviews include list.

Alerting

Alerting is supported by Alertmanager, part of the Prometheus ecosystem. Current alerts can be browsed through alerts.wikimedia.org (NDA access only). Of the three metric categories, the performance SLOs and the resource saturation metrics should be equipped with alerting rules, but it is optional for the Wikifunctions specific metrics.

Currently the Abstract Wikipedia team’s alert receiver is defined in alertmanager.yml.erb in the puppet repository.

There are multiple ways to implement alerts. The alerting rules can either be added to the wikimedia alerts repository with alert routing setup (example), or added directly to the grafana dashboards. Alerts are defined as prometheus alerting rules.

  • Timing of the alert: after threshold has been crossed for 5 mins immediately after threshold is crossed (noisy to start with, adjust as we go)
  • Who to alert: AW team
  • Severity: warning for the initial phase, critical after user base grows

Infrastructure

The main metric reporting mechanism is backed by Prometheus. It supports production critical metric reporting and alerting, with builtin integration with Grafana. To be exact, this project’s interface with Prometheus is provided by the Service Template Node’s Service Runner component’s Prometheus client for node.js. It exposes a node-statsd interface. The workflow and syntax for metric reporting can be found in the abovementioned frameworks’ README. As an example, this is part of Citoid's metric logging setup.

The built-in Prometheus JS client already provides the necessary endpoints for the basic metrics such as general request latency and error rate. In order to log more advanced metrics (such as usage per function implementation), custom event metric hooks will need to be implemented.

Data Storage

For metrics that are not production critical (i.e. most Wikifunctions specific metrics), Prometheus is not the ideal solution. These metrics require different sampling rates, storage space, and retention periods. For example, records of function invocations would require a large keyspace and short retention period. Additional data storage options need to be set up to accommodate such data. It is an ongoing discussion tracked in T309792.

User Story

Many of the SLOs come from concrete user stories, which are captured below.

As a WMF Ops team member, I want to…  

  • access current and historical information and patterns of how the orchestrator and evaluator services are performing / have performed, using the tools I am familiar with, so that I can enable my team to make better decisions about resource requirements, capacity planning, and configuration changes
  • be confident that orchestrator and evaluator services emit logging and monitoring signals in a way that complies with established Foundation standards, so that I can enable my team to manage Wikifunctions services in a consistent manner
  • be alerted as soon as possible when the orchestrator or evaluator service has failed or is in a degraded state, so that I can enable my team to respond to restore the failed or degraded service
  • understand, in the event of a service failure or degradation, what events as captured in logs led up to that state, so that I can enable my team to conduct root cause analysis and prevent or mitigate similar issues in the future
  • understand how to respond to a service failure or degradation in a way that addresses the unique needs of the orchestrator or evaluator service, so that I can enable my team to be context aware and reduce time to resolution
  • understand and access current and prior security signals with or without alerts, so that I can enable my team to perform threat pattern analysis
  • identify calls from suspicious origins, so that I can alert my team to suspicious load and decide whether or not to block
  • identify and (automatically) respond to resource hogging orchestrator calls, so that I can prevent usage that could degrade the service's performance
  • enact lockdown to the desired level of granularity (Wikifunctions application, orchestrator or evaluator service, individual function), so that I can prevent abusive usage