SLO/WDQS

From Wikitech
< SLO

Status: draft

Organizational

Service

See Wikidata_Query_Service

Teams

The Search team is the sole team responsible for WDQS.

Architectural

Environmental dependencies

WDQS runs on bare-metal hosts (wdqs*) in the eqiad and codfw datacenters.

WDQS is split into the public cluster, which serves general traffic from query.wikidata.org, and the internal cluster which serves requests coming from mediawiki.

Service dependencies

WDQS runs on blazegraph, a java-based graph database that implements the RDF standard.

Each wdqs host contains a full copy of the entire dataset; that is to say there is no clustering or sharding.

The most common source of WDQS failures is a failure in the Blazegraph backend. Over the medium-long term, WDQS will need to be moved off of Blazegraph onto a different backend. For more information, see Wikidata_Query_Service/ScalingStrategy.

Client-facing

Clients

WDQS' clients are users who submit queries through query.wikidata.org, as well as automated bots.

Service Level Indicators (SLIs)

Uptime (availability) percentage: The percentage of all requests receiving a non-error response, defined as one of: HTTP 200 (success), HTTP 403 (client banned), or HTTP 429 (client throttled).

In plain english, this is the percentage of requests that are either successful, or were refused due to the client being either fully banned or temporarily throttled. We don't count banned or throttled requests against our success rate because this represents the service functioning as intended, rather than indicating a problem.

Excessive lag percentage: The percentage of time that the maximum update lag of the cluster exceeds the desired threshold (currently, 10 minutes).

Each host reports an independent lag number representing how far behind that individual server is from the queue of update events. For this SLI we take the value from the highest-lag pooled host.

Operational

Monitoring

WDQS hosts emit time-series metrics, which are visible via Grafana.

WDQS Update lag can be monitored via this Grafana dashboard.

Troubleshooting

WDQS relies upon blazegraph, which has a number of known issues.

Simple issues, such as a one-off deadlock in blazegraph, can be fixed with a service restart on the affected host once the problem is diagnosed.

However, deeper failures, such as the entire fleet of wdqs hosts repeatedly going into blazegraph deadlock despite service restarts, are more nebulous to debug. In the past when we've encountered this condition we checked the logs to come up with a best guess on which queries/clients seemed likely to be the source of the problem, and then manually banned the hypothetically-offending user agent / IPs.

For more guidance on troubleshooting please refer to the WDQS runbook.

Deployment

See Wikidata_Query_Service#Production_Deployment

Service Level Objectives

Realistic targets

A realistic target for uptime percentage would be 95% of queries are acceptable (response code of 200, 403, 419), with no particular upper bound of latency

A realistic target for excessive lag percentage would be 95% of the time, the max lag is within the desired threshold (currently, 10 minutes).

Ideal targets

An ideal target for uptime percentage would be 99% of queries are acceptable (response code of 200, 403, 419)

An ideal target for excessive lag percentage would be 99% of the time, the max lag is within an aggressive threshold of 1 minute. This would imply near-instant propagation of changes when a user makes a change to a wikidata item.


Reconciliation

We are not in a state with WDQS where we have the requisite resources to hit the ideal SLOs. So our realistic SLO is going to be the actual SLO; this represents a level of service that allows for us to not need to page for service instability, since there is enough headroom built into the SLO such that operators can wait until business hours to respond rather than potentially being woken up or performing work on the weekend.