SLO/WDQS
Status: draft
Organizational
Service
Teams
The Search team is the sole team responsible for WDQS.
Architectural
Environmental dependencies
WDQS runs on bare-metal hosts (wdqs*
) in the eqiad and codfw datacenters.
WDQS is split into the public cluster, which serves general traffic from query.wikidata.org, and the internal cluster which serves requests coming from mediawiki.
Service dependencies
WDQS runs on blazegraph, a java-based graph database that implements the RDF standard.
Each wdqs host contains a full copy of the entire dataset; that is to say there is no clustering or sharding.
The most common source of WDQS failures is a failure in the Blazegraph backend. Over the medium-long term, WDQS will need to be moved off of Blazegraph onto a different backend. For more information, see Wikidata_Query_Service/ScalingStrategy.
Client-facing
Clients
WDQS' clients are users who submit queries through query.wikidata.org, as well as automated bots.
Service Level Indicators (SLIs)
Uptime (availability) percentage: The percentage of all requests receiving a non-error response, defined as one of: HTTP 200 (success), HTTP 403 (client banned), or HTTP 429 (client throttled).
In plain english, this is the percentage of requests that are either successful, or were refused due to the client being either fully banned or temporarily throttled. We don't count banned or throttled requests against our success rate because this represents the service functioning as intended, rather than indicating a problem.
Excessive lag percentage: The percentage of time that the maximum update lag of the cluster exceeds the desired threshold (currently, 10 minutes).
Each host reports an independent lag number representing how far behind that individual server is from the queue of update events. For this SLI we take the value from the highest-lag pooled host.
Operational
Monitoring
WDQS hosts emit time-series metrics, which are visible via Grafana.
WDQS Update lag can be monitored via this Grafana dashboard.
Troubleshooting
WDQS relies upon blazegraph, which has a number of known issues.
Simple issues, such as a one-off deadlock in blazegraph, can be fixed with a service restart on the affected host once the problem is diagnosed.
However, deeper failures, such as the entire fleet of wdqs hosts repeatedly going into blazegraph deadlock despite service restarts, are more nebulous to debug. In the past when we've encountered this condition we checked the logs to come up with a best guess on which queries/clients seemed likely to be the source of the problem, and then manually banned the hypothetically-offending user agent / IPs.
For more guidance on troubleshooting please refer to the WDQS runbook.
Deployment
See Wikidata_Query_Service#Production_Deployment
Service Level Objectives
Realistic targets
A realistic target for uptime percentage would be 95% of queries are acceptable (response code of 200, 403, 419), with no particular upper bound of latency
A realistic target for excessive lag percentage would be 95% of the time, the max lag is within the desired threshold (currently, 10 minutes).
Ideal targets
An ideal target for uptime percentage would be 99% of queries are acceptable (response code of 200, 403, 419)
An ideal target for excessive lag percentage would be 99% of the time, the max lag is within an aggressive threshold of 1 minute. This would imply near-instant propagation of changes when a user makes a change to a wikidata item.
Reconciliation
We are not in a state with WDQS where we have the requisite resources to hit the ideal SLOs. So our realistic SLO is going to be the actual SLO; this represents a level of service that allows for us to not need to page for service instability, since there is enough headroom built into the SLO such that operators can wait until business hours to respond rather than potentially being woken up or performing work on the weekend.