Jump to content

Wikidata Query Service/WDQS Architecture re-design

From Wikitech

Team: Wikidata Platform
Date: 2026-5

Overview

This document describes the design of the architecture of the next version of the WDQS service. As we move the backend away from Blazegraph, we need to decouple the database from service layers, and handle bespoke logic that is currently implemented in the database itself. We want the next version of WDQS to be database agnostic, with an API contract that supports the SPARQL 1.1 spec. Embedding service logic in the database layer creates tight coupling to a specific engine, obscures system behavior, and limits observability. The proposed architecture addresses these issues by moving logic into the application layer, where it can be versioned, tested, measured, and evolved independently of the storage backend.

The design proposed in this document is still undergoing review.

Background

WDQS currently couples service logic directly into Blazegraph, resulting in operational fragility, poor observability, and vendor lock-in. Additionally, Blazegraph was abandoned in 2018, so it won't be actively developed or supported now or in the future. For these reasons, the Wikidata Platform team is actively migrating to a new backend system in FY27.

This document introduces a service architecture that decouples the query service from the storage layer[1]. Key features:

  • Proxy middleware handles query routing, logging, federation, and Blazegraph-specific query rewrites.
  • RDF databases remain backend-only, supporting SPARQL 1.1.
  • Observability shift to the service layer. Authentication and rate limiting will be largely handled through integration with Wikimedia’s API Gateway (REST Gateway).

To support our long term platform evolution goals, we want to provide a globally available, scalable, and maintainable WDQS platform with clear operational boundaries, enabling safe migration from Blazegraph while supporting existing query classes and SLOs, and enabling future platform evolution needs, particularly mapping query classes to a product defined Quality of Service (QoS) framework.

Goals

  • Describe the functional goals of our target architecture, that decouples database from service layers and provides a path forward from Blazegraph specific features (e.g. wikibase:label). The service should handle 20k requests / minute, available globally active/active[2] in both eqiad and codfw data centers (DCs), and support latencies of up to 60 seconds at p95 (measured over a quarter). The minimum set of availability requirements is captured in the WDQS SLO page for the external, internet-facing endpoint.[3] Note that our active-active strategy is meant for high availability, not necessarily load sharing. The concrete implication is that we will need to size so that all the traffic can be served from a single DC, even if most of the time the traffic will be split between the 2 (this is the strategy for all of our active-active services).
  • The service exposes a secondary store (RDF database) to Mediawiki’s wikidata-wiki. Its purpose is to allow for performant read access patterns on graph data, that are inefficient on a relational database like MariaDB. It should handle programmatic data reload from dumps and explicit, automated, reconciliation patterns. Real-time index updates from wikidata edits will be supported using the existing changes propagation pipeline (rdf streaming updater).
  • The service should support authentication and rate limiting in accordance with WIkimedia Policy. This will be achieved by integrating with the API Gateway (REST Gateway).
  • The service is flexible enough to support deployments with different QoS requirements, according to Product needs. These QoS should inform SLO needs (availability requirements, need for SRE on-call rota)

Non-Goals

  • This is not a database evaluation. Our design is database agnostic, and assumes standard protocols available from vendors (SPARQL 1.1, HTTP).
  • This document does not try to solve all implementation details at once. It’s a living document and will be updated as we progress with the migration.
  • The proposed design is scoped to provide continuity for the current graph split (main and scholarly graphs) data model. While our capacity model accounts for dataset growth, the architecture assumes support for the current model and the continued existence of federated graphs in the Wikidata ecosystem as defined through community data governance.

We move business logic into an application service (wdqs-proxy), deployed on Kubernetes. The latter acts both as a frontend and pass-through proxy to RDF databases deployed in eqiad and codfw nodes. The middleware service receives incoming requests and transparently passes them through to the RDF databases. This architecture integrates with API Gateway (REST Gateway), to add a layer of defense against abusive queries, and allows us to develop and deploy features independently of the RDF database.

Request flow and query classes

WDQS serves a variety of query classes and needs to support query federation request flows.

Request flow

Figures 1 to 4 compare the request flow and query federation sequence in WDQS v1 (current), and the WDQS v2 (future) architecture.

In WDQS v1 (current), there is no clear separation of concerns between control and data planes. A client submits a SPARQL query directly to the RDF database, which performs a SPARQL federation policy check and handles execution of the local query as well as federation to the permitted external endpoints. Responses from both the local and federated sources are collected by the RDF database and merged in a response aggregation stage, after which they are passed back to the client. The service is exposed to the public internet behind a load balancer, but there is no API Gateway integration to manage authentication and rate limits. In case of traffic spikes, rate limits need to be manually applied via requestctl rules.[4] Logging and metrics are collected directly from the RDF database.

WDQS v1 (current) request flow

Figure 1: WDQS v1 (current): request flow

WDQS v2 (future) request flow

Figure 2: WDQS v2 (future): request flow

When a client sends a request to WDQS v2, it first hits the API Gateway (REST Gateway), which handles authentication, rate limiting, and routing. The request then enters the Proxy Service, which acts as the central control plane. The first step in the proxy is a SPARQL federation policy check: the service determines whether the query is allowed. Queries that are allowed locally are then passed through a query rewrite stage, where selected Blazegraph-specific extensions (wikibase:label) are translated into standard SPARQL 1.1 syntax. This step will only be a part of the expected query time during part of the migration and will be removed as users make their queries compliant with the SPARQL 1.1 standard. Queries are forwarded to the RDF database, which handles execution of the local query as well as federation to the permitted external endpoints. Responses from both the local and federated sources are collected by the RDF database and merged in a response aggregation stage, after which they pass back through the proxy for final response handling. The API Gateway (REST Gateway) then returns the combined response to the client. Throughout this process, the proxy service emits logging and metrics asynchronously to Prometheus and the logging pipeline, capturing request details, response latency, and execution metadata for observability and operational monitoring.

WDQS v1 (current): Query federation sequence

Figure 3. WDQS v1 (current): Query federation sequence

WDQS v2 (current): Query federation sequence

Figure 4. WDQS v2: Query federation sequence

While we don’t have a strict response budget, a hard query timeout on requests is enforced at 60 seconds. The 60 second timeout is defined by the current SLO for external endpoints and is our starting working hypothesis. As we map query classes to a quality of service model, we might have a tiered timeout configuration. While we don’t have clearly defined metrics for this, the service will be designed to support configuration variants.

Query Classes

This section provides an overview of the query classes currently being satisfied by WDQS.

Query Class Purpose Characteristics Operational Handling
1. Interactive Queries Queries submitted by users via the public UI or API that expect a fast response.
  • Typically short, well-scoped SPARQL queries
  • Low to moderate resource usage
  • Latency- sensitive (e.g. p95 < 5s, p99 < 10s)
  • Timeout applied to prevent blocking other queries
  • May be throttled or rate-limited per user or API key by the API Gateway (REST Gateway)
2. Long-Running / Analytical Queries Complex queries that aggregate large amounts of data, often for research or analytics.
  • Heavy use of joins, federated queries, or full-graph scans
  • Can take tens of seconds to minutes
  • High resource usage
  • Timeout applied to prevent blocking other queries
  • May be throttled or rate-limited per user or API key
3. Federated Queries Queries that require data from external SPARQL endpoints.
  • Network latency dominates execution time
  • May fail due to external endpoint unavailability
  • Includes queries that combine local and remote data
  • Timeout rules applied for external calls
  • Response may be partial if some endpoints fail
  • Logging might include endpoint-specific metrics
4. Administrative / Internal Queries Queries executed by the system itself for monitoring, indexing, or internal operations.
  • Usually predictable and low-volume
  • Do not count against public rate limits
  • Likely to hit the internal endpoint
  • Executed with high reliability but low priority on resource contention
  • Can bypass federation rules

Table 1: WDQS query classes.

Currently, we don’t have a standard and automated way to classify queries, or specific actors, to a query class. The tables below report a distribution of latency buckets for January - March 2026 in the internal (Table 2) and external (Table 3) facing endpoints. [5]

Query duration Number of queries % of total queries
1_less_10ms 428,004 < 0.01
2_10ms_to_100ms 932,379,697 94.8
3_100ms_to_1s 51,038,187 5.2
4_1s_to_10s 148,158 < 0.01
5_more_10s 5396 < 0.01

Table 2. January - March 2026 query latencies measured in wdqs-internal (main and scholarly).

Internal endpoints serve specific actors: WikibaseQualityConstraints, Kartotherian, and a generic MediaWiki SparqlClient. This traffic originates within Wikimedia’s network and exhibits predictable volumes, patterns, and latency profiles.

Query duration Number of queries % of total queries
1_less_10ms 23,988,800 2
2_10ms_to_100ms 654,577,408 54
3_100ms_to_1s 461,108,821 38.1
4_1s_to_10s 47,463,706 3.9
5_more_10s 24,154,242 2

Table 3. January - March 2026 query latencies measured in wdqs-external (main and scholarly).

External traffic consists of any requests made to the service from the public internet. It includes several personas, involving both human and non-human actors. Currently, it does not provide any quality of service guarantees. This traffic spans all query classes described in Table 1.

Capacity Model

External endpoints

The external WDQS endpoints are expected to handle ~20k requests/min (~333 rps), with a worst-case concurrency of ~20k in-flight requests given a 60s timeout. This is the expected worst case scenario traffic volume based on existing traffic, and does not factor in rate-limiting via the API Gateway (REST Gateway). Based on traffic patterns, growth projections[6], and efforts to move traffic away to other APIs, we expect the WDQS traffic to stay in the same order of magnitude in the upcoming 50 months.

We estimate that 4-10 wdqs-proxy instances (per data center) are sufficient under normal operating conditions, with additional headroom for failure scenarios. We estimate that instances with 2 cores and 2GB or ram should be sufficient.

The primary scaling constraints for the proxy service are memory usage per request and downstream backpressure rather than CPU or thread availability. In particular:

  • Large result sets and federated queries increase per-request footprint (handled by the RDF database)
  • Unbounded concurrency may overload backend RDF databases

To mitigate this, the proxy could enforce per-node backpressure controls to ensure backend stability[7]. Integration with API Gateway (REST Gateway) will provide global rate limiting.

Backend load is estimated at ~350–400 effective qps when accounting for federated query fan-out. Assuming 50–100 qps per RDF database node, this requires 4-8 nodes per cluster[8], consistent with current (64 cores, 128GB memory) and planned capacity (>64 cores, >256GB memory, NVMe drives?). The database size is currently (March 2026) 8B triples for Wikidata main and 9B triples for the scholarly graph. We measured indexing and latency using datasets up to 20B triples (full graph), and the main bottleneck are memory and disk iops. With an estimated growth of 1B triples per year, the planned capacity should provide at least 5 years of runway.

The dominant latency component remains query execution in the RDF database; proxy service overhead is expected to remain <1% of total latency.

Internal endpoints

The internal WDQS endpoints are expected to handle a similar request volume to the external ones (20K / min at peak), but with more predictable seasonality, query complexity and lower load on database nodes (99% of the queries complete < 1s). The current WDQS/Blazegraph nodes (3x per data center) are underutilized, and can already satisfy >100 qps. A deployment of two proxy instances (per data center) should suffice for the service layer. Capacity for the database nodes should be set to the minimum required by Data Platform SRE for availability (3x nodes per cluster).

Failure modes

Table 2 below summarizes known failure modes for the WDQS architecture based on observed behaviour and patterns that will persist on a new system. It tries to qualify their impact, estimate their likelihood and lists known mitigations (technology and design patterns) that may be applied to reduce the likelihood and/or impact of failure modes.

# Failure Mode Likelihood Impact Known Mitigations
F1 Proxy memory exhaustion / queue overflow Medium High Enforce per-request memory limits, max concurrent requests per proxy instance, manually scale proxy pods.
F2 RDF database overload[9] Medium High API Gateway (REST Gateway) rate global limits, per-node QPS limits, and request backpressure; monitor latency and scale nodes dynamically.
F3 Federated query failures / timeouts Medium Medium Circuit breaker, timeout, retry policies; partial responses flagged; telemetry for endpoint reliability.
F4 Federation allow-list reload failure Low Medium Use atomic config swaps; pre-validate allow-list; refresh without dropping requests.
F5 Multi-DC / node data inconsistency Low High Track Kafka offsets, enforce data quality gate before re-pooling nodes; reconcile with full dump if needed; regular audit for consistency.
F6 API Gateway (REST Gateway) downtime / bottleneck Low Critical Fallback endpoints if possible, rate-limiting to prevent cascading failures.
F7 Long-running analytical queries blocking resources Medium Medium Cost-based rate limit at REST Gateway level; move analytic query classes to dedicated nodes (QoS); enforce hard timeout.
F8 Query rewrite coverage gaps (Blazegraph extensions) Medium High Validate traffic coverage using replayed queries; fallback errors clearly returned; incremental coverage updates pre-cutover.

Table 5. WDQS Failure modes.

Architecture

How the pieces fit together.

WDQS v2 service components

Figure 5: WDQS v2 service components.

The system is composed of two components: a proxy service (wdqs-proxy) and a database node. Each potentially consisting of multiple deployment units.

wdqs-proxy is implemented as a Java web service using the quarkus web framework. This will be deployed on k8s and its build deployment lifecycle will be managed independently from other components. This will provide the capability of:

Component Description Notes
Federation Allow List Management Manages the federation allow list to control which endpoints can federate with the service. The list resides in the mainline puppet configuration repository and must be loadable at service startup and refreshable without losing service availability. This mirrors the configuration management approach solved with EventGate.
  • EventGate bundles files (a git repo) and updates via a MediaWiki Extension at periodic intervals. The team will discuss with SRE what capabilities are available before committing to an implementation.
Query Inspection and write protection The service acts as a passthrough proxy allowing:
  • that only read operations are passed through, and prevents INSERT/DELETE/UPDATE semantics. In the current architecture this is handled by nginx assigning a specific header that will tell blazegraph to reject sparql UPDATE queries. In v2 we can centralize this in this passthrough layer.
  • query inspection and supporting the capability to rewrite Blazegraph-specific SPARQL extensions to SPARQL 1.1 equivalents (if required).
  • The rewrite handler would process 60% to 70% of the WDQS traffic volume in wikidata main graph.
Query Logging Logs incoming queries and their responses. Currently query logs are emitted to EventGate directly from Blazegraph. The current implementation causes some federated query details to be missed (T399829). Moving logging to the service layer provides more control and a better view of incoming requests.
  • The team would still like to trace on which WDQS RDF database node each query was executed.
Metrics Reporting The service will report metrics and integrate natively with Prometheus, reducing dependencies on bespoke exporters. This enables greater control over reported metrics. For example, per-request latency, which is currently missing.
  • Observability Platform integration
  • Report PODs health and standard operational metrics.
  • Report of service health and ad-hoc metrics (e.g. query latency).
  • We aim to replicate current WDQS metrics, and extend with finer grained query latency and QPS reporting
  • This will allow finer grained metrics on federated queries.
Logstash Integration Integration with Logstash for log aggregation and analysis.
  • Observability Platform integration.
  • Service logs at desired level of severity will be forwarded to logstash
Health status reporting Per-pod information about the latest deployed version and SPARQL endpoint federation allow list commit id.
  • This is to track configuration drift, especially with regards to the SPARQL federation allow list.

Table 3: WDQS proxy service components.

In the future this approach will allow the team to extend the APIs, for instance by providing facades to limit the amount of SPARQL verbiage exposed to users, possibly introduce other API models in front of the RDF database, and open the door for asynchronous query scheduling.

A database node is composed of the RDF database (qlever-server), and system utilities to support its lifecycle and health (prometheus agent, index updating agent, logstash forwarder). This matches what is currently deployed on the wdqs fleet, minus the capabilities refactored in the wdqs-proxy service.

The qlever-server processes for WDQS v2 are deployed within a k8s cluster (we target DSE). A kubelet runs on each dedicated WDQS database host (on prem hardware), registering it as a regular member of the cluster. These nodes are tainted with a NoSchedule policy so that no standard workloads are placed on them. The qlever-server itself is managed as a standard Kubernetes Deployment whose pods carry a matching toleration, ensuring they are scheduled exclusively onto the dedicated DB nodes. wdqs-proxy, running on regular cluster nodes, routes queries to the qlever-server pods across the tainted nodes. A Kafka Update Agent (streaming-consumer-updater) will be deployed on k8s alongside the qlever-server pod. Each streaming-consumer-updater will consume (as its own consumer group) real-time triple update statements, propagated from wikidata.org, and update each qlever-server independently with SPARQL UPDATE semantics.

WDQS v2 database nodes are managed as k8s kubelets (tainted).

Figure 6. WDQS v2 database nodes are managed as k8s kubelets (tainted).

Key sub-components of database-nodes are:

Component Description Notes
RDF database A SPARQL 1.1 compatible RDF database database. This is the core storage and query engine of the database node, equivalent to what is currently deployed on the WDQS fleet, minus capabilities refactored into the proxy middleware service.
Kafka Update Agent An agent that consumes triple updates from Kafka and updates the index in real-time, ensuring the RDF database reflects the latest state of the data.
  • We will re-use the current WDQS rdf-streaming-updater-producer and df-streaming-updater-consumer
Prometheus Monitoring A Prometheus monitoring agent that supports the lifecycle and health of the database node, providing observability into the system's operation.
  • Observability Platform integration.
  • Reports index updater agent metrics for each single node.
  • This supplements metrics with host specific reporting (probes). Including memory, network, io, and CPU stats. Mainly managed by SRE.
Logstash Deliver host logs to Logstash
  • Observability Platform integration.
  • Reports index updater logs.
  • OS and application logs, at tunable level of severity.

Table 4: WDQS database node components

Data Model

Key entities and their relationships

Data is stored in RDF format and the read/write access patterns follow SPARQL 1.1 protocol standards.

API/Interface Design

If applicable: endpoints, function signatures, events

The service will expose these public HTTP APIs.

Endpoint (public) Verb Description Response status code Response Example
/sparql
POST Proxies (pass through) to a canonical SPARQL 1.1 endpoint for read and write (UPDATE) operations
  • 200 on success
  • 4xx on client errors
  • 5xx on server errors and timeout
  • Response schema is not part of the SPARQL protocol spec
/federation/allowlist GET Retrieves the endpoint federation allow list configuration for WDQS.
  • 200 on success
  • 4xx on client errors
  • 5xx on server errors and timeout
{

"commit_id": "a1b2c3d4e5f67890abcdef1234567890abcdef12",

"last_updated": "2026-03-18T10:15:30Z",

"allow_list": [

"https://query.wikidata.org/sparql%22,

"https://dbpedia.org/sparql%22,

"https://example.org/sparql%22

]

}

/prefixes GET The predefined set of namespace abbreviations (prefixes) it expects or recognizes when parsing queries or RDF data. Needs planning

Table 5: WDQS Public APIs

The service will produce the following events to EventPlatform

Event type Schema EventGate endpoint Description Needs data lake ingestion
WDQS query log An updated version of the current sparql/query schema.
  • Analytics-external
  • Produce via mediawiki-event-utilities
Yes

Table 6: EventPlatform event schemas

Data Infrastructure and Lifecycle

This component is not critical for the migration, but it does impact the sustainability of the platform long term. This section is a work in progress.

Currently we treat database indexing as a one-off event. Data is indexed when a node is imaged, or needs hard reconciliation. Index updates happen in real time via a streaming pipeline that consumes page change notifications via mediawiki.page_change.v1 event stream. Generated streams of rdf mutations are then consumed by agents local to the RDF database and update the DB using the SPARQL UPDATE semantics. While we don’t plan to change this architecture during the migration process, moving to a new RDF database allows for significantly faster ingestion times (from days/weeks to a few hours) and offers a possibility to streamline batch data infrastructure on top of modern Data Platform capabilities.

Data preparation and indexing requirements

Figure 7. Data preparation and indexing requirements

We would like WDQS to follow lambda architecture patterns we use for similar data movements, where datasets are updated in real-time and reconciliation happens at fixed points in time by reloading the full dataset.

  • Q&A: Dumps has a similar need for orchestrating data movements via AIrflow

⚓ T405360 Implement an Airflow operator for moving data from point A to B

https://wikimedia.slack.com/archives/C055QGPTC69/p1769419827226519

  • Q&A: some aspects of indexing and its lifecycle might be vendor dependent and need to be ironed out.

We envision a pipeline where:

  1. The entity dumps are prepared for ingestion (eg. skolemization) and the main dump is split into main and scholarly graphs.

    1. Q&A: we might not need the full munging process

  2. A Data Quality gating step should inform whether the datasets are ready for publication.

  3. Main and scholarly triple datasets are transferred from HDFS to WDQS. There should be no manual SRE intervention required for this step.

  4. An automated process is responsible for depooling, re-indexing, backfilling from Kafka and re-pooling the WDQS node when ready. A Data Quality gating step should inform whether the datasets are ready for publication.

Reconciliation should be documented and essentially left unchanged.

Alternatives Considered

What other approaches did you evaluate? Why did you choose this one?

Alternative Pros Cons Why Not
Alternative 1

Address Blazegraph bugs and performance issues; effectively become its maintainers.

  • WDQS has continuity without disrupting the user-facing API and the internal deployment model.
  • We effectively would become a database vendor.
  • If performance issues are not addressed in a reasonable timeframe (12–18 months), WDQS might effectively become unsuitable as a secondary store.
  • The decision of migrating away from Blazegraph was reached years ago. There is no strong argument to question it.
Alternative 2

After identifying query classes and subsets needing rewriting, adopt the current WDQS approach and implement business logic, ad-hoc extensions (wikibase:label) and infrastructure into a new RDF database.

  • All code goes into a single system.
  • We could reuse the same deployment model of Blazegraph.
  • All concerns documented in Background would persist.
  • Tight coupling to a database vendor.
  • All concerns documented in Background.

Table 7: Alternative approaches to the proposed design.

Security & Privacy

Authentication, authorization, data handling, compliance considerations

Authentication and authorization are delegated to the API Gateway (REST Gateway), that will act as a frontend to all incoming requests.

No new PII will be collected.

Testing Plan

How will you verify this works? Unit tests? Integration tests?

Internal testing of this architecture will begin in FY26Q4, leveraging internal infrastructure and traffic replaying capabilities we started developing and deployed in FY26Q3.

The service will be deployed concurrently to WDQS/Blazegraph and will collect telemetry data (query logs, metrics) in dedicated event streams and metric buckets. To support this, we will need a parallel EventPlatform stream to collect query logs. Similarly, we will need new prometheus metrics and system logs to be available in the observability platform. Both automated alerting and manual reviews, as part of current operational practices, will be put in place to monitor system health.

Prior to rolling out (FY26Q4), we will perform qualitative analysis of results sets generated by queries run on both old and new RDF databases, and quantified and qualified eventual drift. A report on data quality will be created.

Rollout Plan

How will you ship this? Phased rollout? Feature flags?

We will ship this as a phased rollout in accordance with migration planning. Feature development will be prioritized to support product needs at migration time. We will progressively migrate traffic from Blazegraph to the new WDQS service throughout FY27Q1 and Q2. This rollout will require some initial elastic capacity (5-6 extra database nodes in the external fleet and 1-2 extra nodes in the internal fleet), but as traffic from Blazegraph decreases, these nodes will be repurposed.

Risks & Open Questions

What could go wrong? What's still uncertain?

# Risk Likelihood Impact Status
R1 Query rewrite coverage is incomplete at cutover Medium High Mitigating
R2 Proxy middleware adds latency that breaches SLO p95 target Low / Medium High Under evaluation
R3 Federation allow-list reload causes service disruption Medium Medium Open
R4 SLO gap between Blazegraph baseline and new service Medium Low Planned
R5 Kafka consumer lag causes index staleness Medium High Mitigating
R6 Vendor-specific indexing lifecycle assumptions High Medium Open
R7 Concurrent deployment telemetry creates ambiguous signal Low Medium Accepted
R8 API Gateway (REST Gateway) becomes a single point of failure Low Critical Delegated
R9 Performance degradation when streaming result sets vs materializing view in the RDF database before returning a response to clients Medium Low Planned

Table 8: Roll out. Risks and open questions.

R1 Query rewrite coverage. The proxy is responsible for translating Blazegraph-specific SPARQL extensions (notably wikibase:label) into standard SPARQL 1.1. Research to date suggests this is tractable, but the edge case surface is large. If production traffic contains extension patterns not covered by the rewrite rules at cutover time, those queries will either fail or return incorrect results. Mitigation: traffic replaying infrastructure (developed in FY26 Q3) will be used to validate coverage before the migration launch in FY27Q1. We will also provide a period where the new endpoints are open for testing and encourage people to find the failure points.

R2 Latency budget. Introducing a JVM-based proxy service in the request path adds overhead. The service SLO requires p95 latency < 60 seconds for the external endpoint, but interactive queries have a tighter operational expectation. Quarkus is a reasonable choice for low-overhead Java services, but this has not yet been validated under realistic load. Mitigation: latency profiling against replayed traffic.

R3 Federation allow-list reload. The allow-list must be refreshable without dropping service. The mechanism for this is explicitly open: the EventGate pattern (periodic git repo bundle updates via a Mediawiki extension) is a candidate reference, but no implementation decision has been made. Until this is resolved, there is no clear operational story for updating the allow-list in production. This needs to be scoped and owned by the Wikidata Platform, and will require alignment with SRE on capabilities.

R4 SLO continuity during migration. The external WDQS endpoint has published SLOs. The new service will be deployed concurrently with Blazegraph and will collect telemetry in dedicated streams, but during the transition period there is risk that the new service does not yet meets SLO targets; in particular if the Blazegraph fleet begins to be reduced before the new service has demonstrated parity.

R5 Kafka backfill and index staleness. Re-indexing a node from a dump then backfilling from Kafka introduces a window during which the node's index is behind the live event stream. The automated depooling/re-pooling process needs to account for this lag and gate on a data quality check before re-pooling. The Airflow operator work (T405360) is a dependency here. If the backfill window is longer than expected or the Kafka offset tracking is incorrect, nodes could be re-pooled with stale data, causing query results to diverge between eqiad and codfw. In practice this will be unlikely to be an issue. QLever and Virtuoso are able to re-index in main and scholarly graphs in less than 5 hours.

R6 Vendor-specific indexing lifecycle. The design explicitly assumes SPARQL 1.1 protocol compatibility across RDF database vendors. However, the data lifecycle section acknowledges that [...] some aspects of indexing and its lifecycle might be vendor dependent [...]. QLever, for instance, is known for requiring periodic re-indexing to avoid performance issues.

R7 Concurrent deployment telemetry creates ambiguous signal. Running the new service in parallel with Blazegraph may produce conflicting or duplicate telemetry that makes it harder to interpret metrics during the transition period.

R8 API Gateway (REST Gateway) as a critical dependency. Authentication, rate limiting, and routing are fully delegated to the API Gateway (REST Gateway). This is the right architectural choice, but it means WDQS availability is now gated on the Gateway's availability.

R9 Streaming result sets vs materializing view in the RDF database before returning a response to clients. Blazegraph may start streaming large result sets back to the client without fully materializing them first. As a result, errors might appear in the response after the HTTP 200 status line has already been sent. This streaming approach is also used in other RDF databases we evaluated, and its implications need to be tested at scale.

Success Metrics

How will you measure if this is working?

The architecture is considered validated when the proxy middleware decouples the API endpoint lifecycle from the database deployment, query rewriting covers the full observed extension corpus with no correctness regressions, and the federation allow-list can be updated without service interruption. Observability is considered sufficient when latency per request and query log completeness are reported natively by the service, with no dependency on bespoke exporters. The design is database-agnostic if a RDF database can be swapped without changes to the service layer or public API contract.

Functionally, the architecture is considered successful if:

1. We have solved the problems currently facing users of the WDQS platform.

  • measurement: performance improvements in throughput, query latency, and query success rate

2. We are better positioned to avoid similar scalability issues in the future

  • measurement: a db migration is made easier by decoupling its functionality from the rest of our infra

References

  1. The architecture we’re describing is orthogonal to the choice of database system. That said, the database will still influence the user experience.
  2. https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service#Hardware
  3. Based on the availability requirements from our current SLO, with a worst case scenario production traffic load.
  4. The remediation is described in https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook/High_replication_lag_and_query_timeout#5._Mitigation_steps_if_bad_actors_have_been_identified
  5. Query duration buckets are based on the existing reporting classification. Latencies are reported only for successful queries (200 http status).
  6. Work in progress analysis from WMDE.
  7. This is largely orthogonal to the database architecture. Even with a global rate limit, different query classes can create uneven resource contention at the node level, so per-node backpressure can still be useful.
  8. The scalining model for the RDF database is vertical. However, Wikimedia SRE practices require at least three hosts per deployment for availability.
  9. The definition of “overload” will need to be refined based on the technology we will ultimately adopt.