Incidents/2026-05-13 wdqs
document status: final
Summary
| Incident ID | 2026-05-13 wdqs | Start | 2026-05-07 15:10:00 |
|---|---|---|---|
| Task | T425758 | End | 2026-05-11 13:50:00 |
| People paged | 0 | Responder count | Brian King, Ryan Kemper, Guillaume Lederrey, Gabriele Modena, Ben Tullis |
| Coordinators | Gabriele Modena | Affected metrics/SLOs | Both Uptime (availability) percentage as well as Excessive lag percentage |
| Impact | We serve stale data for >20 hours from 6 nodes, and at peak 50% of WDQS external endpoint requests were timing out for users. | ||
…
Aggressive scrapers started hitting WDQS on 2026-05-07 causing a decreased service availability that impacted SLO/WDQS (both Uptime (availability) percentage as well as Excessive lag percentage).
Over the whole period we identified two issues at play:
- Blazegraph was under load and started to timeout for a large population of users (>50% at peak).
- The streaming-updater-consumer service (responsible for real-time index updates) was throttled by the overloaded Blazegraph, resulting in index UPDATES being rejected (429) and lag increased. This, in return, triggered max lag protection in Wikibase, resulting in wikidata.org requests (edits) getting throttled.
The incident began in the afternoon (UTC) of 2026-05-07. It was temporarily mitigated, but alerts started firing again overnight.
Upon reviewing the alerts on Friday (2026-05-08), we diagnosed that the entirety of eqiad was lagging and proceeded to depool the deployment to allow Wikidata changes (WDQS index updates) to propagate. As lag started increasing again, we applied rate limits to actors that were aggressively querying the service and causing timeouts.
Despite the aggressive global edge rate limiting applied on Friday, the outage persisted throughout the weekend. The initial rate-limiting rules were extrapolated from a Turnilo data cube based on a 1-in-128 sample of all incoming web requests across Wikimedia projects. Deeper analysis of WDQS logs (both offline on HDFS and in real time on the nodes themselves) on Monday (2026-05-11) identified a scraper that had not previously been captured by the webrequest sample (Turnilo). Once a requestctl rule was applied to the scraper signatures, query timeout rates returned to baseline.
Timeline
All times in UTC.
- 2026-05-07 15:10 OUTAGE BEGINS
- 2026-05-07 15:19 Gabriele Modena starts a thread on #talk-to-wikidata-platform (internal comms)
- 2026-05-07 15:38 Brian King responds and based on traffic analysis manually applies rate limits on aggressive actors. The situation seemed contained, but overnight alerts started firing again.
- 2026-05-08 08:34 Gabriele Modena starts a thread on #talk-to-wikidata-platform (internal comms) and #data-platfrom-sre coordination (https://wikimedia.slack.com/archives/C055QGPTC69/p1778229258561559). Ben Tullis responds.
- 2026-05-08 09:04 We diagnose that the whole of eiad is lagged and following https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbooks/ElevatedMaxLagWDQS.
- 2026-05-08 09:40 Ben Tullis depooled the whole DC. Streaming-updater-consumer starts to backfill and catch up on depooled nodes. https://sal.toolforge.org/log/kA_1Bp4BffdvpiTruey1

wdqs depooled from eqiad. kafka consumer lag decreasing - 2026-05-08 09:52 While cross referring WDQS bespoke throttling logic and logs, we identify that the streaming-updater-consumer service running on localhost is being throttled
- 2026-05-08 10:17 Consumer lag is climbing on codfw nodes

codfw wdqs kafka consumer lag increasing - 2026-05-08 10:50 wdqs is repooled in eqiad https://sal.toolforge.org/log/rKQ2B54B1kByGTxAC5bL
- 2026-05-08 18:32 Ryan Kemper applies rate limits on actor signatures (based on turnilo) mitigates the issue, but it will persist throughout the weekend.
- 2026-05-11 09:11 webrequest_sampled (turnilo) does not report traffic accurately enough. Gabriele Modena manually inspects service logs (on nodes) and escalates to #mediawiki_security
- 2026-05-11 11:00 Gabriele Modena manually depooled wdqs1011,wdqs1012, wdqs1014, wdqs1015, wdqs1016. The nodes had lag > 20h
- 2026-05-11 11:42 rate limits are applied.
- 2026-05-11 13:50 OUTAGE ENDS
- 2026-05-11 15:30 post-outage cleanup finished. all consumers have caught up
- 2026-05-11 16:40 Ryan Kemper lifts rate limit rules that have accidentally impacted legitimate traffic.
Detection
The issue was detected via alerts firing.
- RdfStreamingUpdaterHighConsumerUpdateLag.
- ElevatedMaxLagWDQS
- BlazegraphFailedServerRatioIncrease
The alerting was accurate. The alerts indicated the related runbooks in their message body.
Conclusions
We have learned that we cannot rely only on Turnilo (webrequest sample) to extrapolate actors that need rate limits. We have also learnt that streaming-updater-consumer should not be throttled by Blazegraph filter logic.
Communication with the community went well generally, but could use a better balance of speed and accuracy. We've aligned across teams on coordinating event and resolution status information in an incidents Slack channel to ensure we engage the community quickly and with clarity.
What went well?
- Alters fired appropriately
- Incident response was prompt and responders have been highly engaged.
What went poorly?
- We learn that some runbooks need update (ElevatedMaxLag)
Where did we get lucky?
- …
Links to relevant documentation
- SLO/WDQS
- https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook/High_replication_lag_and_query_timeout
Actionables
- https://phabricator.wikimedia.org/T426067: We updated the runbooks with additional information on troubleshooting traffic directly from logs (in real-time).
- https://phabricator.wikimedia.org/T425770: As a follow up cleanup task, we developed a workaround for WDQS to not throttle streaming-updater-consumer requests. This will be deployed and tested in Wikidata Platform’s current sprint.
- https://phabricator.wikimedia.org/T425989: Investigate and document options to improve real-time traffic analysis for the WDQS telemetry.
- Ryan Kemper cleaned up previously defined requestctl rules that could have impacted legitimate traffic.

