Incidents/2026-02-23 ml-serve
document status: in-review
Summary
| Incident ID | 2026-02-23 ml-serve | Start | 2026-02-23 17:20:00 |
|---|---|---|---|
| Task | T418722 | End | 2026-02-24 12:50:00 |
| People paged | Responder count | Dawid Pogorzelski
Luca Toscano Aiko Chou | |
| Coordinators | Dawid Pogorzelski
Luca Toscano Aiko Chou |
Affected metrics/SLOs | No relevant SLOs exist.
Increased rate of 5XX response codes from a subset of ML services. |
| Impact | A subset of ML services were unreachable for the duration of approx 24h.
The issue first affected services in codfw and the day after eqiad. | ||
Two production K8s Machine Learning clusters were updated to version 1.31, that included upgrading the Istio version from 1.15 to 1.24. Istio is being used in those clusters both as Ingress Gateway and mesh sidecar; the latter is basically a transparent proxy that all the model servers deployed use for their ingress and egress traffic. The new Istio version has a stricter validation policy for the "host" attribute in the Virtual Service definitions (basically the proxy/route rules for the transparent proxy): if two or more Virtual Services have overlapping values, only one Virtual Service is considered and the other one discarded. The model servers running on those clusters calls services like the MediaWiki API in two ways:
- Explicit reference of the MW API VIP in the code, including the right HTTP Host header. For example, mw-api-ro.discovery.wmnet with Host header en.wikipedia.org.
- Implicit reference to the MW API using high level domains like en.wikipedia.org, without mentioning the MW API VIP at all. This is the "transparent proxy" feature that was mentioned before.
The ML team runs some model servers running with the first configuration, and the others with the second configuration. The motivation is that sometimes it is more convenient to use one rather than the latter, so the team never decided which one was the canonical one to use.
The affected model servers in this outage were the ones using the transparent proxy configuration:
- article-models
- articletopic-outlink
- revertrisk
- revision-models
- revscoring-articlequality
- revscoring-articletopic
- revscoring-draftquality
- revscoring-drafttopic
- revscoring-editquality-damaging
- revscoring-editquality-goodfaith
Timeline
All times in UTC.
- 2026-02-23 17:21 ml-serve-codfw is considered updated
- 2026-02-23 17:30 start of the outage: first 5XX response codes appear in the metrics, alerts start to fire.
- 2026-02-24 08:00 update of ml-serve-eqiad starts
- 2026-02-24 10:15 codfw issues are reported in the ml-slack channel, subsequently alerts were spotted.
- 2026-02-24 11:00 the update of ml-serve-eqiad is done and eqiad services become affected
- 2026-02-24 11:30 investigation into issues starts
- 2026-02-24 16:00 end of outage: deployed a hot fix that solved most of the HTTP 500s.
- 2026-02-27 11:30 the permanent fix is filed (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/12452890 and deployed to ml-staging-codfw to verify that everything works correctly.
- 2026-03-03 13:17 permanent fix is deployed on all affected clusters
Detection
The issue was first reported in the talk-to-machine-learning Slack channel by consumers/users of the affected model servers.
Alerts were present in the Machine Learning's IRC channel right after the codfw upgrade, but the team didn't notice them.
The volume was moderate compared to other WMF services, but it affected half of the running model servers. The alerts pointed to HTTP 5XX class of errors for each affected service.





Conclusions
What went well?
- The issue was solved in a relatively short amount of time considering when first user detection occurred.
What went poorly?
- The issue could have been spotted on the staging cluster first.
- Ad-hoc testing within the team after each cluster upgrade could have revealed the issue at an earlier stage (the upgrades were communicated and testing was requested).
- The existence of httpbb tests was not known to everyone involved.
- Nobody in the Machine Learning team spotted the alerts fired right after the codfw upgrade.
Where did we get lucky?
- We had the right technical WMF member online to help finding the right Istio hot patch to restore the broken functionality.
Links to relevant documentation
- Relevant changes that led to issue resolution:
Actionables
- Consider delivering alerts to relevant slack channels and/or patrol IRC or alerts.wikimedia.org.
- Possibly implements more SLOs? Fix also the current ones since the Istio metrics have changed:
- https://slo.wikimedia.org/objectives?expr={__name__=%22revertrisk-la-availability%22,%20service=%22revertrisk-la%22,%20team=%22ml%22}&grouping={}&from=now-4w&to=now
- https://slo.wikimedia.org/objectives?expr={__name__=%22revertrisk-la-latency%22,%20service=%22revertrisk-la%22,%20team=%22ml%22}&grouping={}&from=now-4w&to=now
- (there are other two about tone check but they are fine afaics).
Scorecard
| Question | Answer
(yes/no) |
Notes | |
|---|---|---|---|
| People | Were the people responding to this incident sufficiently different than the previous five incidents? | yes | |
| Were the people who responded prepared enough to respond effectively | yes | ||
| Were fewer than five people paged? | yes | N/A | |
| Were pages routed to the correct sub-team(s)? | yes | N/A | |
| Were pages routed to online (business hours) engineers? Answer ânoâ if engineers were paged after business hours. | yes | N/A | |
| Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | no | N/A |
| Was a public wikimediastatus.net entry created? | no | N/A | |
| Is there a phabricator task for the incident? | yes | ||
| Are the documented action items assigned? | no | ||
| Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | yes | ||
| Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer ânoâ if there are open tasks that would prevent this incident or make mitigation easier if implemented. | yes | |
| Were the people responding able to communicate effectively during the incident with the existing tooling? | yes | ||
| Did existing monitoring notify the initial responders? | no | ||
| Were the engineering tools that were to be used during the incident, available and in service? | yes | ||
| Were the steps taken to mitigate guided by an existing runbook? | no | ||
| Total score (count of all âyesâ answers above) | 10 | ||