Incidents/2025-02-19 maps
document status: draft
Summary
Incident ID | 2025-02-19 maps | Start | 2025-02-19 09:10:00 |
---|---|---|---|
Task | T386648 | End | 2025-02-19 14:25:00 |
People paged | 0 | Responder count | 1 |
Coordinators | 0 | Affected metrics/SLOs | No relevant SLOs exist" |
Impact | Due to a misconfiguration in the TLS certificate for the Kubernetes service on Kubernetes, around 31000 requests for maps.wikimedia.org ended up in HTTP 50X responses. |
The Kartotherian service on Kubernetes didn't have its TLS certificate properly configured (missing the maps.wikimedia.org SAN), and around 31000 HTTP requests ended up in 50X errors over the course of some hours.
Timeline
SAL link: https://sal.toolforge.org/production?p=0&q=elukey&d=2025-02-19
All times in UTC.
- 09:09 Luca pools Wikikube workers back into the Kartotherian LVS service (they were depooled due to a previous outage).
- 09:09 OUTAGE BEGINS
- 14:20 After a chat with Yiannis, Luca realizes that the HTTP 50X errors were visible in the Logstash's Webrequest 50X dashboard and depools the Wikikube workers.
- 14:25 OUTAGE ENDS
- 16:30 The updated TLS certificate is configured and deployed in Wikikube.
Detection
Very subtle use case since the stream of HTTP 50x was tiny and not enough to be considered an immediate impact for the service, so no alarm was raised. Luca and Yiannis realized the impact later on, thanks to Yiannis suggesting to check the Webrequest 50x dashboard.
Conclusions
Second outage of the week for maps, in both times Luca (SRE Infra Foundations) was trying to shift Kartotherian's traffic from bare metal servers to Kubernetes. In this case, the TCP connections between the CDN (ATS) and the backend hosts (Wikikube workers) were successfully established (so they ended up in the pool of usable ones by ATS) but the TLS handshake was failing due to a missing SAN in the TLS certficate provided by the Wikikube workers (for Kartotherian). Requests for maps.wikimedia.org
(proxied to Kartotherian on Wikikube), force the CDN (and hence ATS) to use the domain as HTTP Host header and TLS SNI, so the certificate needs to have an extra SAN to be able to be part of a successful TLS handshake.
What went well?
- As the problem was spotted on logstash it was really quick to rollback to a good state, since it was a matter of depooling the Wikikube workers via confctl.
What went poorly?
- The issue was spotted several hours after its start, mainly because its volume was not significant enough (in the short timeframe) to warrant an alert.
- The Logstash Webrequest 50X dashboard was not consulted by Luca after pooling the Wikikube workers, even if it was known and used since a long time. Checking the impact to external users should be priority number one when doing these procedures, especially after another outage happened a couple of days earlier.
Where did we get lucky?
- No luck registered this time :)
Links to relevant documentation
The most relevant one is probably the LVS page, where all steps to configure properly a load balanced service are listed. TLS is not mentioned since it is not something that LVS takes care of, due to the fact that it load balances TCP connections (it is a L4 load balancer) and upper layers aren't a concern. We have several pages related to how to add a service to Kubernetes, but I didn't find it mentioned (yet) that the external domain must be added as TLS SAN (if needed) when setting up a new service (that uses TLS, but it is almost everywhere the standard). It is something that we know it needs to be done, and also something difficult to check/alarm-on due to where it should be defined (helmfile config of a service, in the deployment-charts repo). Not all services are backing up an external domain, and the tie between external and internal is not straightforward to make. For new services it is easy since the Traffic team usually reviews the config before any live traffic is sent, meanwhile in this case we had a special use case (half the service with new Kubernetes capacity, not properly configured).
Actionables
In the long run, SLOs should take care of this use case since we'll spot these issues with alerts related to budget being burned. I don't think that we should spend much time in trying to add specific 50X alerts that are more fine tuned, we should concentrate on rolling out SLOs everywhere.
Adding an extra check that answers to the question "Does this service's TLS cert need extra config like a SAN?" would be very difficult on the current state of our Kubernetes infrastructure, since the configs are scattered in multiple disjoint places. I am open to suggestions but I don't see what/how this could be done at the moment.
Spreading the knowledge about this issues to the broader SRE team is surely good to help preventing it in the future.
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | no idea | |
Were the people who responded prepared enough to respond effectively | yes | ||
Were fewer than five people paged? | yes | ||
Were pages routed to the correct sub-team(s)? | yes (no pages) | ||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | yes (no pages) | ||
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | no | |
Was a public wikimediastatus.net entry created? | no | ||
Is there a phabricator task for the incident? | yes | ||
Are the documented action items assigned? | no | ||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | yes | ||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. | yes | |
Were the people responding able to communicate effectively during the incident with the existing tooling? | yes | ||
Did existing monitoring notify the initial responders? | no | ||
Were the engineering tools that were to be used during the incident, available and in service? | yes | ||
Were the steps taken to mitigate guided by an existing runbook? | no | ||
Total score (count of all “yes” answers above) | 9 |