Incidents/2024-06-21 - 2024-06-24 Outage Incidents - Dumps Generation Issues
document status: draft
Summary
Incident ID | 2024-06-21 - 2024-06-24 Outage Incidents - Dumps Generation Issues | Start | 2024-06-21 18:51 UTC |
---|---|---|---|
Task | T368098 | End | 2024-06-24 20:28 UTC |
People paged | 3 | Responder count | 10 |
Coordinators | sukhe | Affected metrics/SLOs | |
Impact | Wikis inaccessible |
Wikimedia full dumps put pressure on the databases, this pressure resulted in errors on mediawiki attempting to access the database. The speed and volume of the errors caused a flood of udp logs which overwhelmed the network, causing an outage for the wikis.
Timeline
Thursday June 20th
wikimedia-operations: 18:15 UTC <+icinga-wm_> PROBLEM - MariaDB Replica Lag: s1 page on db1206 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.70 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
mediawiki_security: 18:50 UTC <Amir1> db1206 starts lagging a lot when dumps start, something in the query is either wrong or it bombard the replica. Either way, it needs to be investigated.
mediawiki_security: 19:01 UTC <mutante> https://phabricator.wikimedia.org/T368098 < created for this
Friday June 21st
phabricator: 14:10 UTC: <xcollazo> I suspect significantly more pressure is being generated because we deleted the 20240520 dumps in T365501 due to T365155. Now that the 20240620 run of the dumps is going, there is no content for the dumps process to prefetch from. This should only happen on this run of the dumps, and next run on 20240720 will have prefetch available.
wikimedia-operations: 22:38 UTC <+icinga-wm> PROBLEM - MariaDB Replica Lag: s1 page on db1206 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 346.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
phabricator: 23:16 UTC <JJMC89> This is impacting the wikis - bots are receiving maxlag errors.
mediawiki_security: 23:18 UTC <denisse> Regarding db dumps task, it seems to be user facing: https://phabricator.wikimedia.org/T3680989914664
wikimedia-operations: 23:43 UTC <+logmsgbot> !log brett@puppetmaster1001 dbctl commit (dc=all): 'set db1206 s1 weight to 1 - T368098', diff saved to https://phabricator.wikimedia.org/P65334 and previous config saved to /var/cache/conftool/dbconfig/20240621-234328-brett.json
Saturday June 22nd
wikimedia-operations: 22:19 UTC <+jinxer-wm> FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 9.622s -

mediawiki_security: 22:22 UTC <Emperor> s4 looks sad though
mediawiki_security: 22:23 UTC <Emperor> ...and it's recovering
mediawiki_security: 22:26 UTC <Emperor> I can't seem to log in to superset though
mediawiki_security: 22:27 UTC <zabe> yeah, superset, phab and logstash are also gone
mediawiki_security: 22:36 UTC <swfrench-wmf> yeah, db ops/s across ~ all sections is still way down in eqiad, while latency is up
mediawiki_security: 22:39 UTC <Amir1> circuit breaking kicked in this time
mediawiki_security: 22:40 UTC <cdanis> also we need to post to the status page if we're still down
mediawiki_security: 22:40 UTC <cdanis> successful edits is at 0 so I'm guessing that's so?
mediawiki_security: 22:40 UTC <Emperor> I agree
mediawiki_security: 22:41 UTC <Emperor> (and the RED dashboard still has 5xxs elevated)
mediawiki_security: 22:41 UTC <TheresNoTime> (still down, someone should update the status yeah)
mediawiki_security: 22:42 UTC <cdanis> incident created
mediawiki_security: 22:42 UTC <Amir1> databases now are happy by they have many many connections without any actual queries
mediawiki_security: 22:43 UTC <taavi> cr1-eqiad<->asw2-c-eqiad link seems to be saturated
mediawiki_security: 22:43 UTC <swfrench-wmf> is it possible this is a network event in eqiad?
mediawiki_security: 22:43 UTC <Amir1>

mediawiki_security: 22:45 UTC <Amir1> https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&refresh=1m&viewPanel=17
mediawiki_security: 22:45 UTC <Amir1> we had a tiny ddos

mediawiki_security: 22:46 UTC <cdanis> we absolutely saturated some of the cross-router links in eqiad too
mediawiki_security: 22:46 UTC <taavi> seeing spikes in the access interfaces for many k8s nodes on asw2-c-eqiad?
mediawiki_security: 22:46 UTC <cdanis> which I think is still going on?
mediawiki_security: 22:46 UTC <taavi> do we have an incident doc yet? who's IC?
mediawiki_security: 22:50 UTC <Emperor> edit rates and suchlike all returning to normal AFAICS
mediawiki_security: 23:03 UTC <cwhite> I'm going to have to drop a lot more of these `Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks: found writes pending (MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock).` logs. 5.5million per minute is unsustainable.
mediawiki_security: 23:06 UTC <swfrench-wmf> cwhite: ah, right ... I vaguely recall seeing a flood of those during a previous incident that involved request timeouts
mediawiki_security: 23:18 UTC <cdanis> I'm going to downgrade the incident
Sunday June 23rd
mediawiki_security: 01:35 UTC <Amir1> envoy telemetry is not giving me any internal network jump either: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&from=1719092473404&to=1719097541096&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=mediawiki&var-kubernetes_namespace=All&var-destination=All
mediawiki_security: 01:35 UTC <Amir1> my only conclusion was some maint scripts but that would show here: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1169&var-datasource=thanos&var-cluster=mysql&from=1719089686711&to=1719101545535
mediawiki_security: 01:35 UTC <Amir1> maaybe dumpers
mediawiki_security: 01:37 UTC <Amir1> yup it's dumps: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=snapshot1012&var-datasource=thanos&var-cluster=dumps&from=1719089686711&to=1719101545535
mediawiki_security: 01:40 UTC <Amir1> it seems to be only snapshot1012 and that host has enwiki dump running
mediawiki_security: 01:41 UTC <Amir1> I go rest. If it happens again, just kill enwiki dump
mediawiki_security 14:58 UTC <Amir1> My very unscientific gut feeling based idea of what happened is as follows: dumps put pressure on the infra, it basically trigger a full outage by causing so many errors being logged that congested the network, it recovers naturally after five minutes with the help of circuit breaking than dumper decides to go absolutely crazy (tbd: why? I'd guess it has some measures and thought because of the outage, the infra has more capacity than it actually has) and cause a smaller network congestion for half an hour
Monday June 24th
mediawiki_security 05:32 <_joe_> If dumps caused such an outage I hope we've opened a UBN! about it
mediawiki_security 07:34 <_joe_> I've retitled it and raised it to ubn!
mediawiki_security 07:35 <Emperor> OOI, which, to save my weak phab search skills?
mediawiki_security 07:39 <Emperor> ah, T368098
phabricator: 07:36 UTC <Joe> Dumps for english wikipedia caused a full sized outage as they saturated the network on saturday night:
https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=snapshot1012&var-datasource=thanos&var-cluster=dumps&from=1719089686711&to=1719101545535&viewPanel=8
We definitely need to work on fixing whatever causes this amount of work to be done. Even if this a relatively one-off, we might need to kill the dumps generation if it causes issues again so the logic that clearly has no cap on the amount of data fetched in parallel needs to be fixed.
To clarify further, our running hypothesis regarding the outage is that dumps put an extraordinary strain on the databases, which in turn caused a flurry of errors to be logged from mediawiki, which overwhelmed network (so a new case of T365655)
mediawiki_security: 09:15 UTC <Amir1> the flurry of the logs also had a major impact on network saturation but they both together brought everything the a grinding halt
mediawiki_security 09:15 UTC <Amir1> (and the flurry of the logs itself was caused by the dumps)
mediawiki_security: 09:16 UTC <Amir1> I need to look into way to reduce log flood (it wasn't circuit breaking 500s)
mediawiki_security: 09:16 UTC <Emperor> logs saturating the network feels like something that's going to bite again
mediawiki_security: 09:28 UTC <topranks> the link from the core routers towards row C is what maxed out anyway
mediawiki_security: 09:28 UTC <topranks>

mediawiki_security: 10:08 UTC <Emperor> logs / pods > does sound like an accident waiting to happen a bit
mediawiki_security: 10:08 UTC <claime> yes
mediawiki_security: 10:08 UTC <claime> I'll take a look at it
mediawiki_security: 10:09 UTC <_joe_> claime: thank you for volunteering!
mediawiki_security: 12:33 UTC <cdanis> just want to make sure I understand -- dumps was the trigger (and in particular the from-scratch enwiki dump), but, the issues were all caused by us UDP ddosing ourselves internally with syslogs?
mediawiki_security: 13:33 UTC <cwhite> FWIW, the largest chunk of logs generated during the incident were warnings from `Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks`
mediawiki_security: 13:49 UTC <_joe_> It's very important to remember that "the website not working" is priority 1000. Everything else broken is much much lower priority. So in case you need to pick, pick the website up :)
phabricator: 13:39 UTC <xcollazo> enwiki finished dumping at 2024-06-22 19:52:29, which doesn't correlate with the network event, which started around 2024-06-22 22:46:00.
However snapshot1012 is currently dumping commonswiki, so I suspect this was the one running when the event happened.
Additionally we are both running the 20240620 (i.e partial dump) as well as the 20240601 (i.e. full dump) of wikidatawiki. The full dump of wikidatawiki did had a network spike around the event time.
All these backed up runs are unfortunate, so sorry for the outage folks.
> @xcollazo do you see any complication with reducing parallelism of the enwiki dumps?
enwiki is now done, but I will monitor the run of commonswiki while I look whether we can temporarily lower the parallelism.
mediawiki_security: 15:30 UTC <Amir1> do we have a ticket for the log flood?
mediawiki_security: 15:46 UTC <cwhite> created https://phabricator.wikimedia.org/T368289
phabricator: 18:04 UTC <xcollazo> Ok after observing the dumps for a bit now, here are some conclusions:
- I still believe the extra pressure against the databases is attributable to not having prefetch, as discussed in T3680989913625.
- I do not think network usage bursts coming from the snapshot servers are an indication of a bug, as discussed in T3680989918320.
- There is an existing mechanism to control parallelism in dumps, called slots (unrelated to MCR slots), however this is currently a global setting that I would rather not lower. More discussion in T3680989918849.
The hypothesis I have is that we will not see this issue once we are done with this particular run of the dumps that doesn't have prefetch. And in fact, up until now in my tenure as (reluctant) dumps maintainer, this is the first time I see a run affecting a database so heavily.
Here is a plan of action, considering this is an unbreak now:
- I will continue monitoring the remaining two wikis related to the offending 20240620 run: commonswiki and wikidatawiki.
- In the event that we have another SRE being paged, I will kill the remaining 20240620 dumps runs.
- In a couple days, we will start the next run of the dumps, namely the 20240701 run. I do not expect that run to be problematic, as it will leverage prefetching. However, in the event that we do see similar issues, I will work with @BTullis to lower the dumps slots, and/or pause the dumps as necessary.
Let me know if this sounds reasonable.
mediawiki_security: 19:58 UTC <cwhite> we just had another blast of ~1bil `Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks` logs
mediawiki_security: 20:03 UTC <Amir1> s4 not happy
wikimedia-operations: 20:03 UTC <+jinxer-wm> FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
wikimedia-operations: 20:04 UTC <+jinxer-wm> FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 6.955s - https://wikitech.wikimedia.org/wiki/Application_servers/RunbookAverage_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
wikimedia-operations: 20:04 UTC — sukhe whistles
mediawiki_security: 20:04 UTC <cwhite> more logs coming in
mediawiki_security: 20:04 UTC <cwhite> uhh, I think we're headed for another event like Sunday
mediawiki_security: 20:04 UTC <Amir1> rdbms_server_readonly
mediawiki_security: 20:04 UTC <sukhe> yeahhhh
mediawiki_security: 20:05 UTC <sukhe> dark clouds
mediawiki_security: 20:05 UTC <sukhe> here it is
wikimedia-operations: 20:03 UTC <+jinxer-wm> FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
wikimedia-operations: 20:04 UTC <+jinxer-wm> FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 6.955s - https://wikitech.wikimedia.org/wiki/Application_servers/RunbookAverage_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
wikimedia-operations: 20:04 UTC — sukhe whistles
phabricator 20:52 UTC <sukhe> This happened today again, starting at 20:05 and resulted in basically the same issues we observed over the weekend and in the previous incident. I have not identified the query and I leave that to people who know better, but the steps we took to resolve this were:
taavi@snapshot1017 ~ $ sudo systemctl stop commons*.service
<mutante> snapsho1017 - systemctl mask commonsrdf-dump ; systemctl mask commonsjson-dump T368098
(Please note: Puppet is still disabled there so that it doesn't start the service.)
And then subsequently, a patch by @Ladsgroup to reduce the log severity https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1049261.
Tuesday June 25th
phabricator: 12:02 UTC <claime> Change 1049532 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):
[operations/deployment-charts@master] mw-on-k8s: Rate limit udp2log
https://gerrit.wikimedia.org/r/1049532
Wednesday June 26th
mediawiki_security: 16:15 UTC <cdanis> Amir1: dr0ptp4kt: there's another interesting thing here, which is that it was specifically the use of UDP packets for logging that allowed this outage to be so bad, as well
mediawiki_security: 16:15 UTC <cdanis> any TCP-based protocol would notice that its receiver was dropping packets like crazy, and slow down
mediawiki_security: 16:22 UTC <cdanis> if we weren't logging over UDP or if we had implemented network QoS by now (and tagged the UDP low), this also wouldn't've happened
mediawiki_security: 16:22 UTC <_joe_> yes we 100% have as an action item on the mediawiki side to finally ditch udp2log and move all those logging channels to go via kafka
Detection
Write how the issue was first detected. Was automated monitoring first to detect it? Or a human reporting an error?
Copy the relevant alerts that fired in this section.
Did the appropriate alert(s) fire? Was the alert volume manageable? Did they point to the problem with as much accuracy as possible?
TODO: If human only, an actionable should probably be to "add alerting".
Conclusions
OPTIONAL: General conclusions (bullet points or narrative)
What went well?
- …
OPTIONAL: (Use bullet points) for example: automated monitoring detected the incident, outage was root-caused quickly, etc
What went poorly?
- …
OPTIONAL: (Use bullet points) for example: documentation on the affected service was unhelpful, communication difficulties, etc
Where did we get lucky?
- …
OPTIONAL: (Use bullet points) for example: user's error report was exceptionally detailed, incident occurred when the most people were online to assist, etc
Links to relevant documentation
- https://phabricator.wikimedia.org/T368098
- Traffic towards mwlog1002 during the incident, as measured at mwlog1002 (note: (1) interface is speed 1Gbps and (2) interpolated gaps suggest metric loss):
Actionables
- …
Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.
Add the #Sustainability (Incident Followup) and the #SRE-OnFire Phabricator tag to these tasks.
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | yes | |
Were the people who responded prepared enough to respond effectively | yes | ||
Were fewer than five people paged? | no | ||
Were pages routed to the correct sub-team(s)? | no | ||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | no | ||
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | no | |
Was a public wikimediastatus.net entry created? | no | ||
Is there a phabricator task for the incident? | yes | ||
Are the documented action items assigned? | no | ||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | yes | ||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented. |
yes | |
Were the people responding able to communicate effectively during the incident with the existing tooling? | no | ||
Did existing monitoring notify the initial responders? | yes | ||
Were the engineering tools that were to be used during the incident, available and in service? | no | ||
Were the steps taken to mitigate guided by an existing runbook? | no | ||
Total score (count of all “yes” answers above) |