Jump to content

Incidents/2024-06-21 - 2024-06-24 Outage Incidents - Dumps Generation Issues

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2024-06-21 - 2024-06-24 Outage Incidents - Dumps Generation Issues Start 2024-06-21 18:51 UTC
Task T368098 End 2024-06-24 20:28 UTC
People paged 3 Responder count 10
Coordinators sukhe Affected metrics/SLOs
Impact Wikis inaccessible

Wikimedia full dumps put pressure on the databases, this pressure resulted in errors on mediawiki attempting to access the database. The speed and volume of the errors caused a flood of udp logs which overwhelmed the network, causing an outage for the wikis.

Timeline

Thursday June 20th

wikimedia-operations: 18:15 UTC <+icinga-wm_> PROBLEM - MariaDB Replica Lag: s1 page on db1206 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.70 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica

mediawiki_security: 18:50 UTC <Amir1> db1206 starts lagging a lot when dumps start, something in the query is either wrong or it bombard the replica. Either way, it needs to be investigated.

mediawiki_security: 19:01 UTC <mutante> https://phabricator.wikimedia.org/T368098  < created for this

Friday June 21st

phabricator: 14:10 UTC: <xcollazo> I suspect significantly more pressure is being generated because we deleted the 20240520 dumps in T365501 due to T365155. Now that the 20240620 run of the dumps is going, there is no content for the dumps process to prefetch from. This should only happen on this run of the dumps, and next run on 20240720 will have prefetch available.

wikimedia-operations: 22:38 UTC <+icinga-wm> PROBLEM - MariaDB Replica Lag: s1 page on db1206 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 346.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica

phabricator: 23:16 UTC <JJMC89> This is impacting the wikis - bots are receiving maxlag errors.

mediawiki_security: 23:18 UTC <denisse> Regarding db dumps task, it seems to be user facing: https://phabricator.wikimedia.org/T3680989914664

wikimedia-operations:  23:43 UTC <+logmsgbot> !log brett@puppetmaster1001 dbctl commit (dc=all): 'set db1206 s1 weight to 1 - T368098', diff saved to https://phabricator.wikimedia.org/P65334 and previous config saved to /var/cache/conftool/dbconfig/20240621-234328-brett.json

Saturday June 22nd

wikimedia-operations: 22:19 UTC <+jinxer-wm> FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 9.622s -

mediawiki_security: 22:22 UTC <Emperor> s4 looks sad though

mediawiki_security: 22:23 UTC <Emperor> ...and it's recovering

mediawiki_security: 22:26 UTC <Emperor> I can't seem to log in to superset though

mediawiki_security: 22:27 UTC <zabe> yeah, superset, phab and logstash are also gone

mediawiki_security: 22:36 UTC <swfrench-wmf> yeah, db ops/s across ~ all sections is still way down in eqiad, while latency is up

mediawiki_security: 22:39 UTC <Amir1> circuit breaking kicked in this time

mediawiki_security: 22:40 UTC <cdanis> also we need to post to the status page if we're still down

mediawiki_security: 22:40 UTC <cdanis> successful edits is at 0 so I'm guessing that's so?

mediawiki_security: 22:40 UTC <Emperor> I agree

mediawiki_security: 22:41 UTC <Emperor> (and the RED dashboard still has 5xxs elevated)

mediawiki_security: 22:41 UTC <TheresNoTime> (still down, someone should update the status yeah)

mediawiki_security: 22:42 UTC <cdanis> incident created

mediawiki_security: 22:42 UTC <Amir1> databases now are happy by they have many many connections without any actual queries

mediawiki_security: 22:43 UTC <taavi> cr1-eqiad<->asw2-c-eqiad link seems to be saturated

mediawiki_security: 22:43 UTC <swfrench-wmf> is it possible this is a network event in eqiad?

mediawiki_security: 22:43 UTC <Amir1>

mediawiki_security: 22:45 UTC <Amir1> https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&refresh=1m&viewPanel=17

mediawiki_security: 22:45 UTC <Amir1> we had a tiny ddos

mediawiki_security: 22:46 UTC <cdanis> we absolutely saturated some of the cross-router links in eqiad too

mediawiki_security: 22:46 UTC <taavi> seeing spikes in the access interfaces for many k8s nodes on asw2-c-eqiad?

mediawiki_security: 22:46 UTC <cdanis> which I think is still going on?

mediawiki_security: 22:46 UTC <taavi> do we have an incident doc yet? who's IC?

mediawiki_security: 22:50 UTC <Emperor> edit rates and suchlike all returning to normal AFAICS

mediawiki_security: 23:03 UTC <cwhite> I'm going to have to drop a lot more of these `Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks: found writes pending (MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock).` logs.  5.5million per minute is unsustainable.

mediawiki_security: 23:06 UTC <swfrench-wmf> cwhite: ah, right ... I vaguely recall seeing a flood of those during a previous incident that involved request timeouts

mediawiki_security: 23:18 UTC <cdanis> I'm going to downgrade the incident

Sunday June 23rd

mediawiki_security: 01:35 UTC <Amir1> envoy telemetry is not giving me any internal network jump either: https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&from=1719092473404&to=1719097541096&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=mediawiki&var-kubernetes_namespace=All&var-destination=All

mediawiki_security: 01:35 UTC <Amir1> my only conclusion was some maint scripts but that would show here: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1169&var-datasource=thanos&var-cluster=mysql&from=1719089686711&to=1719101545535

mediawiki_security: 01:35 UTC <Amir1> maaybe dumpers

mediawiki_security: 01:37 UTC <Amir1> yup it's dumps: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=snapshot1012&var-datasource=thanos&var-cluster=dumps&from=1719089686711&to=1719101545535

mediawiki_security: 01:40 UTC <Amir1> it seems to be only snapshot1012 and that host has enwiki dump running

mediawiki_security: 01:41 UTC <Amir1> I go rest. If it happens again, just kill enwiki dump

mediawiki_security 14:58 UTC <Amir1> My very unscientific gut feeling based idea of what happened is as follows: dumps put pressure on the infra, it basically trigger a full outage by causing so many errors being logged that congested the network, it recovers naturally after five minutes with the help of circuit breaking than dumper decides to go absolutely crazy (tbd: why? I'd guess it has some measures and thought because of the outage, the infra has more capacity than it actually has) and cause a smaller network congestion for half an hour

Monday June 24th

mediawiki_security 05:32 <_joe_> If dumps caused such an outage I hope we've opened a UBN! about it

mediawiki_security 07:34 <_joe_> I've retitled it and raised it to ubn!

mediawiki_security 07:35 <Emperor> OOI, which, to save my weak phab search skills?

mediawiki_security 07:39 <Emperor> ah, T368098

phabricator: 07:36 UTC <Joe> Dumps for english wikipedia caused a full sized outage as they saturated the network on saturday night:

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=snapshot1012&var-datasource=thanos&var-cluster=dumps&from=1719089686711&to=1719101545535&viewPanel=8

We definitely need to work on fixing whatever causes this amount of work to be done. Even if this a relatively one-off, we might need to kill the dumps generation if it causes issues again so the logic that clearly has no cap on the amount of data fetched in parallel needs to be fixed.

To clarify further, our running hypothesis regarding the outage is that dumps put an extraordinary strain on the databases, which in turn caused a flurry of errors to be logged from mediawiki, which overwhelmed network (so a new case of T365655)

mediawiki_security: 09:15 UTC <Amir1> the flurry of the logs also had a major impact on network saturation but they both together brought everything the a grinding halt

mediawiki_security 09:15 UTC <Amir1> (and the flurry of the logs itself was caused by the dumps)

mediawiki_security: 09:16 UTC <Amir1> I need to look into way to reduce log flood (it wasn't circuit breaking 500s)

mediawiki_security: 09:16 UTC <Emperor> logs saturating the network feels like something that's going to bite again

mediawiki_security: 09:28 UTC <topranks> the link from the core routers towards row C is what maxed out anyway

mediawiki_security: 09:28 UTC <topranks>

mediawiki_security: 10:08 UTC <Emperor> logs / pods > does sound like an accident waiting to happen a bit

mediawiki_security: 10:08 UTC <claime> yes

mediawiki_security: 10:08 UTC <claime> I'll take a look at it

mediawiki_security: 10:09 UTC <_joe_> claime: thank you for volunteering!

mediawiki_security: 12:33 UTC <cdanis> just want to make sure I understand -- dumps was the trigger (and in particular the from-scratch enwiki dump), but, the issues were all caused by us UDP ddosing ourselves internally with syslogs?

mediawiki_security: 13:33 UTC <cwhite> FWIW, the largest chunk of logs generated during the incident were warnings from `Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks`

mediawiki_security: 13:49 UTC <_joe_> It's very important to remember that "the website not working" is priority 1000. Everything else broken is much much lower priority. So in case you need to pick, pick the website up :)

phabricator: 13:39 UTC <xcollazo> enwiki finished dumping at 2024-06-22 19:52:29, which doesn't correlate with the network event, which started around 2024-06-22 22:46:00.

However snapshot1012 is currently dumping commonswiki, so I suspect this was the one running when the event happened.

Additionally we are both running the 20240620 (i.e partial dump) as well as the 20240601 (i.e. full dump) of wikidatawiki. The full dump of wikidatawiki did had a network spike around the event time.

All these backed up runs are unfortunate, so sorry for the outage folks.

> @xcollazo do you see any complication with reducing parallelism of the enwiki dumps?

enwiki is now done, but I will monitor the run of commonswiki while I look whether we can temporarily lower the parallelism.

mediawiki_security: 15:30 UTC <Amir1> do we have a ticket for the log flood?

mediawiki_security: 15:46 UTC <cwhite> created https://phabricator.wikimedia.org/T368289

phabricator: 18:04 UTC <xcollazo> Ok after observing the dumps for a bit now, here are some conclusions:

  • I still believe the extra pressure against the databases is attributable to not having prefetch, as discussed in T3680989913625.
  • I do not think network usage bursts coming from the snapshot servers are an indication of a bug, as discussed in T3680989918320.
  • There is an existing mechanism to control parallelism in dumps, called slots (unrelated to MCR slots), however this is currently a global setting that I would rather not lower. More discussion in T3680989918849.

The hypothesis I have is that we will not see this issue once we are done with this particular run of the dumps that doesn't have prefetch. And in fact, up until now in my tenure as (reluctant) dumps maintainer, this is the first time I see a run affecting a database so heavily.

Here is a plan of action, considering this is an unbreak now:

  1. I will continue monitoring the remaining two wikis related to the offending 20240620 run: commonswiki and wikidatawiki.
  2. In the event that we have another SRE being paged, I will kill the remaining 20240620 dumps runs.
  3. In a couple days, we will start the next run of the dumps, namely the 20240701 run. I do not expect that run to be problematic, as it will leverage prefetching. However, in the event that we do see similar issues, I will work with @BTullis to lower the dumps slots, and/or pause the dumps as necessary.

Let me know if this sounds reasonable.

mediawiki_security: 19:58 UTC <cwhite> we just had another blast of ~1bil `Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks` logs

mediawiki_security: 20:03 UTC <Amir1> s4 not happy

wikimedia-operations: 20:03 UTC <+jinxer-wm> FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy

wikimedia-operations: 20:04 UTC <+jinxer-wm> FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 6.955s - https://wikitech.wikimedia.org/wiki/Application_servers/RunbookAverage_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded

wikimedia-operations: 20:04 UTC — sukhe whistles

mediawiki_security: 20:04 UTC <cwhite> more logs coming in

mediawiki_security: 20:04 UTC <cwhite> uhh, I think we're headed for another event like Sunday

mediawiki_security: 20:04 UTC <Amir1> rdbms_server_readonly

mediawiki_security: 20:04 UTC <sukhe> yeahhhh

mediawiki_security: 20:05 UTC <sukhe> dark clouds

mediawiki_security: 20:05 UTC <sukhe> here it is

wikimedia-operations: 20:03 UTC <+jinxer-wm> FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy

wikimedia-operations: 20:04 UTC <+jinxer-wm> FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 6.955s - https://wikitech.wikimedia.org/wiki/Application_servers/RunbookAverage_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded

wikimedia-operations: 20:04 UTC — sukhe whistles

phabricator 20:52 UTC <sukhe> This happened today again, starting at 20:05 and resulted in basically the same issues we observed over the weekend and in the previous incident. I have not identified the query and I leave that to people who know better, but the steps we took to resolve this were:

taavi@snapshot1017 ~ $ sudo systemctl stop commons*.service

<mutante> snapsho1017 - systemctl mask commonsrdf-dump ; systemctl mask commonsjson-dump T368098

(Please note: Puppet is still disabled there so that it doesn't start the service.)

And then subsequently, a patch by @Ladsgroup to reduce the log severity https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1049261.

Tuesday June 25th

phabricator: 12:02 UTC <claime> Change 1049532 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-on-k8s: Rate limit udp2log

https://gerrit.wikimedia.org/r/1049532

Wednesday June 26th

mediawiki_security: 16:15 UTC <cdanis> Amir1: dr0ptp4kt: there's another interesting thing here, which is that it was specifically the use of UDP packets for logging that allowed this outage to be so bad, as well

mediawiki_security: 16:15 UTC <cdanis> any TCP-based protocol would notice that its receiver was dropping packets like crazy, and slow down

mediawiki_security: 16:22 UTC <cdanis> if we weren't logging over UDP or if we had implemented network QoS by now (and tagged the UDP low), this also wouldn't've happened

mediawiki_security: 16:22 UTC <_joe_> yes we 100% have as an action item on the mediawiki side to finally ditch udp2log and move all those logging channels to go via kafka

Detection

Write how the issue was first detected. Was automated monitoring first to detect it? Or a human reporting an error?

Copy the relevant alerts that fired in this section.

Did the appropriate alert(s) fire? Was the alert volume manageable? Did they point to the problem with as much accuracy as possible?

TODO: If human only, an actionable should probably be to "add alerting".

Conclusions

OPTIONAL: General conclusions (bullet points or narrative)

What went well?

OPTIONAL: (Use bullet points) for example: automated monitoring detected the incident, outage was root-caused quickly, etc

What went poorly?

OPTIONAL: (Use bullet points) for example: documentation on the affected service was unhelpful, communication difficulties, etc

Where did we get lucky?

OPTIONAL: (Use bullet points) for example: user's error report was exceptionally detailed, incident occurred when the most people were online to assist, etc

Actionables

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

Add the #Sustainability (Incident Followup) and the #SRE-OnFire Phabricator tag to these tasks.

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? yes
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? no
Were pages routed to the correct sub-team(s)? no
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. no
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? no
Was a public wikimediastatus.net entry created? no
Is there a phabricator task for the incident? yes
Are the documented action items assigned? no
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

yes
Were the people responding able to communicate effectively during the incident with the existing tooling? no
Did existing monitoring notify the initial responders? yes
Were the engineering tools that were to be used during the incident, available and in service? no
Were the steps taken to mitigate guided by an existing runbook? no
Total score (count of all “yes” answers above)