User:ArielGlenn/dumps issues

This list of "recent" dumps issues is meant to assist the new dumps maintainers, as they figure out where to devote their resources. "Recent" means in the past couple of years, with a few older tasks added because they are representative of classes of issues.

Legend:

Incident is taken from the title of the Phabricator task.
Reporting date is when the phabricator task was initially opened, not when the bug was first introduced or even noticed.
Task is a link to the phabricator task with more details.
Impact is the impact it had on dumps production, if any.
Time to resolve is a guesstimate of the cumulative time spent by this dumps maintainer and does not include time spent by other developers, SRE, etc. These guesses are pretty arbitrary and may be fairly low.
Likelihood of recurrence is Unlikely, Low, Medium, Likely, Guaranteed; usually this is an estimate of how likely an issue of this sort is to recur, rather than the exact same thing.

Dumps issues over the past year or so, with impact, risk of recurrence
Incident	Reporting date	Task	Impact	Time to resolve	Likelihood of recurrence
PHP Warning: XMLReader::read(): Memory allocation failed : growing input buffer	May 12 2023	T336573	Manual intervention needed: Had to manually run the page range using bz2 prefetch files	Unresolved, several hours of investigation	Unlikely: we've seen maybe two of these sorts of errors in the last several years
various weekly and daily dumps run from systemd timers are broken	Apr 27 2021	T281267	Various jobs did not run due to a change in the systemd timer syntax in puppet manifests	30 minutes (?) of investigation, and a patch merged	Unlikely: Possible to see a similar change but I don't see systemd timer manifests being tweaked again any time soon
Two page content jobs for wikidatawiki are taking days to complete.	May 16 2023	T336742	Eventually completed but would have been better with manual intervention to shoot the jobs and let them rerun	1 hour (?) of investigation etc	Low: While we don't know the cause, I've seen this only a couple times in several years.
SQL/XML dumps multiple errors "Couldn't update index json file, Continuing anyways" from 09 21 2022 near midnight	Sep 21 2022	T318206	No impact on dumps, surprisingly	This turned out to be a space issue, needed cleanup script time changes in puppet	Low: we have new hosts with lots of space for the next couple years
look into space issues on dumpsdata1001 and 1003	Sep 19 2020	T263318	No impact on dumps	30 mins (?) of investigation and cleanup	Low: we have new hosts with lots of space for the next couple years
Resolve space issues on dumpsdata1001, 1003	Feb 25 2023	T330573	None on dumps	At lest 8 hours total: quote/order/deploy new hosts and make them active	Low: we have new hosts with lots of space for the next couple years
Siteinfo v2 format job needs to be fixed up	Feb 9 2022	T301373	Dumps jobs kept retrying a successful ob until this was fixed	1 hour to inestigate, fix, clean up	Low: We seldom add new datasets other than tables to the sql/xml dumps any more and could probably just defer until the new system is on line
Make the Dumps Stats job faster	Feb 3 2023	T328804	None on dumps but basically means this job had to be shot if we wanted to do maintenance in between runs	30 mins investigation, patch, etc	Medium: the job now takes 4 days to complete, we might need to tweak the number of cores one more time as these wikis get larger
Kiwix rsyncs not completing and stacking up on Clouddumps1001,2	Aug 12 2020	T260223	Latest kiwix files not available for download	a few hours of investigation, discussion, patches in puppet, etc. (there were multiple issues over time on both ends)	Medium: a full mirror on their side can cause this and they've no way to alert for that nor get their upstream mirror to fix things quickly
Include linktarget data in public dumps	Aug 12 2022	T315063	Patch needed to add the new table to dumps output	30 minutes to patch, merge, deploy, verify	Medium: we get requests for new tables from time to time
page_restrictions field incomplete in current and historical dumps	Apr 29 2020 (but a problem since 2008 probably)	T251411	Dumps user must download the page_restrictions table and extract the info they need from there.	30 minutes (?) of investigation, patch never merged, resolved by planned schema change removing the obsolote field	Medium: Possible that another schema change will impact the dumps by causing a field to no longer be dumped, etc.
Failures for Cirrus Search dumps for wikidatawiki, zhwiki, enwiki for the 20230703 run	Jul 4 2023	T341058	Continuation of a previous issue, needed a few different patches from Discovery team to sort it out, broken search dumps for those wikis that week	15 minutes of investigation, discussion, etc	Medium: Search dumps have had a couple sorts of incidents in the past year related to run time or resilience, there might well be more
Failures for Cirrus Search dumps for wikidatawiki, zhwiki, enwiki for the 20230703 run	Jul 4 2023	T341058	Continuation of a previous issue, needed a few different patches from Discovery team to sort it out, broken search dumps for those wikis that week	15 minutes of investigation, discussion, etc	Medium: Search dumps have had a couple sorts of incidents in the past year related to run time or resilience, there might well be more
Make Cirrus Search dump script more resilient to failures (elasticsearch restarts)	Oct 8 2020	T265056	Cirrus dumps for specific wikis unavailable that week	30 minutes investigation, discussion	Medium: Search dumps have had a couple sorts of incidents in the past year related to run time or resilience, there might well be more
Missing Cirrussearch dump (enwiki and wikidata)	Mar 1 2023	T330936	Cirrus dumps for specific wikis unavailable that week	10 minutes investigation, reporting	Medium: Search dumps have had a couple sorts of incidents in the past year related to run time or resilience, there might well be more
Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs	May 2 2023	T335761	Needed upstream (WME) fixes in generation of WME dumps metadata	1-2 hours of investigation, patches, discussion, etc	Medium: The downloader and the generation of these dumps haven't undergone the intense scrutiny that the sql/xml dumps have, so we might see new issues cropping up
Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs	May 2 2023	T335761	Needed upstream (WME) fixes in generation of WME dumps metadata	1-2 hours of investigation, patches, discussion, etc	Medium: The downloader and the generation of these dumps haven't undergone the intense scrutiny that the sql/xml dumps have, so we might see new issues cropping up
Missing Enterprise Dumps from 2022-10-01 run	Oct 4 2022	T319269	Patches in puppet needed as followup to migration to new clouddumps hosts	1 hour (?) investigation and patching etc	Medium: The downloader and the generation of these dumps haven't undergone the intense scrutiny that the sql/xml dumps have, so we might see new issues cropping up
Failed to refresh WME API access token	Apr 25 2023	T335368	missing Enterprise dumps for public download until this was fixed	1 hour to investigate, discuss, patch, run	Medium: The downloader and the generation of these dumps haven't undergone the intense scrutiny that the sql/xml dumps have, so we might see new issues cropping up
Missing Enterprise Dumps from 2022-06-20 run	Jun 27 2022	T311441	Specific dumps were unavilable until the issue was fixed upstream	1 hour to investigate, discuss, verify	Medium: The downloader and the generation of these dumps haven't undergone the intense scrutiny that the sql/xml dumps have, so we might see new issues cropping up
Missing projects on Enterprise HTML dumps	Mar 2 2022	T302930	some wiki Enterprise dumps remained missing until this was fixed	30 mins (?) investigation, patch, deploy	Medium: The downloader and the generation of these dumps haven't undergone the intense scrutiny that the sql/xml dumps have, so we might see new issues cropping up
The content model 'Json.JsonConfig' is not registered on this wiki(Collabwiki)	Apr 20 2023	T335130	Collabwiki dumps were broken until this was fixed	30 mins of investigation, discussion, etc	Medium: We get hit by content-handler issues from time to time, MW does not deal with these gracefully.
abstracts dumps for dewikiversity fail with MWUnknownContentModelException from ContentHandler.php	Apr 10 2019	T220594	abstract dumps were broken for that wiki until MW was patched	2 hours (?) of investigation, discussion, looking at patches, etc.	Medium: We get hit by content-handler issues from time to time, MW does not deal with these gracefully.
content still marked as flow-board on urwikibooks breaks abstract dumps	Apr 12 2019	T220793	Manual intervention: ran a locally patched MW to generate the output files	30 minutes of investigation, writing a patch, etc.	Medium: We get hit by content-handler issues from time to time, MW does not deal with these gracefully.
Data dumps for November aborted	Nov 3 2022	T322363	Shot by SRE due to https://phabricator.wikimedia.org/T322360 which needed to be fixed; this required MW patches (see https://wikitech.wikimedia.org/wiki/Incidents/2022-11-03_conf_disk_space)	2 hours investigation, patches, discussion, etc.	Medium: Changes to tricky parts of MW like db conn handling can have unintended consequences, and we get hit by this sort of thing from time to time
incomplete conversion of flow revisions after disabling flow, breaks stubs dumps	Jul 24 2019	T228921	Manual intervention: patched mw on a testbed and ran it to produce the output files	a few hours of investigation, discussion about patches, etc	Medium: Exception handling is changed in MW often enough and changes to methods in the core dealing with revisions, titles, pages can break things
"No server with index" and "Warning: Undefined index" from LoadBalancer::reconfigure	Nov 1 2022	T322156	stubs dumps broken until this was patched	2 hours investigation, discussion, testing, etc	Medium: Changes to tricky parts of MW like db conn handling can have unintended consequences, and we get hit by this sort of thing from time to time
Connections to all db servers for wikidata as wikiadmin from snapshot, terbium	Jun 20 2016	T138208	Obstructed dba maintenance	Several hours of investigation, discussion, looking at patches etc	Medium: Changes to tricky parts of MW like db conn handling can have unintended consequences, and we get hit by this sort of thing from time to time
Flow dumps are broken on all wikis due to MediaWIki update	Mar 21 2022	T304318	Flow dumps for all wikis were broken until patched	1 hour to investigate, verify, etc	Medium: touching 'requires' and other such things that can impact autoload is fragile
Flow (current and/or history) dumps failed on various wikis with php exhausting allowed memory	Feb 2 2022	T300760	Manual intervention: locally patched MW run to generate output files on time	2 hours (?) investigation, patches, testing, etc	Medium: SettingsBuilder and similar things that may impact execution of FinalSetup() are fragile
Exception from dumps on group0 wikis after MediaWiki deployment	Jan 12 2022	T299020	all sql/xml dumps were broken for group0 wikis until this was patched	1 hour investigation, patch, deploy, verify, etc	Medium: Expecting MediaWikiServices and similar things to be available in the constructor of dumps scripts is too soon (and hard to test for)
PHP Notice: Trying to access array offset on value of type int	Sep 23 2022	T318423	None on dumps, they ran to completion (Content Translation dumps)	15 minutes to follow along, verify, etc	Medium: bugs crop up in extensions related to "misc" dumps from time to time
Some mw snapshot hosts are accessing main db servers	Aug 25 2016	T143870	Needed many patches, some of which broke dumps in the meantime, etc. Whack-a-mole, probably ok now	Many hours, different patches over several years	Likely: we still have longish running jobs that talk to a depooled db server, blocking maintenance unless the dba shoots the dumps
Wikidata rdf lexemes dump failed for Fri Aug 4, due to db conn error	Sat, Aug 5	T343621	The dump is missing for that week.	30 minutes of work	Likely: Connection errors happen often. We should expect other incidents where retries are not done.
cleanup_xmldumps is failing on dumpsdata1005	Mar 3 2023	T331129	None on dumps. Caused by manual rsync of dumps files needed to bring a new NFS share online; went away by itself as expected	15 minutes	Likely: likely to be seen new NFS shares are brought on line
Dumps Exception email did not list full exception	Dec 5 2022	T324463	No impact on dumps	Left as is, not worth the work to resolve; the message is part of a zero-impact exception	Likely
Flow page content dumps not resilient when database goes away	Feb 8 2018	T186801	None on dumps. retries in the python dumps scripts get the dumps done regardless	10 minutes to investigate, left unresolved	Likely, but we've not seen another incident of it since 2018
PHP Warning: failed to get text for revid [id] [Called from AbstractFilter::getText in /srv/mediawiki/php-1.39.0-wmf.7/extensions/ActiveAbstract/includes/AbstractFilter.php at line 195]	Apr 21 2022	T306629	None on dumps, the specific text is bad and is skipped/represented as empty	15 minutes to investigate, left unresolved, needs maintenance script run that marks a blob as bad in the db	Likely: there's a fair amount of bad data in the dbs.
nl.wiktionary.org edits from May 2004 corrupt "PHP Warning: gzinflate(): data error" (fatal RevisionAccessException)	Oct 20 2020	T265989	None on dumps, the bad revisions are represented as empty	1 hour of investigation, action items still pending	Likely: there's a fair amount of bad data in the dbs.
MWContentSerializationException $entityId and $targetId can not be the same	Feb 28 2019	T217329	Manual intervention: ran locally patched MW to generate the output files	several hours to investigate, discuss patches, etc. Note that MW still does not proerly handle these bad redirects.	Likely: Edge cases like this show up on a regular (once every 6 months?) basis
XmlDumpWriter::openPage handles main namespace articles with prefixes that are namespace names AND are redirects incorrectly	Apr 8 2019	T220316	Manual intervention: ran locally patched MW to generate output files	several hours to investigate, patch, shephard the patches through	Likely: Edge cases like this show up on a regular (once every 6 months?) basis
XmlDUmpWriter::writeRevision sometimes broken by duplicate keys in Link Cache	Apr 8 2019	T220424	Needed a patch to fix dumps	1 hour (?) investigation, patch, etc.	Likely: Edge cases like this show up on a regular (once every 6 months?) basis
investigate why content history dump of certain wikidata page ranges is so slow	Apr 20 2019	T221504	Needed a patch to make individual processes shorter (overall run time still as long but a problem with any one output file takes less time to rerun)	Unresolved: a few hours of investigation, patching, discussion, etc, but wikidata is just too effing big and json is a verbose format	Likely: We trend towards more verbosity rather than less in dump formats, and the data sources get larger and larger
PHP warnings about bad gzinflate thrown during XML stubs dumps Dec 2021	Dec 1 2021	T296823	None on dumps, the bad entries are skipped or represented as empty; these are warnings so annoying in logstash, that's all	Unresolved, there is a pending patch that was never reviewed	Guaranteed: lots of bad data in them there dbs
New error "DB is set and has not been closed by the Load Balancer" for certain bad revisions during page content dumps	Aug 22 2022	T315902	None on dumps, they keep running	Unresolved several db-related patches later	Guaranteed: We'll see these errors on every run.