User:ArielGlenn/dumps issues

From Wikitech

This list of "recent" dumps issues is meant to assist the new dumps maintainers, as they figure out where to devote their resources. "Recent" means in the past couple of years, with a few older tasks added because they are representative of classes of issues.

Legend:

  • Incident is taken from the title of the Phabricator task.
  • Reporting date is when the phabricator task was initially opened, not when the bug was first introduced or even noticed.
  • Task is a link to the phabricator task with more details.
  • Impact is the impact it had on dumps production, if any.
  • Time to resolve is a guesstimate of the cumulative time spent by this dumps maintainer and does not include time spent by other developers, SRE, etc. These guesses are pretty arbitrary and may be fairly low.
  • Likelihood of recurrence is Unlikely, Low, Medium, Likely, Guaranteed; usually this is an estimate of how likely an issue of this sort is to recur, rather than the exact same thing.
Dumps issues over the past year or so, with impact, risk of recurrence
Incident Reporting date Task Impact Time to resolve Likelihood of recurrence
PHP Warning: XMLReader::read(): Memory allocation failed : growing input buffer May 12 2023 T336573 Manual intervention needed: Had to manually run the page range using bz2 prefetch files Unresolved, several hours of investigation Unlikely: we've seen maybe two of these sorts of errors in the last several years
various weekly and daily dumps run from systemd timers are broken Apr 27 2021 T281267 Various jobs did not run due to a change in the systemd timer syntax in puppet manifests 30 minutes (?) of investigation, and a patch merged Unlikely: Possible to see a similar change but I don't see systemd timer manifests being tweaked again any time soon
Two page content jobs for wikidatawiki are taking days to complete. May 16 2023 T336742 Eventually completed but would have been better with manual intervention to shoot the jobs and let them rerun 1 hour (?) of investigation etc Low: While we don't know the cause, I've seen this only a couple times in several years.
SQL/XML dumps multiple errors "Couldn't update index json file, Continuing anyways" from 09 21 2022 near midnight Sep 21 2022 T318206 No impact on dumps, surprisingly This turned out to be a space issue, needed cleanup script time changes in puppet Low: we have new hosts with lots of space for the next couple years
look into space issues on dumpsdata1001 and 1003 Sep 19 2020 T263318 No impact on dumps 30 mins (?) of investigation and cleanup Low: we have new hosts with lots of space for the next couple years
Resolve space issues on dumpsdata1001, 1003 Feb 25 2023 T330573 None on dumps At lest 8 hours total: quote/order/deploy new hosts and make them active Low: we have new hosts with lots of space for the next couple years
Siteinfo v2 format job needs to be fixed up Feb 9 2022 T301373 Dumps jobs kept retrying a successful ob until this was fixed 1 hour to inestigate, fix, clean up Low: We seldom add new datasets other than tables to the sql/xml dumps any more and could probably just defer until the new system is on line
Make the Dumps Stats job faster Feb 3 2023 T328804 None on dumps but basically means this job had to be shot if we wanted to do maintenance in between runs 30 mins investigation, patch, etc Medium: the job now takes 4 days to complete, we might need to tweak the number of cores one more time as these wikis get larger
Kiwix rsyncs not completing and stacking up on Clouddumps1001,2 Aug 12 2020 T260223 Latest kiwix files not available for download a few hours of investigation, discussion, patches in puppet, etc. (there were multiple issues over time on both ends) Medium: a full mirror on their side can cause this and they've no way to alert for that nor get their upstream mirror to fix things quickly
Include linktarget data in public dumps Aug 12 2022 T315063 Patch needed to add the new table to dumps output 30 minutes to patch, merge, deploy, verify Medium: we get requests for new tables from time to time
page_restrictions field incomplete in current and historical dumps Apr 29 2020 (but a problem since 2008 probably) T251411 Dumps user must download the page_restrictions table and extract the info they need from there. 30 minutes (?) of investigation, patch never merged, resolved by planned schema change removing the obsolote field Medium: Possible that another schema change will impact the dumps by causing a field to no longer be dumped, etc.
Failures for Cirrus Search dumps for wikidatawiki, zhwiki, enwiki for the 20230703 run Jul 4 2023 T341058 Continuation of a previous issue, needed a few different patches from Discovery team to sort it out, broken search dumps for those wikis that week 15 minutes of investigation, discussion, etc Medium: Search dumps have had a couple sorts of incidents in the past year related to run time or resilience, there might well be more
Failures for Cirrus Search dumps for wikidatawiki, zhwiki, enwiki for the 20230703 run Jul 4 2023 T341058 Continuation of a previous issue, needed a few different patches from Discovery team to sort it out, broken search dumps for those wikis that week 15 minutes of investigation, discussion, etc Medium: Search dumps have had a couple sorts of incidents in the past year related to run time or resilience, there might well be more
Make Cirrus Search dump script more resilient to failures (elasticsearch restarts) Oct 8 2020 T265056 Cirrus dumps for specific wikis unavailable that week 30 minutes investigation, discussion Medium: Search dumps have had a couple sorts of incidents in the past year related to run time or resilience, there might well be more
Missing Cirrussearch dump (enwiki and wikidata) Mar 1 2023 T330936 Cirrus dumps for specific wikis unavailable that week 10 minutes investigation, reporting Medium: Search dumps have had a couple sorts of incidents in the past year related to run time or resilience, there might well be more
Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs May 2 2023 T335761 Needed upstream (WME) fixes in generation of WME dumps metadata 1-2 hours of investigation, patches, discussion, etc Medium: The downloader and the generation of these dumps haven't undergone the intense scrutiny that the sql/xml dumps have, so we might see new issues cropping up
Missing Enterprise Dumps in 2023-04-20, 2023-05-01 and 2023-05-20 runs May 2 2023 T335761 Needed upstream (WME) fixes in generation of WME dumps metadata 1-2 hours of investigation, patches, discussion, etc Medium: The downloader and the generation of these dumps haven't undergone the intense scrutiny that the sql/xml dumps have, so we might see new issues cropping up
Missing Enterprise Dumps from 2022-10-01 run Oct 4 2022 T319269 Patches in puppet needed as followup to migration to new clouddumps hosts 1 hour (?) investigation and patching etc Medium: The downloader and the generation of these dumps haven't undergone the intense scrutiny that the sql/xml dumps have, so we might see new issues cropping up
Failed to refresh WME API access token Apr 25 2023 T335368 missing Enterprise dumps for public download until this was fixed 1 hour to investigate, discuss, patch, run Medium: The downloader and the generation of these dumps haven't undergone the intense scrutiny that the sql/xml dumps have, so we might see new issues cropping up
Missing Enterprise Dumps from 2022-06-20 run Jun 27 2022 T311441 Specific dumps were unavilable until the issue was fixed upstream 1 hour to investigate, discuss, verify Medium: The downloader and the generation of these dumps haven't undergone the intense scrutiny that the sql/xml dumps have, so we might see new issues cropping up
Missing projects on Enterprise HTML dumps Mar 2 2022 T302930 some wiki Enterprise dumps remained missing until this was fixed 30 mins (?) investigation, patch, deploy Medium: The downloader and the generation of these dumps haven't undergone the intense scrutiny that the sql/xml dumps have, so we might see new issues cropping up
The content model 'Json.JsonConfig' is not registered on this wiki(Collabwiki) Apr 20 2023 T335130 Collabwiki dumps were broken until this was fixed 30 mins of investigation, discussion, etc Medium: We get hit by content-handler issues from time to time, MW does not deal with these gracefully.
abstracts dumps for dewikiversity fail with MWUnknownContentModelException from ContentHandler.php Apr 10 2019 T220594 abstract dumps were broken for that wiki until MW was patched 2 hours (?) of investigation, discussion, looking at patches, etc. Medium: We get hit by content-handler issues from time to time, MW does not deal with these gracefully.
content still marked as flow-board on urwikibooks breaks abstract dumps Apr 12 2019 T220793 Manual intervention: ran a locally patched MW to generate the output files 30 minutes of investigation, writing a patch, etc. Medium: We get hit by content-handler issues from time to time, MW does not deal with these gracefully.
Data dumps for November aborted Nov 3 2022 T322363 Shot by SRE due to https://phabricator.wikimedia.org/T322360 which needed to be fixed; this required MW patches (see https://wikitech.wikimedia.org/wiki/Incidents/2022-11-03_conf_disk_space) 2 hours investigation, patches, discussion, etc. Medium: Changes to tricky parts of MW like db conn handling can have unintended consequences, and we get hit by this sort of thing from time to time
incomplete conversion of flow revisions after disabling flow, breaks stubs dumps Jul 24 2019 T228921 Manual intervention: patched mw on a testbed and ran it to produce the output files a few hours of investigation, discussion about patches, etc Medium: Exception handling is changed in MW often enough and changes to methods in the core dealing with revisions, titles, pages can break things
"No server with index" and "Warning: Undefined index" from LoadBalancer::reconfigure Nov 1 2022 T322156 stubs dumps broken until this was patched 2 hours investigation, discussion, testing, etc Medium: Changes to tricky parts of MW like db conn handling can have unintended consequences, and we get hit by this sort of thing from time to time
Connections to all db servers for wikidata as wikiadmin from snapshot, terbium Jun 20 2016 T138208 Obstructed dba maintenance Several hours of investigation, discussion, looking at patches etc Medium: Changes to tricky parts of MW like db conn handling can have unintended consequences, and we get hit by this sort of thing from time to time
Flow dumps are broken on all wikis due to MediaWIki update Mar 21 2022 T304318 Flow dumps for all wikis were broken until patched 1 hour to investigate, verify, etc Medium: touching 'requires' and other such things that can impact autoload is fragile
Flow (current and/or history) dumps failed on various wikis with php exhausting allowed memory Feb 2 2022 T300760 Manual intervention: locally patched MW run to generate output files on time 2 hours (?) investigation, patches, testing, etc Medium: SettingsBuilder and similar things that may impact execution of FinalSetup() are fragile
Exception from dumps on group0 wikis after MediaWiki deployment Jan 12 2022 T299020 all sql/xml dumps were broken for group0 wikis until this was patched 1 hour investigation, patch, deploy, verify, etc Medium: Expecting MediaWikiServices and similar things to be available in the constructor of dumps scripts is too soon (and hard to test for)
PHP Notice: Trying to access array offset on value of type int Sep 23 2022 T318423 None on dumps, they ran to completion (Content Translation dumps) 15 minutes to follow along, verify, etc Medium: bugs crop up in extensions related to "misc" dumps from time to time
Some mw snapshot hosts are accessing main db servers Aug 25 2016 T143870 Needed many patches, some of which broke dumps in the meantime, etc. Whack-a-mole, probably ok now Many hours, different patches over several years Likely: we still have longish running jobs that talk to a depooled db server, blocking maintenance unless the dba shoots the dumps
Wikidata rdf lexemes dump failed for Fri Aug 4, due to db conn error Sat, Aug 5 T343621 The dump is missing for that week. 30 minutes of work Likely: Connection errors happen often. We should expect other incidents where retries are not done.
cleanup_xmldumps is failing on dumpsdata1005 Mar 3 2023 T331129 None on dumps. Caused by manual rsync of dumps files needed to bring a new NFS share online; went away by itself as expected 15 minutes Likely: likely to be seen new NFS shares are brought on line
Dumps Exception email did not list full exception Dec 5 2022 T324463 No impact on dumps Left as is, not worth the work to resolve; the message is part of a zero-impact exception Likely
Flow page content dumps not resilient when database goes away Feb 8 2018 T186801 None on dumps. retries in the python dumps scripts get the dumps done regardless 10 minutes to investigate, left unresolved Likely, but we've not seen another incident of it since 2018
PHP Warning: failed to get text for revid [id] [Called from AbstractFilter::getText in /srv/mediawiki/php-1.39.0-wmf.7/extensions/ActiveAbstract/includes/AbstractFilter.php at line 195] Apr 21 2022 T306629 None on dumps, the specific text is bad and is skipped/represented as empty 15 minutes to investigate, left unresolved, needs maintenance script run that marks a blob as bad in the db Likely: there's a fair amount of bad data in the dbs.
nl.wiktionary.org edits from May 2004 corrupt "PHP Warning: gzinflate(): data error" (fatal RevisionAccessException) Oct 20 2020 T265989 None on dumps, the bad revisions are represented as empty 1 hour of investigation, action items still pending Likely: there's a fair amount of bad data in the dbs.
MWContentSerializationException $entityId and $targetId can not be the same Feb 28 2019 T217329 Manual intervention: ran locally patched MW to generate the output files several hours to investigate, discuss patches, etc. Note that MW still does not proerly handle these bad redirects. Likely: Edge cases like this show up on a regular (once every 6 months?) basis
XmlDumpWriter::openPage handles main namespace articles with prefixes that are namespace names AND are redirects incorrectly Apr 8 2019 T220316 Manual intervention: ran locally patched MW to generate output files several hours to investigate, patch, shephard the patches through Likely: Edge cases like this show up on a regular (once every 6 months?) basis
XmlDUmpWriter::writeRevision sometimes broken by duplicate keys in Link Cache Apr 8 2019 T220424 Needed a patch to fix dumps 1 hour (?) investigation, patch, etc. Likely: Edge cases like this show up on a regular (once every 6 months?) basis
investigate why content history dump of certain wikidata page ranges is so slow Apr 20 2019 T221504 Needed a patch to make individual processes shorter (overall run time still as long but a problem with any one output file takes less time to rerun) Unresolved: a few hours of investigation, patching, discussion, etc, but wikidata is just too effing big and json is a verbose format Likely: We trend towards more verbosity rather than less in dump formats, and the data sources get larger and larger
PHP warnings about bad gzinflate thrown during XML stubs dumps Dec 2021 Dec 1 2021 T296823 None on dumps, the bad entries are skipped or represented as empty; these are warnings so annoying in logstash, that's all Unresolved, there is a pending patch that was never reviewed Guaranteed: lots of bad data in them there dbs
New error "DB is set and has not been closed by the Load Balancer" for certain bad revisions during page content dumps Aug 22 2022 T315902 None on dumps, they keep running Unresolved several db-related patches later Guaranteed: We'll see these errors on every run.