Dumps/Troubleshooting

Dealing with problems

Broken SQL/XML dumps overview

The SQL/XML dumps can break in a few interesting ways.

They no longer appear to be running. Is the monitor running? See below. If it is running, perhaps all the workers are stuck on a stage waiting for a previous stage that failed.
Shoot them all and let the systemd timer job job sort it out. You can also see if there have been emails about exceptions; fix the underlying problem and wait for systemd.
A dump for a particular wiki has been aborted. This may be due to me shooting the script because it was behaving badly, or because a host was powercycled in the middle of a run.
The next systemd run should fix this up.
A dump on a particular wiki has failed.
Check emails for exceptions, track down the underlying issue (db outage? MW deploy of bad code? Other?), fix it, and wait for systemd to rerun it.
A dump has hung on some step, the processes in the pipeline apparently reading/writing and yet no output being produced.
We get email notifications to ops-dumps@wikimedia.org if there is a lockfile for a wiki and no file updated within the last 4 hours. These must be investigated on a case by case basis.

If you end up needing to manually rerun part or all of a dump follow the guidance at Dumps/Rerunning a job.

Details for SQL/XML and other dumps

The below a not a list of all possible issues, but some that come up from time to time.

Note: some issues are less likely to occur than others, and so each issue has been marked as "Unlikely", "Low", "Medium", "Likely" in order of ascending probability.

Web server SQL/XML dumps index.html page was updated more than 12 hours earlier than current (UTC) time

This can mean that the dumps themselves are stalled (Unlikely), the monitor job that updates the index.html file is broken in some way (Low), that the rsync which copies files from the internal hosts to the webserver is hung (Medium), or just that there is a lot of catching up the rsync must do, if for example an rsync target was pulled out for maintenance and then returned to service (Likely).
Check which is the case by going to the internal NFS share where dumps are written (see the server list for which one that is, verify when you get on the host by checking the role displayed as you log in). Look at the rsync process that is copying dumps around and see which host it is copying to; then check to see if data is actually being copied. If the rsync is hung, stop and restart it via systemctl.
If the rsync is working fine, but the index.html file on the internal NFS share has a date more a couple hours old, check to see if the monitor process is running on the worker host (see the worker list] for which one that is, verify the role displayed at log in). If it is not, try running the python monitor script from the command line as the dumpagen user, and look for errors. If none, try restarting the job via systemctl. This is quite unlikely to be an issue.
If the monitor is running fine, perhaps the dumps themselves are not running. This is extremely unlikely and should be indicated on the index.html page ("Dump process is idle"). If it is not, something is seriously broken and you should ask for help.

Email from the dumps exception checker or a systemd job

Email is ordinarily sent if a dump does not complete successfully, going to ops-dumps@wikimedia.org (which is an alias, not a list). If you want to follow and fix failures, add yourself to that alias.

Logs are kept of each SQL/XML dumps run. From any snapshot host, you can find the logs in the directory (/mnt/dumpsdata/xmldatadumps/private/<wikiname>/<date>/dumplog.txt). From these you may glean more reasons for the failure.

Logs that capture everything for all sql/xml dumps are available in /var/log/dumps/ and may also contain clues, but these are cluttered and split oddly via date, check the run-specific logs first.

Logs for "other" dumps are generally available on the worker host dedicated to running those, in a subdirectory of /var/log. For example, the cirrus search dumps are logged in subdirs cirrussearch-dump-s1, cirrussearch-dump-s2 and so on.

When one or more steps of an sql/xml dumps run fail, the index.html file for that dump usually includes a notation of the failure and sometimes more information about it. Note that one step of a dump failing does not prevent other steps from running unless they depend on the data from that failed step as input.

Your first step in most cases when receiving such an email should be to check if the particular dump job for that wiki has run to completion already; perhaps the error was transient (a db server was depooled, or a bug in MediaWiki was deployed and fixed while you were away). The easiest way to check this is to look at the web page for the specific wiki and dump run and see if the job is shown as complete and the output files are listed as available for download. There are numerous retry mechanisms built in at every step of these dumps; often they silently handle issues and no manual intervention is required.

Exceptions we see

These are some categories of errors we have seen or currently see, and approaches to dealing with them.

SQL/XML dumps

"Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways" (from sql/xml dumps)
ignore these, no impact on dumps, difficult to track down, related to db connection shuffling patches.

"Other" dumps

"dumps.exceptions.BackupError: Too many consecutive failures, giving up: last failures on some-wikiname-here" (from adds-changes dumps)
check to see if the specific wiki is newly created; if so, it will return zero revisions in response to queries, and we treat that as an error, which we can ignore
"dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/202XMMDD/some-wiki-and-date-here.json.gz" (from CirrusSearch dumps)
make sure the file was not generated later on a retry; go to the primary NFS share for "other" dumps and look in /data/otherdumps/cirrussearch/ in the subdirectory for the run date. If not, check logs on the worker that runs "other" dumps, looking for the wiki-specific file in /var/log/cirrussearch-dump-sectionnumberhere (see the email for the section number). You'll likely end up making a Phab task with that info and tagging the search folks to have a look.

Feteches of other datasets

" /usr/bin/python3 /usr/local/bin/wm_enterprise_downloader.py --creds /etc/dumps/wm_enterprise_creds --settings /etc/dumps/wm_enterprise_settings --retries 5" and a long unreadable stack trace followed by something like "ERROR - Failed to retrieve dump file info for wiki some-name-here and namespace some-number-here" (from enterprise dumps downloader)
Check to see if retries got it. On the public web server (one of the clouddumps hosts), look in the subdir in /srv/dumps/xmldatadumps/public/other/enterprise_html/runs/ for the run date and see if the files for the wiki are there or not. If not, and the error from the script doesn't seem transitory (connection issue, 500 upstream, whatever), try running the script as the dupsgen user from the command line on the web server, if it is not still running, invoking it only for that namespace and wiki, and check for errors.
"rsync: [sender] send_files failed to open "/zim/wikipedia/wikipedia_some-date-etc-here.zim" (in wmf.download.kiwix.org): Permission denied (13)" (from kiwix mirror rsync)
Q These can be ignored, unless we see it for a bunch of files, in which case we'll need to contact upstream and let them know they have a perms problem.

Space issues

The current dumpsdata NFS shares have plenty of room for growth over the next few years. If space becomes tight, old dumps may have been copied over and not cleaned up, or perhaps some extensive testing was done and not cleaned up. (Unlikely)

First check that there's nothing huge in /data/temp/dumpsgen or any other subdir under /data/temp. "Huge" in this case means over 1T. If there is, toss it.
Next, if this is a primary NFS share for the "other" dumps, see if the /data/xmldatadumps directory is huge. If so, clean it up; the wiki subdirectories aren't needed for anything and can all be tossed.
If this is a primary NFS share for the sql/xml dumps, see if the /data/otherdumps directory is huge; if so, clean it up.
If this is a primary or secondary NFS share for the sql/xml dumps, check that the number of subdirectories for enwiki, elwiki and wikidatawiki (data/xmldatadumps/public/enwiki/20XX) is less than 5 or 6. If it's not, the cleanup job is not running. Check permissions on the directories and files; they should be owned by the dumpsgen user. Try running the cleanup job by hand as the dumpsgen user from the command line and look for errors.

A worker host (snapshot) dies

If it can be brought back up within a day, don't bother to take any measures, just get the box back in service. If there are deployments scheduled in the meantime, you may want to remove it from scap targets for mediawiki: edit hieradata/common/scap/dsh.yaml in our puppet repo for that.

If it's the testbed host (check the role in site.pp), just leave it and arrange for parts to be replaced, no services will be impacted

If it will take more than a day to be fixed, swap it for the testbed/canary box, and remove it from scap targets for mediawiki:

open manifests/site.pp and find the stanza for the broken snapshot host, grab that role
now look for the snapshot host with role(dumps::generation::worker::testbed), and put the broken host's role there
in hieradata/hosts, look for a a file with the name of the broken host, and one with the name of the testbed host. Swap their contents.
edit hieradata/common/scap/dsh.yaml to remove the broken host as a mediawki scap target
merge all the things, run puppet on the broken host first (if it's not too broken to run puppet) and then on the new production host (former testbed).
The sql/xmldumps process will run automatically on that host if it is not too late in the run (after the 14th for the full run, after the 25th for the partial run). If needed, start it by hand via systemctl (see the puppet manifest, you want 28 for maxjobs, and runtype one of "regular", "enwiki", "wikidatawiki" depending on which worker died and what it did).

Routine maintenance

Sometimes a host may need to be rebooted to pick up security updates.

A worker host should be rebooted when the dumps it runs are idle. For sql/xml dumps runners, this means between the 15th and 19th of the month for the full run, or the 25th to the end of the month for the partial run, except for the worker that handles wikidatawiki, which may not be available until the 17th or 18th of the month for the full run or the 26th of the month for the partial run. For the worker running the "other" dumps, there are windows on Saturday and Sunday during which a reboot can be scheduled without interrupting any job.

The fallback NFS share for sql/xml dumps runs a stats job once a dumps run has completed. Check to make sure it is not running before you reboot; grep for get_dump_stats.sh in the process list there.

The primary NFS share for sql/xml dumps runs has a filesystem mounted on worker hosts. The monitor job on one of those workers accesses the NFS share regularly, even in between dumps runs. Go to that worker, disable puppet, stop the monitor (systemctl stop dumps-monitor) and then do the NFS share reboot; once done, restart the monitor and verify that it seems ok, then re-enable puppet and run it just for good measure, making sure all looks normal.