Dumps/SQL-XML Dumps

We want mirrors! For more information see Dumps/Mirror status.

Docs for end-users of the xml/sql dumps can be found on meta. If you're a Toolforge user and want to use the dumps, check out Help:Shared storage for information on where to find the files.

Current Info

For current dumps issues, see the Dumps-generation project in Phabricator.
For current redesign plans and discussion, see Dumps/Dumps 2.0: Redesign.
For information about the WikiTeam initiative to upload these dumps to the Internet Archive, see the Nova Resource:Dumps project.

Older Info

For information about the initiative to upload these dumps to the Internet Archive, see the Nova Resource:Dumps project.
For historical information about the dumps, see Dumps/History.

Hodge Podge

For a list of various information sources about the dumps, see Dumps/Other information sources.

The following info is for folks who hack on, maintain and administer the dumps and the dump servers.

Setup

Current architecture

Rather than bore you with that here, see Dumps/Current Architecture.

Current hosts

For which hosts are serving data, see Dumps/Dump servers. For which hosts are generating dumps, see Dumps/Snapshot hosts. For which hosts are providing space via NFS for the generated dumps, see Dumps/Dumpsdata hosts.

Adding a new snapshot host

Install and add to site.pp in the snapshot stanza (see snapshot1005-9). Add the relevant hiera entries, documented in site.pp, according to whether the server will run enwiki or wikidatawiki xml/sql dumps (only one server should do so for each of these huge wikis), or misc cron jobs (one host should do so, and it should not run xml/sql dumps).

Dumps run out of /srv/deployment/dumps/dumps/xmldumps-backup on each server. Deployment is done via scap3 from the deployment server.

Starting dump runs

Do nothing. The dumps run automatically via a systemd timer job.

Deploying code updates

The python dumps scripts are all in the operations/dumps.git repo, branch 'master'. Various supporting scripts that are not part of the dumps proper, are in puppet; you can find those in the snapshot module.

The python dump scripts rely on a number of C utilities for manipulating MediaWiki xml files and/or bzip2-compressed files. These can be found in the operations/dumps/mwbzutils repo.

Getting a copy of the python scripts as a committer:

git clone ssh://<user>@gerrit.wikimedia.org:29418/operations/dumps.git

git checkout master

ssh to the deployment host.

cd /srv/deployment/dumps/dumps
git pull
scap deploy

Note: you likely need to be in the ops ldap group to do the scap.

Also note that changes pushed will not take place until the next dump run; any current run uses the existing dump code to complete.

If necessary, you can either trigger dumps maintenance mode on one or several workers, wait for the dump jobs in progress to eventually finish (from several hours to a day), and then put it back into service; when the dump scheduler starts again in the morning or evening UTC time, it will use the new code.

Dumps maintenance mode is triggered per worker, by touching the file "/srv/deployment/dumps/dumps/xmldumps-backup/maintenance.txt". This is specific to the version of the dumps scripts running; deployment and running of a new version will bring the worker host out of maintenance mode automatically because the maintenance.txt file will be in the previous version's base directory and not the current version.

Updating configuration files

Configuration file setup is handled in the snapshot puppet module; see [1] and [2]. You can check the config files themselves at /etc/dumps/confs on any snapshot host to see the results of the values used.

We use a single configuration file in production for all sql/xml dump runs, but within that file, there are sections with values used for specific large wikis, which override the values specified earlier. Example: enwiki, wikidatawiki, and bigwikis all have thir own section. When the dumps script is run, the section name, if any, is specified along with the configuration file path.

The adds-changes dumps have their own configuration file, as do the "other" dumps.

Deployment-prep has its own configuration files for dumps which follows the same structure as the production files.

Rerunning dumps

You really really don't want to do this. These jobs run via systemd timers. All by themselves. Trust me. Once the underlying problem (bad MW code, unhappy db server, out of space, etc) is fixed, it will get taken care of.

Okay, you don't trust me, or something's really broken. See Dumps/Rerunning a job if you absolutely have to rerun a wiki/job.

NFS share and/or web server issues

Out of space (Low)

See Dumps/Dump servers#Space issues if we are running out of space on the dumps web or rsync servers.

A dumpsdata host dies (Unlikely)

Coming soon... but in the meantime see Dumps/XML-SQL Dumps/Swapping NFS servers which explains the steps for swapping the primary and fallback xml/sql NFS servers when they are both operational.

A dumpsdata host has NFS issues (Unlikely)

Maybe icinga alerted, or maybe you noticed that the dumps snapshot hosts have extra high load and that there are NFS timeouts in their syslogs. First, check the obvious; is the array full? Is the box so loaded that something OOMed? Is there anything bizarre in the syslog or other logs?

Assuming you see nothing unusual, and nfsd is still running:

We do a lot of disk I/O on these NFS-mounted filesystems; multiple dumps jobs running in parallel on multiple hosts, plus an rsync to copy off data to the fallback dumpsdata host and the labstore boxes, could be more than the disks can handle. Check disk utilization and IOPs and see what's going on. Narrow spikes of 100% utilization are normal, but no more than that. If that's the problem, check if there was an rsync going when the alert was triggered; if so, you can try being more aggressive with rsync bandwidth caps (look for the BWLIMIT setting).

We applied a class in the past to adjust vm.min_free_kbytes for hosts with 16GB RAM that were also providing web service at the time. These settings have not been altered for the current dumpsdata hosts which have 32GB RAM and do only NFS for dumps generation, and rsync to internal peers; perhaps they should be. There's an open ticket for the dumps public-facing servers ([3]) but not for the dumpsdata boxes.

Some of the past history of issues on the old dump nfs servers can be found in this phab task).

Note that nfs cache use has been a problem in the past with data consistency so we have actimeo=0 on the clients. (See Phab task.)) This could be revisited.

A labstore host dies (web or nfs server for dumps) (Unlikely)

These are managed by Wikimedia Cloud Services. When this situation should arise, someone on that team should conduct the procedure below.

At current writing there are two labstore boxes that we care about; one serves web to the public + NFS to stats hosts; the other serves NFS to cloud VPS instances/toolforge.

Determine which box went down. You can look at hieradata/common.yaml and the values for dumps_dist_active_web, dumps_dist_nfs_servers, and dumps_dist_active_vps for this.
Remove the host from dumps_dist_nfs_servers.
Change dumps_dist_active_vps to the other server, if the dead server was the vps NFS server.
Change dumps_dist_active_web to the other server, if the dead server was NOT the vps NFS server (this means it was the stats NFS server, which is all that this setting controls).
Forcibly unmount the NFS mount for the dead host everywhere you can in Toolforge. Try Cumin first, if that fails try clush for Toolforge. See #Notes on NFS issues and Toolforge load for more about this.
- Hint: If using clush under pressure, try:
```
clush -w @all 'sudo umount -fl /mnt/nfs/dumps-[FQDN of down server]'
```
  on tools-clushmaster-02.tools.eqiad.wmflabs
If the dead server was the web server:
- The certificate should be active on both hosts, so that shouldn't be a problem thanks to the acme_chief module, but you still need to change the profile::dumps::distribution::web::is_primary_server hiera value on each host after you change DNS.
  - By all means still check!! From a shell as your user account on the host you can run
```
echo | openssl s_client -showcerts  -connect localhost:443 2>/dev/null | openssl x509 -inform pem -noout -text
```
    on both servers to see the certs. Check the "Validity" section.
- Change the 'dumps' entry in operations/dns.git's templates/wikimedia.org, and deploy to gdns according to DNS#authdns-update
- Once that change has had some time to propagate (check the TTL), test to see that it successfully picked up a cert (checking https://dumps.wikimedia.org should work). Trying puppet runs on the working server might be helpful here.

Notes on NFS issues and Toolforge load

This may be untrue now. WMCS believes we have found and corrected the root cause of the rising load issue when dumps servers are off line. This has yet to be tested, though.

Both hosts' NFS filesystems are mounted on all hosts that use either server for NFS, and the clients determine which nfs filesystem to use based on a symlink that varies from cluster to cluster. The dumps_dist_active_web setting only affects the symlink to the NFS filesystem on the stats hosts. Likewise, the dumps_dist_active_vps only affects the symlink to NFS filesystem on the VPSes (including Toolforge).

If the server is the vps NFS server (the value of dumps_dist_active_vps), Toolforge is probably losing its mind by now. The best that can be done is to remove it from dumps_dist_nfs_servers and change dumps_dist_active_vps to the working server and unmount that NFS share everywhere you possibly can. The earlier this is done, the better. Load will be climbing like mad on any Cloud VPS server, including Toolforge nodes the entire time. This may or may not stop because you unmounted everything.