Dumps/Current Architecture

Overview

User-visible files appear at http://dumps.wikimedia.org/backup-index.html

Dump activity involves a monitor node (running status sweeps) and arbitrarily many worker nodes running the dumps. One of the worker nodes also acts as the monitor. The workers read and write from dumps datastores which export filesystems via nfs for dump file storage. Once these files are created, they are eventually rsynced over to fallback nfs shares and to public dumps webservers.

Architecture

Full project dumps with all content history are run once a month starting on the 1st of the month; dumps with only current content are dumped starting on the 20th of the month.

Dumps are run in "stages". The stages for the first run of the month look like this: stub xml files (files with all of the metadata for pages and revisions but without any page content) are dumped first. After that's been done on all small wikis, all tables are dumped. This is then done for the 'big' wikis (see dblists.pp, look for the definition of "$bigwikis"). Then the current page content for articles is dumped, first for small wikis and then for big ones. And so on.

The stages for the second run of the month are identical to the above, but without the full page history content dumps.

A 'dump scheduler' manages the running of these steps on each available host, given the number of available cores and the order we want the jobs to run in.

Worker nodes

There is a worker script which goes through the set of available wikis to dump a single time for each dump step, starting with the wiki that has gone the longest without a dump. The dump scheduler starts up several of these on each host, according to the number of free cores configured, starting the script anew for the same or a later stage, as determined in the stages list.

For each wiki, the worker script simply runs the python script worker.py on the given wiki. To ensure that multiple workers don't try to dump the same wiki at the same time, the python script locks the wiki before proceeding. Stale locks are eventually removed by a monitor script; if you try to run a dump of a wiki by hand when one is already in progress, you will see an error message to the effect that a wiki dump is already running, along with the name of the lock file.

Monitor node

On one host, the monitor script runs a python script for each wiki that checks for and removes stale lock files from dump processes that have died, and updates the central index.html file which shows the dumps in progress and the status of the dumps that have completed (i.e. http://dumps.wikimedia.org/backup-index.html ). That is its sole function.

On the Dumps datastores

On these hosts, old dumps are cleaned up regularly, and there is a systemd timer job that notifies by email if any dumps job appears to be hung; see cleanup_old_xmldumps.py and job_watcher.sh in the dumps module of our puppet repo.

Code

Check /operations/dumps.git, branch 'master' for the python code in use. All dumps code run in production is in master or, for a few scripts not directly related to dumps production, in our puppet repo.

Getting a copy:

git clone https://gerrit.wikimedia.org/r/operations/dumps.git

git checkout master

Getting a copy as a committer:

git clone ssh://<user>@gerrit.wikimedia.org:29418/operations/dumps.git

git checkout master

Programs used

File layout

<base>/
- index.html - Information about the server, links to other datasets and so on
- backup-index.html - List of all wiki databases and their last-touched statu, sorted by date
- backup-index-bydb.html - The same list sorted by wiki dbname
  - <db>/
    - <date>/
      - index.html - List of items in the database

Sites are identified by the database name, NOT by the host part of the url used for viewing articles.

ETOOMANYSCRIPTS

Currently systemd invokes fulldumps.sh which invokes dumpscheduler.py which calls worker which runs worker.py which may call xmlstubs or a similar job, which forks dumpBackup.php. Really it's a bit ridiculous.

Just for completeness here's the rundown on all those pieces. The name in parentheses is the repo where the script may be found.

fulldumps.sh (puppet) -- checks to make sure that it's within the acceptable date range and then starts up the dump scheduler with the appropriate list of commands
dumpscheduler.py (dumps) -- reads through a list of commands and runs several copies of each depending on local host resources, starting new ones as old ones complete. We use it to start copies of the worker bash script
worker (dumps) -- bash script which loops through all wikis according to the specified config file, running all or a list of dump jobs via the worker.py script until all are complete or too many have failed
worker.py (dumps) -- runs a dump of one or more jobs on a specified wiki, using the specified config file. For any given job it will fire off one or more processes to dump just that job.
xmlstubs.py (dumps) -- runs pieces of stubs (metadata, no page content) and collects the output into one file. Runs dumpBackup.php to get the stub pieces.
xmllogs.py (dumps) -- as xmlstubs but for page logs
xmlabstracts.py (dumps) -- as xmlstubs but for abstracts (short pragraph for each page, rather than full content)
dumpBackup.php (MW) -- dumps metadata about pages, page content, page log info
dumpTextPass.php (MW) -- invoked by dumpBackup.php to retrieve page content
AbstractFilter.php (MW ActiveAbstracts extension) -- plugin for dumpBackup.php to dump abstracts
dumpBackup.php (MW Flow extension) -- dumps Flow content pages
getSlaveServer.php (MW) -- used to get a db host where dump queries can be run
getConfiguration.php (MW) -- used to get various MW global values needed for the dumps

This should be seriously cleaned up in the dumps rewrite.