Dumps/Current Architecture

From Wikitech
Jump to: navigation, search

Overview

User-visible files appear at http://dumps.wikimedia.org/backup-index.html

Dump activity involves a monitor node (running status sweeps) and arbitrarily many worker nodes running the dumps. Typically one of the worker nodes also acts as the monitor, it's a low-resource job.

Architecture

Full project dumps with all content history are run once a month starting on the 1st or 2nd of the month; dumps with only current content are dumped near the end of the month.

Dumps are run in "stages". The stages for the first run of the month look like this: stub xml files (files with all of the metadata for pages and revisions but without any page content) are dumped first. After that's been done on all small wikis, all tables are dumped. This is then done for the 'big' wikis (see list here, look for the definition of "$bigwikis"). Then the current page content for articles is dumped, first for small wikis and then for big ones. And so on.

The stages for the second run of the month are identical to the above, but without the full page history content dumps.

A 'dump scheduler' manages the running of these steps on each available host, given the number of available cores and the order we want the jobs to run in.

Worker nodes

There is a worker script which goes through the set of available wikis to dump a single time for each dump step, starting with the wiki that has gone the longest without a dump. The dump scheduler starts up several of these on each host, according to the number of free cores configured, starting the script anew for the same or a later stage, as determined in the stages list.

For each wiki, the worker script simply runs the python script worker.py on the given wiki. To ensure that multiple workers don't try to dump the same wiki at the same time, the python script locks the wiki before proceeding. Stale locks are eventually removed by a monitor script; if you try to run a dump of a wiki by hand when one is already in progress, you will see an error message to the effect that a wiki dump is already running, along with the name of the lock file.

Monitor node

On one host, the monitor script runs a python script for each wiki that checks for and removes stale lock files from dump processes that have died, and updates the central index.html file which shows the dumps in progress and the status of the dumps that have completed (i.e. http://dumps.wikimedia.org/backup-index.html ). That is its sole function.

Code

Check /operations/dumps.git, branch 'master' for the python code in use. Some tools are in the 'ariel' branch but all dumps code run in production is in master or, for a few scripts not directly related to dumps production, in our puppet repo.

Getting a copy:

git clone https://gerrit.wikimedia.org/r/p/operations/dumps.git
git checkout master

Getting a copy as a committer:

git clone ssh://<user>@gerrit.wikimedia.org:29418/operations/dumps.git
git checkout master

Programs used

See also Dumps/Software dependencies.

The scripts call mysqldump, getSlaveServer.php, eval.php, dumpBackup.php, and dumpTextPass.php directly for dump generation. These in turn require backup.inc and backupPrefetch.inc and may call ActiveAbstract/AbstractFilter.php and fetchText.php.

The generation of XML files relies on Export.php under the hood and of course the entire MW infrastructure.

The worker.py script relies on a few C programs for various bz2 operations: checkforbz2footer and recompressxml, both in /usr/local/bin/. These are in the git repo in branch 'ariel', see [1].

File layout

Sites are identified by raw database name currently. A 'friendly' name/hostname can be added for convenience of searching in future.

ETOOMANYSCRIPTS

Currently cron invokes fulldumps.sh which invokes dumpscheduler.py which calls worker which runs worker.py which may call xmlstubs or a similar job, which forks dumpBackup.php. Really it's a bit ridiculous.

Just for completeness here's the quick rundown on all those pieces. The name in parens is the repo where the script may be found.

  1. fulldumps.sh (puppet) -- checks to make sure that it's within the acceptable date range and then starts up the dump scheduler with the appropriate list of commands
  2. dumpscheduler.py (dumps) -- reads through a list of commands and runs several copies of each depending on local host resources, starting new ones as old ones complete. We use it to start copies of the worker bash script
  3. worker (dumps) -- bash script which loops through all wikis according to the specified config file, running all or a list of dump jobs via the worker.py script until all are complete or too many have failed
  4. worker.py (dumps) -- runs a dump of one or more jobs on a specified wiki, using the specified config file. For any given job it will fire off one or more processes to dump just that job.
  5. xmlstubs.py (dumps) -- runs pieces of stubs (metadata, no page content) and collects the output into one file. Runs dumpBackup.php to get the stub pieces.
  6. xmllogs.py (dumps) -- as xmlstubs but for page logs
  7. xmlabstracts.py (dumps) -- as xmlstubs but for abstracts (short pragraph for each page, rather than full content)
  8. dumpBackup.php (MW) -- dumps metadata about pages, page content, page log info
  9. dumpTextPass.php (MW) -- invoked by dumpBackup.php to retrieve page content

TODO: add the rest of the extensions that get called by regular dumps (Flow, Abstracts, etc)

This should be seriously cleaned up in the dumps rewrite.