Dumps/Phases of a dump run

From Wikitech

Each dump of a wiki runs in several phases (jobs). If you look at any completed wiki run from the main index page (http://dumps.wikimedia.org/backup-index.html) you'll see the jobs for that wiki listed with the most recently completed job listed first and a brief explanation of what each job is.

Backup jobs overview

  • dumps of various database tables, both private and public (via mysqldump)
  • list of page titles in namespace 0, i.e. titles of articles (via mysqldump)
  • page abstracts for Yahoo (via dumpBackup.php with filter ActiveAbstract/AbstractFilter.php)
    additional step to recombine chunks produced in parallel -- enwiki only
  • page stubs, gzipped (via dumpBackup.php)
    additional step to recombine chunks produced in parallel -- enwiki only
  • XML files with revision texts, bzipped (via dumpTextPass.php, fetchText.php)
    additional step to recombine chunks produced in parallel -- enwiki only
  • log of actions taken on pages (via dumpBackup.php)
  • 7z compression of the XML file with all revision texts, for full history only
    additional step to recombine chunks produced in parallel -- enwiki only
  • rewrite of XML file with revision text into file of multiple bz2 streams, 100 pages per stream, for current revisions of articles only (via /usr/local/bin/recompressxml)

Details

The worker.py script will produce a list of all jobnames it knows about for the given wiki with the specified configuration if it is given a jobname it doesn't recognize. Use the --dry-run option so that it doesn't do any of the prep work (e.g. cleaning up old dumps) that it would normally do before a run.

So for example,

python3 ./worker.py --job help --configfile wikidump.conf.dumps:en --dryrun enwiki

will produce a few lines of cruft you can ignore and then:

No job of the name specified exists. Choose one of the following:
noop (runs no job but rewrites md5sums file and resets latest links)
latestlinks (runs no job but resets latest links)
tables (includes all items below that end in 'table')
usertable 

...(snipped)...

metahistorybz2dump 
metahistory7zdump 
articlesmultistreamdump 

listing the jobs that it would run in the order that it would run them.