Dumps/Current Architecture
Overview
User-visible files appear at http://dumps.wikimedia.org/backup-index.html
Dump activity involves a monitor node (running status sweeps) and arbitrarily many worker nodes running the dumps. One of the worker nodes also acts as the monitor. The workers read and write from dumps datastores which export filesystems via nfs for dump file storage. Once these files are created, they are eventually rsynced over to fallback nfs shares and to public dumps webservers.
Architecture
Full project dumps with all content history are run once a month starting on the 1st of the month; dumps with only current content are dumped starting on the 20th of the month.
Dumps are run in "stages". The stages for the first run of the month look like this: stub xml files (files with all of the metadata for pages and revisions but without any page content) are dumped first. After that's been done on all small wikis, all tables are dumped. This is then done for the 'big' wikis (see dblists.pp, look for the definition of "$bigwikis"). Then the current page content for articles is dumped, first for small wikis and then for big ones. And so on.
The stages for the second run of the month are identical to the above, but without the full page history content dumps.
A 'dump scheduler' manages the running of these steps on each available host, given the number of available cores and the order we want the jobs to run in.
Worker nodes
There is a worker script which goes through the set of available wikis to dump a single time for each dump step, starting with the wiki that has gone the longest without a dump. The dump scheduler starts up several of these on each host, according to the number of free cores configured, starting the script anew for the same or a later stage, as determined in the stages list.
For each wiki, the worker script simply runs the python script worker.py on the given wiki. To ensure that multiple workers don't try to dump the same wiki at the same time, the python script locks the wiki before proceeding. Stale locks are eventually removed by a monitor script; if you try to run a dump of a wiki by hand when one is already in progress, you will see an error message to the effect that a wiki dump is already running, along with the name of the lock file.
Monitor node
On one host, the monitor script runs a python script for each wiki that checks for and removes stale lock files from dump processes that have died, and updates the central index.html
file which shows the dumps in progress and the status of the dumps that have completed (i.e. http://dumps.wikimedia.org/backup-index.html
). That is its sole function.
On the Dumps datastores
On these hosts, old dumps are cleaned up regularly, and there is a systemd timer job that notifies by email if any dumps job appears to be hung; see cleanup_old_xmldumps.py
and job_watcher.sh
in the dumps module of our puppet repo.
Code
Check /operations/dumps.git, branch 'master' for the python code in use. All dumps code run in production is in master or, for a few scripts not directly related to dumps production, in our puppet repo.
Getting a copy:
git clone https://gerrit.wikimedia.org/r/operations/dumps.git
git checkout master
Getting a copy as a committer:
git clone ssh://<user>@gerrit.wikimedia.org:29418/operations/dumps.git
git checkout master
Programs used
See also Dumps/Software dependencies.
The scripts call mysqldump, various Mediawiki maintenance scripts, and a maintenance script from the Flow extension for dumping Flow content. The dump of abstracts relies on the ActiveAbstract Mediawiki extension. Curl is used for retrieving namespace information.
The worker.py script relies on a few C programs for various xml and bz2 operations; these are in the mwbzutils wikimedia package.
File layout
- <base>/
- index.html - Information about the server, links to other datasets and so on
- backup-index.html - List of all wiki databases and their last-touched statu, sorted by date
- backup-index-bydb.html - The same list sorted by wiki dbname
- <db>/
- <date>/
- index.html - List of items in the database
- <date>/
- <db>/
Sites are identified by the database name, NOT by the host part of the url used for viewing articles.
ETOOMANYSCRIPTS
Currently systemd invokes fulldumps.sh which invokes dumpscheduler.py which calls worker which runs worker.py which may call xmlstubs or a similar job, which forks dumpBackup.php. Really it's a bit ridiculous.
Just for completeness here's the rundown on all those pieces. The name in parentheses is the repo where the script may be found.
- fulldumps.sh (puppet) -- checks to make sure that it's within the acceptable date range and then starts up the dump scheduler with the appropriate list of commands
- dumpscheduler.py (dumps) -- reads through a list of commands and runs several copies of each depending on local host resources, starting new ones as old ones complete. We use it to start copies of the worker bash script
- worker (dumps) -- bash script which loops through all wikis according to the specified config file, running all or a list of dump jobs via the worker.py script until all are complete or too many have failed
- worker.py (dumps) -- runs a dump of one or more jobs on a specified wiki, using the specified config file. For any given job it will fire off one or more processes to dump just that job.
- xmlstubs.py (dumps) -- runs pieces of stubs (metadata, no page content) and collects the output into one file. Runs dumpBackup.php to get the stub pieces.
- xmllogs.py (dumps) -- as xmlstubs but for page logs
- xmlabstracts.py (dumps) -- as xmlstubs but for abstracts (short pragraph for each page, rather than full content)
- dumpBackup.php (MW) -- dumps metadata about pages, page content, page log info
- dumpTextPass.php (MW) -- invoked by dumpBackup.php to retrieve page content
- AbstractFilter.php (MW ActiveAbstracts extension) -- plugin for dumpBackup.php to dump abstracts
- dumpBackup.php (MW Flow extension) -- dumps Flow content pages
- getSlaveServer.php (MW) -- used to get a db host where dump queries can be run
- getConfiguration.php (MW) -- used to get various MW global values needed for the dumps
This should be seriously cleaned up in the dumps rewrite.