Obsolete:Dumps/Dumps 2.0: Redesign

Transcription of notes from mid-October 2015.

(in progress.)

Some goals of the redesign

No guarantee that we'll adopt all of these but let's keep them in mind.

adding new formats is easy
storage is no longer a problem; it can be easily expanded without forking out large sums for new disk arrays
NFS is gone
any dump step can be stopped and resumed, or a broken step can be rerun from where it broke, easily and with minimal extra time for completion
adding new dump steps for a given wiki is easy
worker cluster doesn't waste cores on a run and is easily/cheaply expandable
download servers do just that; they are not used for writing dumps being generated, and it's easy to add/remove those too
we can easily add new types of notifications as new dump files become available (example: IRC, twitter, api for user checks)
there is one notion of a dump job and one generic black box that handles job queue, runner, worker cluster. NOT written by us, NOT maintained by us!
it is easy(ier) to add new metadata for publication, about any dump step (time started. size of uncompressed file. etc.)
we could run en wikipedia in a week if we wanted to invest in an appropriately sized worker cluster.
we spend a lot less time uncompressing, reading, writing and recompressing dump files.
dumps run in closer to O(changes) than O(all revisions).
someone else can set up, use, maintain, add to and curse at this besides me
provide estimates of when a particular file might be ready; alternatively, know that a given run will always start at around the same time and date every month and have estimate of how long it takes to complete
allow multiple dump steps for the same wiki and date to run at the same time as long as there are no dependencies between them
allow users to monitor via web (or also api?) just the dump steps and wikis that interest them
ZIM format dumps?
be able to easily handle requests for partial table dumps, where the table includes some private data, but we can't just partially dump all rows, because some items would be hidden/oversighted, "deleted" for the unprivileged MW user
notion of incrementals: example, datasets which would indicate: for each table, add these rows, remove these rows, update these rows. for each page dump, remove these pages, delete these pages, update these pages by appending these revisions. (needs discussion)

Any functionality that is removed in the redesign must have equivalent or better functionality added.

Current dump types

XML Dumps: these consist of sql table dumps, XML dumps, plus accompanying MD5/SHA1 sums. There are pending requests for partial table dumps of those tables that contain some private data; due to the fact that in large part any or all of the fields in the rows of these tables might be hidden by "deletion" or oversight or similar, these are not yet implemented.
Other Dumps:
1. Central Auth table by cron every so often
2. Wikidata dumps in json format
3. "adds/changes" dumps in xml format, weekly
Other requests: we often get requests to "please include namespace X in this part of the dumps of wiki Y", or "please include this table for wiki Z". Because it's difficult to alter the dump run for just one wiki, these requests are usually left pending.
Binary-format incrementals: untested for production, these use a special binary format which is a bit inconvenient to extend when new fields are introduced. Periodic fulls and regular incrementals would be produced.
Media tarballs: Not generated since the move to swift storage for media.
1. It has been suggested to allow users to create such tarballs on demand, but would we really support hundreds of users all picking up all images for all of en wikipedia at the same time? What burden would this put on the varnish caches, or alternatively, on the swift servers?
2. Another common suggestion is the generation of media bundles of default thumbnail sizes or of the sizes of images used on wiki pages.
HTML pages from RESTBase: these are dumped by an entirely different setup.

Those are the basic dump types and we should expect to add others. How do Flow pages get dumped? What about JSON format? Would we be better off generating something like the revisions table in mysql format, prepopulating columns with defaults when we don't have the live values, rather than XML? I'm leaving aside questions of "which compression" for now.

Dump Components

There are a few components for all of these dumps:

A. Generation

B. Storage

C. Service for downloaders, mirrors, labs, analytics, others

D. Display of current status

E. Notifications of newly available files

Let's look at the case of XML dumps for en wikipedia (a large and so complicated case), and what these four components look like now.

A. Generation

Generation of en wiki dumps is done on a non-standard server with 64GB ram and 4 8-core cpus. The files with content for all revisions are done in parallel by 27 processes. (This could probably be expanded to 31 cores but that's irrelevant to this discussion.)

Each process reads the corresponding stubs file (for example, stubs-meta-history25.xml.gz), and searches for revision content for each revision included in the stubs file, ordered by page. The revisions are first checked for in the revision content dump from the most recent dump, and if not found there they are requested from the db servers.

Because producing just one of these files takes several days, and the run can be interrupted for various reasons (a bad MW deploy, a db server depooled, etc.), we produce them in pieces, closing the current file after 12 hours (configurable) and opening a new file. All filenames contain the first and last page id dumped in the filename. This is all handled in Export.php which is referenced by a MW maintenance script for the dumps.

As the MW php maintenance script for the dumps runs, it writes and errors or warnings generated during the process, as well as progress messages that report the number of pages and revisions processed so far, some speed estimates and an ETA (wrong) of when the job will complete.

These 27 processes are started and watched by a python module designed to run various groups of commands (pipelines, series of pipelines, etc) in parallel, and pass back messages from stdout and stderr on an ongoing basis to a caller-provided callback. In our case the callback function writes the line of output to a status file in the current dump directory and also opdates the index.hmtl file in the dump directory with the message for that dump step.

Once all 27 processes have completed, and if we have good return codes from them all, we do a basic sanity check of the files (checking for the xml mediawiki footer at the end, or for the compression end of file signature for those compression types that support it).

We then generate md5/sha1 hashes for the produced files and update the status of this dump step as completed, via files in the dump directory.

B. Storage

Since older content dumps are needed to run some dump types and steps, we have a server with disk arrays NFS-exported and available to all hosts used for dumps generation. This means that little disk space is used on the generating host. It does mean we have all of the downsides of NFS.

C. Service to downloaders

The same server with disk arrays NFS-exported to dump generating hosts also provides service to downloaders and mirror sites, as well as rsyncs to labs.

D. Display of current status

A monitor process runs on one generating host, checking for lock files produced by each running dump script on the NFS mounted filesystem, cleaning up locks deemed stale (too old), and writing a new index.html file listing all wikis, once per minute.

The dump script running a given dump step updates the index.html page in the dump directory for that date and wiki at regular intervals, typically as it gets progress reports from the php maintenance script that actually produces the output.

E. Notification of newly available files

The dump script updates files in a directory "latest" containing rss info; it also links to the newly created file, replacing the date in the filename with the string "latest" for the link itself. The monitor, as we said above, updates the main index.html page so anyone checking that will see that a new file is available as well; additionally, the dump run status file (a plain text file easily parseable) is updated so that a client could check the contents of this file periodically.

Other moving pieces

Logs:
We keep logs of all jobs run for a dump. All logging info gets appended to a text file for that wiki's dump of that day in the "private" (non-downloadable) dumps directory.
Lock files:
In the distant past we did not want to be running two dumps of the same wiki at the same time; there was no notion of discrete rerunnable dump steps. It's still desirable not to run the same job at the same time for the same wiki and date, but we could run different jobs at the same time if we redid logging, hash file updating, and status updating. These lock files are touched once a minute by the dump script as it runs. If it dies, the monitor script, on whichever generating host it is running, will notice eventually that the lock file is old (> 60 minutes, configurable), and remove it, making the wiki available for another process to dump it.
Cleanup of old dumps, or of files from a broken dump step:
The dump script checks the number of dumps we want to keep (configurable), and removes any extras, oldest first, at the beginning of a run. Via NFS of course. Files from a broken dump step are retained until the job is rerun, and only replaced when the new file is created. In case no new file with the exact same name is created, the old file will not be removed. Files that may be left around are the 12-hour checkpoint files with first and last pageid in the name.
Fairness:
Small wikis take little time to run, big wikis may take days. In olden times small wikis would all get stuck waiting behind a few big wikis which were running. Subsequently we split small and big wikis into 2 queues, but now that's no longer necessary; see "Staged dumps" below.
Memory use:
Long running php scripts are likely to have memory leaks. We avoid dealing with this by running scripts for short periods of time and concating the output together. We currently do this for abstracts, stubs and page logs; page content is handled with a separate script respwaned on possible failure, which happens often enough that we never have a problem.
Running dumps over all the wikis:
A bash script calls the dump script in a loop. The dump script looks for the wiki which has waited the longest to be dumped, and runs it, then returning. Eventually all wikis will have been dumped and the dump script will start on a new dump of a previously dumped wiki. This is what we call "rolling" dumps, but this has been replaced by "Staged dumps", see below.
Running multiple wikis at the same time:
We run multiple bash scripts on each generating host.

Staged dumps

These days, instead of doing rolling dumps where each wiki's dump is run from beginning to end until it is complete, before going on to the next available wiki, we run them in "stages". This was a special request by folks who wanted the stubs to be generated all at once early in the month, so that statistics could be produced from them and available also early in the month.

Current stages we run are as folows: stubs, then tables for small wikis; stubs, then tables for bug wikis; content dumps for articles for small wikis, then for big wikis; content dumps for all revisions, for small and the big wikis; and so on. En wiki is run on its separate non=standard server but in stages, for consistency but for no other reason.

The "scheduler" for these stages reads a list of what tasks we want to run (bash script with job names and config file, which indicates the group of wikis to run on). Each command is set to be run a specified number of times, taking some number of cores, and the scheduler knows how many free cores are available at the start of the run (passed as an argument). It's not perfect, as some st stages depend on other stages having run, and by a bit of bad timing we might have fewer dump scripts running a specific stage later on than we would like, as others gave up waiting for their dependencies to show up. But it's been pretty reasonable.

This does mean another layer, however: scheduler - bash scripts - dump scripts - processes running in parallel - scripts running specific page ranges with concatenation of output into a compressor.

Rethinking the dumps

In rethinking the dumps, whether XML or other types, there is no reason we need to keep to the current model for generation. For example... we could have an object file store that can handle billions of objects, keeping each page with all its revisions in a single object, for a given wiki. A dump might consist of asking MediaWiki for a list of all pages which have been deleted and removing the corresponding objects; a list of all pages which have been updated, for each of these retrieving and deleting the object, retrieving and appending the new revisions for these pages to the object and storing it again; and a list of all pages with old revisions that have been hidden or otherwise been made unavailable to unprivileged users, and removing those revisions from the pages in a similar way. (This assumes that we do not request a list of page moves; a page move would be considered page creation (for the new location of the page) plus page update (if a redirect is left behind) or page deletion (if not). Finally, after getting and adding a list of new pages to the object store, we could then produce a dump in whatever format we wished, by exporting all the objects to an output formatter. Note that each of these steps should be done in bite-size pieces, any job running for no longer than an hour, so that if it fails we simply rerun it.

For a rethink of the dumps, then we need to look at:

Storage: instead of NFS, something cheaper, more scalable that could server as an object store
MW modifications: get lists of new/changed/deleted pages via the API, get lists of hidden revisions via the API (?), specify a time limit to Special:Export or other content retrieval means so that we can limit jobs to short running times
Service to downloaders: separate cluster of web servers which reads from... local disks? the object store directly? a varnish cache which can handle at least the surge when a new dump for a large wiki is released? what about media bundles?
Generation:
- dump manager which knows how to break down each dump, whether an sql table or an XML dump step or something else, into small jobs, and knows how to take the resulting output and combine it; it must be able to specify to the job runner the resources required for any given job, OR each job must take the same number of resources (1 cpu only). The dump manager adds all jobs to a job queue, marking those that are available to run (some jobs cannot run immediately because they are dependent on jobs from other steps completing first). As jobs are marked as completed by the job runner, the dump manager will collect all completed output for one dump step on one wiki and convert it to the appropriate sort of output file(s) with desired compression, appropriate size for downloaders, etc. It will then remove those jobs from the queue. Failed jobs may be added for rerun after some delay (one hour?) in case something temporary (MW code push, db server problem) was at issue, and at some point marked as permanently failed.
- job runner which runs jobs on a cluster of hosts (job output would go into an object store but this would not be handled by the job runner nor the hosts in the cluster but by the jobs themselves, i.e. in our case php scripts, if we were to continue using some existing MW code). It would read, and claim jobs from the job queue, mark them as done if they run successfully, mark retries or final failures, but not ever remove them.
- dump client which feeds lists of dump steps to run on various groups of wikis, in the order we want them run, to the dump manager. This client could also allow reshuffling or removing or adding dump steps on one or a group of wikis, (as long as the manager says they haven't already run), it could mark a dump step on one or all wikis as hold, or resume, or fix, or kill. again by passing the appropriate message to the dump manager.
- logging facility for each job as it is running, logging output and errors per job with metadata for the run of the dump step for that job attached to the log
Status monitor: can deliver the status of any job or any price or any dump when asked, can be used to generate HTML files for download service, RSS feeds, notification in an IRC channel, or anything else we like. This would be used by the Notifier to notify of newly available files, by jobs that update index.html files on the download hosts, etc.
Notifier of newly available files: could be written as independent scripts running to publicize however we like, with common input from the status monitor
Failure alerter: notify via email, icinga or other means of jobs that have failed permanently, including all metadata going with the dump step for that job

A diagram for discussion with a few walkthroughs

Here's a diagram with the possible building blocks, for discussion; arrows indicate direction of data flow. arrows into a data source indicate writes, out indicate reads.

Basic workflow

The client would read dblists and lists of dump steps to run on groups of wikis, in the order we wish them to run. It would generate a list of all dump steps in order on each wiki as appropriate, and pass this list on to the dump manager.

The dump manager would split each dump step into small jobs with limited page ranges and with a one hour time limit option to the job script (php maintenance script). It would add these jobs to the job queue, marking some for running and others as waiting, depending on which jobs need output from previous jobs in order to start.

The job runner and workers would be a block box, some sort of grid computing set-up, except that the job runner would claim jobs as running, mark retries and permanent failures, and mark successfully completed jobs. It is assumed that a successfully completed job would be a job that has an exit code of zero.

The jobs themselves, run by the workers, would log to the logging facility as they produce output. Any error messages would be reported there as well. All log entries would include metadata about the job being run: at least, which dump step on which wiki for which date it is a part of. They would write their output files to the object store, and retrieve any files needed for the run (stubs, older page content dumps) from the object store as well.

The dump manager would check the job queue; once jobs for a dump step are completed, it would read them, combine them as appropriate, format them as needed with the desired compression, and write the result back to the object store. It would them remove them from the job queue and log the result. Metadata for each dump step would also be produced at this time and stored in the object store, such as: hash sums, size of uncompressed data, cumulative run time, configuration options for the run (examples only!).

If jobs are marked as permanent failure, the dump manager might requeue after a delay, and after a second round of failures log the failure as permanent, leaving the job in the queue. These jobs would require manual intervention.

The status monitor would watch the logs and be able to report on request to the new files notifier and to the download service about any new files produced; to the download service about the status of any job(s) for a dump step for a wiki and date being run; and to the failure alerter about permanent failures.

The failure alerter might notify via mail and icinga and possibly other means of permanent failures so that humans can investigate the problem.

The new files notifier might produce rss feeds, announce in an irc channel, and perhaps by other means. This would predate the publication to the world by a few minutes, as the download service would not yet have a copy.

The download service would retrieve new files from the object store as it learns they are there, making them available to the world. It would update its index.html files accordingly. It might produce hash files for an entire wiki's dump run from the individual metadata files it will pick up from the object store for each dump step for that wiki and date, etc.

As this whole process is carried out, the client can be used to check via the dump manager what dump steps are running and which are waiting, to shuffle dump steps around or remove them from scheduling, etc. It could also display job status per running or failed dump step, and it could be used to mark any dump step on one or a group of wikis as hold, resume, fix (rerun broken bits), kill, cleanup (remove all output produced from broken run).

Walkthrough for XML stubs dump of en wiki

This presumes use of existing (with small modifications) php maintenance scripts. Those might be replaced by something very different! But this is an example.

Client submits "date XXXXXX, wiki enwiki, dump step stubs" to dump manager.
Dump manager gets max page id for en wiki from dbs, creates jobs in 100K page chunks. Each job is an invocation of dumpBackups.php with a one hour cutoff passed as an argument. These jobs are all added to the job queue.
Job runner reads the jobs from the queue and by magic schedules them on hosts in the worker cluster using all resources as appropriate. This is some grid computing thing that we didn't write and we don't maintain.
Php scripts write their output to the object store. These would be (probably) uncompressed files, with 1st and last page id in the filename, as well as the wiki name, the date of the urn, and the dump step. They write all either progress output to the logging facility.
The job runner collects exit codes and reruns jobs or not as needed, until are all done. It marks them as done in the job queue.
The dump manager checks, for any completed job, the page range of the actual output against the page range requested. If the one hour cutoff was reached instead, the job is removed and new jobs covering the remainder of the page range are queued each requesting approximately the number of pages actually output by the initial job; the initial job is requeued with the actual page range written, and marked as complete.
Eventually all jobs are done. The dump manager sees that all jobs are done, combines the three types of stub output from jobs into compressed files of a suitable size for download (2gb max?), with first and last page names in the file name of the newly combined files. It must also remove the MW XML header and footer from intermediate files at this stage. These output fles are added to the object store. The dump manager announces the result to the logging facility.
The dump manager adds metadata for each file to the object store as well (md5/sha1 hashes, total run time to produce, dump step + wiki + date, jobs run (?), uncompressed file size, etc). This too is announced to the logging facility.
Status notifier receives request from new file notifier to check; it responds with all information about new files (file names in object store plus metadata file names in obj store)
New file notifier announces completed production of new files soon to be available for download
Status file receives request from download service to check; it responds with new file and metadata file information as above.
Download service retrieves files and metadata, produces combined metadata files covering the entire run so far, and updates index.html files.
Downloaders download and a thousand mirrors are built. Profit?

Walkthrough for XML page content dumps (full history) of en wiki

Still using the existing php scripts with small modifications.

Client submits "date XXXXXX, wiki enwiki, dump step pages meta history" to dump manager.
Dump manager gets max page id for en wiki from dbs, creates jobs in 20K page chunks. Each job is an invocation of dumpBackups.php with a one hour cutoff passed as an argument. These jobs are all added to the job queue. Creation of these jobs entails locating the stubs that cover the desired page range. In fact the history stubs should be rewritten into 20K page chunks, since page content dumps simply read the stubs and request revision content for each page and revision listed in the stub file. This means the dumps manager would produce these stub chunks and add them as temporary files to the object store, and the jobs it produces would reference these files. The dump manager must also locate the most recent page content history dump for en wiki that covers each page range, and split these up too into temporary files that cover only 20K pages. If it does not split them up, each job will read through a much large bz2-zipped file before it gets to the start of the pages and revisions that it needs, and this wasted time will add up. (Can we do better?) These temp files would also go into the object store.
All other steps continue as above, but at the end...
Dump manager removes all temp files for this step from the object store, once the output and metadata files have been successfully added to the object store and been logged.