We now run the en wikipedia XML dumps using code that produces some files in small chunks via threads running at the same time.
Running the dumps for en wikipedia used to take 1.5 months for the history phase, two weeks for the recompression, another week or so for the tables and the abstracts, adding up to over two months total.
There are a lot of things that can go wrong when one of these jobs is run, and we have had them all: out of space on the server storing the dumps, hardware issue with the server that stores the dumps, code push of MediaWiki that breaks the backups, power failures in the data center, network outages or packet loss, corrupted data carried over form one run to the next...
There is one basic thing we can do to try to produce dumps on a regular frequent schedule, given the above. We can try to make them consist of small restartable steps, none of which take very long, and which all together do not take very long. The obvious way to make these run faster is to split up the task of writing out metadata for pages and/or the page contents, so that we write several files containing different ranges of pages, in parallel.
The new code allows us to run separate stages of the dumps as options on the command line. It updates the index file, the status file and the checksums so all of those things stay in sync. It won't run a job if the required previous steps didn't run successfully. We can restart a dump form a given job and have it proceed through all the remaining jobs as well.
The new code allows us to run each of the XML file producing stages so they produce multiple small files in parallel instead of one large file with just one process writing to it. The way this works is not at all fancy; we run several jobs that each request a range of page IDs and the relevant revisions for those pages; each set of pages is written to its own file. Recombining these files is not too bad for most stages; for the history file it's horribly slow. For the bzip2 file we could probably get a parallel bzip2 together that would make this reasonably fast. For the 7z version we are probably stuck with 2 weeks for the recombine.
Notes about speed
- Each time we run one of these we'll tinker a bit with the page ranges; the goal is to get close to the same number of revisions in each history chunk, so that they all complete at roughly the same time.
- There are some early revisions in en wikipedia that are missing from the external store due to some old bug. We try to retrieve these and fail; failure means trying 5 times, waiting 5 seconds in between each retrieval (in case the database server is overloaded or otherwise having issues). This means a delay in production of the chunks with more of these early revisions.
- We try to use revision text written to the previous dump files. We use the pieces from the previous dump, which, since we are moving the page ranges around, may mean we read through a bunch of pages before we get to the ones that interest us. The jobs do take multiple prefetch files and they select only those files that contain the relevant pages in them someplace.
- Some early revisions don't get prefetched; I need to check on this.
See bug 27110 for outstanding bugs/missing features. Also:
- Not seeing progress reports for the articlesdumprecombine step. (I thought I saw it working on earlier recombine steps.)
In the meantime those curious about the code can look at my svn branch, specifically at the xmldumps-backup directory. Expect the code to be crappy; that's why it's there and not in trunk yet.
Other things we need
- dbzip2 was a parallel bzip2 brion was working on. It sure would be nice to have it fully functional.
- How about a parallel 7za with similar features?
- There's the huge list of items on the Research Data Proposals page too.