Dumps/History

From Wikitech

(Please fill in interesting tidbits here.)

The first dumps of the projects that we still have lying around are from January 2001; Tim Starling turned them up during a perusal of the files on the old MediaWiki SourceForge site. At that time dumps consisted of tar-ing up the top level directory, as far as I can tell by looking at at the scripts from a January 2002 dump. This meant that you got automatically a copy of the images, the user pages, the current article versions, and all the scripts. This was when the projects were still using UseModWiki. The scripts from the Jan 2002 dump say "UseModWiki version 0.91 (February 12, 2001)".

Once MediaWiki became the platform, dumps were produced as sql dumps of the various tables for a given project. In March of 2003 the en wikipedia dump (enwiki-cur.sql.gz) was just about 700 megabytes. We even have a description of how you dumped projects then. Those original backup scripts are still around on our bastion host in 2012! (Curious? Go to /home/wikipedia/bin-old/ and look at backup-all and the other backup-* scripts.) At that time in order to do a dump the database was taken offline for the duration of the dump (presumably to maintain integrity of the output of the tables). For English Wikipedia this meant that the database was offline for an hour [1].

In mid 2005, with the adoption of MediaWiki 1.5, the storage format for text changed, so that plain sql dumps were no longer feasible. Brion Vibber put together a new python script using the MediaWiki export mechanisms to produce dumps. The first checkin I could find of WikiBackup.py is from January 2006.

In 2007 things started to get tough. There were really only two full time developers (Brion and Tim) and the en wikipedia history dumps were getting bigger. The entire dump infrastructure was under stress; several hardware issues were encountered that year and in 2008. In January of 2008 there was a successful en wikipedia full history dump but none after that for the remainder of the year.

In mid-2009 Tomasz took over the dumps, working on getting a server (dataset1) with much more storage. In the meantime Tim worked out the recompression of the revision texts on the production databases. Also during this time, locking of a table in order to dump it was turned off for good, because it was causing serious replication lag.

A full history run of en wikipedia from Jan 2010 was near completion when some sort of hiccup caused an interruption. After splitting up the bzip2 file into many many blocks (with bzrecover iirc) and rerunning the missing pieces, Tomasz was able to announce in March of 2010 the completion of the first set of en wikipedia dumps in over two years. It was later found that due to a bug in the earlier recompression process, revision texts from January 1 2005 through May 14 2005 were missing[1], but these could be retrieved from a 2006 dump which was made available for download.

A second en wikipedia history dump completed in March 2010 [2] but it was discovered to be incomplete, missing about a third of the revisions.

At this point the en wikipedia history dumps took a hiatus again until September 2010, when the first parallel job run finished, producing 11 separate pieces of the bzip2 compressed revision history text. From January of 2011 through the middle of the year, full dumps succeeded on time (i.e. once a month) about half of the time. In April we got a new beefy server for the en wikipedia dumps[2], 4 8-core cpus and 64 GB of RAM, but we faced low level kernel errors during the transition[3] and it didn't came into action until June [3]. The first run of the dumps on that hostz!-- Check me, seem to have happened in the old one!-->, split into 32 pieces, failed spectacularly, with over 2/3 of the history dump files truncated after the correspoding bzip2 process died[4]. After rerunning those 16 at a time, the dumps were reconfigured to run with 27 chunks at a time, which turned out much more viable.

On November 2010 the server hosting the only copy of the dumps failed with kernel panics [4] and raid errors [5]. The machine was down until December Dataset1#11-10-2010 - New_errors.

In September of 2011 the first run using checkpointed files for output completed successfully [6]; producing many small files means that if a problem is discovered with one file only that one needs to be regenerated. That, 27 chunks at a time and good hardware seemed to do the trick, because from August through December the en wikipedia dumps ran once a month completing in an average of ten days or so.

In December of 2011 the first adds/changes dumps were produced [7]; these dump new revision texts for each wiki over a 24 hour period. (Hmm, there seems to be a y2012 issue, better find out why they aren't running now.)

In January of 2012 the first "multiple stream" bzip2 compressed dumps of currrent pages were produced for en wikipedia [8]; these files consist of 100 pages per stream, each stream readable as a complete bzip2 file. Accompanying the file is an index which gives the page title and id of each article, along with the offset into the file where the stream containing it starts. These files might be useful to people producing off line readers or doing various types of analysis.

For old development plans, see:

Notes

  1. FYI: comparison between enwiki-20100130-pages-meta-history.xml.7z and enwiki-20100312-pages-meta-history.xml.7z Xmldatadumps-admin-l 2010-05-14
  2. Migrating storage2 to dataset1 xmldatadumps-admin-l 2010-04-05
  3. Migrating storage2 to dataset1' xmldatadumps-admin-l 2010-04-09
  4. New enwiki-Dump xmldatadumps-admin-l 2010-04-08