Obsolete:Data dump redesign (old)

This article is obsolete and is kept for historical purposes.

Database Dumps

The WMF provides public Database XML and SQL dumps on a per wiki basis located at http://download.wikipedia.org/backup-index.html. These dumps seem to effectively run every month or so per wiki but within the last set of years their reliability has no been consisten. Specifically the English language dump of Wikipedia has not had a succesful dump in so long that there isn't even an existing dump that has each job intact. Having a publicly consumed service like this fail consistently is just not acceptable to our community. There needs to be some stability and predictable added added to the horror that is the Dumps

Current Uses

Many people end up consuming these dumps and are being factored in. These dumps in their simplest forms are a backup mechanism for the WMF. For our external communities, the uses range offline to reader applications used on the olpc, various statistical and analytical data mining, and as a backup mechanism if anything should ever happen to the WMF projects.

Current Architecture

Currently located in the wikimedia svn.

dumpBackup.php pulls on a per wiki basis to pull XML skeleton for both page metadata & revisions.
Upon successful completion dumpTextPass.php is started and it forks fetchRevision.php
- If there is a previous successful backup of the wiki then the old data dump is read for and matched to the current meta data pull.
  - If there is a match and there are no new revisions, then we do not pull data from external storage
  - Otherwise we pull all new data from external storage and the databases involved.
- Lots of times passes with little to any error checking past outright failure.
Progress get recorded on a download status page
- The job fails and gets recorded on the status page.
- If were lucky then a job finishes and everyone is happy.

Problems

These are the main problems that we are looking to solve.

Dump process takes too long or doesn't finish at all
Files are too big
- Hundreads of gigs for each dump
Corrupted Output even if we have a successful build.

Requirements

Regain the ability to run dumps and have them finish in ideally one week or less on average and nothing more then two weeks.
Ability to retry jobs for any length of time in case of node failure
Do validation during on input and output so that we can don't publish crap.
Better efficiency in storage format.
Split data in dumps so that offline readers can consume them without needing the whole piece.

Possible Solutions

Map Reduce

A MapReduce based system that divides work into distinct chunks and processes them on individual nodes. This would have the advantage farming out jobs to a managed cluster that would have the ability to start,restart and schedule sets of jobs.

Hadoop is one implementation of the Map Reduce system, there are many others.

Improvements

Job segmentation based on revision, page id or another unique identifier to allow us divide our data set. Ideally the chunked data will come in ranges so that each job node can be given a specific and distinct data-set to operate on.
Multiple job nodes (our job queue workers for instance) can be setup to process data.
- Since each chunk is a distinct piece of work, any one unit of work that fails will not affect the success of the job as whole if when re-submitted it succeeds.

Cons

Adds complexity to both the backup architecture and the operational overhead
Due to hadoop's usage of a clustered file system, the network overhead is higher
- And since were talking XML, the data set is huge

Gearman

Distributed job system from LJ.

http://www.danga.com/gearman/

Pros

Simple interface

Cons

Client is in charge of retries. The gearman system will not restart anything on its own.

Follow up

Test install hadoop in DC
Analyze throughput of each component a bit better
- Obsolete:Data Dump Redesign/Throughput
How big is our dataset
- How big is XML going over the wire within this system
Costs of compression in both CPU time and disk speeds
- Parallel bzip2
  - Validation is important here; can we parallel *validate* the file? Decompression takes as long as compression with bzip2...
  - mw:Dbzip2

Notes

Notes from a conversation with Tim

270M rows in enwiki.revision

I've said that a stopped slave could be better
so anyway, it's important to recognise that CPU might not be limiting here you need a lot of processes to load from ES, because you need to keep lots of requests queued up to the disks, to keep them busy and reduce seek overhead
- but going from there to XML will take negligible CPU time
  - well, there will be decompression to do, but still, very small
then you want to recompress, that will be quite a bit more CPU-intensive
finally you have to write to the disk, and unless you have a very fast disk, that will be a bottleneck, so make sure you have the right sort of hardware lined up for that
What definitely *would* be much more user friendly is to use a compression scheme which allows random access, so that end users don't have to decompress everything all at once in the first place. -Anthony on foundation-l
- Note OLPC reader using block index technique for bzip2 archive.
- ZIP and .7z offer random access to sub-files which are individually compressed, but this requires a file dictionary -- potentially huuuuuge and slow to access if we have one sub-file per article.
Some universities (including ourselves) has offered storage capacity and some bandwith to distribute mirrors and improve the dump availability, at no cost at all :).
- Distribution is the easy part. :)

External Consumers

Metaweb - http://aws.amazon.com/publicdatasets/

Offine Readers