Obsolete:Dumps/Development 2012

This is a detailed explanation of ongoing tasks and their status.

As of the start of January 2012

Backups, availability

Files backed up to a local host: done
Currently a copy of the XML dumps through November of 2010 is on tridge, our backup host. There is not room to copy over new files; this is an interim measure only.

Dataset1001 has an rsync of all public data. (Shoudl rsync /data/private too I guess.)
(Public) dump content backed up to remote storage: In progress
Google:
--We have a copy of the latest complete dumps from November 2010 or before copied over to Google Storage.

--We don't expect to copy all dumps produced but only a selection, probably a full run at six-month intervals.

--Note that these can be downloaded only by folks with google accounts.

--We have been advised to use specific naming schemes for Google storage "buckets" which cannot be preempted by other users; the files are being moved now to a bucket with this naming scheme. A script for doing the bimonthly copy from the xml dumps server is ready and can need to be updated with the new naming scheme.

--We need to get developer keys for each developer who might run the script; this process is also underway.

Archive.org:
--We have contacts there now to help shepard things along.

--Code is in progress to use their S3-ish api.

--We don't expect to copy all dumps produced but only a selection, probably a full run at six-month intervals.

--There is a lab project that was set up by Hydriz to copy all dumps ever produced; we need to discuss this. See the labs project page.
Off site backups: Not started
This mans a full copy of dumps, page stats, mediawiki tarballs but also all private data, to durable media stored off-site.

Questions to be answered (need discussion): How often? Do we do incrementals? What third party location would hold these backups? What media would we use?
Mirroring of the files: In progress
We have had discussions with a couple of folks about possible mirrors. Again this would only be public files. Needs followup. More info: Dumps/Mirror status.

We have three mirror sites; see [1]
Make old dumps from every six months or so (2002 through 2009) available In progress
2002, 2003, 2005, 2006 available for download.
Old dumps from community members: In progress
We have some leads. Needs followup.
Files copied to gluster cluster for access to labs: Done
The last 5 good dumps are available in gluster storage at /publicdata-project of labstore1. It is up to date, and is accessible by any instance at /public/datasets.
Manage toolserver copies of dumps somehow: Not started
Until recently everyone had their own copies of whatever dumps they wanted lying around, takiing up lots of space and requring extra downloads. They were discussing holding all dumps in one centralized location. Can we provide an rsync of the last 5 to them, or (ewww) make the gluster cluster available to them?

Speed

de wp takes too long. Folks using the dump aren't interested in parallel runs. Needs discussion.
be able to skip parts of prefetch files that are irrelevant, by locating the pageid in the appropriate bz2 block. In progress

Robustness

Can rerun a specified checkpoint file, or rerun from that point on: Done
Safeguards against wrong or corrupt text in the XML files: In progress
Need to use sha1 hash, as soon as that code in core is deployed and column populated.
Automated random spot checks of dump file content: Not started
Restore missing db rev text from older dumps where possible: Not started
Scheduled and regular testing of dumps before new MW code deployment: Not started
Test suite for dumps: In progress We now have a contractor, yay! Dumps/Testing
Easy deployment of new python scripts while current jobs are running: In progress
Need to finish migration to new deployment setup, make sure worker can exit gracefully on demand

Configuration, running, monitoring

Make wikitech docs for dumps suck less: Done
Start and stop runs cleanly via script: Done
We can restart a dump from a given stage now or fromm any checkpoint and have it run through to completion.
Stats for number of downloads, bandwidth, bot downloads: Not started
Automated notification when run hangs: Not started
Packages for needed php modules etc., puppetization: In progress
Need to update the "writeuptopageid" package and the mwbzutils need to be packaged; everything else other than the actual backup scripts is packaged and puppetized.
Docs and sample conf files for backup scripts: Done

Enhancement

Assign priorities to requests for new fields in dumps, implement: Not started
See [2].
Incremental dumps: In Progress
Have deployed adds/changes content dumps, does not include deletes/undeletes, also not robust right now.
Multistream Bz2 dumps of pages-articles for all wikis, plus scripts to put them to use: In Progress
Running for enwiki, need to deploy for the rest, [3] demo using them is now available
Full image dumps: In progress
copy of production media to server for rsync: done

rsync to external mirrors setup: done

generation of tarballs per wiki: first run happening now

script to "http-sync" media once it's in Swift: in progress

Old plans: Obsolete:Dumps/Image dumps plans 2012