Obsolete:Dumps/Development status 2011

This is a detailed explanation of ongoing tasks and their status.

As of the start of February 2011

Backups, availability

Files backed up to a local host: In progress
Currently a copy of the XML dumps through November of 2010 is on tridge, our backup host. There is not room to copy over new files; this is an interim measure only.

Dataset1 needs to be fixed; when it is functional again we will begin rsyncs of the XML files to it.

Dataset1 does not see all 48 disks in any configuration that lets it reboot and continue to see them. See the RT ticket (#388) for grody details. Next is shipping it back and waiting for fix from the vendor; this leaves us without any place to back up private wikis and private files for the interrim. It also means we keep the public data on tridge much longer than anticipated. We need to monitor space useage and make sure we don't impact the regular backups over there. There isn't room for the newly generated xml dumps to be copied over there; they will have to go to google only.
Files backed up to remote storage: In progress
We have a copy of the latest complete dumps from November 2010 or before copied over to Google Storage. This does not include private wikis or private files, so it is not a complete solution. Additionally, we expect to retain more copies of XML files than we copy over to Google.

We have been advised to use specific naming schemes for Google storage "buckets" which cannot be preempted by other users; the files are being moved now to a bucket with this naming scheme. A script for doing the bimonthly copy from the xml dumps server is ready and can need to be updated with the new naming scheme.

We need to get developer keys for each developer who might run the script; this process is also underway.

When the new datacenter comes on line, we will have a server there which hosts a current copy of all files. That may not be for several months, however.

We are looking into NAS solutions for data backups as well.

We can and should copy files up to archive.org; we should contact folks there about an api for facilitating this. Not started yet.
Mirroring of the files: In progress
We have had discussions with a couple of folks about possible mirrors. Again this would only be public files. Needs followup. See [1]. We have one contact doing a first rsync; this also needs followup.
Old dumps from community members: In progress
We have some leads. Needs followup.

Speed

Dumps in batches of smaller wikis, larger wikis, enwikipedia in a third: Done.
Smaller wikis run separately from larger wikis, enwikipedia will run on its own host.
Parallel jobs for enwikipedia: In progress
Works but needs tweaking. Specific needs: see Obsolete:Dumps/Parallelization.

Robustness

Ability to restart each phase of a wiki XML dump separately: Done.
Can run single phase of one dump while a later dump of the same wiki runs: Done.
Can restart dump from a given stage and have it run through to completion: Done.
Can dump checkpoints of files with text content at arbitrary intervals: Done.
Can rerun a specified checkpoint file, or rerun from that point on: In progress
Detection of truncated files of text content (bz2 only): Done.
Safeguards against wrong or corrupt text in the XML files: In progress
It would be best to have some sort of hash; currently we check against rev length, which on some projects isn't so variable, particularly for stubs of articles in a given topic area with the same infoboxes and templates.
Automated random spot checks of dump file content: Not started
Restore missing db rev text from older dumps where possible: Not started
Scheduled and regular testing of dumps before new MW code deployment: Not started
Stop after n failed dumps in a row: Done.

Configuration, running, monitoring

Alternate conf files, alternate lists of wikis: Done.
Per-wiki configuration of appropriate options: Done.
Start and stop runs cleanly via script: In progress
We can restart a dump from a given stage now and have it run through to completion.

Working on the ability to restart history phase of dumps from the middle after interruption.
Stats for number of downloads, bandwidth, bot downloads: Not started
Automated notification when run hangs: Not started
Packages for needed php modules etc., puppetization: In progress
Need to update the "writeuptopageid" package; everything else other than the actual backup scripts is packaged and puppetized.
Docs and sample conf files for backup scripts: In progress
First draft of sample conf file, doc of conf file options written
Ability to add local notice to dump-specific index file: Done.
Automated stop of dumps after next complete job, for maintenance needs: Done.
Ability to add maintenance notice to main dumps generated index page: Done.

Enhancement

Assign priorities to requests for new fields in dumps, implement: Not started
See [2].
Accept or reject doing incremental dumps, implement: In Progress
Have design for proof of concept for incrementals of added/changed content, does not include deletes/undeletes.
Dumps of image subsets: Not started