Obsolete:Dumps/Development 2012

From Wikitech

This is a detailed explanation of ongoing tasks and their status.

As of the start of January 2012

Backups, availability

  • Files backed up to a local host: done
    Currently a copy of the XML dumps through November of 2010 is on tridge, our backup host. There is not room to copy over new files; this is an interim measure only.
    Dataset1001 has an rsync of all public data. (Shoudl rsync /data/private too I guess.)
  • (Public) dump content backed up to remote storage: In progress
    Google:
    --We have a copy of the latest complete dumps from November 2010 or before copied over to Google Storage.
    --We don't expect to copy all dumps produced but only a selection, probably a full run at six-month intervals.
    --Note that these can be downloaded only by folks with google accounts.
    --We have been advised to use specific naming schemes for Google storage "buckets" which cannot be preempted by other users; the files are being moved now to a bucket with this naming scheme. A script for doing the bimonthly copy from the xml dumps server is ready and can need to be updated with the new naming scheme.
    --We need to get developer keys for each developer who might run the script; this process is also underway.
    Archive.org:
    --We have contacts there now to help shepard things along.
    --Code is in progress to use their S3-ish api.
    --We don't expect to copy all dumps produced but only a selection, probably a full run at six-month intervals.
    --There is a lab project that was set up by Hydriz to copy all dumps ever produced; we need to discuss this. See the labs project page.
  • Off site backups: Not started
    This mans a full copy of dumps, page stats, mediawiki tarballs but also all private data, to durable media stored off-site.
    Questions to be answered (need discussion): How often? Do we do incrementals? What third party location would hold these backups? What media would we use?
  • Mirroring of the files: In progress
    We have had discussions with a couple of folks about possible mirrors. Again this would only be public files. Needs followup. More info: Dumps/Mirror status.
    We have three mirror sites; see [1]
  • Make old dumps from every six months or so (2002 through 2009) available In progress
    2002, 2003, 2005, 2006 available for download.
  • Old dumps from community members: In progress
    We have some leads. Needs followup.
  • Files copied to gluster cluster for access to labs: Done
    The last 5 good dumps are available in gluster storage at /publicdata-project of labstore1. It is up to date, and is accessible by any instance at /public/datasets.
  • Manage toolserver copies of dumps somehow: Not started
    Until recently everyone had their own copies of whatever dumps they wanted lying around, takiing up lots of space and requring extra downloads. They were discussing holding all dumps in one centralized location. Can we provide an rsync of the last 5 to them, or (ewww) make the gluster cluster available to them?

Speed

  • de wp takes too long. Folks using the dump aren't interested in parallel runs. Needs discussion.
  • be able to skip parts of prefetch files that are irrelevant, by locating the pageid in the appropriate bz2 block. In progress

Robustness

  • Can rerun a specified checkpoint file, or rerun from that point on: Done
  • Safeguards against wrong or corrupt text in the XML files: In progress
    Need to use sha1 hash, as soon as that code in core is deployed and column populated.
  • Automated random spot checks of dump file content: Not started
  • Restore missing db rev text from older dumps where possible: Not started
  • Scheduled and regular testing of dumps before new MW code deployment: Not started
  • Test suite for dumps: In progress We now have a contractor, yay! Dumps/Testing
  • Easy deployment of new python scripts while current jobs are running: In progress
    Need to finish migration to new deployment setup, make sure worker can exit gracefully on demand

Configuration, running, monitoring

  • Make wikitech docs for dumps suck less: Done
  • Start and stop runs cleanly via script: Done
    We can restart a dump from a given stage now or fromm any checkpoint and have it run through to completion.
  • Stats for number of downloads, bandwidth, bot downloads: Not started
  • Automated notification when run hangs: Not started
  • Packages for needed php modules etc., puppetization: In progress
    Need to update the "writeuptopageid" package and the mwbzutils need to be packaged; everything else other than the actual backup scripts is packaged and puppetized.
  • Docs and sample conf files for backup scripts: Done

Enhancement

  • Assign priorities to requests for new fields in dumps, implement: Not started
    See [2].
  • Incremental dumps: In Progress
    Have deployed adds/changes content dumps, does not include deletes/undeletes, also not robust right now.
  • Multistream Bz2 dumps of pages-articles for all wikis, plus scripts to put them to use: In Progress
    Running for enwiki, need to deploy for the rest, [3] demo using them is now available
  • Full image dumps: In progress
    copy of production media to server for rsync: done
    rsync to external mirrors setup: done
    generation of tarballs per wiki: first run happening now
    script to "http-sync" media once it's in Swift: in progress
    Old plans: Obsolete:Dumps/Image dumps plans 2012