Obsolete:Dumps/Development 2011

From Wikitech

Tasks overview

We have work in the following areas. See Obsolete:Dumps/Development status 2011 for detailed explanation and the status of any of these items.

Se also the tracking bug on bugzilla for some of these issues in finer detail: [1]

Backups, availability

  • These files should be backed up to a local host from which the files would be instantly available at any time.
  • The files should also be backed up to remote storage.
  • We should work with other organizations and with individuals to set up mirroring of the files.
  • Old copies of missing dumps should be obtained from community members so that they can be archived and mirrored.[1]

Speed

  • Dumps should be run in batches, with smaller wikis in one set, larger wikis in another, and enwikipedia in a third, so that every wiki project will be dumped on a regular interval without too much waiting in the queue.
  • Dumps of larger wikis, or at least of enwikipedia, should be broken into pieces that can be run in parallel, with each piece taking approximately the same time to run.

Robustness

  • Each dump consists of a number of smaller steps; each step should be restartable from the beginning in case of failure, rather than requiring restart of the entire dump.
  • Individual steps of a dump should be able to be run independently of the main dump process, so that new dumps of a project can be generated while an old one is being tidied up.
  • We should have safeguards in place against writing wrong or corrupt text into the dump files.
  • We should have an automated means of doing random spot checks of dump file content for accuracy.
  • Text of some older revisions for various projects is missing both in the database and in current dumps. We should examine older dumps and see if this content is available and can be restored to the database.
  • Scheduled and regular testing of dumps should be done before MediaWiki code is synced from the deployment branch, rather than automatic syncing via puppet.[2]

Configuration, running, monitoring

  • The dump process should support alternate configuration files and alternate lists of wiki projects.
  • The dump process should support starting and stopping a number of runs via a script, rather than manually, and with appropriate cleanup at termination.
  • We should generate and keep statistics about the number of downloads for each project in a given time frame, about bandwidth usage, and about bot downloads. (Perhaps we should see whether organizations doing automated downloads would like to host a local copy.)
  • When a dump process hangs, we should be notified by some automated means so that we can investigate.
  • All non-MediaWiki software needed for the dumps to run should be packaged and the installation of such packages puppetized.
  • The MediaWiki backups code should include setup and operation documentation and sample configuration files.

Enhancement

  • There have been many requests for new data fields to be included in the dumps. These need to be prioritized and added as appropriate. See Research Data Proposals.
  • We have had many many requests for "incremental" dumps that would include a list of moves and deletions, changed content an new pages only, since the last run. We should evaluate this carefully to see if it's doable.
  • People have asked us for image dumps. While we would not provide downloads of the entire set of images (8T? Sorry, folks ;-)) we should consider providing smaller reasonable-sized subsets for download.



[1] One might think that new copies of the dumps are all we need, but in the past it has been useful to have the old copies. For example, when new runs have had corrupted text content, we have been able to go back to the older copies, which are used for retrieving text rather than requesting all text from the database.</ref>

[2]In the past we have synced code that broke the dumps in a subtle fashion, but this was not discovered until many bad dumps had been produced. Because completed dumps are used as input for the next set, we had to invest some time into isolating which dumps were bad and moving them out of the way, once the problem was isolated. Such issues can be avoided with regular scheduled testing of code before deployment.</ref>