Nova Resource:Dumps/Archive.org

Archive.org refers to the Internet Archive, which is a library of stuff, mainly scanned books, but can contain almost anything that is of free content.

We are currently working on moving the public datasets to the Archive for preservation, although right now its mainly being handled by volunteers (specifically Hydriz and Nemo).

Archiving

There is a Cloud VPS project called "Dumps" that is dedicated to running the archiving processes by volunteers. Currently, the datasets that are being archived are:

Adds/Changes dumps (source) - Runs automatically via crontab
Main database dumps - Runs automatically via an archiving daemon
Wikimedia visitor project statistics (hourly versions, grouped by month) - Manually run
Other available Wikimedia datasets

Code

The source code for all the files used in this project are available on GitHub. These code might (in the future) find its way into the Wikimedia Gerrit repository, but there is no plans on doing so right now.

The archiving of all datasets is managed by an archiving daemon under the project "Balchivist". It regularly scans for new dumps and feeds them into a database, which will be picked up by an "archive runner" which will archive the dump at a later stage. Code is available here.

Metadata

As for metadata, it's important to keep it correct and consistent, if not rich, so that things are easy to find, bulk download and link.

For pageview stats we follow this template: https://archive.org/details/wikipedia_visitor_stats_201001
For incremental dumps something like this: https://archive.org/details/incr-idwiktionary-20150115
For regular dumps (usually $wikiid-$timestamp as identifier) it's important to be precise, follow this template: https://archive.org/details/enwiki-20150304

Caveats

Never assume some data is safe. If we didn't archive something and you can archive it before us, do so! Use archive.org and we'll notice, filling any gap. Please ping us to have your item moved into the collection.

Internet Archive tips

It's fine to upload to "opensource" collection with "wikiteam" keyword and let the collection admins among us to sort it later.
Every new archival code should use the IA library: https://pypi.python.org/pypi/internetarchive
On duplication: first of all, be thankful of Internet Archive's generosity and efficiency with little funding. Second, SketchCow> [...] uploading stuff of dubious value or duplication to archive.org: [...] gigabytes fine, tens of gigabytes problematic, hundreds of gigabytes bad.
- For instance, it's probably pointless to archive two copies of the same XML, one compressed in 7z and one in bz2. Just archive the 7z copy; fast consumption needing bzcat and whatever will rely on the original site.
As of summer 2014, upload is one or two orders of magnitude faster than it used to be. It's not uncommon to reach 350 Mb/s upstream to s3.us.archive.org.
Ask more on #wikiteam or #internetarchive at EFNet for informal chat, or on the archive.org forums for discoverability.
Especially when the files are huge, remember to disable automatic derive: it creates data transfer for no gain.

Development

Expansion of scope

Archive the main database dumps: Done
Archive the full media tarballs: In progress
- A blueprint is currently being drafted on the best method to archive the media on all Wikimedia wikis (including Commons) while using minimal resources.

Robustness

Improve overall usability of items on Archive.org: Done
Better error handling/skipping errors: Done

Speed

Implement parallelization: Done
Tap into multipart uploading for S3: In progress
- Multipart uploads are slower overall compared to using direct uploads. Work is also being done to ensure that multipart uploads can easily resume.