This is a project that archives the public datasets generated by Wikimedia.
Archive the public Wikimedia datasets.
Anticipated time span
Willing to take contributors or not
Subject area narrow or broad
This project was created to provide a dedicated space just for transferring Wikimedia dump files to the Internet Archive. These dumps were created as a possible backup in the case of cluster-wide hardware failure, and its also often used by researchers/bots. Sometimes, these files are generated for forking of any Wikimedia project, when lots of people of a project has different aims from the original Wikimedia goal.
More information about the archiving process is available at Nova Resource:Dumps/Archive.org
Data currently being archived
Here are some information and links regarding the data that this project is archiving:
- Wikimedia main database dumps
- Wikimedia incremental dumps
- Wikidata JSON dumps
- Wikimania videos
- OpenStreetMap datasets
- dumps-N (where N is an integer): Main archiving servers
- dumps-stats: Wikimedia data manipulation, including dumps above and other stuff of relevance for Wikimedia research.
- Before the eqiad migration we used to have a 900 GB quota (hardly sufficient for comfortable work).
- Currently all heavy operations are conducted on /data/scratch/. We currently keep to a soft limit of using only 3 TB of space, but such disk usage is always temporary and will be deleted once the data is pushed to the Archive.
- Everything is retained locally only for very short periods, just the time needed for packing on archive.org.
Server admin log
- 12:43 arturo: briefly stopping VM dumps-5 and dumps-4 to migrate hypervisor
- 18:34 bstorm: removing files from /data/project/dumps/temp/wikidata and /data/project/dumps/temp/cirrussearch T255628
- 19:59 jeh: restart dumps-0
- 14:04 andrewbogott: moving dumps-5 to a new cloudvirt
- 13:58 andrewbogott: moving dumps-4 to a new cloudvirt
- 20:13 andrewbogott: rebooting dumps-1 to try to workaround nfs issues
- 23:18 bd808: Deleted dumps-stats and bugzilla ([[ph... (more)