Dumps/Mirror status

From Wikitech
Jump to: navigation, search

We are interested in mirroring of the dumps; please add information there if you can host or know of an organization that can. Check there for mirror requirements and for the list of current mirrors.

Mirror setup for xml dumps and media

The mirror sites depend on the generation of lists of files/directories for rsync. Since some mirrors have limited space, we don't ask them to pick up the last 5 dumps but rather the last 5 complete good dumps. Since determining which ones those are would take some poking around in the directories, we generate this list for them (along with similar lists for last 1 and 2 good dumps). Script: [1] run out of cron on the snapshot host responsible for misc cron jobs (see hiera/hosts to check which one).

Dataset2 (pmtpa) is rsynced to dataset1001 (eqiad) every 2 hours, checking to make sure the previous rsync has completed first.

Rsync to our mirrors is available from dataset1001.

Mirrors in operation

  • C3SL (run entirely by them, see dataset2 rsyncd.conf for contact info)
  • Your.org (run by us and them jointly; in particular, image tarballs are done by us, see /home/wikipedia/doc/mirrors/your.org for contact info, login and server info)
  • muni.cz (run entirely by them, see dataset2 rsyncd.conf for contact info)

In progress

  • Host being set up at wansecurity.com -- initial rsync happening now
  • Historical mirror for media at Archive.org (See collection [2])
  • Set up but need to test: mirror.fr.wickedway.nl
  • Discussions about images via Vito, organization: Tiscali s.p.a.

Initial contact

  • Pinged someone at dattobackup.com
  • Checking contacts at amazon re: Amazon Public Data Sets which has been defunct for some time
  • Checked with Nemo_bis about GARR, need to work with them about their legal concerns
  • SJ looking into contacts at MIT


A gluster share has the last 5 good dumps available; it sohuld be available to all lab instances (is it?) This is kept up to date by a script runout of cron on dataset1001, script: [3]

Historical dumps are available on Archive.org and is entirely managed by Hydriz. See Dumps/Archive.org for more information.

We have copied one complete run of our public XML files (about 1.3T?) off to Google storage, which they have kindly donated to us. We are in the process of moving things around to comply with a better (non-usurpable by other Google storage users) naming scheme. We'd like to run a copy once every two weeks, keep the last five copies and then one copy permanently every six months. Script here.

Earlier mirror efforts are documented on the Offsite Backups page. We need to see if any of these are still viable. Email sent to Kul, Milos to see if any of these possibilities are still live.