Dumps/Mirror status

From Wikitech
Jump to: navigation, search

We are interested in mirroring of the dumps; please add information there if you can host or know of an organization that can. Check there for mirror requirements and for the list of current mirrors.

Mirror setup for xml dumps and media

The mirror sites depend on the generation of lists of files/directories for rsync. Since some mirrors have limited space, we don't ask them to pick up the last 5 dumps but rather the last 5 complete good dumps. Since determining which ones those are would take some poking around in the directories, we generate this list for them (along with similar lists for last 1 and 2 good dumps). Script: [1] run out of cron on the snapshot host responsible for misc cron jobs (see hiera/hosts to check which one).

Dataset2 (pmtpa) is rsynced to dataset1001 (eqiad) every 2 hours, checking to make sure the previous rsync has completed first.

Rsync to our mirrors is available from dataset1001.

Mirrors in operation

See the hiera setting in puppet, profile::dumps::distribution::mirrors:, for all the gory details.

In progress

The info here is obsolete and needs to be updated.

  • Host being set up at wansecurity.com -- initial rsync happening now
  • Historical mirror for media at Archive.org (See collection [2])
  • Set up but need to test: mirror.fr.wickedway.nl
  • Discussions about images via Vito, organization: Tiscali s.p.a.

Initial contact

The info here is obsolete and needs to be updated.

  • Pinged someone at dattobackup.com
  • Checking contacts at amazon re: Amazon Public Data Sets which has been defunct for some time
  • Checked with Nemo_bis about GARR, need to work with them about their legal concerns
  • SJ looking into contacts at MIT


WMF Cloud instances have an nfs-mounted copy of all publically available dumps and datasets.

Historical dumps are available on Archive.org and is entirely managed by Hydriz. See Dumps/Archive.org for more information.

We have copied one complete run of our public XML files (about 1.3T?) off to Google storage, which they have kindly donated to us. We are in the process of moving things around to comply with a better (non-usurpable by other Google storage users) naming scheme. We'd like to run a copy once every two weeks, keep the last five copies and then one copy permanently every six months. Script here.

Earlier mirror efforts are documented on the Offsite Backups page. We need to see if any of these are still viable. Email sent to Kul, Milos to see if any of these possibilities are still live.