Obsolete:Dumps/Image dumps plans 2012

From Wikitech
See also Obsolete:Dumps/media

How we might handle image dumps

We might do the following for each project:

Phase 1

Dump the stubs (metadata) for all revisions of each page in ns 6 (NS_FILE). This can be done by applying the appropriate filter to a regulr stubs dump, so the worst case scenario will be en wiki with 1 day for this.

Phase 2

Dump the "File:" description pages with all revisions, corresponding the the stub file generated above. We can use the standard prefetch mechanisms here to save wear and tear on the dbs.

Phase 3

Generate a list of image filenames as contained in the stubs, in the order that the stubs list them, convert them to urls (including the historical versions), and write out a file of these urls. Why don't we just write out a list of full path names that we could retrieve by copying from the image server? Because it won't work with SWIFT.

Phase 4

Retrieve the images and put them into a tarball. We should re-use previous tarballs, only retrieving if the previous archive doesn't have the image.

The first such run is going to be really hard on the squids and the image server. Perhaps we want the first run to generate a list of filenames after all and we can copy from the rsync'ed server in eqiad. Further runs could use the url update mechanism.

For large projects (enwiki), we'll want to create several smaller tarballs. This means splitting up the list of urls in a reasonable and predictable way, so that checking the previous run's tarballs for existing images isn't really expensive.

Directory structure of the tarballs: These might as well be set up with two levels of directory names, just as the actual image filesystem is now. So an image might be stored in 1/1a/Blot....jpg This should be easy enough for endusers of the images to work from.

Notes

Commons is going to be the big headache, for two reasons. First, it's huge. At 200GB per tarball it would take 100+ tarballs. And btw we don't have room to put those anywhere. We could generate a few and hand them off to people offsite, but it would be really nice to have a local host to put them on.

Secondly, someone who wants to get a copy of all the images in use on, say, fr wikipedia, would have to download the package for fr wikipedia which contains all the images locally uploaded there, and then all the Commons images. Ouch!

We could provide tarballs that included, for each project, all the images in use from Commons as well But then we might have substantial duplication of images in these various tarballs. Since many projects discourage uploads or have turned them off altogether, this problem will just get worse over time.

In an ideal world with lots of space we would do two tarballs per project: locally hosted images and Commons-hosted images. We could look at how much room these might take; it should be possible to crunch some numbers and get some estimates.

Temporary "get it out the door now" approach

Since there are no dumps of images in the world at present, anything would be better than nothing. We could create 256 tarballs of Commons, one per directory, and make them available as space permits, without writing any code; this would at least get people something to start with. We could generate these on the rsync'ed image store in eqiad. We're talking around 70GB or so for an average such tarball, not unreasonable for anyone who wants to work with this image collection anyways.