Swift/Load Thumbnail Data

From Wikitech
The "Swift" project is Current as of 2012-04-01. Owner: Bhartshorne. See also RT:1384

This page details the method of loading existing thumbnails into swift pre-deploy.

new method

ms5 is currently suffering severe load. To avoid increasing that load, we don't want to run 'find' on ms5 (the old method).

capture incoming thumbnail requests

On ms5, we run tcpdump to capture incoming thumbnail requests. We ship them off to fenari, where they're processed and requested from squids (which have just received the object back from ms5) where they're served from cache

 root@ms5:~# tcpdump -i any -s 0 dst port 80 and dst host ms5.pmtpa.wmnet -A  | grep GET \
                  | grep -o "[^ ]*/thumb/[^ ]*" |  nc fenari.wikimedia.org 29876

stuff images into swift

on fenari, we run a listener that

  • processes the incoming list of URLs from ms5's tcpdump
  • holds on to each URL for about 30s to make sure it's present in squid
  • requests the URL from swift, which falls through to upload.wikimedia.org (the squids) on 404
 ben@fenari:~/swift$ ./urllistener-fifo

watching progress

Ganglia is graphing the total number of objects as well as the number of new objects per 30s time slice. If either of these metrics stagnate, we should verify the tcpdump and urllistener are still runnning (in a screen session ben/listener on fenari - you can connect as root)




old method (as of 2012-01-30)

success condition

The goal is only 99% coverage; it's ok to not get 100% of thumbnails. Any that we miss will be picked up from ms5 or regenerated when they're requested.

additionally, only publicly visible thumbnails are getting retrieved for this first iteration

get a list of existing thumbnails from ms5

ionice -c 3 specifies the 'idle' priority.

cd /export/thumbs
for i in wikibooks wikimedia wikinews wikipedia wikiquote wikisource wikiversity wiktionary
do
  ionice -c 3 find $i -type f > /tmp/$i-filelist.txt
done

exclude commons for test clusters

Until we clear out the google-generated stuff, it's just wasted space. We should skip anything on commons for testing. Note - for production; we should include commons, aka skip this step

for i in *-filelist.txt; do grep -v "/commons/" $i > ${i/.txt/-nocommons.txt}; done

transform the list into URLs

Turn the list of paths into a list of URLs that should be provided by swift. Loading each URL will cause the file to be fetched from ms5 and saved to swift

swift_frontend="msfe-pmtpa-test.wikimedia.org:8080"
for i in *-nocommons.txt; do cat $i | sed -e "s/^/http:\/\/${swift_frontend}\//" > ${i/.txt/-urls.txt}; done

load all these urls

fetch them all from swift - run this on hume or some other host in pmtpa

cd ~ben/swift
for i in *-urls.txt; do ./geturls.py -t 30 $i; done