Swift/Dev Notes

From Wikitech
The "Swift" project is Current as of 2012-04-01. Owner: Bhartshorne. See also RT:1384

Swift is a distributed object store used to hold large media files for many of the wikimedia projects. It was created as part of OpenStack project.

Documentation Organization

Swift architecture

You can read an excellent overview of the Swift architecture on the swift.openstack.org website. This page only talks about what's relevant to our installation.

Hardware

Proposed

Proposal and rationale for how much hardware we need for swift in our environment (note - this proposal is based on capacity calculations, not performance. It may need to be adjusted based on performance metrics):

Start serving thumbnails only in eqiad with:

  • 2 high performance misc servers for the frontend proxy
  • 3 ES servers for storage

During 2012Q1, expand to serve both thumbs and originals from both colos:

  • 3 high performance misc proxy servers in both pmtpa and eqiad
  • 5 ES storage servers in both pmtpa and eqiad

During 2012Q3, expand both clusters as needed for capacity

  • 3 high performance misc proxy servers in both pmtpa and eqiad
  • 8 ES storage servers in both pmtpa and eqiad

Analysis

Current disk usage (data extrapolated from ganglia):

  • Thumbnails (ms5): 8T used, growth of 1T/month -> 20T in 1 year
  • Originals (ms7): 18T used, growth of 1T/month -> 30T in 1 year
  • Combined: 26T used, growth of 2T/month -> 50T in 1 year

Current ES hardware comes with 12 2TB disks. The proposed disk layout is 2 disks raid 1 for the OS and 10 disks directly formatted with xfs given to swift (note - the swift docs recommend against using RAID for node storage). This yields 20T of available disk per storage node. Swift stores content in triplicate, so when comparing to current disk utilization, we must divide storage capacity by 3. This means that each additional swift storage node will give us 6.6T usable space. Additionally, 3 nodes is the minimum cluster size allowed, but the docs suggest a minimum of 5 nodes.

Necessary ES nodes for current use and estimated growth:

  • Thumbnails: 3 node cluster (20T) will be sufficient for 1 year.
  • Originals: 3 node cluster (20T) will serve current images, growing to 5 nodes (33T) over the course of the year
  • Combined: 4 node cluster (26T) will serve current content, growing to 8 nodes (52T) over the course of the year

Purchased

  • storage: dell poweredge c2100 with 2 xeon E5645 2.4GHz chips, 48G ram and 12 2TB 7.2kRPM SAS disks.
  • proxy: dell R610 with 2 xeon 5650 2.6GHz chips, 16G ram and 2 250G 7.2kRPM SATA disks.

Deploy order

Here is the list of what components we'll deploy in what order to maximize testability and minimize risk.

Thumbnails, round 1

  • Build testing swift cluster in eqiad [DONE - accessible at msfe-test.wikimedia.org:8080]
  • do preliminary testing
  • move 1/256th to 1/16th of thumbnail load to the swift cluster via squid to load test
  • move all traffic back to normal servers
  • based on the load test results, build production scale swift cluster in both pmtpa and eqiad
  • move all thumbnail traffic to the swift cluster, swift's 404 handler points to ms5. 404 handler must write data to swift.
  • copy all existing thumbnails to swift behind the scenes
  • change swift's 404 handler to bypass ms5. 404 handler still writes data to swift.

Originals

TBD

Thumbnails, round 2

  • change swift's 404 handler to stop writing to swift

MediaWiki integration

For the mediawiki documentation on the SwiftMedia extension, look at the mediawiki.org SwiftMedia extension

All interaction with Swift is done via the proxy (a collection of front end boxes). The proxy handles all communication with the backend object storage machines.

Thumbnail Upload

n/a. thumbnails aren't uploaded.

Thumbnail Retrieval

Both originals and thumbnails are stored in the Swift cluster. Thumbnails are more interesting because the 404 handler will create thumbnails that don't exist yet. This section covers what happens when a thumbnail is requested right now, and then each step of the process of moving from where we are now to being entirely backed by swift.

Current Path

Note This image is how things work right now, before swift gets in the mix.

Testing Swift

Initially for load testing, but also to provide a gradual shift from the existing system to Swift as a backend for thumbnails, we wish to configure the squids to split traffic between ms5 and swift. By varying the amount of traffic to swift we can evaluate how it handles load. When we're done testing and want swift to handle all the traffic, we can move load back and forth if the system starts to get stressed.

Because the image scalers are writing the created images to NFS, there is no state that gets created on swift that is not also created on ms5, meaning that there will be no data loss if we move off of swift and back to the way it was before.

All Swift

When we are comfortable with swift handling thumbnails, we will move 100% of the traffic out of the squids to swift. Swift will still pass on traffic for images it does not have to ms5. The only change at this step is removing the arrow from the squids to m5s5

At this point we can still roll back (off of swift) without any data loss.

Removing ms5

In order to remove ms5 from the picture, we'll have to take the logic that sits in the thumb-handler and move it elsewhere. Most of what this script does is create an appropriate URL then pass it along to the scalers. There are transformations necessary based on the content of the image so that the scalers know which project the image comes from.

This logic will most likely wind up in the 404 handler in swift.

An additional change at this step is that the scalers no longer write the scaled image locally. The only requirement is that they return the scaled image via HTTP; the swift proxy will write it to the data store. At this point the scalers are still using the NFS-based filerepo, not the SwiftMedia extension. They are getting the originals from ms7 via NFS.

Removing NFS from the scalers

Note This is the desired end state, after all the changes to integrate swift into the thumbnail path are complete.

This step will have to come during the transition of uploading new content to swift (we're just talking about thumbnails here). That path is not yet written, so the details of switching from the current FileRepo (which looks at NFS via a local path) to SwiftMedia (which will talk to swift) as the method of getting the full size image is not yet clear.

Check back after that path is described for details on that switch.

Originals Uploads

Uploads (current system) are relatively simple.

  • The upload request goes to a mediawiki app server
  • every medaiwiki app server has ms7 (originals) and ms5 (thumbs) mounted at /mnt/*
  • mediawiki (via the FileRepo extension) writes uploaded files to ms7 via NFS
    • upload location is in InitialiseSettings.php

Uploads under swift are relatively simple.

  • The upload request goes to a mediawiki app server
  • every medaiwiki app server has the SwiftMedia extension
  • mediawiki (via the SwiftMedia extension) writes uploaded files to swift via HTTP

How do we test, deploy, and rollback swift for original uploads?

  • Idea: write to both local disk and swift on every upload?
    • brief chat with Ibaker - would need a superclass to swallow up both FileRepo and SwiftMedia and send the writes to both
    • basically means a duplicate of all the methods in Filerepo/SwiftMedia... lots of work?
  • Idea: note what time the switch is made; use the database to copy all uploaded files out of swift if we need to roll back
    • each db (enwiki, frwiki, enwiktionary, etc.) has a table 'image' that has a file name and a timestamp. This should be enough to track the file.
    • problems will come up around file renames, moves, deletions
    • this approach feels like it's the more dangerous of the two.
  • Idea: put a hook in when file operations take place to insert a job into the job queue. Use the job queue to make copies of the modified files on the other store (either FileRepo or SwiftMedia)
    • this might need hooks everywhere and be impractical
  • Idea: modify mediawiki (probably just FileRepo/SwiftMedia) to write an audit log of all file changes. Use this log to power an out of band syncer
    • if the udplogger is used, make sure that the stream parser can catch all entries. dropped packets here would suck.
    • maybe use a tcp log aggregator or alternat method of consolidating the logs instead

For all of these deploy ideas, the deploy will have to be in multiple parts

  • initial deploy of something to synchronize swift with nfs
  • intermediate deploy of something to write to both (whether inline or out of band)
  • switch from FileRepo to SwiftMedia
    • switch back and forth freely, since both back ends are up to date
  • remove the write to both thing and commit to swift.

Questions:

  • is SwiftMedia aware of both the php scratch space and the ImageStash scratch space?

Originals Retrieval

Requests for files go straight to the Swift proxy servers. They don't know about our URL structure, so we have a 'rewrite.py' wsgi module which goes in line with the requests. It recognizes the Wikipedia-formatted requests, and rewrites them into Swift-format requests. It's documented in Extension:SwiftMedia. The Swift proxy server determines which object servers actually have the files. It issues a request to one or more of those servers to fetch the file. It would be quite reasonable to run a caching server in front of the Swift proxy server. The proxy servers also synch between servers, so they might not have extra resources for caching. We will have to see if we want to combine these functions.

FAQ and random bits about how our cluster is deployed and why

how to thumbnails for private vs. public wikis differ?

why don't we use ssl, and what are the ramifications?

More detailed task breakdown

Hiding elsewhere:

Installation Notes

These are off on different pages:

old notes