Swift/Dev Notes
Swift is a distributed object store used to hold large media files for many of the wikimedia projects. It was created as part of OpenStack project.
Documentation Organization
- SwiftMedia MediaWiki Extension page
- Architecture, hardware, how it used to work, overall design for how it will work, etc. - this page
- Initial installation and puppetization notes
- How to set up a new swift cluster (post-puppetization)
- How To do Stuff on Swift
- How to add a new node to an existing swift cluster
- Swift/Account and user management
- Pre-deploy tests (qualification tests before doing the first production deploy)
- Tasks and Schedule for deploying Swift for thumbnails
- Plan to load thumbnails into swift
- Swift/Logging and Metrics
- Swift/Performance Metrics
- Swift/Deploy Plan - Thumbnails
Swift architecture
You can read an excellent overview of the Swift architecture on the swift.openstack.org website. This page only talks about what's relevant to our installation.
Hardware
Proposed
Proposal and rationale for how much hardware we need for swift in our environment (note - this proposal is based on capacity calculations, not performance. It may need to be adjusted based on performance metrics):
Start serving thumbnails only in eqiad with:
- 2 high performance misc servers for the frontend proxy
- 3 ES servers for storage
During 2012Q1, expand to serve both thumbs and originals from both colos:
- 3 high performance misc proxy servers in both pmtpa and eqiad
- 5 ES storage servers in both pmtpa and eqiad
During 2012Q3, expand both clusters as needed for capacity
- 3 high performance misc proxy servers in both pmtpa and eqiad
- 8 ES storage servers in both pmtpa and eqiad
Analysis
Current disk usage (data extrapolated from ganglia):
- Thumbnails (ms5): 8T used, growth of 1T/month -> 20T in 1 year
- Originals (ms7): 18T used, growth of 1T/month -> 30T in 1 year
- Combined: 26T used, growth of 2T/month -> 50T in 1 year
Current ES hardware comes with 12 2TB disks. The proposed disk layout is 2 disks raid 1 for the OS and 10 disks directly formatted with xfs given to swift (note - the swift docs recommend against using RAID for node storage). This yields 20T of available disk per storage node. Swift stores content in triplicate, so when comparing to current disk utilization, we must divide storage capacity by 3. This means that each additional swift storage node will give us 6.6T usable space. Additionally, 3 nodes is the minimum cluster size allowed, but the docs suggest a minimum of 5 nodes.
Necessary ES nodes for current use and estimated growth:
- Thumbnails: 3 node cluster (20T) will be sufficient for 1 year.
- Originals: 3 node cluster (20T) will serve current images, growing to 5 nodes (33T) over the course of the year
- Combined: 4 node cluster (26T) will serve current content, growing to 8 nodes (52T) over the course of the year
Purchased
- storage: dell poweredge c2100 with 2 xeon E5645 2.4GHz chips, 48G ram and 12 2TB 7.2kRPM SAS disks.
- proxy: dell R610 with 2 xeon 5650 2.6GHz chips, 16G ram and 2 250G 7.2kRPM SATA disks.
Deploy order
Here is the list of what components we'll deploy in what order to maximize testability and minimize risk.
Thumbnails, round 1
- Build testing swift cluster in eqiad [DONE - accessible at msfe-test.wikimedia.org:8080]
- do preliminary testing
- move 1/256th to 1/16th of thumbnail load to the swift cluster via squid to load test
- move all traffic back to normal servers
- based on the load test results, build production scale swift cluster in both pmtpa and eqiad
- move all thumbnail traffic to the swift cluster, swift's 404 handler points to ms5. 404 handler must write data to swift.
- copy all existing thumbnails to swift behind the scenes
- change swift's 404 handler to bypass ms5. 404 handler still writes data to swift.
Originals
TBD
Thumbnails, round 2
- change swift's 404 handler to stop writing to swift
MediaWiki integration
For the mediawiki documentation on the SwiftMedia extension, look at the mediawiki.org SwiftMedia extension
All interaction with Swift is done via the proxy (a collection of front end boxes). The proxy handles all communication with the backend object storage machines.
Thumbnail Upload
n/a. thumbnails aren't uploaded.
Thumbnail Retrieval
Both originals and thumbnails are stored in the Swift cluster. Thumbnails are more interesting because the 404 handler will create thumbnails that don't exist yet. This section covers what happens when a thumbnail is requested right now, and then each step of the process of moving from where we are now to being entirely backed by swift.
Current Path
Note This image is how things work right now, before swift gets in the mix.
Testing Swift
Initially for load testing, but also to provide a gradual shift from the existing system to Swift as a backend for thumbnails, we wish to configure the squids to split traffic between ms5 and swift. By varying the amount of traffic to swift we can evaluate how it handles load. When we're done testing and want swift to handle all the traffic, we can move load back and forth if the system starts to get stressed.
Because the image scalers are writing the created images to NFS, there is no state that gets created on swift that is not also created on ms5, meaning that there will be no data loss if we move off of swift and back to the way it was before.
All Swift
When we are comfortable with swift handling thumbnails, we will move 100% of the traffic out of the squids to swift. Swift will still pass on traffic for images it does not have to ms5. The only change at this step is removing the arrow from the squids to m5s5
At this point we can still roll back (off of swift) without any data loss.
Removing ms5
In order to remove ms5 from the picture, we'll have to take the logic that sits in the thumb-handler and move it elsewhere. Most of what this script does is create an appropriate URL then pass it along to the scalers. There are transformations necessary based on the content of the image so that the scalers know which project the image comes from.
This logic will most likely wind up in the 404 handler in swift.
An additional change at this step is that the scalers no longer write the scaled image locally. The only requirement is that they return the scaled image via HTTP; the swift proxy will write it to the data store. At this point the scalers are still using the NFS-based filerepo, not the SwiftMedia extension. They are getting the originals from ms7 via NFS.
Removing NFS from the scalers
Note This is the desired end state, after all the changes to integrate swift into the thumbnail path are complete.
This step will have to come during the transition of uploading new content to swift (we're just talking about thumbnails here). That path is not yet written, so the details of switching from the current FileRepo (which looks at NFS via a local path) to SwiftMedia (which will talk to swift) as the method of getting the full size image is not yet clear.
Check back after that path is described for details on that switch.
Originals Uploads
Uploads (current system) are relatively simple.
- The upload request goes to a mediawiki app server
- every medaiwiki app server has ms7 (originals) and ms5 (thumbs) mounted at
/mnt/*
- mediawiki (via the FileRepo extension) writes uploaded files to ms7 via NFS
- upload location is in InitialiseSettings.php
Uploads under swift are relatively simple.
- The upload request goes to a mediawiki app server
- every medaiwiki app server has the SwiftMedia extension
- mediawiki (via the SwiftMedia extension) writes uploaded files to swift via HTTP
How do we test, deploy, and rollback swift for original uploads?
- Idea: write to both local disk and swift on every upload?
- brief chat with Ibaker - would need a superclass to swallow up both FileRepo and SwiftMedia and send the writes to both
- basically means a duplicate of all the methods in Filerepo/SwiftMedia... lots of work?
- Idea: note what time the switch is made; use the database to copy all uploaded files out of swift if we need to roll back
- each db (enwiki, frwiki, enwiktionary, etc.) has a table 'image' that has a file name and a timestamp. This should be enough to track the file.
- problems will come up around file renames, moves, deletions
- this approach feels like it's the more dangerous of the two.
- Idea: put a hook in when file operations take place to insert a job into the job queue. Use the job queue to make copies of the modified files on the other store (either FileRepo or SwiftMedia)
- this might need hooks everywhere and be impractical
- Idea: modify mediawiki (probably just FileRepo/SwiftMedia) to write an audit log of all file changes. Use this log to power an out of band syncer
- if the udplogger is used, make sure that the stream parser can catch all entries. dropped packets here would suck.
- maybe use a tcp log aggregator or alternat method of consolidating the logs instead
For all of these deploy ideas, the deploy will have to be in multiple parts
- initial deploy of something to synchronize swift with nfs
- intermediate deploy of something to write to both (whether inline or out of band)
- switch from FileRepo to SwiftMedia
- switch back and forth freely, since both back ends are up to date
- remove the write to both thing and commit to swift.
Questions:
- is SwiftMedia aware of both the php scratch space and the ImageStash scratch space?
Originals Retrieval
Requests for files go straight to the Swift proxy servers. They don't know about our URL structure, so we have a 'rewrite.py' wsgi module which goes in line with the requests. It recognizes the Wikipedia-formatted requests, and rewrites them into Swift-format requests. It's documented in Extension:SwiftMedia. The Swift proxy server determines which object servers actually have the files. It issues a request to one or more of those servers to fetch the file. It would be quite reasonable to run a caching server in front of the Swift proxy server. The proxy servers also synch between servers, so they might not have extra resources for caching. We will have to see if we want to combine these functions.
- The proxy will accept either URLs in the native swift format or URLs that match what mediawiki currently uses.
- swift format: http://proxy/v1/AUTHKEY/container/path-to-file/may-contain-slashes
- mediawiki format example: http://upload.wikimedia.org/wikipedia/commons/thumb/c/c6/MtLyell.jpg/360px-MtLyell.jpg
- If it gets requests that look like they were destined for the regular thumb store, it will internally rewrite them to the appropriate format and check for the object.
- The proxy expects to find all objects it's asked for locally
- if it doesn't, it will query the existing thumb store, store the object locally, and return the object. From the end user's perspective, it appears as though swift did have the object but was just a bit slow.
FAQ and random bits about how our cluster is deployed and why
how to thumbnails for private vs. public wikis differ?
why don't we use ssl, and what are the ramifications?
More detailed task breakdown
Hiding elsewhere:
Installation Notes
These are off on different pages:
old notes
- Full details for the storage cluster will be at Swift/Cluster_Ops
- Integration is done with the SwiftMedia extension (http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/SwiftMedia/)
- thumbnail documentation: Thumbnail_repository