Jump to content

Media storage

From Wikitech
FIXME: Remove information on Ceph; remove reference to pmtpa data center

This page describes Wikimedia's media storage infrastructure.

Context

When we talk about "media storage", we refer to the storage & serving of user-uploaded content (typically images, PDF, video & audio files), served from upload.wikimedia.org. It includes both original content or content generated by other sources. The files can be broadly grouped into the following categories:

  • "Originals": originally uploaded content
  • Thumbnails: variable arbitrarly-sized thumbnails of original content, scaled on demand by imagescalers
  • Transcoded videos: multiple format (Ogg, WebM) multiple -but preset- resolution (360p, 720p etc.) conversion of originally uploaded videos
  • Rendered content:

Components

The media storage architecture involves the following closely coupled components.

Media storage components

Caching proxies

There are the usual, tiered multiple layers of caching proxies serving the upload.wikimedia.org domains.

The upload setup is special in a few ways:

  • There are certain provisions for handling Range requests that also needed a special Varnish version, due to its importance for serving large video files.
  • The config contains rewriting rules to handle the conversion from upload.wikimedia.org URLs to ms-fe Swift API URLs.
  • The config has special support for handling 404 from media storage on thumbnails URLs, retrying instead using an image scaler.

The last two were written to replace the previous Swift middleware written in Python to prepare for the Ceph transition as well as cut some of the issues that the middleware had with cascading failures looping each other in case of simple incidents. As of July 2013, those two are implemented but inactive, pending the full Ceph roll-out.

Media storage

This is the Swift storage layer where files/objects are being stored to and retrieved from.

Image & video scalers

Image scalers are a special group of application servers that are otherwise normal and running MediaWiki. Their sole purpose is to receive thumb scaling requests for arbitrary originals & sizes and scaling them down on request. While there are a number of constraints for resource usage & security purposes, they are performing resource-intensive operations on foreign content and thus can frequently misbehave, which is why they are grouped separately.

Video scalers are similar, but because of the nature of their work, instead of a per request basis, they are performing their work as part of job queue processing.

Architecture

File/object structure

Files are grouped into containers that have 5 components: a project (wikipedia), a language (en), a repo, a zone and, optionally, a shard.

Project can also be "global" for certain global items. Note that there are a few exceptions to the project names, with the most notable being Wikimedia Commons that has a project name of "wikipedia" for historical reasons.

Rendered content (timeline, math, scope and captcha) all have their zone set to render and their repo set to their respective category. Regular media files have repo set to local. Zones are public, for public unscaled media, thumb for thumbnails/scaled media, transcoded for transcoded videos, temp for temporary files created by e.g. UploadStash, and deleted for unscaled media that have had their on-wiki entries deleted. These are defined and categorized on the MediaWiki configuration option $wgFileBackends.

Historically, files were put under directories on a filesystem and directories were sharded per wiki in a two-level hierarchy of 16 shards per level, totaling 256 uniformly sharded directories. In the Swift era, the hope was that such a sharding scheme would be unneeded, as the backend storage would handle such a complexity. This hope ultimately proved to be untrue and for certain wikis, the amount of objects per container is large enough that it created scalability problems on Swift. To address this issue, multiple containers were created for those large projects. These were shared into a flat (one level) 256 shards (00-ff), with the exception of the deleted zone that was sharded to 1296 shards (00-zz). The list of large projects that has sharded containers is currently defined in three places: a) MediaWiki's $wmgSwiftBigWikis, b) Swift's shard_container_list (proxy-server.conf, via puppet) and c) under the rewrite Varnish's rewrite configuration in puppet.

The previous, two-level scheme is kept as the name of the object on all containers as well as the public upload.wikimedia.org URLs, irrespective of whether the project is large enough to have sharded containers. This was made for compatibility reasons, as well as to give us the capability of sharding more containers in the future if they get large enough. For those that are sharded, the name of the shard matches the name of the object's second-level shard and the shard of derived contents (thumbnail) remains the same as the shard of the original that produced it.

A few examples:

Thumbnail handling

When a user requests a page from a public wiki, links to scaled media needed for the page (e.g. http://upload.wikimedia.org/project/language/thumb/x/xy/filename.ext/NNNpx-filename.ext) are generated, but the scaled media themselves are not generated at that time. As the thumb sizes are arbitrary, it is not possible to pregenerate them either, therefore the only way to handle this is to generate them on demand and cache them. On the MediaWiki side, this is accomplished by using Thumbor to generate thumbnails.

When Varnish can't find a copy of the requested thumbnail - whether it's a thumbnail that has never been requested before, or ones that fell out of Varnish cache - Varnish hits the Swift proxies.

For private wikis, Varnish doesn't cache thumbnails, because Mediawiki-level authentication is required to ensure that the client has access to the desired content (is logged into the private wiki). Therefore, Varnish passes the requests to Mediawiki, which verifies the user's credentials. Once authentication is validated, Mediawiki proxies the HTTP request to Thumbor. A shared secret key between Mediawiki and Thumbor is used to increase security.

Datacenter replication

This information is outdated.

We currently have Swift running in pmtpa and Ceph running in eqiad. MediaWiki has a the capability of running with multiple backends, with one of them being the primary (where reads come from and what MediaWiki cares most about in terms of file operation success). This is configured by means of the FileBackendMultiWrite setting for $wgFileBackends, after creating a local-ceph and a local-swift SwiftFileBackend instance.

This is currently disabled because of Ceph's stability issues, that due to to the synchronous nature of FileBackendMultiWrite, would propagate to production traffic. Until this is enabled, we have two ways of syncing files:

  • For original media content, MediaWiki has a journal mechanism that keeps all changes into a database table, and scripts exist to replay that journal to the other store.
  • For all other content, operations/software has a software of our own called swiftrepl which traverses containers on both sides and syncs them.

History

Historically, media storage was composed of a few NFS servers that all MediaWiki application servers had mount points to and MediaWiki wrote files using regular filesystem calls. This was unscalable, fragile and inelegant.

2008

The media storage is spread over three NFS servers:

  • /upload3 (amane): Most user-uploaded original are stored here.
  • /upload4 (storage1): Thumbnails and (maybe) some original images as well?
  • /math (amane): TexVC-rendered images for MathML formulas.

2009

Media storage:

  • /mnt/upload5 (ms1): uploaded images, thumbs and texvc-rendered images.
  • /mnt/thumbs (ms4): Thumbnail storage.

2012

In 2012, a combined effort from the platform and technical operations teams was made to replace this with a separate infrastructure, with MediaWiki gaining the new FileBackend abstract interface that could scale from a simple local upload directory for small sites (including local development and CI) all the way to a large media storage cluster for WMF production.

OpenStack Swift was selected as the new platform. The Swift API was picked because of its simplicity and it being native to the Swift implementation.

Swift has certain limitations, in particular geographically-aware replication between datacenters that affected the Eqiad migration, as well as shortcomings with data consistency and performance.

As of 2013, Ceph and its Swift-compatible API layer (radosgw) is being evaluated for the same purpose, with Pmtpa running Swift and the new Eqiad cluster running Ceph instead. A final decision between the two was taken in late 2013.

Examples

First request for a thumbnail image

  1. Request for http://upload.wikimedia.org/project/language/thumb/x/xy/filename.ext/NNNpx-filename.ext is received by an LVS server.
  2. The LVS server picks an arbitrary Varnish frontend server to handle the request.
  3. Frontend Varnish looks for cached content for URL in in-memory cache.
  4. Frontend Varnish computes hash of URL and uses that hash to select a consistent backend ATS server.
    • The consistent hash routing ensures that all frontend Varnish servers will select the same backend ATS server for a given URL to eliminate duplication in the backend cache layer.
  5. Frontend Varnish requests URL from backend ATS.
  6. Backend ATS looks for cached content for URL in SSD based cache.
  7. Backend ATS requests URL from media storage cluster.
  8. Request for URL from media storage cluster received by an LVS server.
  9. The LVS server picks an arbitrary frontend Swift server to handle the request.
  10. The frontend Swift server rewrites the URL to map from the wiki URL space into the storage URL space.
  11. The frontend Swift server requests the new URL from the Swift cluster.
  12. The 404 response for the URL is caught in the frontend Swift server.
  13. The frontend Swift server constructs a URL to request the thumbnail from the Thumbor cluster in the same datacenter.
  14. The LVS server picks an arbitrary Thumbor server to handle the request.
  15. The Thumbor server requests the original image from Swift.
    • This goes back to the same LVS -> Swift frontend -> Swift backend path as the thumb request came down from the ATS backend server.
  16. Thumbor transforms the original into the requested thumbnail image.
  17. Thumbor stores the resulting thumbnail in Swift.
  18. Thumbor returns the thumbnail as a http response to the frontend Swift server's request.
  19. The frontend Swift server returns the thumbnail image as a http response to the backend ATS server.
  20. The frontend Swift server echos the request to the inactive Thumbor cluster
    • It does not wait for a response. The thumbnail is re-generated in the inactive cluster and saved to the Swift cluster in that datacenter. This prevents having to replicate thumbnails between datacenters.
  21. The backend ATS server stores the response in its SSD-backed cache.
  22. The backend ATS server returns the thumbnail image as a http response to the frontend Varnish server.
  23. The frontend Varnish server stores the response in its in-memory cache.
  24. The frontend Varnish server returns the thumbnail image as a http response to the original requestor.

Common operations

Removing archived files

Occasionally, there is a need to eradicate the content of files that have been deleted & archived on the wikis (e.g. for illegal to distribute content). To serve this purpose, there is a MediaWiki maintenance script, eraseArchivedFile.php, that handles the deletion of both the content and its thumbnails from all configured FileBackend stores, as well as the purging of those from frontend HTTP caches. The script takes either the filename as input:

user@mwmaint1002:~$ mwscript eraseArchivedFile.php --wiki commonswiki --filename 'Example.jpg' --filekey '*' 
Use --delete to actually confirm this script
Purging all thumbnails for file 'Example.jpg'...done.
Finding deleted versions of file 'Example.jpg'...
Would delete version 'f6mypp1mxmrj2aoxfucxwo2sj8eb9ww.jpg.jpg' (20130604053028) of file 'Example.jpg'
Done

or the filekey (e.g. as given in a Special:Undelete URL) as an argument:

user@mwmaint1002:~$ mwscript eraseArchivedFile.php --wiki commonswiki --filekey 'f6mypp1mxmrj2aoxfucxwo2sj8eb9ww.jpg'
Use --delete to actually confirm this script
Purging all thumbnails for file 'Example.jpg'...done.
Would delete version 'f6mypp1mxmrj2aoxfucxwo2sj8eb9ww.jpg.jpg' (20130604053028) of file 'Example.jpg'

(note how it needs to be invoked with --delete to confirm all actions)

Cleaning up thumbs

Syncing between stores

See also