Obsolete:Media server/2011 Media Storage plans

From Wikitech

Media Storage Architecture

DRAFT

Actually, this is not even at the draft stage yet; this is still in the "collecting ideas" stage.

Current characteristics

Intro

This is a holding space for notes about the media storage architecture we want to roll out in the first part of 2011.

Instead of having one single large server that serves all web requests for media, with its filesystem nfs-mounted on the hosts that do scaling of media, and on all the apache webservers for uploads of media, we want a cluster of small boxes.

Instead of having one single large server that serves all web requests for scaled media, with its filesystem nfs-mounted on the hosts that do scaling of media, we want a cluster of boxes.

Features we definitely want

  • These boxes run Linux, not Solaris.
  • These boxes are commodity hardware, with (say) 500GB storage each instead of 16 or 32T.
  • No NFS mounts. Ever again.
  • Each media file is on more than one box, and if a box dies we lose nothing; the cluster keeps running.
  • File access via HTTP, so we can point the squids directly at the Media server.
  • We can toss a new box in the pool without disrupting the cluster performance.
  • Any changes to MediaWiki are at the FileRepository layer or lower; any extension, piece of MW core, or the api retrieving media or information about some media file is unaffected.
  • The system must be able to accomodate the new file upload setup that's in development, which includes a user uploading to a temporary area until the entire liecensing process is completed.
    • Agreed, although I think this needs to be transparent to this system to a large degree -- Mark 16:19, 11 November 2010 (UTC)
  • We can have a similar pool at at least one other datacenter, with a complete copy of all media, for disaster recovery.
    • Let's call this "rack-awareness", such that we can make sure that files are distributed a) across multiple racks, and b) across multiple data centers, in a hierarchic manner. -- Mark 16:19, 11 November 2010 (UTC)
  • It's easy to get all this data copied off-site (if only because each server just shoves 500GB over the wire to some off-site location at the same time).
  • No proprietary software
  • Thumbs get replicated too; it's not so important that we have any given thumb if a box goes down, but it is important that we don't find ourselves having to regenerate most of them, as this would put a huge load on the scalers.
  • Get ExtensionDistributor off these boxes!! As well as any other of the random things that have been put over there: math, "portal" (= old javascript from 2009 fundraiser), centralnotice, skins (aren't those on bits now?) I might be convinced to keep math over there, we'd have to talk about it.
    • This has nothing to do with the current architecture of the current system, so is not a requirement of the new system. -- Mark 16:19, 11 November 2010 (UTC)
  • Currently, deleted images are moved into a different location on the filesystem (where they could in theory be retrievable by web, but there is an htaccess file to prevent it); they are instead retrieved by direct access to an nfs-mounted filesystem, you can look at CommonSettings.php for the wgLocalFileRepo settings for this. We'll need to make sure that the new setup *does not use NFS* but does not allow unauthorized access to these files.
  • Private images (i.e. images viewable only by users registered on private projects) should remain that way under this new architecture.

Some things we probably want

  • If we want to examine a file it is sitting on the filesystem in the usual way; we do not split it over several boxes, nor do we need special tools to retrieve it. A simple cat or more or ls is all that is required. Also it would be extremely nice if the file names remained more or less intact. I don't care if there is some simple transformation that deals with special characters say, but one should be able to hunt for files on the filesystem with basically the names MediaWiki sees.
  • We don't write it all from scratch; we should be able to use pre-existing pieces and just write the glue (plus a new MediaWiki File Repository interface).
  • It would be nice if the back end were in python or perl I guess, rather than in php, but please not Ruby or some esoteric language that we don't have a lot of knowledge in house. Also please a language that is reasonable as to performance
  • It would be very cool if the backend piece were reusable by other folks instead of being a bit grody and particular to MediaWiki etc.
  • No dependencies on some esoteric kernel feature etc.
  • "Consistency" checking? As in "really, there are n copies of these files on live hardware", or "yes, these db pointers really point to the media we think"? Not sure what the need is nor what would meet that need. Along with that: what sort of tracking do we need for hosts that are down? What sort of problems would we face when we put them back into production? Would we want to reinitialize them from scratch if they were out of service for more than a few days? If they were just out for a day what would they need to get synced up?
  • Correctness: Would be nice if the file had a hash attached so "the right version" can be easily determined.

Components that we need

  • Something that translates urls of the original or scaled media to machines hosting them; presumably this is a database, which itself needs to be a cluster with replication, easy way to switch fail over the master to another host, easy way to add a new host to the cluster, etc.
  • Something that receives and executes writes and moves of media files, updates of media files, creations of new directories on a media server filesystem, reads of media file, stats of media files, and deletes of scaled media files (purges) -- maybe a RESTful API
  • Something that sends a purge request to the machines hosting a particular image
  • A mechanism for write/update/move of media, creation of directories, and deletes of media on all hosts with that media when one host is changed
  • At some point a media server may start getting full. We need to be able to refactor the mapping of media file urls to media servers at that point.
  • Programs we use for scaling (like "convert" from ImageMagick) expect an input file and an output file on the local filesystem; we'll need to think about that, unless we decide that all thumbs must be hosted on the same servers as the original media, and that scaling is done by themedia servers as well as serving read requests. I would recommend this as scaling is quite memory and somewhat cpu intensive. I don't know whether it's better to feed the results of a media file retrieval to stdin of convert, or whether it's better to write a temp copy and feed that to convert. Likewise the output could go directly to some process waiting to write it remotely on the media server, or it could go into a temp copy on disk which then gets sent over.

Current Stats

  • We currently get on the order of __ uploads for month at about __ GB total and an average file size of __, with a maximum file size restriction that we have set of 100MB.
  • A year ago these numbers were...

Thumbnail generation on ms4, 7 December 2010, about 2:20 pm UTC:

Requests for thumbs to be scaled, with success (these were actually completed by the scalers: 25 to 30 thumbs / sec
Requests for thumbs that failed for various reasons, either because of permissions errors or because the scaler could not complete the request: 20 to 25 thumbs / sec

The above was generated by being in /opt/local/share on ms4 and running

dtrace -qs ./access_log.d | grep thumb-handl | grep 'php 200' or
dtrace -qs ./access_log.d | grep thumb-handl | grep -v 'php 200' respectively.

Note that you can only run this for a few minutes; after this it begins to lag.

Future considerations

  • As people download bigger files, their pipes aren't going to be big enough to finish downloads in a few seconds, which will mean more and more connections taking a while on the media servers (and later on the squids/varnish hosts).
  • We should plan for an upswing in video uploads, as the tools to make that easier become available.
  • Some day we may want load balanced serving of files in the case that we actually have clusters serving each file. The short to mid term plan I have in my head is that we have a single server for each little group of files, but they are replicated to another host to ensure availability in case of hardware issues.
  • If we abstract all the front end pieces to deal with media via a standard API (for example REST) then it will be easy to replace the backend with something that uses cloud storage as well, sometime in the future. Right now it's a PITA.

Monitoring

  • We need to be alerted in all the usual ways -- unusual activity, passing some percentage utilization of some resource. Presumably Nagios.

Misc

  • A distributed "filesystem" is ok if what it does is take get/put/post/delete requests for files and do the corresponding thing, it's easy to add servers, death of a server does not lose data, data lives on the unix filesystem in the normal way and can be retrieved without special tools, and it has a facility for redistributing files when servers get full, etc. If we find one we like that's got reasonable performance, has all these features, isn't a big hairy mess to install or maintain, and is on some language we mostly know, this project could just be about integrating it into MediaWiki, getting purging working, and getting the scalers to work with it.
  • Do we want thumbs stored with their original media files on the same hosts? I dunno, maybe it would facilitate things. We should think about it I guess.
  • Does it make sense to upgrade our database structure at the same time? There are a number of pain points, the main one being that the unique id is the title. There are a couple of other features that might be nice to add, like upload app (there are several ways to get your contribution into our db).

Implementation ideas

  • We will have multiple upload-accepting servers, so we have the issue of where to keep temp files. (For example, chunked uploads, which are arriving in multiple requests, or uploads without metadata or license a la UploadWizard). Otherwise we have to have to store temporary uploads (even things like chunked uploads) in some place that's available to any upload server.
    • Simple: service to request an upload server? For web users could be as simple as 302, sending you to upload23.wikimedia.org. For API users that introduces a new concept that you have to first do an API request to wikimedia.org for a free upload server, then start talking to upload23.wikimedia.org or whatever. Seems a bit lame.
    • Fancy: use edit token to link a request with the upload-accepting server. In other words while the hostname, web or API, is always upload.wikimedia.org, the edit token somehow tells us which upload server to which we direct traffic (?). Also enables us to resume uploads much later. Too complex?
      • Needing an upload token, the same request can easily provide an url for which that token is valid.

Discussion

Media server/2011 Media Storage plans/Conference call 2010-12-14

Media server/2011 Media Storage plans/Conference call 2010-12-21