Obsolete:Media server/Homebrew

This proposal has been withdraw.

Always good to start off with a problem statement. We currently have a reasonable (but excessively complicated) system for storing media files, creating thumbnails, serving the files and thumbnails. Trouble is that the system isn't scaling well because at its center is a single host (with another one like it for thumbnails) which serves up files using NFS to processes which write (uploader and thumbnailer), and which read (thumbnailer and apaches). One of the goals is to go to a simpler system, because things are only simple when they're new. As they get older, they get more complicated, stiff and hard to change. Kinda like people.

If we drop back and look at a simple mediawiki install, the apache accepts uploads, invokes the thumbnailer, and serves up the files. What is to stop us from continuing that design? Simply that we need multiple hosts. If we incorporate the hash from the file path into the hostname, that lets us use DNS records to split the load between as many machines as we want. The machines don't need to be in the same rack, or even data center. The DNS records can point to multiple machines.

There are several problems with this plan.

We didn't give out multiple hostnames, but instead upload.wikimedia.org is the hostname for everything. This means that we have to rewrite the URLs inside the caches (which works for both squid and varnish). Nothing wrong with that, but it adds complexity. Would be simpler to hand out the correct hostname in the first place. If we do this, it will mean that there may be a hostname lookup for every image. Some browsers will access the same host for at most four images at a time ... to avoid trampling it. That's why some sites will have multiple names for the same host in order to initiate more downloads. A dozen hostnames will cause a dozen DNS lookups, but the browser will also immediately smart a download on all dozen hosts.

Place the upload hosts as subdomains and add a *.upload.wikimedia.org record for the squids round robin.

A possible mismatch between resources needed to run the thumbnailer and serve up files. The thumbnailer needs to run on the host with the original files and from which the thumbnails will be served. It may be the case that two machines, one configured to be a thumbnailer (with, perhaps, more memory), and another one configured to serve files (with, perhaps, a faster I/O system) would be more efficient than two equally configured machines both doing thumbnailing and serving up files. My speculation is that if this is the case, then it would be solved by adding another machine into the cluster and rebalancing files.

Overview

One of the benefits of Homebrew is that it simplifies the architecture of the Media Server to bring it back into line with a simple MediaWiki running on just one machine. It does this by eliminating NFS (Network Failure System), and by distributing all the functions among a set of hosts. The more I try to document how it works, the more I realize that except for certain details (which I will lay out below), it functions exactly and without difference from the way a simple MediaWiki works. Thus, here are the differences:

Files are still uploaded to one (or more) apache webservers which handle the MediaWiki upload wizard. When, in the process, the file is ready to be exposed at its public URL, the apache webserver calculates the internal URL pointing to the right machine in the cluster, and uploads it using POST, much the same as the file was uploaded to it in the first place.
File updates happen using a REST PUT.
Deletes happen through a REST DELETE. Deleted files are made unavailable to the web, but posting to a special POST url will undelete them.
A new concept: replication of repository. If there are never any errors, the upload process will keep all servers in sync. Any 404 errors for full-size files should be handled by fetching the file from a sister, and storing a local copy. Thus, the tendency will be for the servers to stay identical. We can run rsync periodically to make sure that happens.
A new concept: rack awareness. We'll need to have a configuration file to control how the hash buckets get assigned to servers. For example, if we initially set it up with 4TB/server, and then add a new server with 8TB/server, it should be given twice as many hash buckets. The hash buckets should be assigned to ensure that there is a full copy of everything in each rack. There could be extra copies, and will be when things are being moved around.

I'm thinking that the configuration file should be policy-free. It should be 100% mechanism, with policy being enforced by another program which writes the configuration

A new concept: computed hostnames. We might ... or might not ... choose to expose the hostnames computed from the hash in the path to the file. If we expose the hostname, then the caches don't need to rewrite the hash into the hostname. Trouble is that the client browser now needs to resolve that hostname; likely a different hostname for each file. That could slow down browsing. On the other hand, some browsers limit the number of connections to each host (e.g. to uploads.wikimedia.org). That's intended to keep from trampling a host. In our case, though, it would actually be different hosts, so that would speed up page loads. If we choose not to do this, and continue to present just one hostname to the public, we will rewrite the hostname in the cache, which will then resolve the hostname privately (which will probably always be cached).

Redundancy

An important part of a distributed file system is redundancy. You don't want the existance of a file to be dependent upon the uptime of a server. ...

Pathnames

The Homebrew solution relies on the front-end cache's ability to rewrite the hash part of the existing path into the hostname. We then set up 256 hostnames in a namespace which get mapped to two hosts each. Each of these hosts has a full copy of everything which maps to that hash. So, for example, this file:

http://upload.wikimedia.org/wikipedia/commons/7/7c/Russ_Nelson_2.jpg

gets turned into this file:

http://upload7c.wikimedia.org/wikipedia/commons/7/7c/Russ_Nelson_2.jpg

and gets served up by the machine which responds to that name. Similarly, this thumbnail:

http://upload.wikimedia.org/wikipedia/commons/thumb/7/7c/Russ_Nelson_2.jpg/220px-Russ_Nelson_2.jpg

get turned into this file:

http://upload7c.wikimedia.org/wikipedia/commons/7/7c/Russ_Nelson_2.jpg-thumb/220px-Russ_Nelson_2.jpg

Plus / Minus

Plus:

We're making a pack of mini-me's out of a standard MediaWiki.
- Right now the scaler has access to both the media and thumb servers via NFS.
- The Apache servers have access to the media and thumb servers via NFS.
- The new system will shuffle the functionality around.
- The media and thumb files for a particular file will be on the same host.
- The Apache server for those files will run on the host.
- The scaler will run on every host, and create thumbnails as needed.
Existing knowledge, configuration and tools carry over.
Preserves existing URL space.
Lightweight compared to a general-purpose distributed file system.
Nothing at the center (other than the database which we have to have anyway).

Minus:

Unproven design (but then again, whatever we do to adapt our system to their filesystem will also be unproven).
Need to write the management tools which will enforce the replication and balancing as new servers are added.
Locks us into a caching system which can rewrite URLs. Both squid and Varnish can do this.

RussNelson 21:36, 29 November 2010 (UTC)

Thoughts

Thoughts ... just thoughts ... not clear that they're good thoughts or bad thoughts.

We could run a DNS server on each host in the cluster, and let it answer (or remain silent) for DNS queries that it knows how to answer. This would require a custom DNS server ... but djbdns makes that trivial to implement. So, the DNS server could look at the hash entries present on the machine, and answer the DNS queries ... or not. This has some interesting benefits. There would be no central authority for the Media Server. Machines that have files would serve them up. Machines that don't have files wouldn't answer the query. Hash buckets would get migrated from one machine to another simply by creating the subdirectory which is supposed to have the files. The DNS server would notice, start answering, and when it got a query for a file it didn't have, it would go ask the other server(s) for that bucket for a copy of the file. When a machine went offline, it would automatically stop answering DNS queries.

I see some difficulties with this scheme.

without some process to guarantee that all files in a hash have been copied, the copying-on-read scheme would likely take forever copy EVERY file in a hash.
the least-loaded server will answer the dns query fastest, which is good. But how does one cluster server discover what other cluster servers are supposed to have copies of its files? This system needs to be able to accept multiple A record responses and merge them into one. That's actually not that hard since a DNS client is just another program, and could wait for multiple responses to return on the same port. So, rather than having to modify the resolver library, which is used by every program on the machine, the cluster software could have its own "resolve-all" routine.

Questions

Ops

How is data redundancy handled?
- Each hash bucket (first characters of the name's MD5sum; currently 2 in published URLs) holds a collection of files. Each of these files is duplicated on a second, third, or fourth (as policy dictates) server.
  - So for redundancy of x nodes, we need x*2 (or x*3) nodes?
What happens when a node fails?

The node's hash buckets get assigned to other nodes in the cluster, and are rebuilt from the other copies (sisters) in the cluster.

What happens when multiple nodes fail?

The node's hash buckets get assigned to other nodes in the cluster, and are rebuilt from sisters in the cluster. It's possible but unlikely that all the copies of a hash bucket will be lost. If that happens, the files will be restored from a sister in another data center or from a backup.

What happens when a filesystem fills up on one node?
- The hash buckets get redistributed. More likely, however, is that the policy constraints for number of copies of files and amount of free space per system won't be met, and an alarm will be raised.
  - Will this happen automatically? If so, how is it handled? If not, how will it be accomplished?
How is the data evenly distributed across the nodes?
- The hash buckets get redistributed. Same as when you add a new node with no files on it.
How will we monitor this architecture for faults?

The job isn't done until there's a control panel and meters, no matter how it's implemented underneath.

If we wish to add authentication/authorization support to this, how will it work?
- Shared secret and an authorization token. The cluster servers won't be visible to the outside world.
  - This isn't already supported by MediaWiki, is this another layer that will need to be written?
When something goes wrong, who will we contact?
- The person who wrote the code. Same thing you'd do no matter what software was being used.
  - Only one person will be writing this code, what happens if you are no longer available at some point in time?
How we are going to manage DNS records to map file hashes -> hostnames? This has to be visible to both MediaWiki (to calculate hostnames) and to work at the DNS level
- No it doesn't. There will be 256 DNS records which will point to some smaller number of hosts.
  - What are these DNS records for? How are they going to get entered into DNS? Will they ever change? If so, how will they be maintained?
Each image storing host have a twin
- Or triplet or quadruplet, as dictated by policy. I prefer the term "sister". The policy determines the mechanism, and the mechanism determines which nodes carry which hashes. If a hash bucket is published in the DNS, it can be used to rsync by other sisters. If rsync succeeds, the sister registers with the DNS server as having a copy of the hash. When a file is uploaded, it is uploaded to all sisters.
  - What do you mean with "registers with DNS"? Does this require dynamic DNS?
- As a paranoia check, the 404 handler will look on a sister for a missing file, and save a copy if it's found on a sister. presumably there is a DNS rotation to the one that's less broken, but thereafter there has to be a recovery/catchup procedure.
  - What do you mean by DNS rotation? Do we have to replace nodes via DNS?
- If a host has been seen to be unreachable, it is removed from the DNS server. A host will periodically check to make sure that it's listed in the DNS server as serving up the files it thinks it has. If its most recent rsync succeeded, it will relist itself.
  - Who removes the host from DNS?
how rack-awareness (cross-colo failure recovery) works
- The sisters in the other racks will be consulted, but on a lower priority than any local sisters.

Application

since we aren't using NFS, a description of how frontend upload servers are going to

distribute files to these servers

HTTP PUT to a hostname derived from the hash of the filename

What happens if no server is responding to this hostname? We just can't store the file? That's a pretty weird result given that other files can be stored just fine.

This is a CANT_HAPPEN state; same as a crushed Ethernet cable, dead router, total power failure, etc. These kinds of things can already happen, and we'll continue to deal with them just as we do now.

add new versions of files

HTTP PUT (with mangled filenames?)

RARE: update files (and expire caches)
delete files (and expire caches)
undelete files

In MediaWiki today, deletion just means moving to non-public area. Are there special "deleted files" servers, just as we have a deleted FileRepo zone?

Each server will have its own deleted zone.

are there multiple thumbnail servers? Or are you saying each host is a thumbnailer too?

how do we assign a thumbnail server to a particular thing that needs scaling
where thumbnails go if they are stored permanently at all

Each host creates and serves up the thumbnails for its files. Each one is a stand-alone MediaWiki ... except that it's only for a portion of the files.

Multiple hosts = multiple hostnames for downloading?

We have that whole upload.wikimedia.org versus the hashed hostnames you suggest here; is there a transition to multiple hostnames, or do we redirect or have some frontend server to manage this transparently.

I was originally thinking that we might want to expose the hostnames to the rest of the world. There are good reasons for doing that, but I think better reasons for continuing to rewrite from upload.wikimedia.org to an internal hashed hostname.

I think we want to avoid exposing such hostnames to the world at all costs. If we ever wanted to switch away from such a system we would be stuck (people copy urls to these images all over the place. I suppose we could maintain redirects forever, but that's ugly. And if we expand the hash system to go another level out, or we need to shuffle files around... eh. Let's have just one level of rewrites, shall we? -- ArielGlenn 15:57, 14 December 2010 (UTC)

security

everyone can read public files
presumably the upload servers, and the host pair twins, have remote write privs

if not via NFS mount, how?

REST

near-future application use cases

An appropriate answer to these questions may be that "it's handled at the PHP application level". However, since some DFS systems have authorization hooks, it is appropriate to note that at least the Homebrew system doesn't add anything here.

avatars: for the proposed enhancements of LiquidThreads and other discussion systems. small images that are controlled by the user, but visible to all.
incomplete files: images that are upload but not yet published. Controlled by the user, and only visible to the user.

Concerns

This solution will likely only be used by us. It will have no community behind it. We will have to fully maintain this solution.
- The same is true of any part of MediaWiki. If the alternative were to use software which had a userbase, we would certainly follow that path. Unfortunately, we have no choice but to write custom software; either to clusterize the existing repo, or to adapt the repo code to use an existing filesystem. That software will have no user base and we will have to fully maintain it; we have no choice in this.
  - Maintaining a new FileRepo class is much easier than maintaining a filesystem. A filesystem is a major undertaking, and requires a lot more domain specific knowledge than writing an interface to a filesystem's API.
Building a filesystem from scratch, no matter how simple the design, is not a matter to be taken lightly. A simple design often leaves holes that must later be filled. It is very likely we'll end up adding most of the features found in other DFSs when we realize our design is lacking a feature we need badly.
- Or we'll find that we need a patch which is not needed by anybody else, and we'll have to maintain it in the third-party DFS forever. The future is closed to us. We can only base present actions on our guesses about the future.
  - We patch software all the time. Often it is accepted upstream and maintained by others. You are comparing a theoretical situation to a definite, absolutely going to happen scenario. We may use a DFS and never have a need to patch it.
A custom solution is almost always more difficult for operations to maintain. The operations team is already overburdened. A DFS with proper documentation, and a strong community is a major factor in a decision like this.
- We don't have a choice about a custom solution. We stick with the MediaWiki code we have, and write back-end redundancy management code, OR we write a custom solution which interfaces to whatever API the DFS presents us with ... with all of its warts and wiggles.
- We don't have a choice about a custom solution. Do we want the custom solution to be in the path of minute-by-minute operation? Or do we want it to be in the path of week-by-week operation? Homebrew is designed so that the critical path to files has zero custom code in it; just squid rewriting of URLs. We are relying on time-tested code here, and not a brand-new interface to somebody's DFS. If there's a bug in Homebrew, it will show up during work hours. If there's a bug in DFS interface code, it will show up with 2/3rds likelihood outside of work hours. Homebrew is the conservative solution; adapting MediaWiki to use somebody else's DFS is the radical untested solution.
  - I fundamentally disagree with this. The system you are proposing is a DFS, with a REST API. We are just as likely to run into problems with this as we are to run into problems with another solution. I'd wager we are more likely to run into problems with a custom built system. (Ryan Lane)
  - This is not my definition of a custom solution. You seem to be arguing that because we have to adapt a FileRepo interface to a network API anyway, all solutions then become "custom". Here's how I see it: it is a given that we have to adapt the FileRepo code to handle the complexities of network APIs. This should be the easy part since the FileRepo methods already return success/failure or sometimes compound Status objects that encapsulate both a flag of success or failure and more detailed results. But with HomeBrew, we have the additional difficulty of writing the "other end" of that API and making it reliable. It's almost as if your perception here is flipped from what I perceive to be the norm -- that calling an API is hard but writing one from scratch is easy. Especially about something like transparently redundant file storage! NeilK 01:13, 14 December 2010 (UTC)
In the "minuses" it says: "Need to write the management tools which will enforce the replication and balancing as new servers are added." (Plus other similar tools.) In my mind that's the tough part. Getting a file onto or off of the disk or even renaming it are not hard things. It's all the extra crud that is hard to get right, and it's what has been killing us for quite some time. How did we wind up on ZFS? Because we wanted up to the minute (or at least the 5 minutes) redundancy, as well as the local snapshot capability. And it turned out that getting that with a reliable FS was hard. Sure, we need to scale. But that is not the only thing that needs to be changed in our infrastructure. A single MediaWiki instance running on one host, while simple, does not have these features which a large site needs. We can hack them together ourselves and go through all the pain that entails, or we can rely on the work of others to some degree, which will be maintained by others as well. Don't get me wrong, I understand very clearly the downsides to using a third party package: it doesn't quite do what you want, the bugs are a PITA, upstream patching, sudden changes with a new version, etc. etc. Nonetheless, people write packages to be used by others precisely so that we don't all have to write our own. What a waste of resources that would be, right? Anyways, I think we don't want to downplay the development of these management tools as something that can be done in just a little time and then we can rest easy. -- ArielGlenn 15:51, 14 December 2010 (UTC)

Ops

I believe that this is the best choice from an operational standpoint, for several reasons:

Everything breaks. That is not at question. The question is: when is it going to break and how hard is it going to be to fix. Things can break at 2PM during daylight work hours with awake sysadmins, or they could break at 2AM with sleepy sysadmins. Sysadmin work between 2AM and 8AM is not merely unpleasant; it is risky. People make poor tactical decisions when deprived of sleep. Thus, we should design systems to move the risk into activities run by sysadmin action.
- I see nothing supporting your argument that this homebrew solution is any less likely to break at odd hours, or in any way that will be easier to fix than any other system. I'd argue the exact opposite. With a well supported third party system, we have the backing of a community that has already likely run into problems we may run into. We'll have the ability to search mailing lists, FAQs, official documentation, blogs, etc. When we run into problems with a homebrew system, we'll have to rely on contacting a single author. This is extremely problematic for Ops people, especially if they aren't the ones who wrote the system.
This solution has been used by us for many years. The only thing that we are changing in the way of reading articles is hostname rewriting, to include the hash in the hostname. It is very very unlikely for this to introduce any risk of failure. Thus, the risk to reading articles will remain the same. Not so if we introduce code to read files off a cluster. The cluster could fail; our code could fail; they could fail at any time of the day or night, and we will need to rely on outside expertise; expertise which will not be familiar with how we are using the cluster.
- This is not something we have used for years. We are using NFS. What you are proposing is a distributed abstraction of what we currently use, but it is different and will act totally different.
We need new code to handle file uploading no matter what, so this will be a risk. Right now, the upload code touches the filesystem. Since we have ruled out using FUSE and NFS, we will need to use an API. The API will be hitting the network, so it will need the complexity of handling network failure. The complexity of mapping our needs to their API (versus creating an API which has simplicity designed-in) means that there is no obvious difference in risk between a packaged DFS and Homebrew.
Same thing for renaming, moving, and deleting. Need an API.
Thumbnailing in Homebrew will run in the node's filesystem. It works the same as the standard MediaWiki code. Not so for the other DFSes under consideration. Likely that we will have to copy files out of the DFS onto the thumbnailer, create the resized copy, and write it into to the DFS.
- We don't currently do thumbnails on the media servers. There are pros and cons to this approach. This will need to be investigated more to see if the pros outweigh the cons.
For Homebrew, there is nothing at the center except a DNS server, which is designed for replication. Mogile requires a database server, which needs to be replicated. GlusterFS uses a hash which isn't centralized.
- This is just as much a point of failure as other systems...
So ... in the critical minute-by-minute path, Homebrew wins on reliability. But Homebrew needs the replication pieces that other DFSes come with.
- When a file is changed (uploaded, updated, renamed, or deleted) it needs to happen on all nodes. If a node fails, it must be marked as failing.
- When a squid detects that a node is giving us trouble, that node/hash combination must be marked as failing.
- We need to have a configuration file which describes our policy (three copies of everything, with a per-node weighting of usage) which determines which node serves which hash.
- We need a program which takes that hash/node allocation and "makes it so" by copying hashes and modifying DNS.
- We need a program which does consistency checking. It must look into the folders to make sure that 1) files which should be there are, 2) files which should not are not, and 3) the DNS matches the hash allocation.
- Actually, thinking about it, the latter is just a dry run of the former, so they should be the same program with a "-n" (do nothing) option.
^^ I ask again, why are we considering re-writing a DFS from scratch? --Ryan Lane 00:39, 14 December 2010 (UTC)