Obsolete talk:Media server/Distributed File Storage choices

Rendered with Parsoid
From Wikitech
Latest comment: 13 years ago by RussNelson in topic Notes on homebrew

I haven't been closely involved with this, and don't have a ton of time to devote to this over the next couple of days, but just a couple of questions:

  • The requirements are expressed in terms of software libraries, but do we have any requirements about...
Requirements are stated more verbosely in Obsolete:Media server/2011_Media_Storage_plans. RussNelson 00:59, 1 December 2010 (UTC)Reply
  • consistency or availability, particularly cross-colo?
  • performance?
  • scaling?
  • horizontally -- we add a new server, what happens to all the old files?
I believe that the presumption is that the new server takes over its fair share. It's mine, anyway. RussNelson 00:59, 1 December 2010 (UTC)Reply
  • "diagonally" -- we want to add some servers with newer hardware, what's the procedure for balancing faster and slower machines?
Not planning to do this automatically, because I don't think we understand that problem yet. There will be a manual procedure for storing more or less files on each machine. RussNelson 00:59, 1 December 2010 (UTC)Reply
How is this handled by existing third party code? -- ArielGlenn 20:28, 5 December 2010 (UTC)Reply
In my survey of DFSs, I didn't see anybody claim to have solved that problem. RussNelson 15:58, 8 December 2010 (UTC)Reply
  • I assume that vertical scaling is right out. :)
Gosh, it's worked so far! RussNelson 00:59, 1 December 2010 (UTC)Reply
Or is it just that all the options here are roughly the same?
  • MogileFS is listed as not having files available over HTTP, but why? Doesn't it run over HTTP?
  • I strongly disagree with the requirement that files should be stored in the physical filesystem under a name similar to their title. I realize this is how MediaWiki works, but I would rather not have this baked into our systems going forward.
That was listed as a Want, not a Need, but I get your point. RussNelson 00:59, 1 December 2010 (UTC)Reply
Ariel seems to believe this offers some sysadmin convenience, but I think that comes at a HUGE penalty for the application design. When it comes to media uploads, many files could legitimately have the same title, and we also have the issue of the file extension to deal with (as a wart on the title). Media files should be stored under guaranteed unique ids or an id generated from a content hash. Titles should be stored in a secondary place such as a database.
The namespace is flat (AFAIK) and so no, they can't have the same title. RussNelson 00:59, 1 December 2010 (UTC)Reply

IMO Mogile ought to be an acceptable alternative (since I disagree with the two strikes it has against it in your current matrix). LiveJournal uses it for relatively similar use cases, and it is what everybody in the Web 2.0 space copied, including Flickr.

I'd like to hear a bit more about the "Homebrew" solution as it seems to rely on DNS and does not allow arbitrary files to be associated with arbitrary hosts. Maybe we can get away with this, as, unlike the Web 2.0 world, we don't have to do complex lookups related to permissions or other user characteristics just to serve a file. But I confess I am a bit skeptical that we've come up with something that the rest of the industry has missed, unless it can be shown to flow directly from our more relaxed requirements about (for instance) affiliating a file with a user, checking privacy, licensing, etc.

Groups of files (which have the same hash) are associated with arbitrary hosts. Currently, we have 13TB. Spread over a 256 hash, that's 19GB per file group. It's straightforward to scale that to 65536 hosts by using the secondary hash. RussNelson 00:59, 1 December 2010 (UTC)Reply

The title->hostname hashing algorithm I see discussed there seems to suggest that all servers ought to be of equal performance characteristics too, which won't be the case. Unless we have 256 "virtual" servers that may not map to the exact set of machines. How does this work if we have more than 256 machines?

Yes, that's exactly how it works. We apportion the file groups to servers as their performance dictates. Once we have more than 256 machines, we expand the hostname to include the secondary hash. We can roll out all of these changes on the fly by ensuring that both systems work and are coherent. In fact, we can probably go from having the existing Solaris machines to the cluster on the fly. As we go, adding machines, we copy files off the Solaris machines. Adding the 1st machine in the cluster should work identically to adding the Nth machine into it. RussNelson 00:59, 1 December 2010 (UTC)Reply

NeilK 22:47, 29 November 2010 (UTC)Reply

I'd like to know more about what pieces would need to be written for the "homebrew" solution also. At least a repo method for uploads would have to be done. Also I'd like to know what varnish url rewriting capabilities exist. If the rewriting piece turns out to hamper our choices for caching in any way, I would be opposed to that on principle. We would want to think about workarounds.
Homebrew pieces: 1) replication and migration tools. 2) We actually need to shuffle up the existing system, because we need to do scaling without NFS mounts of a central computer. I think we should do scaling on the cluster hosts themselves. We didn't do that in the past because the central big machine needed to stick to its knitting. But with clustered machines they can serve multiple functions. Scale by adding more.
I agree that being able to rewrite is an absolute requirement for Homebrew. Once I finish editing this (and sleep) it will be the first thing I look at tomorrow. RussNelson 00:59, 1 December 2010 (UTC)Reply
As to filenames that can be humanly read, I have as a want that the filename on wiki be similar to (= embedded in, for example, as we do for the archived filenames) the filename as stored. Why? Because if we ever have buggy mapping (and believe me we have had all kinds of bizarre bugs with the current system) then we will be in a world of hurt trying to sort out what files go to what. But if the filename is at least a part of the file as stored, we have a fighting chance of finding things gone missing or straightening them out. I would like to build in robustness of this sort wherever possible. Having said that I listed it as a "want" because it's up for discussion, not set in stone. -- ArielGlenn 23:18, 29 November 2010 (UTC)Reply
I see the problem with remapping, but that's only because MediaWiki thinks that the primary key is the title. It really should be the other way around, we map the file id to a title. Then again we have to deal with the system as it is. NeilK 23:22, 29 November 2010 (UTC)Reply
Ariel noted to me that we need to preserve the existing project/hash/2ndhash/Title.jpg and thumbs/project/hash/2ndhash/Title.jpg/px-Title.jpg URLs. If we give up on titles in filenames and actually store them in arbitrary tokens, then we need to have an index which maps those names into the tokens. RussNelson 00:59, 1 December 2010 (UTC)Reply

Notes on homebrew

Looking more closely at the existing URLs, they contain the first, and then first two digits of the md5sum of the filename. Thus, we have at most a 256-way split with the existing filename structure. If we have machines with, say, 4TB on them, then we can serve up at most a petabyte. Given that we have 13TB now, a petabyte is only an 64X expansion. I don't think a design which only supports a 64X expansion is reasonable.

One of the constraints is to preserve these URLs. There are two sources of the URLs, however: our own, which we hand out on the fly (which we are free to change), and URLs presented to us from an external page. We only really need to preserve those URLs. Depending on how many there are, it's possible that we could get away with a machine or two rewriting the URLs and re-presenting them to the caches.

If we're then handing out a new kind of permalink, we can create one which scales between machines better. For example, we could take the first three digits of the md5sum (256*16 or 4096 unique values) and put them into hostnames. Those hostnames would map into a relatively small set of servers which would serve up the file in the usual manner. For reliability, we could keep duplicate copies of each of these 4096 bins of files, and resolve the hostname into multiple IP addresses.

One of my hesitations about the homebrew solution, besides that then we do wind up maintaining another non-trivial piece of code, is that as it is described it won't be general purpose, so we will have spent a decent chunk of time developing something that is useful pretty much for us only. If we put in work making one of the third-party options work for us, that work can go back to the upstream project, at least in theory, and benefit other folks. I realize that's a bit of a meta reason to favor one choice over another, but I think we should keep it in mind anyways. Another point is that a third party project with a decent community behind it has exactly that: a decent community of developers. If we go with homebrew, we'll have 1, maybe 2 people who know the code well, and who are probably immediately booked on other things once it's deployed. -- ArielGlenn 20:28, 5 December 2010 (UTC)Reply
I agree with this concern 100%. Maybe 101%. Or more. The only only reason I'm taking Homebrew seriously is because we have four choices:
  1. Use somebody's DFS with POSIX semantics using FUSE,
  2. a DFS mounted via NFS,
  3. Write PHP code that talks to a DFS API, or
  4. Write PHP code to a REST API that implements our Repo modification needs.

I'm eliminating #1 because I'm dubious of putting FUSE into a production path. There's a REASON why you do some things in the kernel. I'm eliminating #2 because NFS has always had reliability problems which I think are inherent in the design. Both of these solutions are code-free. We just configure the DFS to put files where we have always been mounting them. I don't believe this is the correct path. Requirements and capabilities change over time, which mean that a newly-implemented solution will make different trade-offs between requirements and capabilities. Trying to preserve existing capabilities in the face of different requirements just pushes changes off into the future. Sometimes this is the right thing to do when you don't understand the problem well enough. I think that right now, we do understand the problem of a bigger Media Server. Thus, it's time to write code.

By the previous paragraph's reasoning, we are writing code. To address your concern, we should try to write the simplest, most understandable, most reusable code possible. On the one hand, that could be something like Domas' PHP interface to Mogile (proven to be reusable since somebody already IS reusing it). On the other hand, that could be extending the FileRepo code so that it splits the store over different machines, but preserves the structure of the local store, thus eliminating entities and configuration. I think that the Homebrew solution will result in our taking away configuration variables and code that (already) we are the only user of. RussNelson 15:58, 8 December 2010 (UTC)Reply