Jump to content

User:Andrewbogott/wmcs on a public cloud

From Wikitech

I originally wrote this in February of 2024 for internal discussion, and it has since been updated and discussed off-wiki among members of the WMCS team including Arturo, Taavi , Francesco, and Vivian.

Why Don’t You Just Run This on a Public Cloud?

It is unusual for an organization the size of the WMF to run its own bare-metal servers. It is even more unusual for a team the size of WMCS (10 or so people) to run its own public cloud. From the outside, it often appears that we are doing things the hard way, and that the scope of our work cannot possibly be manageable.

It’s all true. The WMF does things the hard way, and the WMF always tackles problems that are comically huge compared to the size of the foundation. We also often do things the right way, and we usually tackle problems that need to be tackled.

The answers to how we do all that we do can be found all over the place; this page is meant to tackle the question of why. Why do we run our own cloud, when we could just pay AWS? Why run our own platform as a service when we could just run things on Heroku? Why buy servers, and pay for data centers rather than renting servers from Digital Ocean? In truth, we periodically consider all these options, and more. Nonetheless, any time we step close to the outsourcing ledge, we’re pulled back for one reason after another. Here are some of those reasons.


Values

FOSS

The WMF in general, and WMCS in particular, are strong advocates of Free and open-source software. We require our users to run FOSS and all of the installed software running our platform is Free and Open Source.

There will probably always be a bottom layer to our technical stack that involves proprietary technology. Currently that basement is embedded within server hardware: firmware, chips, routers, etc.

In a public cloud, we would still be able to enforce FOSS standards on our users, for the software that we install. Nonetheless, several layers of the stack would (or, could, outside of our control) move into the proprietary basement: virtualization, cloud UI, storage, etc. We would also lose control over hardware purchases themselves, where there are no perfect choices but nevertheless better and worse ones.

Privacy + Security

As a matter of current policy the WMF stores all PII and other confidential data on-site where it can be technically protected by our security engineers, and legally protected by our in-house legal staff. WMCS Virtual servers do not have direct access to on-wiki PII. That doesn’t mean that there is no PII hosted in Cloud VPS:

  • WMCS-hosted projects may generate their own PII, unrelated to on-wiki use. Our terms of use require that this collection be disclosed, but it is otherwise permitted.
  • Wiki replicas are partially sanitized but still contain confidential data (PII and redacted history) that is only protected via live database views. Replicating our current replica model in a public cloud would necessitate storing protected data off-site where it might be more subject to theft or subpoena.

Forkability

All wikimedia content is licensed using creative commons licenses, to ensure that content can be shared and improved on by future readers. Similarly, our tech stack design is informed by the goal of forkability, meaning that our infrastructure /and/ content can be duplicated and reproduced in a new place by new people without requiring permission or assistance on the part of the WMF.

  • All user, admin, and design documentation is public.
  • Hardware, software, and system configurations are performed in public and all code can be read or cloned by anyone at any time.
  • Information about the workloads our users are running on our cloud is also made public via interfaces like openstack-browser.toolforge.org.
  • Decisions, motivations, and processes are published and performed as publicly as possible.
  • All software is FOSS; any software developed in-house is made available to outside users with all source code freely available.
  • To the degree that our time and resources allow, we encourage and support third-party use of projects developed by WMF staff.

A move to the public cloud will reduce the forkability of WMCS platforms. Presumably if we migrate to a public cloud we’ll make use of cloud-provided services for storage, databases, secrets, orchestration, etc. Many of these features will make use of cloud-proprietary technology; some configuration will become more difficult to publish as it moves out of git-managed text files and into cloud-provider-managed web interfaces.

Paying people, not companies

The Wikimedia Movement is self-organizing and largely non-hierarchical. Politically and economically it more closely resembles an anarchist collective than a corporation. The WMF itself is a not-for-profit corporation which seeks to serve the demands of the wikimedia community and, more broadly, the cause of information freedom.

Given this context, donors to the WMF might reasonably expect their donations to be spent primarily in support of similarly-aligned projects and organizations. WMF staff are, if not explicitly anti-capitalist, generally committed to causes that transcend and conflict with the drive for profit. We prefer to spend our dollars and our time building the commons rather than contributing to the financial wealth of specific individuals or companies.

Public clouds are typically for-profit institutions. A move of infrastructure to a public cloud will redirect a portion of our effort and our donor’s funds out of the commons and into shareholder pockets. Given a choice, we prefer to pay humans to build things rather than pay companies to buy things.

Distributed and decentralized internet

The original idea of the internet was to have a distributed and decentralized system, one that could survive the disconnection or failure of any of the nodes, or the organization supporting such nodes. This idea still holds true for many actors in the industry. At the same time, the promise of cheap, reliable, elastic and disposable systems presented by the major cloud providers is challenging this idea.

From this angle, one may consider that running any Wikimedia-related service on any of the major cloud providers is the wrong strategy, especially considering that we are already running on our own datacenters with our own hardware servers.

Value

Hardware Cost

Comparing costs between our current model and a public cloud model is hard, in part because there are many unanswered questions about how our infra would look if detached from a physical WMCS datacenter.

Metal-as-a-service or Openstack-on-Openstack

If we transplanted our existing stack onto a cloud service provider, our existing workflows would remain largely the same. We would spend a little bit more time dealing with the cloud provider itself, but less time troubleshooting hardware and interacting with dc-ops, most likely for a net savings.

Public cloud resources are more expensive in raw terms, but the flexibility of cloud space would allow us to expand and contract as needed; the need to maintain less slack in our hypervisor capacity would mean maintaining a much smaller ram/cpu footprint than we keep now. Some preliminary estimates suggested hosting costs of around $900,000 per year to host our current infra on AWS, but with long-term rental agreements and more aggressive overprovisioning we could likely drop that number considerably, possibly by as much as two thirds.

We have around 150 servers, on a 5-year refresh schedule. Cost per server varies, but we probably spend an average of around $200,000 per year on those refreshes. Data center costs (largely rent and power) are more obscure as they are shared with the rest of WMF operations, but they are non-trivial.

Transit would be more expensive in the public cloud due to lack of free peering, but we don’t have good numbers about wmcs transit usage.

It is possible with vigilance and aggressive finops work that a largely-intact shift of our stack to cloud-provided servers would not result in a cost increase for our team, and might be a savings for the WMF as a whole. All of these numbers would require much more scrutiny to draw a specific conclusion.

Cloud-native adoption

The alternative to a full transplant of our current stack into a public cloud would be a full adoption of cloud-native services. Toolforge would become a WMCS-maintained api pointing directly at cloud-provided k8saas endpoints; cloud-vps projects would be mapped directly onto WMF-funded billing accounts, etc. Estimating the cost of such a move would be very complicated, as we would ultimately be providing (and purchasing) radically different services and resources.

It would be difficult to maintain many of our special, collaboration-enabling features with this move. Our auth and membership model is fundamentally more open than that provided by default by public clouds, so we would either lose such collaboration features or spend considerable engineering effort re-implementing them.

It’s very difficult to predict whether this move would reduce or increase engineering effort. One case study suggests that a move between cloud and self-hosted services involves different work, but not necessarily less work. Most WMCS staff effort (about six out of nine FTEs) is in pursuit of custom community-facing platforms that could not be replaced by public cloud services; in this realm there would be little savings.

Workload type

The type of workloads that run on Cloud VPS are often considered “bursts”, which could be considered the worst type of workloads from the cost-effectiveness point of view with regards to public clouds. [citation needed].

Practicality

Autonomy

Public clouds will always design and product choices based on their needs, not ours. Our reliance on stock hardware and open source software means that in the worst case scenario we can fork and maintain our own products when something is discontinued upstream. If we rely on an external service provider, we will be entirely subject to their decision whether or not to support the foundations that our products are built on. We will also run some risk of vendor lock-in which could result in uncontrollable cost increases.

This risk could be mitigated by using a dual-cloud approach (to provide an easy backup migration) or by limiting ourselves to cloud providers that provide extremely standardized products (e.g. bog standard openstack). There would nevertheless be considerable engineering effort involved in any change between providers.

A current (tiny) example of this issue is our current reliance on Rackspace for the hosting of wikitech-static.wikimedia.org; it makes use of a VPS product that rackspace is likely to deprecate shortly, requiring engineering time to move and rebuild.

Migration path

Moving to a new platform with different features and limitations will require a substantial front-loaded effort. In some cases we may need to discontinue public-facing services in order to accommodate limitations of our new cloud platform.

Full engineering rebuilds are always costly and unpredictable; the benefits would need to be considerable to justify the effort, risk, and transitional expense of supporting both systems during transition.

Fungibility and abundance

Each Cloud-VPS project closely resembles an ordinary public cloud account. We could, if we so desired, install telemetry software and produce exact metrics and ‘bills’ for transit, cpu usage, etc. The fact that we don’t do that is a feature, not a bug. Our approach to granting resources is largely decoupled from expense; we try to maintain a context of abundance, where any reasonable resource request will be granted, and even extreme resource requests are granted if the use case is compelling. If Cloud VPS moves to a public cloud, this feeling of abundance will be rapidly replaced with a balance sheet. Rather than an open playground for volunteers, we’ll be in danger of becoming cost-conscious grantmakers who need to carefully track and justify every user expense. This change would be more financially responsible, but it would also be much different from the current ‘anyone can edit’ ethos that Cloud VPS currently embodies. We could of course avoid this descent into bean-counting by starting out with a fixed cloud budget and not tracking expenditures within the budget. There would nevertheless be a long-term risk to the project because we would become a very easy (and easily measurable) target for cost cutting due to the near-instant responsiveness in public cloud billing. Fun Engineers like doing lots of different things. Engineers don’t like having to mess with unpredictable black boxes; they like having control over the things they build. Engineers like feeling like they understand the whole stack, top to bottom.

We like running all this stuff ourselves. Our staff retention is great, and finding people who want to work here is easy, because working here is fun.