User:Andrewbogott/keep the cloud
Cancelling or deprecating cloud-vps would be a tremendous mistake.
Cloud-vps is not a nuisance or an anachronism; It is an effective and inexpensive tool that is vital to our organization and our movement.
- We run hundreds of projects and have a huge number of users and stakeholders, both inside and outside the foundation. There is no existing alternative home for these users, and not even a serious proposed future alternative home outside of AWS. Here are a few projects hosted on cloud-vps:
- The beta cluster ('deployment-prep'). This is still, for better or worse, vital to every mediawiki release
- The engine that builds Kiwix content ('mwoffliner')
- Internet archive bot, a collaboration with the Internet Archive to ensure viability of links to dead web pages ('cyberbot')
- Much of the WMF's deployment and CI pipeline ('integration' and others)
- 'Catalyst', a simple test and demo platform for mediawiki patches
- The 'video' project which runs day and night to re-encode video content into open formats and upload them to Wikimedia Commons
- ...and literally 200 more.
- Cloud-vps is currently maintained by around 3 full time engineers, along with intermittent short-term collaboration with other SRE teams. We are extremely efficient and effective.
- Being able to do a lot with a small number of people may appear fragile or 'unsustainable' but that's mistaking a strength for a weakness. This efficiency isn’t thanks to heroic efforts or long hours, it is possible because of the modern tools, tech stacks, and thoughtful practices that we’ve developed over more than a decade.
- Naturally a larger team could provide better SLOs, but the offering has already endured through multiple staffing changes and a recent sabbatical by its lead SRE without loss of service. It is not precarious or unsustainable, rather it is thriving and steadily improving its offerings.
Why do we need cloud-vps when we have Ganeti and Kubernetes?
Cloud-vps isn't primarily a virtualization platform; it is a system for providing tenant and role management, and a set of self-serve APIs for allocating and managing all kinds of cloud resources.
- Cloud-vps is the only values-compliant platform for non-SRE WMF teams to run their software. It is an open-source platform managed and controlled by the WMF. Without cloud-vps those teams will move their work to AWS or other commercial public clouds, incurring great expense and damaging the foundation's core open source principles.
- We provide block storage as a service, object storage, database-as-a-service, DNS as a service, automatic proxy generation (with SSL termination), Kubernetes as a service, and yes, VM management. Most of those services are customized specifically for WMF use cases, enabling ease of collaboration within projects and integration with much of the standard SRE toolchain.
- The above services are available via a web UI and also available via public APIs. This permits orchestration with OpenTofu and Ansible, for example.
- All of the above is provided with secure, true tenant isolation. After initial creation of a tenant, owners can manage and delegate access without intervention from SRE or WMCS staff.
- A third-party security audit and pentesting exercise performed in 2022 showed complete confidence in tenant isolation -- since that audit we now confidently support storage of secure data and PII on cloud-vps.
- Cloud-vps is the only multitenant offering within the WMF's tech stack. Any other service (or engineer) acting anywhere else in our datacenters requires lengthy security vetting due to ever-present risk of privilege escalation.
- Ganeti and Kubernetes are great tech but they are in no way full solutions to the problems that cloud-vps solves. Replacing cloud-vps with Kubernetes is like trading your house for a bed and thinking you'll still have a place to sleep.
Cloud-vps runs OpenStack. Isn't the OpenStack project defunct?
OpenStack and it's parent organization, the OpenInfra foundation, are thriving.
- The most recent OpenInfra (formerly OpenStack) Summit (Suwon, SK 2024) had 1500 attendees.
- As of 2025-04-25, Canonical lists 21 open positions for engineers working with OpenStack.
- The most recently published user survey (2022) shows 300 public clouds running OpenStack, managing 40 million cores in production clouds. 85% of those clouds offered KAAS.
- There continue to be twice-yearly releases of all the OpenStack projects that we run. Support levels vary, but the community is active; Magnum (the KAAS offering) contains a major update in the most recent (march of 2025) release.
- OpenInfra plans to join the Linux Foundation this year. "The OpenInfra Foundation enters 2025 with strong momentum. The number of member organizations increased by 15%, including two new Platinum members. Our projects are thriving as well, with OpenStack adoption surging and OpenInfra projects like Kata Containers, StarlingX and Zuul experiencing increased adoption. Coupling our global community—110,000-strong—with the Linux Foundation leverages the power of open source and sets the stage for continued success as we build the next decade of infrastructure."
- Release Engineering has recently begun upgrading to Zuul3 for CI/CD. Zuul3 is an OpenInfra project.
Does Cloud-vps follow standard SRE tech practices?
Not only is the cloud-vps team compatible with the greater SRE department, it enables and assists it.
- Our systems run on Debian systems, our software is deployed using puppet, and much of the infrastructure is orchestrated using cumin and cookbooks. Any SRE transferring into our team from elsewhere in the organization would be immediately comfortable with our tech stack.
- Cloud-vps is an important tool for the development of many SRE tools. It serves as a testing ground for new puppet code, and multiple SRE teams run test or dev clusters on cloud-vps. Our offering is designed with puppet integration for testing and planning future production deployments. Some example SRE projects running on cloud-vps are: 'pontoon', 'traffic', 'puppet-dev', and 'appservers'.
Cloud VPS is the best it has ever been
Over time the WMCS team has navigated a number of challenges to make sure the service offerings meet certain desirable quality and operational standards.
Often engaging on multi-year planning strategies, we have a history of setting and then successfully reaching a number of milestones.
Some of the most relevant include:
- Decoupling compute servers from storage servers.
- The WMCS team spearheaded the adoption of the Ceph distributed storage technology in the WMF, which has since been adopted by other teams.
- Among other things, this allowed the WMCS team to greatly increase operational flexibility (VM live migration) and data resilience (more copies of the data)
- Cross-Realm_traffic_guidelines, a multi-team (SRE/WMCS) effort to define the relationship between our different networks and systems.
- Public openstack APIs, to enable operations using modern infrastructure-as-code tooling such as Opentofu or Ansible.
- Re-architect of Wiki-replica proxy layer for additional reliability and operational simplicity.
- Rollout of Trove Database-as-a-Service offering, allowing for one-click database deployments and upgrades.
- Rollout of Magnum Kubernetes-as-a-Service offering, allowing for one-click Kubernetes deployments and upgrades.
- Streamlining of physical datacenter layout and footprint, regarding racks space and switches.
- Re-architect edge network connectivity for increased efficiency and resilience, integrating the know-how of SRE NetOps for protocols such as BGP and OSPF.
- As of April 2025, Cloud VPS recently gained IPv6 and VXLAN support, which is known to unblock a number of other projects and improvements to the platforms and services hosted within Openstack.
Some of the items mentioned here have been executed in close collaboration with multiple engineers from different teams across the WMF.
All of the improvements had a direct, net positive impact for users and stakeholders of Cloud VPS, including different SRE teams, other WMF teams and the wider Wikimedia movement.
| This is an essay, by one or more authors from a given viewpoint or moment in time. Feel free to update this page as needed, but use the discussion page to propose major changes. |