Wikimedia Cloud Services team/EnhancementProposals/Toolforge on bare metal
We want to make Toolforge more scalable and sustainable to better meet current and future needs of tool developers by providing a higher level of availability and reliability through well-defined SLOs, shared operational practices and incident response with SRE, and presence in multiple data centers. One idea to support this is to move Toolforge to the production infrastructure.
Proof-of-concept
This move would be complex and involves a lot of unknown risk. We would want to start by first creating a proof-of-concept to try to untangle some of the expected complexities and prove out some of the riskier elements so that we have a much better understanding of how much time and effort a true migration would take, and whether the potential benefits are worth the investment.
Key questions to answer:
- How much effort would it take to host Toolforge outside of the Cloud VPS infrastructure?
- Which tools would be easy to migrate? Which tools would be hard to migrate?
- Will hosting on bare metal allow us to provide a higher level of availability and reliability?
- What will have to change operationally if Toolforge was on bare metal?
The intention is to have this work be represented as a hypothesis in the WMF FY25-26 Annual Plan under the WE6 objective. A draft text could be:
Hypothesis: If we create a proof of concept (POC) for running Toolforge on the production infrastructure (“bare metal”) and test migrating a subset of representative tools to the POC, we will better understand the limitations and challenges of hosting Toolforge outside of Cloud VPS, which will allow us to make a decision about whether to invest in a full migration.
This work is beginning to take shape in T407296.