Jump to content

Wikimedia Cloud Services team/EnhancementProposals/Toolforge on bare metal

From Wikitech

We want to make Toolforge more scalable and sustainable to better meet current and future needs of tool developers by providing a higher level of availability and reliability through well-defined SLOs, shared operational practices and incident response with SRE, and presence in multiple data centers. One idea to support this is to move Toolforge to the production infrastructure.

Proof-of-concept

This move would be complex and involves a lot of unknown risk. We would want to start by first creating a proof-of-concept to try to untangle some of the expected complexities and prove out some of the riskier elements so that we have a much better understanding of how much time and effort a true migration would take, and whether the potential benefits are worth the investment.

Key questions to answer:

  • How much effort would it take to host Toolforge outside of the Cloud VPS infrastructure?
  • Which tools would be easy to migrate? Which tools would be hard to migrate?
  • Will hosting on bare metal allow us to provide a higher level of availability and reliability?
  • What will have to change operationally if Toolforge was on bare metal?

The intention is to have this work be represented as a hypothesis in the WMF FY25-26 Annual Plan under the WE6 objective. A draft text could be:

Hypothesis: If we create a proof of concept (POC) for running Toolforge on the production infrastructure (“bare metal”) and test migrating a subset of representative tools to the POC, we will better understand the limitations and challenges of hosting Toolforge outside of Cloud VPS, which will allow us to make a decision about whether to invest in a full migration.

This work is beginning to take shape in T407296.