User:Arturo Borrero Gonzalez/Notes/Goal proposal infra

From Wikitech

This page contains a goal proposal for the WMCS team, related to the basic infrastructure we design/operate.

Problem and background

The Cloud VPS service has gained increased usage and importance over the last couple of years. Final users/customers can use Cloud VPS directly or indirectly throught other services like Toolforge or PAWS, which are both currently built as a Cloud VPS projects. Services built on top of Cloud VPS share are constrained by its capacity and are affected by its availability and robustness, which is usually good, even though we don't offer warranties, SLAs or anything like that. As we keep building interesting services on top of Cloud VPS, the service keeps gaining even more importance. This service and his infrastructure becomes more important for the stability and robustess of the services we provide.

The Cloud VPS service is considered to be composed of some basic building blocks:

  • hypervisors (to run actual VM workloads)
  • network (by means of neutron and related networking configuration)
  • control (keystone, nova, glance, etc. Central components of the setup)
  • others (DNS services, horizon, etc)
  • storage (an open problem, both for VM instance disk storage and for project internal shared storage)

However, we have identified several flaws in the service that require both short term and long term actions if we want to provide a robust IaaS service which we can use to build even more services.

Some of the issues already identified are:

  • we don't have a well-defined long term planning and archicture for the service. If this exists, it isn't written anywhere. If is written somewhere, is not known enough by the engineers and the technical contributors related to Cloud VPS.
  • our development/stagging environment is limited, and therefore we don't usually give enough testing to components before implementing them in the production Cloud VPS service.
  • related to the above, we don't have an established and defined workflow that includes testing
  • we didn't analyzed our current setups for single point of failures. We know some of the because we are experienced, not because we tested and planned for them.

Proposed goal specification

Relation to anual planning

The proposed goal itself is framed in the Technology Program 1: Availability, performance, and maintenance, specifically Outcome 4:

Wikimedia Cloud Services users can leverage a reliable and public Infrastructure as a Service (IaaS) product ecosystem for VPS hosting. 

But could probably be related to others as well.

Specific tasks details

  • Do failover testing of components that are already in HA setups
  • Create concrete and specific documentation on how to handle failver situations
  • Define a desired long term Cloud VPS model, and the intermediate steps to get to it.