Kubernetes/Clusters

This page makes use of RFC2119 terminology. See https://datatracker.ietf.org/doc/html/rfc2119

We have multiple Kubernetes clusters deployed in what we call the "production" realm. This page does not describe kubernetes clusters maintained in other realms, e.g. Toolforge which is maintained in WMCS/Labs realm.

WikiKube

WikiKube is the canonical name of the set of Kubernetes clusters that hosts MediaWiki and related services. It is also known, due to historical reasons as "the Kubernetes cluster", "production", "main" (a transitory name that just stuck in some places), "eqiad/codfw". Referring to it by the canonical name is bound to be the least confusing way, so we suggest you use that one.

The clusters are owned by the Service Operations SRE team.

Goal

The goal of these clusters is to serve production MediaWiki and related microservices traffic.

This also means they serve the bulk of our total traffic (>30k requests per second as of 2022-06-23).

Applications that are deployed in these clusters SHOULD fall into at least one of the following categories:

MediaWiki itself (appservers, API servers, job runners, etc.)
Services that MediaWiki relies on internally (EventBus, session store, etc.)
Services that MediaWiki relies on publicly/client-side (Citoid, Maps, etc.)
Services that provide an API that fundamentally depends on MediaWiki (mobileapps, wikifeeds, etc.)

For reliability reasons (mostly interference with the workloads powering end-user traffic) applications that are not related to the stated goal MUST NOT be deployed on this cluster.

Various miscellaneous services that clearly don't fit in the above categories, but also do not fit the stated goal, MAY be examined in a case-by-case basis with Service Operations which will advise whether an application/service can or can not be deployed in these clusters. Note that there are some legacy applications which don't fit in the newly defined scope of this cluster and are expected to be moved elsewhere.

Examples of applications that would be a bad fit for installing in these clusters are:

Monitoring: e.g. Grafana, Kibana, LibreNMS, AlertManager, Icinga, Puppetboard. This restriction also exists because monitoring should be functional even when these clusters are in an outage.
Critical infrastructure pieces: e.g. Netbox
Collaboration Tools: Phabricator, Gitlab, gerrit, etherpad, etc
Continuous Integration applications
Machine Learning applications, especially the training part, but also anything that relies on existing KubeFlow infrastructure. The ml-* clusters are probably better suited for these use cases.
Analytics applications: e.g. Turnilo, Superset
Stateful applications/Datastores: MySQL/MariaDB/Postgres/Cassandra/Memcached/Redis (with persistency enabled at least)/Varnish/ApacheTrafficServer are all bad use cases for this cluster. This has to do with how these applications are designed and created, which would, without significant investment, decrease their overall reliability leading to more outages overall.

Datacenters

Services/applications MUST be deployed identically (barring some datacenter specific configuration) in both main Data centers in an active/active fashion. The reasons for that are:

Possibility for a failover of a datacenter in case of emergencies
Ability to perform maintenance without the need for downtime windows
Decreased latency for some groups of end users, increasing readability

The ability of the above is routinely checked using the Switch Datacenter procedure. Services that are consistently failing the procedure will be asked to be undeployed and deployed elsewhere.

Traffic Flow

Exposure to end-users or internal applications of these service MUST happen via LVS or Ingress and advertised in the DNS Discovery records. The Global traffic routing layer will take care of routing end-users to the appropriate LVS/Ingress endpoints.

Internal applications SHOULD use the Services Proxy infrastructure to communicate with other services. For incoming HTTP(S) traffic no changes to the application are necessary. For outgoing HTTP(S) traffic, the application SHOULD set the correct HTTP Host Header of the endpoint (e.g. "www.wikidata.org") it wants to talk to in the HTTP requests it generates. TLS certificates SHOULD be generated for these services. Consult with Service Operations for how to do that.

staging

staging also known as staging-eqiad, is a sibling cluster to the above clusters. As such, almost everything that applies to WikiKube, applies to this cluster too, unless noted below. All services deployed here are accessible internally via https://staging.svc.eqiad.wmnet:{port}.

Goal

This cluster exists to allow developers to deploy and test new versions of their project without affecting user traffic. It is a complement of the above clusters providing a safety net (and nothing more) during a deployment. The idea is that if a deployment fails in staging no effort should be taken to proceed with deploying to the above clusters.

Other uses of this cluster than the one described above (e.g. as a development environment, a CI runner, a quality assurance platform or a demo scene to name a few) MUST NOT be allowed. This restriction is in addition to the restrictions mentioned for the above clusters.

Datacenters

While staging clusters exists in both eqiad and codfw, the primary home of the staging cluster is eqiad. staging-codfw is intended for SREs to adjust and test the configuration of Kubernetes itself. While developers can deploy there, it's strongly discouraged. The cluster is in a constant state of change. As such, while it can perform the same functions as the staging-eqiad cluster, it is usually not ready for that, nor it should be.

Since no real traffic is served by these clusters, there is no need for High Availability mechanisms.

Traffic Flow

There is no traffic flow for these clusters so none of the things mentioned in the above cluster apply. Typically deployments SHOULD only have 1 replica in staging since it has less resources. TLS is automatically configured for all services deployed here.

ml-serve

The ml-serve cluster group runs the Kubeflow+Kfserve (formerly Kfserving) stack. The cluster group comprises ml-serve-eqiad and ml-serve-codfw for production traffic, and ml-staging-codfw for staging. The owner of these clusters is the Machine Learning team. Despite the different ownership of these clusters, the ML team and Service Operations team are sharing greatly the infrastructure capabilities and processes the other clusters use/have.

Goal

Machine Learning related applications that are being owned or helped into production by the ML team are being deployed here. A first goal is to replace the ORES infrastructure that serves revision scores. Eventually, 1 more cluster will be created in the eqiad datacenter, to allow for training Machine Learning applications using Kubeflow. Those applications will be using the Kubeflow infrastructure to be deployed in the clusters described here. The name of that project is Lift Wing.

Datacenters

The reason for clusters in the 2 main Datacenters is the same as the above clusters.

Traffic Flow

Exposure to end-users or internal applications of these service MUST happen via LVS and advertised in the DNS Discovery records. The Global traffic routing layer will take care of routing end-users to the appropriate LVS endpoints. Internal applications SHOULD use the Services Proxy infrastructure to communicate with other services. For incoming HTTP(S) traffic nothing is required, for outgoing HTTP(S) traffic, the application SHOULD set the correct HTTP Host Header of the endpoint it wants to talk to in the HTTP requests it generates. TLS certificates SHOULD be generated for these services. Consult with the ML Team or Service Operations.

dse-k8s

The dse-k8s cluster group currently comprises only the dse-k8s-eqiad cluster. This cluster is managed by the Data Engineering team. The cluster is based on the standards defined by the wikikube cluster group above and employed also in the ml-serve cluster.

Goal

One of the primary goals of this cluster is to train Machine Learning applications using Kubeflow. This cluster was also known by the project code name of Lift Wing, but the scope has since been broadened to encompass more general Data Science and Engineering (DSE) workloads, hence the renaming to of the cluster to dse-k8s.

Datacenters

We currently only have one cluster within this group, which is in eqiad. This cluster might be replicated in codfw in the future.

Traffic Flow

Exposure to end-users or internal applications of these service MUST happen via LVS and advertised in the DNS Discovery records. The Global traffic routing layer will take care of routing end-users to the appropriate LVS endpoints. Internal applications SHOULD use the Services Proxy infrastructure to communicate with other services. For incoming HTTP(S) traffic nothing is required, for outgoing HTTP(S) traffic, the application SHOULD set the correct HTTP Host Header of the endpoint it wants to talk to in the HTTP requests it generates. TLS certificates SHOULD be generated for these services. Consult with the ML, Data Engineering, or Service Operations teams.

aux

Observability, monitoring and critical infrastructure tooling have been deemed a bad fit for the WikiKube cluster, but the need to deploy such services in a more orchestrated and structured way exists. The owner of these clusters is the Infrastructure Foundations SRE team

Goal

The aux cluster is a, currently, tightly scoped cluster that aims to serve the needs of Observability tooling and other SRE supported critical infrastructure services. Things that at the time of this writing are considered to be a good fit are:

Observability infrastructure (e.g. Jaeger, Grafana, Puppetboard, etc)
Critical infrastructure pieces (e.g. Netbox)

Datacenters

At the time of this writing, 2022-10-20, the aux cluster is present only in eqiad. It is also only present on Ganeti VMs and has no hardware. Both are decisions that could be revisited in the future. (In particular, an aux cluster in codfw seems likely at some future point.)

Traffic Flow

Exposure to end-users or internal applications of these service MUST happen via LVS or Ingress and advertised in the DNS Discovery records. The Global traffic routing layer will take care of routing end-users to the appropriate LVS/Ingress endpoints.

Creating a new cluster

Creating a new cluster is supported, albeit is a substantial amount of work (multiple days even for the fastest SRE team) and investment. SREs MUST definitely consult with the Service Operations team before proceeding further with the instantiation of a new cluster. Docs are at Kubernetes/Clusters/New