This page makes use of RFC2119 terminology. See https://datatracker.ietf.org/doc/html/rfc2119
We have multiple Kubernetes clusters deployed in what we call the "production" realm. This page does not describe kubernetes clusters maintained in other realms, e.g. Toolforge which is maintained in WMCS/Labs realm.
wikikube (aka eqiad/codfw)
codfw are the historical names of the 2 clusters that are named
wikikube. They are owned by the Service Operations SRE team. These are our older kubernetes clusters and have the historical benefit of using the DC names in short form. For the same historical reasons, the infrastructure in multiple places treats those as the primary Kubernetes clusters. This is something that is being worked on.
The goal of these clusters is to serve production MediaWiki and related microservices traffic.
This also means they serve the bulk of our total traffic (>30k requests per second as of 2021-10-22).
Applications that are deployed in these clusters SHOULD fall into at least one of the following categories:
- MediaWiki itself (appservers, API servers, job runners, etc.)
- Services that MediaWiki relies on internally (EventBus, session store, etc.)
- Services that MediaWiki relies on publicly/client-side (Citoid, Maps, etc.)
- Services that provide an API that fundamentally depends on MediaWiki (mobileapps, wikifeeds, etc.)
For reliability reasons (mostly interference with the workloads powering end-user traffic) applications that are not related to the stated goal MUST NOT be deployed on this cluster.
Various miscellaneous that clearly don't fit in the above categories, but also do not fit the stated goal, MAY be examined in a case-by-case basis with Service Operations which will advise whether an application/service can or can not be deployed in these clusters. Note that there are some legacy applications which don't fit in the newly defined scope of this cluster and are expected to be moved elsewhere.
Examples of applications that would be a bad fit for installing in these clusters are:
- Monitoring: e.g. Grafana, Kibana, LibreNMS, AlertManager, Icinga, Puppetboard. This restriction also exists because monitoring should be functional even when these clusters are in an outage.
- Critical infrastructure pieces: e.g. Netbox
- Collaboration Tools: Phabricator, Gitlab, gerrit, etherpad, etc
- Continuous Integration applications
- Machine Learning applications
- Analytics applications: e.g. Turnilo, Superset
- Stateful applications/Datastores: MySQL/MariaDB/Postgres/Cassandra/Memcached/Redis/Varnish/ApacheTrafficServer are all bad use cases for this cluster. This has to do with how these applications are designed and created, which would, without significant investment, decrease their overall reliability leading to more outages overall.
Services/applications MUST be deployed identically (barring some datacenter specific configuration) in both main Datacenters in an active/active fashion. The reasons for that are:
- Possibility for a failover of a datacenter in case of emergencies
- Ability to perform maintenance without the need for downtime windows
- Decreased latency for some groups of end users, increasing readability
The ability of the above is routinely checked using the Switch Datacenter procedure. Services that are consistently failing the procedure will be asked to be undeployed and deployed elsewhere.
Exposure to end-users or internal applications of these service MUST happen via LVS and advertised in the DNS Discovery records. The Global traffic routing layer will take care of routing end-users to the appropriate LVS endpoints. Internal applications SHOULD use the Services Proxy infrastructure to communicate with other services. For incoming HTTP(S) traffic nothing is required, for outgoing HTTP(S) traffic, the application SHOULD set the correct HTTP Host Header of the endpoint it wants to talk to in the HTTP requests it generates. TLS certificates SHOULD be generated for these services. Consult with Service Operations.
stagingalso known as
staging-eqiad, is a sibling cluster to the above clusters.
This cluster exists to allow developers to deploy and test new versions of their project without affecting user traffic. It is a complement of the above clusters providing a safety net (and nothing more) during a deployment. The idea is that if a deployment fails in
staging no effort should be taken to proceed with deploying to the above clusters.
Other uses of this cluster than the one described above (e.g. as a development environment, a CI runner, a quality assurance platform or a demo scene to name a few) MUST NOT be allowed. This restriction is in addition to the restrictions mentioned for the above clusters.
While staging clusters exists in both eqiad and codfw, the primary home of the staging cluster is eqiad.
staging-codfw is intended for SREs to adjust and test the configuration of Kubernetes itself. While developers can deploy there, it's strongly discouraged. The cluster is in a constant state of change. As such, while it can perform the same functions as the
staging-eqiad cluster, it is usually not ready for that, nor it should be.
Since no real traffic is served by these clusters, there is no need for High Availability mechanisms.
There is no traffic flow for these clusters so none of the things mentioned in the above cluster apply. Typically deployments SHOULD only have 1 replica in staging since it has less resources. TLS is automatically configured for all services deployed here.
ml-serve-eqiad & ml-serve-codfw
ml-serve clusters run the Kubeflow+Kfserve(formerly Kfserving) stack . The owner of these clusters is the Machine Learning team. Despite the different ownership of these clusters, the ML team and Service Operations team are sharing greatly the infrastructure capabilities and processes the other clusters use/have.
Machine Learning related applications that are being owned or helped into production by the ML team are being deployed here. A first goal is to replace the ORES infrastructure that serves revision scores. Eventually, 1 more cluster will be created in the eqiad datacenter, to allow for training Machine Learning applications using Kubeflow. Those applications will be using the Kubeflow infrastructure to be deployed in the clusters described here. The name of that project is Lift Wing.
The reason for clusters in the 2 main Datacenters is the same as the above clusters.
Exposure to end-users or internal applications of these service MUST happen via LVS and advertised in the DNS Discovery records. The Global traffic routing layer will take care of routing end-users to the appropriate LVS endpoints. Internal applications SHOULD use the Services Proxy infrastructure to communicate with other services. For incoming HTTP(S) traffic nothing is required, for outgoing HTTP(S) traffic, the application SHOULD set the correct HTTP Host Header of the endpoint it wants to talk to in the HTTP requests it generates. TLS certificates SHOULD be generated for these services. Consult with the ML Team or Service Operations.
Creating a new cluster
Creating a new cluster is supported, albeit is a substantial amount of work (multiple days even for the fastest SRE team) and investment. SREs MUST definitely consult with the Service Operations team before proceeding further with the instantiation of a new cluster. Docs are at Kubernetes/Clusters/New