Wikimedia Cloud Services team/EnhancementProposals/GridEngine plans and timeline
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.
This page contains information about WMCS plans and timeline for Toolforge GridEngine.
Toolforge currently uses Son of Grid Engine (a fork of the original GridEngine) to offer job scheduling functionalities for our technical community. This particular grid software and technology is, however, considered deprecated by more modern approaches to handle similar functions.
The ultimate goal of the WMCS team is to stop using this grid software, and leverage Kubernetes instead.
Timeline
This timeline is so badly guesstimated, that the reader could pretty much take it as simple date placeholders. We hope that future edits to this section may introduce more precision.
- FY21/22 Q2 (Oct-Dec 2021): finish work & release the Toolforge Jobs Framework. Continue working on Toolforge buildpacks. Migrate Son of Grid Engine to Debian Buster.
- FY22/23 Q2 (Oct-Dec 2022): Ask community to begin migrating tools. Collect blocking issues.
- FY22/23 Q3 (Jan-Mar 2023): Toolforge buildpacks beta? Add features to support identified blocking issues. Introduce k8s a service as potential migration path? Tool migrations continue.
Use case continuity
We are aware our technical community relies on the grid for many of the most relevant Toolforge use cases.
In particular, there are a couple of use cases that may need some adaptation work in order to be fully supported on Kubernetes. For some of the current grid workflows, there may be no 1:1 functionality match on Kubernetes.
The following table tracks use case continuity.
Feature | In our Son of Grid Engine | In our Kubernetes | Comment |
---|---|---|---|
tools job scheduling | Native | Toolforge Jobs Framework customization* | Basically a 1:1 match |
mixing tool runtime environments | Native | Toolforge buildpacks customization* | Potentially equivalent solution |
tools web services | Native + customization* | Native | Already in place |
tool management via ssh | Native + customization* | Native + customization* | Already in place |
tool management via web interface | Not implemented | No plans so far | Easier with Kubernetes APIs anyway |
multitenancy | Native, based on POSIX semantics | Native, based on k8s namespaces | Already in place |
quotas and other cluster-level controls | Native | Native | Already in place |
tool development environment local replication | None, up to the user | Native, docker containers | Improvement! |
access to data services (toolsdb, wikireplicas, dumps, etc) | Yes | Yes | Basically a 1:1 match |
observability for individual tools | None ?? | Native, based on prometheus | Already in place |
observabilty service-wide | https://sge-status.toolforge.org/ | https://k8s-status.toolforge.org/ | Basically a 1:1 match |
send emails from within a tool | There is a customization | Done. https://phabricator.wikimedia.org/T286135 |
\* In the context of the table, customization means that a significant development effort is required for making it possible.
Reasoning
Some of the reasons why we want to stop using our current grid implementation.
- there has not been a new release (bugfixes, security patches, or otherwise) since 2016
- the grid has poor controls and support for important aspects such as high availability, fault tolerance and self-recovery.
- maintaining a healthy grid requires plenty of manual operations, like manual queue cleanups in case of failures, hand-crafted script for pooling/depooling nodes, etc.
- there is no good/modern monitoring support for the grid, and we need to craft and maintain several monitoring pieces to be able to do proper maintenance.
- the grid is also strongly tied to the underlying operating system release version. Migrating from one Debian version to the next is often painful.
- the grid imposes a strong dependency on NFS, another old technology. We would like to reduce dependency on NFS overall, and in the future we will explore NFS-free approaches for Toolforge.
- in general, the grid is old software, old technology, which can be replaced by more modern approaches for doing the same thing.
Our desire is to cover all our grid-like technology needs with kubernetes, a technology which has several benefits:
- good high availability, fault tolerance and self-recovery constructs and facilities.
- maintaining a running kubernetes cluster requires little manual operations.
- there are good monitoring options for kubernetes deployments.
- our current approach to deploying and upgrading kubernetes is independent of the underlying operating system.
- while our current kubernetes deployment uses NFS as a central component, there is support for using other, more modern, approaches for the kind of shared storage we need in Toolforge.
- in general, kubernetes is a modern technology, with a vibrant and healthy community, that both enables new use cases and has enough flexibility to adapt legacy ones.