Wikimedia Cloud Services team/EnhancementProposals/GridEngine plans and timeline

The Toolforge Grid Engine was shut down in March 2024. Tools not migrated to newer runtimes were shut down. For details, see News/Toolforge Grid Engine deprecation.

This page contains information about WMCS plans and timeline for Toolforge GridEngine.

Toolforge currently uses Son of Grid Engine (a fork of the original GridEngine) to offer job scheduling functionalities for our technical community. This particular grid software and technology is, however, considered deprecated by more modern approaches to handle similar functions.

The ultimate goal of the WMCS team is to stop using this grid software, and leverage Kubernetes instead.

Timeline

This timeline is so badly guesstimated, that the reader could pretty much take it as simple date placeholders. We hope that future edits to this section may introduce more precision.

FY21/22 Q2 (Oct-Dec 2021): finish work & release the Toolforge Jobs Framework. Continue working on Toolforge buildpacks. Migrate Son of Grid Engine to Debian Buster.
FY22/23 Q2 (Oct-Dec 2022): Ask community to begin migrating tools. Collect blocking issues.
FY22/23 Q3 (Jan-Mar 2023): Toolforge buildpacks beta? Add features to support identified blocking issues. Introduce k8s a service as potential migration path? Tool migrations continue.

Use case continuity

We are aware our technical community relies on the grid for many of the most relevant Toolforge use cases.

In particular, there are a couple of use cases that may need some adaptation work in order to be fully supported on Kubernetes. For some of the current grid workflows, there may be no 1:1 functionality match on Kubernetes.

The following table tracks use case continuity.

Toolforge grid-like features
Feature	In our Son of Grid Engine	In our Kubernetes	Comment
tools job scheduling	Native	Toolforge Jobs Framework customization*	Basically a 1:1 match
mixing tool runtime environments	Native	Toolforge buildpacks customization*	Potentially equivalent solution
tools web services	Native + customization*	Native	Already in place
tool management via ssh	Native + customization*	Native + customization*	Already in place
tool management via web interface	Not implemented	No plans so far	Easier with Kubernetes APIs anyway
multitenancy	Native, based on POSIX semantics	Native, based on k8s namespaces	Already in place
quotas and other cluster-level controls	Native	Native	Already in place
tool development environment local replication	None, up to the user	Native, docker containers	Improvement!
access to data services (toolsdb, wikireplicas, dumps, etc)	Yes	Yes	Basically a 1:1 match
observability for individual tools	None ??	Native, based on prometheus	Already in place
observabilty service-wide	https://sge-status.toolforge.org/	https://k8s-status.toolforge.org/	Basically a 1:1 match
send emails from within a tool	There is a customization	Done. https://phabricator.wikimedia.org/T286135

\* In the context of the table, customization means that a significant development effort is required for making it possible.

Reasoning

Some of the reasons why we want to stop using our current grid implementation.

there has not been a new release (bugfixes, security patches, or otherwise) since 2016
the grid has poor controls and support for important aspects such as high availability, fault tolerance and self-recovery.
maintaining a healthy grid requires plenty of manual operations, like manual queue cleanups in case of failures, hand-crafted script for pooling/depooling nodes, etc.
there is no good/modern monitoring support for the grid, and we need to craft and maintain several monitoring pieces to be able to do proper maintenance.
the grid is also strongly tied to the underlying operating system release version. Migrating from one Debian version to the next is often painful.
the grid imposes a strong dependency on NFS, another old technology. We would like to reduce dependency on NFS overall, and in the future we will explore NFS-free approaches for Toolforge.
in general, the grid is old software, old technology, which can be replaced by more modern approaches for doing the same thing.

Our desire is to cover all our grid-like technology needs with kubernetes, a technology which has several benefits:

good high availability, fault tolerance and self-recovery constructs and facilities.
maintaining a running kubernetes cluster requires little manual operations.
there are good monitoring options for kubernetes deployments.
our current approach to deploying and upgrading kubernetes is independent of the underlying operating system.
while our current kubernetes deployment uses NFS as a central component, there is support for using other, more modern, approaches for the kind of shared storage we need in Toolforge.
in general, kubernetes is a modern technology, with a vibrant and healthy community, that both enables new use cases and has enough flexibility to adapt legacy ones.

Timeline

Use case continuity

Reasoning

See also