Jump to content

Wikimedia Cloud Services team/EnhancementProposals/GridEngine plans and timeline

From Wikitech
This page is currently a draft.
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.
The Toolforge Grid Engine was shut down in March 2024. Tools not migrated to newer runtimes were shut down. For details, see News/Toolforge Grid Engine deprecation.

This page contains information about WMCS plans and timeline for Toolforge GridEngine.

Toolforge currently uses Son of Grid Engine (a fork of the original GridEngine) to offer job scheduling functionalities for our technical community. This particular grid software and technology is, however, considered deprecated by more modern approaches to handle similar functions.

The ultimate goal of the WMCS team is to stop using this grid software, and leverage Kubernetes instead.

Timeline

This timeline is so badly guesstimated, that the reader could pretty much take it as simple date placeholders. We hope that future edits to this section may introduce more precision.

  • FY21/22 Q2 (Oct-Dec 2021): finish work & release the Toolforge Jobs Framework. Continue working on Toolforge buildpacks. Migrate Son of Grid Engine to Debian Buster.
  • FY22/23 Q2 (Oct-Dec 2022): Ask community to begin migrating tools. Collect blocking issues.
  • FY22/23 Q3 (Jan-Mar 2023): Toolforge buildpacks beta? Add features to support identified blocking issues. Introduce k8s a service as potential migration path? Tool migrations continue.

Use case continuity

We are aware our technical community relies on the grid for many of the most relevant Toolforge use cases.

In particular, there are a couple of use cases that may need some adaptation work in order to be fully supported on Kubernetes. For some of the current grid workflows, there may be no 1:1 functionality match on Kubernetes.

The following table tracks use case continuity.

Toolforge grid-like features
Feature In our Son of Grid Engine In our Kubernetes Comment
tools job scheduling Native Toolforge Jobs Framework customization* Basically a 1:1 match
mixing tool runtime environments Native Toolforge buildpacks customization* Potentially equivalent solution
tools web services Native + customization* Native Already in place
tool management via ssh Native + customization* Native + customization* Already in place
tool management via web interface Not implemented No plans so far Easier with Kubernetes APIs anyway
multitenancy Native, based on POSIX semantics Native, based on k8s namespaces Already in place
quotas and other cluster-level controls Native Native Already in place
tool development environment local replication None, up to the user Native, docker containers Improvement!
access to data services (toolsdb, wikireplicas, dumps, etc) Yes Yes Basically a 1:1 match
observability for individual tools None ?? Native, based on prometheus Already in place
observabilty service-wide https://sge-status.toolforge.org/ https://k8s-status.toolforge.org/ Basically a 1:1 match
send emails from within a tool There is a customization Done. https://phabricator.wikimedia.org/T286135

\* In the context of the table, customization means that a significant development effort is required for making it possible.

Reasoning

Some of the reasons why we want to stop using our current grid implementation.

  • there has not been a new release (bugfixes, security patches, or otherwise) since 2016
  • the grid has poor controls and support for important aspects such as high availability, fault tolerance and self-recovery.
  • maintaining a healthy grid requires plenty of manual operations, like manual queue cleanups in case of failures, hand-crafted script for pooling/depooling nodes, etc.
  • there is no good/modern monitoring support for the grid, and we need to craft and maintain several monitoring pieces to be able to do proper maintenance.
  • the grid is also strongly tied to the underlying operating system release version. Migrating from one Debian version to the next is often painful.
  • the grid imposes a strong dependency on NFS, another old technology. We would like to reduce dependency on NFS overall, and in the future we will explore NFS-free approaches for Toolforge.
  • in general, the grid is old software, old technology, which can be replaced by more modern approaches for doing the same thing.

Our desire is to cover all our grid-like technology needs with kubernetes, a technology which has several benefits:

  • good high availability, fault tolerance and self-recovery constructs and facilities.
  • maintaining a running kubernetes cluster requires little manual operations.
  • there are good monitoring options for kubernetes deployments.
  • our current approach to deploying and upgrading kubernetes is independent of the underlying operating system.
  • while our current kubernetes deployment uses NFS as a central component, there is support for using other, more modern, approaches for the kind of shared storage we need in Toolforge.
  • in general, kubernetes is a modern technology, with a vibrant and healthy community, that both enables new use cases and has enough flexibility to adapt legacy ones.

See also