News/Toolforge Trusty deprecation

From Wikitech

This page details information about deprecating and removing hosts running Ubuntu Trusty (14.04) as an operating system from the Toolforge infrastructure. The login bastions and Grid execution hosts are still running Trusty and must be replaced with new instances.

The Ubuntu Trusty job grid was shutdown on Monday 2019-03-25. Migration steps following the shutdown have changed slightly, so be sure to read them.


What is changing?

Timeline

  • Yes Done 2019-01-11: Availability of Debian Stretch grid announced to community
  • Yes Done Week of 2019-02-04: Weekly reminders via email to tool maintainers for tools still running on Trusty
  • Week of 2019-03-04:
    • Yes Done Daily reminders via email to tool maintainers for tools still running on Trusty
    • Yes Done Switch login.tools.wmflabs.org to point to Stretch bastion
  • Yes Done 2019-03-25: Shutdown Trusty grid

What should I do?

SSH to the Stretch bastion

login.tools.wmflabs.org connects to the new Debian Stretch bastion.

Move a grid engine webservice

When possible, we recommend migrating web services to Kubernetes instead of the new grid:

:# Connect to the Stretch bastion
$ ssh <your-shell-name>@login.tools.wmflabs.org

:# Become your tool account
$ become YOUR_TOOL

:# Start the webservice as a Kubernetes container rather than a grid job
:# <type> is one of: php7.2, php5.6, python, python2, nodejs, golang, jdk8, ruby2, tcl
$ webservice --backend=kubernetes <type> start
:# -- OR --
:# Start the webservice as a Stretch grid job
:# <type> is one of: lighttpd, uwsgi-python, tomcat, generic, lighttpd-plain, nodejs, uwsgi-plain
$ webservice --backend=gridengine <type> start

See Help:Toolforge/Web#Backends for more information on migrating from grid engine to Kubernetes.

Python2 and Python3 webservices will need to rebuild their virtualenv environments on the new target runtime (Stretch grid or Kubernetes).
NodeJS webservices will need to rebuild their $HOME/www/js/node_modules on the new target runtime (Stretch grid or Kubernetes).

Move a continuous job

:# Connect to the Stretch bastion
$ ssh <your-shell-name>@login.tools.wmflabs.org
:# Become your tool account
$ become YOUR_TOOL

:# Start your job on the Stretch job grid
$ jstart ...

The exact commands needed to start each continuous job vary greatly from tool to tool. This would be a great time to make a page of reference material for yourself and other maintainers here on Wikitech in the Tool namespace and using the Tool template if you haven't already.

Move a cron job

The crontab data for all tools which still had a cron registered on the Trusty grid were backed up to $HOME/crontab.trusty.save before the Trusty cron server was shutdown. This backup can be used to setup your crontab on the Stretch grid.

:# Connect to the Stretch bastion
$ ssh <your-shell-name>@login.tools.wmflabs.org
:# Become your tool account
$ become YOUR_TOOL

:# Load the backup of your crontab on the Stretch job grid
$ crontab $HOME/crontab.trusty.save

If your workload permits, please avoid scheduling cronjobs from midnight to 3am so you're not competing with other cronjobs for system resources. That time window is currently very crowded.

What are the primary changes with moving to Stretch?

Language runtime and library versions

The vast majority of the language runtimes and libraries installed on the grid nodes are upgraded in Stretch.

Runtime Trusty Version Stretch Version
Python3 3.4.0 3.5.3
PHP 5.5.9 7.2
Python2 2.7.5 2.7.13
NodeJS 0.10.25 8.11.1
Perl 5.18.2 5.24.1
Java 1.7.0 11.0.1
Ruby 1.9.3 2.3.3
Mono 5.12.0 5.12.0
TCL 8.6.1 8.6.0
R 3.2.3 3.3.3

Also note that the system-installed phpunit is not going to be present due to lack of current packages for recent versions of PHP. To use phpunit, please install via composer (instructions for setting up composer are included here Help:Toolforge#Installing_MediaWiki_core)

A table of the primary packages that users are likely to notice changes in is below.

Concurrency limits

  • Maximum of 16 active jobs simultaneously allowed per tool user
    • The scheduler will hold additional job submissions in the qw (queued/waiting) until an active slot is available.
  • Maximum of 50 active and queued jobs simultaneously allowed per tool user
    • The scheduler will reject additional job submissions by exiting with a status code of 25 and writing "Unable to run job: job rejected: only 50 jobs are allowed per user (current job count: 50)" to stderr

Implementing these limits has allowed us enable job submission from the continuous and and task job queues.

Solutions to common problems

Having trouble with the new grid? If the answer to your problem isn't here, ask for help in #wikimedia-cloud connect or file a bug in Phabricator.

Rebuild virtualenv for python users

Since the python executables and libraries are updated in stretch, local virtualenvs will need to be deleted and re-created on the new bastion for anything that runs from those virtualenvs to work. Several errors are likely to be caused by old virtualenvs with one obvious one being an unexpected ImportError.

Using a requirements file may make this simpler in many cases, if your project doesn't already use one. You can create one in your local directory by running pip freeze > requirements.txt in your tool folder with your virtualenv activated. Then later on, you can simply use pip install -r requirements.txt to install the new environment after you deleted the old virtualenv and created a new one. For more information on this option, see pip's documentation on requirements files.

Example 1: Upgrading a Trusty grid engine based tool to the Stretch grid

Follow these steps if you manually submit jobs using jsub, or if you submit jobs using a crontab.

$ ssh <your-shell-name>@login-stretch.tools.wmflabs.org
$ become YOUR_TOOL
$ rm -rf venv     # This will destroy the virtualenv and all libraries, so make sure you know what you will need to install later!
$ virtualenv venv
$ source venv/bin/activate
$ pip install --upgrade pip # upgrade pip itself to avoid problems with older versions
$ pip install ... # Here you'd use the requirements file syntax if you have one, or you'd manually install each needed library.
Example 2: Upgrading a uWSGI webservice into a Kubernetes container

If you are currently running your uWSGI webservice under the Grid Engine backend (i.e., webservice uwsgi-python command), and you want to upgrade to a uWSGI webservice running under Kubernetes (i.e., webservice --backend=kubernetes python command), you should rebuild your virtualenv as follows:

$ ssh <your-shell-name>@login-stretch.tools.wmflabs.org
$ become YOUR-TOOL
$ webservice --backend=kubernetes python stop
$ webservice --backend=kubernetes python shell # do not skip this step – setting up the venv directly from the bastion may result in serious performance issues, compare T214086
$ rm -rf www/python/venv/ # this will destroy the virtualenv and all libraries, so make sure you know what you will need to install later!
$ python3 -m venv www/python/venv/
$ source www/python/venv/bin/activate
$ pip install --upgrade pip # upgrade pip itself to avoid problems with older versions
$ pip install -r www/python/src/requirements.txt # assuming your tool has a requirements.txt file
$ webservice --backend=kubernetes python start
Example 3: Upgrading a Kubernetes uWSGI webservice

If you are already using the Kubernetes backend, there is nothing you need to do -- the container will use the same Debian Jessie-based image as before.

PyYAML fails to install in Debian Stretch Python3 virtualenv

The new bastions are using systemd resource control to restrict the amount of RAM and CPU resources that a user can consume. We do this to attempt to keep a single user from using all of the shared resources of the bastion accidentally and thus making the bastion slow for everyone. The initial limits we had set were overly restrictive and caused gcc to fail when compiling PyYAML. This has been corrected by increasing the limits.

BotPassword or OAuth grant does not work from new job grid

Bot passwords and OAuth registrations can both include allowed IP range restrictions. The defaults for both are to allow usage from any IPv4 and IPv6 address. If you have changed this when creating the bot password or OAuth consumer registration to restrict access to specific IP address ranges you may have issues using the password or OAuth consumer from the new job grid. The Cloud VPS environment is nearing the end of a process of moving from the 10.0.0.0/8 private address range that is shared with other internal servers operated by the Wikimedia Foundation to a new 172.16.0.0/21 private subnet. The new job grid is the first end-user facing portion of Toolforge to be migrated to the new range.

The allowed IP ranges for bot passwords can be changed by the owner of the account using Special:BotPasswords. Either add the 172.16.0.0/21 CIDR to the list of allowed ranges or reset them to the defaults of 0.0.0.0/0 and ::/0.

The allowed IP ranges for an OAuth consumer registration can be changed by the original proposer of the registration using Special:OAuthConsumerRegistration/list. Either add the 172.16.0.0/21 CIDR to the list of allowed ranges or reset them to the defaults of 0.0.0.0/0 and ::/0.

Lighttpd crashes on startup with message "parser failed somehow near here: (EOL)"

Lighttpd 1.4.40 made overriding keys in an existing array a fatal error. The Stretch version of lighttpd is 1.4.45. This change in the upstream application makes the advice at Help:Toolforge/Web/Lighttpd#Header, mimetype, character_encoding, error_handler for replacing existing mime-type mappings with new local versions obsolete.

Look for a $HOME/error.log line similar to Duplicate array-key '.js' just prior to the parser failure error message to help you find the entry in your $HOME/.lighttpd.conf file that needs to be removed.

'webservice stop' says service is not running, but 'webservice start' says service is running

BryanDavis has this advice:

  • webservice stop
  • rm $HOME/service.manifest
  • webservice [add other args here as needed] start

It is not completely well understood what causes webservice to become confused about the state of the process, but deleting the service.manifest file generally seems to fix the issue.

Python: redis.exceptions.ResponseError: value is not an integer or out of range

The Python Redis client made a breaking change in v3.0.0 vs older versions in renaming the prior StrictRedis class to Redis. The new behavior expects a different order of arguments for calls such as setex(). The expected order of arguments now matches the Redis protocol docs rather than the more "pythonic" order that the prior implementation used. Typically this means that you need to swap the order of the time and value arguments in your calling code. See the library documentation for more breaking changes.

Delete a tool

Some tools were experiments that are done, others were made obsolete by other tools, some are just things that the original maintainer is tired of caring for. Maintainers can mark their tools for deletion using the "Disable tool" button on the tool's detail page on https://toolsadmin.wikimedia.org/. Disabling a tool will immediately stop any running jobs including webservices and prevent maintainers from logging in as the tool. Disabled tools are archived and deleted after 40 days. Disabled tools can be re-enabled at any time prior to being archived and deleted.

Python 'oursql' package fails to compile

The latest official release of the Python 'oursql' package will not compile against MariaDB client libraries. See upstream bug report at https://github.com/python-oursql/oursql/issues/5. Oursql can be installed from a fork maintained at https://github.com/sqlobject/oursql, but the recommended long term solution is to migrate application code to the PyMySQL package instead.

SSH to login-stretch.tools.wmflabs.org fails with 'Permission denied (publickey)'

This is typically an issue with the newer Debian Stretch provided version of sshd on the server side refusing to authenticate an insecure or deprecated public key type. Specifically, support for DSA (ssh-dss) keys was deprecated in Openssh 7.0. If your ssh public key starts with the string "ssh-dss" you will be impacted by this. RSA keys smaller than 1024 bits are also deprecated.

First make sure that you are passing a valid key by attempting to ssh to login-trusty.tools.wmflabs.org using the same public key and username. If this also fails, the problem is likely something other than the ssh key type. Join us in #wikimedia-cloud connect for interactive debugging help.

If you can ssh to login-trusty.tools.wmflabs.org with no errors, your key is probably of an unsupported type. Generate a new ssh key pair and upload the public key using the form at https://toolsadmin.wikimedia.org/profile/settings/ssh-keys. We currently recommend using either ed25519 or 4096-bit RSA keys. See Production shell access#Generating your SSH key for more information.

SSH to login-stretch.tools.wmflabs.org fails with 'Permission denied (publickey,hostbased)'

In case you face this problem, make sure to use the right shell name located on your User Preferences called **Instance shell account name**. It's supposed to be used in logging into the Toolforge server when need be, whether Trusty or Stretch.

"Unable to run job: Error reading answer list from qmaster"

Attempting to start a job with a name including non-ASCII characters using jsub, jstart, qcronsub, etc may fail with an error message written to the job's err file like "Unable to run job: Error reading answer list from qmaster". This is a known bug in Son of Grid Engine.

Monitoring tools

Why are we doing this?

Ubuntu Trusty was released in April 2014, and support for it (including security updates) will cease in April 2019. We need to shut down all Trusty hosts before the end of support date to ensure that Toolforge remains a secure platform. This migration will take several months because many people still use the Trusty hosts and our users are working on tools in their spare time.

During past operating system updates we were able to create a mixed grid which contained hosts running multiple operating systems and control which was used to run each job using command line arguments to jsub and webservice. The current version of Sun Grid Engine (v6.2u5) that exists in Ubuntu Trusty is incompatible with "Son of" Grid Engine (v8.1.9) from Debian Stretch. Therefore the two grids must be entirely separate environments. Any cron jobs that exist or web services in the old grid (submitted from one of the current bastions) will not currently exist in the new grid. To schedule any job or service on the new Son of Grid Engine grid, one must log into a bastion dedicated to that grid (currently tools-sgebastion-06.tools.eqiad.wmflabs) to submit them.

See also