News/Toolforge Trusty deprecation
This page details information about deprecating and removing hosts running Ubuntu Trusty (14.04) as an operating system from the Toolforge infrastructure. The login bastions and Grid execution hosts are still running Trusty and must be replaced with new instances.
The Ubuntu Trusty job grid was shutdown on Monday 2019-03-25. Migration steps following the shutdown have changed slightly, so be sure to read them.
What is changing?
- New job grid running Son of Grid Engine on Debian Stretch instances
- New limits on concurrent job execution and job submission by a single tool
- New bastion hosts running Debian Stretch with connectivity to the new job grid
- New versions of PHP, Python2, Python3, and other language runtimes
- New versions of various support libraries
- Done 2019-01-11: Availability of Debian Stretch grid announced to community
- Done Week of 2019-02-04: Weekly reminders via email to tool maintainers for tools still running on Trusty
- Week of 2019-03-04:
- Done Daily reminders via email to tool maintainers for tools still running on Trusty
- Done Switch login.tools.wmflabs.org to point to Stretch bastion
- Done 2019-03-25: Shutdown Trusty grid
What should I do?
SSH to the Stretch bastion
login.tools.wmflabs.org connects to the new Debian Stretch bastion.
Move a grid engine webservice
When possible, we recommend migrating web services to Kubernetes instead of the new grid:
:# Connect to the Stretch bastion $ ssh <your-shell-name>@login.tools.wmflabs.org :# Become your tool account $ become YOUR_TOOL :# Start the webservice as a Kubernetes container rather than a grid job :# <type> is one of: php7.2, php5.6, python, python2, nodejs, golang, jdk8, ruby2, tcl $ webservice --backend=kubernetes <type> start :# -- OR -- :# Start the webservice as a Stretch grid job :# <type> is one of: lighttpd, uwsgi-python, tomcat, generic, lighttpd-plain, nodejs, uwsgi-plain $ webservice --backend=gridengine <type> start
See Help:Toolforge/Web#Backends for more information on migrating from grid engine to Kubernetes.
Python2 and Python3 webservices will need to rebuild their virtualenv environments on the new target runtime (Stretch grid or Kubernetes).
NodeJS webservices will need to rebuild their $HOME/www/js/node_modules on the new target runtime (Stretch grid or Kubernetes).
Move a continuous job
:# Connect to the Stretch bastion $ ssh <your-shell-name>@login.tools.wmflabs.org :# Become your tool account $ become YOUR_TOOL :# Start your job on the Stretch job grid $ jstart ...
The exact commands needed to start each continuous job vary greatly from tool to tool. This would be a great time to make a page of reference material for yourself and other maintainers here on Wikitech in the Tool namespace and using the Tool template if you haven't already.
Move a cron job
The crontab data for all tools which still had a cron registered on the Trusty grid were backed up to
$HOME/crontab.trusty.save before the Trusty cron server was shutdown. This backup can be used to setup your crontab on the Stretch grid.
:# Connect to the Stretch bastion $ ssh <your-shell-name>@login.tools.wmflabs.org :# Become your tool account $ become YOUR_TOOL :# Load the backup of your crontab on the Stretch job grid $ crontab $HOME/crontab.trusty.save
If your workload permits, please avoid scheduling cronjobs from midnight to 3am so you're not competing with other cronjobs for system resources. That time window is currently very crowded.
What are the primary changes with moving to Stretch?
Language runtime and library versions
The vast majority of the language runtimes and libraries installed on the grid nodes are upgraded in Stretch.
|Runtime||Trusty Version||Stretch Version|
Also note that the system-installed phpunit is not going to be present due to lack of current packages for recent versions of PHP. To use phpunit, please install via composer (instructions for setting up composer are included here Help:Toolforge#Installing_MediaWiki_core)
A table of the primary packages that users are likely to notice changes in is below.
|Trusty vs Stretch package version comparison|
|The following content has been placed in a collapsed box for improved usability.|
|The above content has been placed in a collapsed box for improved usability.|
- Maximum of 16 active jobs simultaneously allowed per tool user
- The scheduler will hold additional job submissions in the qw (queued/waiting) until an active slot is available.
- Maximum of 50 active and queued jobs simultaneously allowed per tool user
- The scheduler will reject additional job submissions by exiting with a status code of 25 and writing "Unable to run job: job rejected: only 50 jobs are allowed per user (current job count: 50)" to stderr
Implementing these limits has allowed us enable job submission from the continuous and and task job queues.
Solutions to common problems
Having trouble with the new grid? If the answer to your problem isn't here, ask for help in #wikimedia-cloud connect or file a bug in Phabricator.
Rebuild virtualenv for python users
Since the python executables and libraries are updated in stretch, local virtualenvs will need to be deleted and re-created on the new bastion for anything that runs from those virtualenvs to work. Several errors are likely to be caused by old virtualenvs with one obvious one being an unexpected
Using a requirements file may make this simpler in many cases, if your project doesn't already use one. You can create one in your local directory by running
pip freeze > requirements.txt in your tool folder with your virtualenv activated. Then later on, you can simply use
pip install -r requirements.txt to install the new environment after you deleted the old virtualenv and created a new one. For more information on this option, see pip's documentation on requirements files.
Example 1: Upgrading a Trusty grid engine based tool to the Stretch grid
Follow these steps if you manually submit jobs using jsub, or if you submit jobs using a crontab.
$ ssh <your-shell-name>@login-stretch.tools.wmflabs.org $ become YOUR_TOOL $ rm -rf venv # This will destroy the virtualenv and all libraries, so make sure you know what you will need to install later! $ virtualenv venv $ source venv/bin/activate $ pip install --upgrade pip # upgrade pip itself to avoid problems with older versions $ pip install ... # Here you'd use the requirements file syntax if you have one, or you'd manually install each needed library.
Example 2: Upgrading a uWSGI webservice into a Kubernetes container
If you are currently running your uWSGI webservice under the Grid Engine backend (i.e.,
webservice uwsgi-python command), and you want to upgrade to a uWSGI webservice running under Kubernetes (i.e.,
webservice --backend=kubernetes python command), you should rebuild your virtualenv as follows:
$ ssh <your-shell-name>@login-stretch.tools.wmflabs.org $ become YOUR-TOOL $ webservice --backend=kubernetes python stop $ webservice --backend=kubernetes python shell # do not skip this step – setting up the venv directly from the bastion may result in serious performance issues, compare T214086 $ rm -rf www/python/venv/ # this will destroy the virtualenv and all libraries, so make sure you know what you will need to install later! $ python3 -m venv www/python/venv/ $ source www/python/venv/bin/activate $ pip install --upgrade pip # upgrade pip itself to avoid problems with older versions $ pip install -r www/python/src/requirements.txt # assuming your tool has a requirements.txt file $ webservice --backend=kubernetes python start
Example 3: Upgrading a Kubernetes uWSGI webservice
If you are already using the Kubernetes backend, there is nothing you need to do -- the container will use the same Debian Jessie-based image as before.
PyYAML fails to install in Debian Stretch Python3 virtualenv
Task T215434 Resolved
The new bastions are using systemd resource control to restrict the amount of RAM and CPU resources that a user can consume. We do this to attempt to keep a single user from using all of the shared resources of the bastion accidentally and thus making the bastion slow for everyone. The initial limits we had set were overly restrictive and caused gcc to fail when compiling PyYAML. This has been corrected by increasing the limits.
BotPassword or OAuth grant does not work from new job grid
Bot passwords and OAuth registrations can both include allowed IP range restrictions. The defaults for both are to allow usage from any IPv4 and IPv6 address. If you have changed this when creating the bot password or OAuth consumer registration to restrict access to specific IP address ranges you may have issues using the password or OAuth consumer from the new job grid. The Cloud VPS environment is nearing the end of a process of moving from the
10.0.0.0/8 private address range that is shared with other internal servers operated by the Wikimedia Foundation to a new
172.16.0.0/21 private subnet. The new job grid is the first end-user facing portion of Toolforge to be migrated to the new range.
The allowed IP ranges for bot passwords can be changed by the owner of the account using Special:BotPasswords. Either add the
172.16.0.0/21 CIDR to the list of allowed ranges or reset them to the defaults of
The allowed IP ranges for an OAuth consumer registration can be changed by the original proposer of the registration using Special:OAuthConsumerRegistration/list. Either add the
172.16.0.0/21 CIDR to the list of allowed ranges or reset them to the defaults of
Lighttpd crashes on startup with message "parser failed somehow near here: (EOL)"
Lighttpd 1.4.40 made overriding keys in an existing array a fatal error. The Stretch version of lighttpd is 1.4.45. This change in the upstream application makes the advice at Help:Toolforge/Web/Lighttpd#Header, mimetype, character_encoding, error_handler for replacing existing mime-type mappings with new local versions obsolete.
Look for a $HOME/error.log line similar to
Duplicate array-key '.js' just prior to the parser failure error message to help you find the entry in your $HOME/.lighttpd.conf file that needs to be removed.
'webservice stop' says service is not running, but 'webservice start' says service is running
BryanDavis has this advice:
webservice [add other args here as needed] start
It is not completely well understood what causes webservice to become confused about the state of the process, but deleting the service.manifest file generally seems to fix the issue.
Python: redis.exceptions.ResponseError: value is not an integer or out of range
The Python Redis client made a breaking change in v3.0.0 vs older versions in renaming the prior StrictRedis class to Redis. The new behavior expects a different order of arguments for calls such as
setex(). The expected order of arguments now matches the Redis protocol docs rather than the more "pythonic" order that the prior implementation used. Typically this means that you need to swap the order of the time and value arguments in your calling code. See the library documentation for more breaking changes.
Delete a tool
Task T170355 Resolved
Some tools were experiments that are done, others were made obsolete by other tools, some are just things that the original maintainer is tired of caring for. Maintainers can mark their tools for deletion using the "Disable tool" button on the tool's detail page on https://toolsadmin.wikimedia.org/. Disabling a tool will immediately stop any running jobs including webservices and prevent maintainers from logging in as the tool. Disabled tools are archived and deleted after 40 days. Disabled tools can be re-enabled at any time prior to being archived and deleted.
Python 'oursql' package fails to compile
The latest official release of the Python 'oursql' package will not compile against MariaDB client libraries. See upstream bug report at https://github.com/python-oursql/oursql/issues/5. Oursql can be installed from a fork maintained at https://github.com/sqlobject/oursql, but the recommended long term solution is to migrate application code to the PyMySQL package instead.
SSH to login-stretch.tools.wmflabs.org fails with 'Permission denied (publickey)'
This is typically an issue with the newer Debian Stretch provided version of
sshd on the server side refusing to authenticate an insecure or deprecated public key type. Specifically, support for DSA (ssh-dss) keys was deprecated in Openssh 7.0. If your ssh public key starts with the string "ssh-dss" you will be impacted by this. RSA keys smaller than 1024 bits are also deprecated.
First make sure that you are passing a valid key by attempting to ssh to login-trusty.tools.wmflabs.org using the same public key and username. If this also fails, the problem is likely something other than the ssh key type. Join us in #wikimedia-cloud connect for interactive debugging help.
If you can ssh to login-trusty.tools.wmflabs.org with no errors, your key is probably of an unsupported type. Generate a new ssh key pair and upload the public key using the form at https://toolsadmin.wikimedia.org/profile/settings/ssh-keys or Special:Preferences#mw-prefsection-openstack. We currently recommend using either ed25519 or 4096-bit RSA keys. See Production shell access#Generating your SSH key for more information.
SSH to login-stretch.tools.wmflabs.org fails with 'Permission denied (publickey,hostbased)'
In case you face this problem, make sure to use the right shell name located on your User Preferences called **Instance shell account name**. It's supposed to be used in logging into the Toolforge server when need be, whether Trusty or Stretch.
"Unable to run job: Error reading answer list from qmaster"
Attempting to start a job with a name including non-ASCII characters using jsub, jstart, qcronsub, etc may fail with an error message written to the job's err file like "Unable to run job: Error reading answer list from qmaster". This is a known bug in Son of Grid Engine.
- Tools running jobs on Trusty grid engine in last 7 days
- This report updates once per hour and will not report jobs that have been seen running on the Stretch grid in the same 7 day period.
- Report has drill down pages for each maintainer and tool. Examples: bd808's tools, gridengine-status tool
- Webservices that move from the Trusty grid directly to the Kubernetes cluster will not be removed from the report automatically.
- Jobs running on the Stretch grid in the last 7 days
- Stretch grid status
Why are we doing this?
Ubuntu Trusty was released in April 2014, and support for it (including security updates) will cease in April 2019. We need to shut down all Trusty hosts before the end of support date to ensure that Toolforge remains a secure platform. This migration will take several months because many people still use the Trusty hosts and our users are working on tools in their spare time.
During past operating system updates we were able to create a mixed grid which contained hosts running multiple operating systems and control which was used to run each job using command line arguments to
webservice. The current version of Sun Grid Engine (v6.2u5) that exists in Ubuntu Trusty is incompatible with "Son of" Grid Engine (v8.1.9) from Debian Stretch. Therefore the two grids must be entirely separate environments. Any cron jobs that exist or web services in the old grid (submitted from one of the current bastions) will not currently exist in the new grid. To schedule any job or service on the new Son of Grid Engine grid, one must log into a bastion dedicated to that grid (currently tools-sgebastion-06.tools.eqiad.wmflabs) to submit them.