Nova Resource:Tools/Help

From Wikitech
Jump to: navigation, search

If you have issues after the migration with a previously running tool, please check: Tool_Labs/Migration_to_eqiad


What is Tool Labs

Tool Labs is a reliable, scalable hosting environment for community developers working on tools and bots that help users maintain and use wikis. The cloud-based infrastructure was developed by the Wikimedia Foundation and is supported by a dedicated group of Wikimedia Foundation staff and volunteers.

Tool Labs is a part of the Labs project, which is designed to make it easier for developers and system administrators to try out improvements to Wikimedia infrastructure, including MediaWiki, and to do analytics and bot work.


Tool Labs was developed in response to the need to support external tools and their developers and maintainers. The system is designed to make it easy for maintainers to share responsibility for their tools and bots, which helps ensure that no useful tool gets ‘orphaned’ when one person needs a break. The system is designed to be reliable, scalable and simple to use, so that developers can hit the ground and start coding.


In addition to providing a well supported hosting environment, Tool Labs provides:

  • support for Web services, continuous bots, and scheduled tasks
  • access to replicated production databases
  • easily shared management of tool accounts, where tools and bots are stored
  • a grid engine for dispatching jobs
  • support for mosh, SSH, SFTP without complicated proxy setup
  • version control via Gerrit and Git
  • support for Redis

Architecture and terminology

Tool Labs has essentially four components: the bastion hosts, the grid, the web cluster, and the databases. Users access the system via one of two Tool Labs projects: ‘tools’ or ‘toolsbeta’. To request an account on the ‘tools’ project, where most tool and bot development is hosted and maintained, please see Tools Access Request.

Bastion hosts, grid, web cluster, databases

The four main components of Tool Labs, in a nutshell:

Bastion hosts

The bastion host is where users log in to Tool Labs. Currently, Tool Labs has two bastion hosts:


The two hosts are functionally identical, but we request that heavy processing (compiles, etc) be done only on to keep interactive performance on snappy.

The grid

The Tool Labs grid, implemented with Open Grid Engine (the open-source fork of Sun Grid Engine) permits users to submit jobs from either a log-in account on the bastion host or from a web service. Submitted jobs are added to a work queue, and the system finds a host to execute them. Jobs can be scheduled synchronously or asynchronously, continuously, or simply executed once. If a continuous job fails, the grid will automatically restart the job so that it keeps going. For more information about the grid, please see #Submitting, managing and scheduling jobs on the grid.

The web cluster

The Tool Labs web cluster is fronted by a web proxy, which supports SSL and is open to the Internet. Any of the servers in the cluster can serve any of the hosted web tools as Tool Labs uses a shared storage system; the proxy distributes between the web servers. The cluster uses suPHP to run scripts and CGI. Note that individual tool accounts have both a ~/public_html/ and a ~/cgi-bin/ directory in the home directory for storing web files. For more information, please see #Web services.

We have transitioned to a new web server system ('New Web'). This provides each tool with its own own lighttpd server, with full configuration options. FCGI scripts are supported using configuration options, and WSGI is supported using flup.server.fcgi. See #Web services for more information.

The databases

Tool Labs supports two sets of databases: the production replicas and user-created databases, which are used by individual tools. The production replicas follow the same setup as production, and the information that can be accessed from them is the same as that which normal registered users (i.e.: not +sysop or other types of advanced permissions) can access on-wiki or via the API. Note that some data has been removed from the replicas for privacy reasons. User-created databases can be created by either a user or a tool on the replica servers or on a local ‘tools’ project database.

Projects: Tools and Toolsbeta

Like the rest of Labs, Tool Labs is organized into ‘projects’. Currently, Tool Labs consists of two projects: ‘tools’ and ‘toolsbeta’, which are described in more detail here:

The ‘tools’ project is where tools and bots are developed and maintained. ‘toolsbeta’ is used for experiments in the Tool Labs environment itself--things like new systems or experimental versions of system libraries that could affect other users. In general, every tool maintainer should work primarily on the "tools" project, only doing work on toolsbeta when changes to Tool Labs itself need to be tested to support their tool.


Developers working in Tool Labs do not have to create or set up virtual machines (i.e., Labs ‘instances') as the Tool Labs project admins create and manage them. The term will come up in the Labs documentation; otherwise, don’t worry about it.

Rules of use

Tool Labs policies

All tools and bots developed and maintained on Tool Labs must adhere to the terms of use that will be available here when they are finalized:

Specifically, tools must be

Private information must be handled carefully, if at all. Note that private user information has been redacted from the replicated databases provided by the system.

As the Tool Labs environment is shared, we ask that you strive not to break things for others, and to be considerate when using system resources.

Individual wiki policies (these differ!)

When developing on Tool Labs, please adhere to the bot policies of the wikis your bot interacts with. Each wiki has its own guidelines and procedures for obtaining approval. The English Wikipedia, for example, requires that a bot be approved by the Bot Approvals Group before it is deployed, and that the bot account be marked with a ‘bot’ flag. See for more information on the English Wikipedia.

For general information and guidelines, please see Bot policy.


We’d love to hear from you! You can find us here:

  • On IRC: #wikimedia-labs on Freenode, a great place to ask questions, get help, and meet other Tool Labs developers. See Help:IRC for more information.

Getting access to Tool Labs

Anyone can view the source code and the output of most tools and bots, and anyone can get an account of their own as well.

To access Tool Labs you need:

  • to create a Labs account, which provides shell access (you must upload an SSH key)
  • to request access to the 'tools' project

Steps for creating a Labs account, creating and uploading an SSH key, and for requesting access to the 'tools project' are described in the next sections.

Creating a Labs account on Wikitech

Before you can access Tool Labs, you must create a Labs account on Wikitech, which is the general interface for everything Labs.

Sign up for a Labs account here: Request account (you will be asked to enter the new account's information)

The "Instance shell account name" you specify in the Create Account form will be your Unix username on all Labs projects. If you forget your username, you can always find it under Preferences > Instance shell account name.

Once you have created a Labs account you will be added to a list of users to be approved for shell access, which you can see here: Shell Access Requests.

Generating and uploading an SSH key

In order to access Labs servers using SSH, you must provide a public SSH key. Once you have created a Labs account, you can specify a public key on the 'OpenStack' tab of your Wikitech preferences.

Specify the SSH key here: OpenStack Preferences

Generating a key in Windows

To generate an SSH key in Windows:

  1. Open PuttyGen
  2. Select an SSH-2 RSA key
  3. Click the Generate button
  4. Move your mouse around until the progress bar is full
  5. Type in a passphrase (you will need to remember this) and confirm it
  6. Save the private key and public key onto your local machine
  7. From the text field 'Public key for pasting into OpenSSH authorized_keys file' right click and copy
  8. Insert this into your 'OpenStack' tab of your Wikitech preferences

Generating a key in Linux

Modern Unix systems include the OpenSSH client (if not then install it). To generate a key, use:

ssh-keygen -t rsa

This will store your private key in $HOME/.ssh/id_rsa, and your public key in $HOME/.ssh/ You can use different filenames (with -f parameter), but these are the default filenames, so it's easiest to not change them.

Requesting access to the 'tools' project

Once you have created a Labs account, you must request access to the ‘tools’ project by submitting a Tools Access Request.

Submit a request here: Tools Access Request

Requests for access are generally dealt with within the day (often faster), though response-time may be longer depending on admin availability. If you need immediate assistance, please contact us on IRC.

Receiving access to the 'tools' project

Once your 'tools' project access request has been processed, you will become a member of the 'tools' project, and will be able to access it using the "Instance shell account name" provided when creating your Labs account and the private key matching the public key you supplied for authentication. For more information about accessing the project, please see #Accessing Tool Labs and managing your files below.


You will be notified on Wikitech that your user rights were changed, that your request was linked from 'Nova Resource:Tools', and that you have been added to the project Nova Resource:Tools. You will also receive email explaining that your user rights have been changed and that you are now a member of the group 'shell'. In other words, your Tool Labs account is ready for you to use!

Storage and use

Although you access Tool Labs via your Labs account, we strongly recommend against saving data or tools in any space that is accessible to individuals only. Tools and bots should be maintained in Tool accounts, which have flexible memberships (i.e., multiple people can help maintain the code!). For more information about Tool accounts, please see #Joining and creating a Tool account.

Accessing Tool Labs and managing your files

Tool Labs can be accessed in a variety of ways--from its public IP to a GUI client. Please see Help:Access for general information about accessing Labs. Pointers to more information on specific means of access below.

Tools home page

The Tools home page:

The Tools home page is publicly available and contains a list of all currently hosted Tool accounts along with the name(s) of the maintainers for each. Individual tool accounts that have an associated web page will appear as links. Users with access to the 'tools' project can create new tool accounts here, and add or remove maintainers to and from existing tool accounts.


Users can SSH to the 'tools' project via its bastion host:, provided that a public SSH key has been uploaded to the Labs account.


Note that if you plan to do heavy processing (compiling, etc), you should SSH to

Using 'take' to transfer ownership of uploaded files

Once you have logged in via SSH, you can transfer files via sftp and scp. Note that the transferred files will be owned by you. You will likely wish to transfer ownership to your tool account. To do this:

1. Become your tool account using 'become':

maintainer@tools-login:~$ become toolaccount

2. As your tool account, 'take' ownership of the files:

tools.toolaccount@tools-login:~$ take FILE

The 'take' command will change the ownership of the file(s) and directories recursively to the calling user (in this case, the tool account).

Handling permissions

if you're getting permission errors, note that you can also transfer files the other way around: copy the files as your tool account to /data/projects/<projectname>.

Another, probably easier way is to set the permission to group-writable for the tools directory. For example, if your shell account's name is 'alice' and your tool name is 'alicetools' you could do something like this after logged in as a shell user

become alicetools
chmod -R g+w /data/project/alicetools
cp -rv /home/alice/* /data/project/alicetools/

Another option is to create a Git repository, check that repository out in your tool's directory and run a regular 'git pull' whenever you want to deploy new files.

Using multiple ssh agents

If you use multiple ssh-agents (to connect to your personal or company system, for example), see Managing Multiple SSH Agents for more information about setting up a primary and a Labs agent.

Putty and WinSCP

Note that instructions for accessing Tool Labs with Putty and WinSCP differ from the instructions for using them with other Labs projects. Please see Help:Access to ToolLabs instances with PuTTY and WinSCP for information specific to Tool Labs.

Other graphical file managers (e.g., Gnome/KDE)

For information about using a graphical file manager (e.g., Gnome/KDE), please see Accessing instances with a graphical file manager.

Joining and creating a Tool account

What is a Tool account?

Tool accounts, which can be created by any ‘tools’ project member, are fundamental to the structure and organization of Tool Labs. Although each tool account has a user ID, they are not personal accounts (like a Labs account), rather services that consist of a user and group ID (i.e., a unix uid-gid pair) that are intended to run the actual tool or bot.

  • Unix user: tools.toolname
  • Unix group: tools.toolname

Members of the Unix group include:

  • the tool account creator
  • the tool account itself
  • (optionally, but encouraged!) additional tool maintainers

Maintainers may have more than one tool account, and tool accounts may have more than one maintainer. Every member of the group has the authorization to sudo to the tool account. By default, only members of the group have access to tool account's code and data.

A simple way for maintainers to switch to the tool account is with ‘become’:

maintainer@tools-login:~$ become toolname

In addition to the user/group pair, each tool account includes:

  • A home directory on shared storage: /data/project/toolname
  • A ~/public_html/ and ~/cgi-bin/ directory, which are visible at and, respectively
  • Database access credentials:, which provide access to the production database replicas as well as to project-local databases.
  • Access to the continuous and task queues of the compute grid

Joining an existing Tool account

All tool accounts hosted in Tool Labs are listed on the Tools home page. If you would like to be added to an existing account, you must contact the maintainer(s) directly.

If you would like to add (or remove) maintainers to a tool account that you manage, you may do so with the 'add' link found beneath the tool name on the Tools home page.

Creating a new Tool account

Members of the ‘tools’ project can create tool accounts from the Tools home page:

  1. Navigate to the Tools home page:
  2. Select the “create new tool” link (found beside “Hosted tools” near the top of the page
  3. Enter a “Service group name”. The service group name will be used as the name of your tool account.

Do not prefix your service group name with tools.. The management interface will do so automatically where appropriate, and there is a known issue that will cause the account to be created improperly if you do.

Note: If you have only recently been added to the ‘tools’ project, you may get an error about not having appropriate credentials. Simply log out and back in to Wikitech to fix this

The tool account will be created and you will be granted access to it within a minute or two. If you were already logged in to your Labs account through SSH, you will have to log off then back in before you can access the tool account.

Deleting a Tool account

You can't delete a tool account yourself, though you can delete the content of your directories. If you really want a tool account to be deleted, please contact an admin.

Using Toolsbeta

Nearly all tool development is done on the 'tools' project, and 99.9% of the time, creating a tool account on this project will serve your needs. However, if your tool or bot requires an experimental library or a significant change to the 'tools' infrastructure--anything that could potentially negatively impact existing tools--you should experiment with the new infrastructure on toolsbeta.

To request access to toolsbeta, please visit #wikimedia-labs on IRC. You can also request access via the labs-l mailing list or via Bugzilla.

Customizing a Tool account

Once you have created a tool account, there are a few things that you can customize to make the tool more easily understood and used by other users. These include:

  • adding a tool account description (the description will appear on the Tools home page beside the tool name)
  • creating a home page for your tool (if you create a home page for the tool, it will be linked from the Tools home page automatically)

Tool Labs will soon support mail to both Labs users and tool accounts (mail to a tool account will go to all maintainers by default). You can customize mail settings as well.

Creating a tool web page

To create a web page for your tool account, simply place an index.html file in the tool account's ~/public_html/ directory. The page can be a simple description of the tool or bot with basic information on how to set it up or shut it down, or it contain an interface for the web service. To see examples of existing tool web pages, click any of the linked tool names on the Tools home page.

Note that some files, such as PHP files, will give a 500 error unless the owner of the file is tool account.

You will also need to start a webservice for your tool.

1. Log into your Labs account and become your tool account:

maintainer@tools-login:~$ become toolname

2. Start the web service:

tools.toolname@tools-login:~$ webservice start

Make the tool translatable

If your tool is used from the web, and assuming you think it's worth something at all, you want to make it translatable. You can and should use the Intuition framework (PHP only), which allows you to use and delivers you the localisation.

Don't waste your time, learn from our experience with MediaWiki: read the message documentation tips and other internationalization hints.

Creating a tool description

To create a tool description:

1. Log into your Labs account and become your tool account:

maintainer@tools-login:~$ become toolname

2. Create a ‘.description’ file in the tool account’s home directory. Note that this file must be HTML:

tools.toolname@tools-login:~$ vim .description

3. Add a brief description (no more than 25 words or so) and save the file.

4. Navigate to the Tools home page. Your tool account description should now appear beside your tool account name.

Configuring bots and tools

Tools and bot code should be stored in your tools account, where it can be managed by multiple users and accessed by all execution hosts. Specific information about configuring web services and bots, along with information about licensing, package installation, and shared code storage, is available at the #Developing on Tool Labs section.

Note that bots and tools should be run via the grid, which finds a suitable host with sufficient resources to run each. Simple, one-off jobs can be submitted to the grid easily with the jsub command. Continuous jobs, such as bots, can be submitted with jstart.

Setting up code review and version control

Although it's possible to just stick your code in the directory and mess with it manually every time you want to change something, your future self and your future collaborators will thank you if you instead use source control, a.k.a. version control and a code review tool. Wikimedia Labs makes it pretty easy to use Git for source control and Gerrit for code review, but you also have other options.

Setting up a local Git repository

It is fairly simple to set up a local Git repository to keep versioned backups of your code. However, if your tool directory is deleted for some reason, your local repository will be deleted as well. You may wish to request a Gerrit/Git repository to safely store your backups and/or to share your code more easily. Other backup/versioning solutions are also available. See User:Magnus Manske/Migrating from toolserver#GIT for some ideas.

To create a local Git repository:

1. Create an empty Git repository

maintainer@tools-login:~$ git init

2. Add the files you would like to backup. For example:

maintainer@tools-login:~$ git add public_html

3. Commit the added files

git commit -m 'Initial check-in'

For more information about using Git, please see the git documentation.

Requesting a Gerrit/Git repository for your tool

Tool Labs users may request a Gerrit/Git repository for their tools. Access to Git is managed via Wikimedia Labs and integrated with Gerrit, a code review system.

In order to use the Wikimedia Labs code review and version control, you must upload your ssh key to Gerrit and then request a repository for your tool.

  1. Log in to with your Labs account.
  2. Add your SSH public key (select “Settings” from the drop-down menu beside your user name in the upper right corner of the screen, and then “SSH Public Keys” from the Settings menu).
  3. Request a Gerrit project for your tool: Gerrit/New repositories

For more information, please see:

For more information about using Git and Gerrit in general, please see Git/Gerrit.

Database access

Tool and Labs accounts are granted access to replicas of the production databases. Private user data has been redacted from these replicas (some rows are elided and/or some columns are made NULL depending on the table), but otherwise the schema is, for all practical purposes, identical to the production databases and the databases are sharded into clusters in much the same way.

Database credentials (credential user/password) are stored in the '' file found in the tool account’s home directory. To use these credentials with command-line tools by default , copy '' to '.my.cnf'.

Naming conventions

As a convenience, each mediawiki project database (enwiki, bgwiki, etc) has an alias to the cluster it is hosted on. The alias has the form:


where 'project' is the name of a hosted mediawiki project (enwiki bgwiki bgwiktionary cswiki enwikiquote enwiktionary eowiki fiwiki idwiki itwiki nlwiki nowiki plwiki ptwiki svwiki thwiki trwiki zhwiki commonswiki dewiki wikidatawiki arwiki eswiki... for a complete list, look at the /etc/hosts file on tools-login).

The database names themselves consist of the mediawiki project name, suffixed with _p (an underscore, and a p), for example:

enwiki_p (for the English Wikipedia replica)

In addition each cluster can be accessed by the name of its shard sX.labsdb (for example, s1.labsdb hosts the enwiki_p database). As the cluster where a database is located can change, you should only use this name if your application requires it, e.g. for heavily crosswiki tools which would otherwise open hundreds of database connections.

Connecting to the database replicas

You can connect to the database replicas (and/or the cluster where a database replica is hosted) by specifying your access credentials and the alias of the cluster and replicated database. For example:

To connect to the English Wikipedia replica, specify the alias of the hosting cluster (enwiki.labsdb) and the alias of the database replica (enwiki_p) :

mysql --defaults-file="${HOME}"/ -h enwiki.labsdb enwiki_p

To connect to the Wikidata cluster:

mysql --defaults-file=~/ -h wikidatawiki.labsdb

To connect to Commons cluster:

mysql --defaults-file=~/ -h commonswiki.labsdb

There is also a shortcut for connecting to the replicas: sql <dbname>[_p] The _p is optional, but implicit (i.e. the sql tool will add it if absent).

To connect to the English Wikipedia database replica using the shortcut, simply type:

sql enwiki

Connecting to the database replicas from other Labs instances

It is possible to connect to the tools.labs database replica from other labs instances besides tools. This requires a bit of network configuration. Firstly as convenience measure extract the labsdb entries from the tools /etc/hosts file, like this :

grep '^192\.168\.99\.' /etc/hosts > labsdb.hosts

and copy the labsdb.hosts file to your instance.

Next copy the iptables rules from


to your instance and make sure they are applied (using iptables-restore; preferably in a startup script). The iptables rules perform an address translation from the virtual network to the real database host IPs with the correct port numbers.

On your instance append the labsdb.hosts to your local /etc/hosts file. Lastly you may want to copy the relevant files over to simplify authentication.

Creating new databases

User-created databases can be created on the database hosting the replica servers or on a database local to the 'tools' project: tools-db. The latter tends to be a bit faster since that server has less heavy activity, and tools-db is the recommended location for user-created databases when no interaction with the production replicas is needed. Users have all privileges on the created database and grant options. Database names ending with _p are granted read access for everyone.

Database names must start with the name of the credential user (not your user name), which can be found in your ~/ file (the name of the credential user looks something like 'p50252g21636'). The name of the credential user is followed by two underscores and then the name of the database:


Note that users are granted complete control over their credentialUser__, but nothing else.

Steps to create a user database on the replica servers

If you would like your database to interact with the replica databases (i.e., if you need to do actual SQL joins with the replicas, which can only be done on the same cluster) you can create a database on the replica servers.

To create a database on the replica servers:

1. Become your tool account:

maintainer@tools-login:~$ become toolaccount

2. Connect to the replica servers with the credentials. You must specify the host of the replica (e.g., enwiki.labsdb):

mysql --defaults-file="${HOME}"/ -h XXwiki.labsdb

3. In the mysql console, create a new database (where CREDENTIALUSER is your credentials user, which can be found in your ~/ file, and DBNAME the name you want to give to your database:


You can then connect to your database using:

mysql --defaults-file="${HOME}"/ -h XXwiki.labsdb CREDENTIALUSER__DBBAME
Warning Caution: Writing (INSERT, UPDATE, DELETE) tables on the replica servers leads to the replication being stopped for all users on that server until the query finishes. So you need to make sure that such queries can be executed in a time frame that does not disrupt other users' work too much, for example by processing data in smaller batches or rethinking your data flow. As a rule of thumb, queries should finish in less than a minute.

Steps to create a user database on tools-db

To create a database on tools-db:

1. Become your tool account.

maintainer@tools-login:~$ become toolaccount

2. Connect to tools-db with the credentials:

mysql --defaults-file="${HOME}"/ -h tools-db

You could also just type:

sql local

3. In the mysql console, create a new database, where CREDENTIALUSER is your credentials user, which can be found in your ~/ file, and DBNAME is the name you want to give to your database:


You can then connect to your database using:

mysql --defaults-file="${HOME}"/ -h tools-db CREDENTIALUSER__DBBAME


Assuming that your tool account is called "mytool", this is what it would look like:

maintainer@tools-login:~$ become mytool
tools.mytool@tools-login:~$ cat | grep user
mysql --defaults-file="${HOME}"/ -h tools-db
create database 123something__wiki;

Joins between commons, centralauth and wikidata and other project databases

On every database slice where there is no copy of those databases, there is now a database containing a federated version of the original named respectively, commonswiki_f_p, centralauth_f_p, and wikidatawiki_f_p.

Those databases use a network connection between the slices to access data, allowing joins. But please be aware of a very important limitation: joins or queries that do not use an indexed column to restrict the data will be much slower – by several orders of magnitude – and can impact general database. If you can use the "real" database and combine data with application logic, doing so is generally a better solution.

In general, joins where there is a one-to-one map between a row and an index key will remain reasonably efficient.

Given the performance impact, databases with federated tables are not provided on the shards where the original lies.

Tables for revision or logging queries involving user names and IDs

The revision and logging tables do not have indexes on user columns. In an email, one of the system administrators pointed out that this is because "those values are conditionally nulled when supressed". One has to instead use the corresponding revision_userindex or logging_userindex for these types of queries. On those views, rows where the column would have otherwise been nulled are elided; this allows the indexes to be usable.

Example query that will use the appropriate index (in this case on the rev_user_text column, the rev_user column works the same way for user IDs):

SELECT rev_id, rev_timestamp FROM revision_userindex WHERE rev_user_text="Foo"

Example query that fails to use an index because the table doesn't have them:

SELECT rev_id, rev_timestamp FROM revision WHERE rev_user_text="Foo"

You should use the indexes so queries will go faster (performance).

Metadata database

From bugzilla:48626 there is a table with automatically maintained meta information about the replicated databases:

MariaDB [meta_p]> DESCRIBE wiki;
| Field            | Type         | Null | Key | Default | Extra |
| dbname           | varchar(32)  | NO   | PRI | NULL    |       |
| lang             | varchar(12)  | NO   |     | en      |       |
| name             | text         | YES  |     | NULL    |       |
| family           | text         | YES  |     | NULL    |       |
| url              | text         | YES  |     | NULL    |       |
| size             | decimal(1,0) | NO   |     | 1       |       |
| slice            | text         | NO   |     | NULL    |       |
| is_closed        | decimal(1,0) | NO   |     | 0       |       |
| has_echo         | decimal(1,0) | NO   |     | 0       |       |
| has_flaggedrevs  | decimal(1,0) | NO   |     | 0       |       |
| has_visualeditor | decimal(1,0) | NO   |     | 0       |       |
| has_wikidata     | decimal(1,0) | NO   |     | 0       |       |

Example data:

MariaDB [nlwiki_p]> select * from limit 1 \G
*************************** 1. row ***************************
          dbname: aawiki
            lang: aa
            name: Wikipedia
          family: wikipedia
            size: 1
           slice: s3.labsdb
       is_closed: 1
        has_echo: 0
 has_flaggedrevs: 0
has_visualeditor: 1
    has_wikidata: 1

Configuring MySQL Workbench

You can connect to databases on Tool Labs with MySQL Workbench (or similar client applications) via an SSH tunnel. Instructions for connecting via MySQL Workbench are as follows:

  1. Launch MySQL Workbench on your local machine.
  2. Click the plus icon next to "MySQL Connections" in the Workbench window (or choose "Manage Connections..." from the Database menu and click the "new" button).
Example configuration of MySQL Workbench for Wikimedia Tool Labs
  1. Set Connection Method to "Standard TCP/IP over SSH"
  2. Set the following connection parameters:
    • SSH Hostname:
    • SSH Username: <your Tool Labs login username>
    • SSH Key File: <your Tool Labs SSH private key file>
    • SSH Password: password/passphrase of your private key (if set) - not your wiki login password.
    • MySQL Hostname: enwiki.labsdb (or whatever server your database lives on)
    • MySQL Server Port: 3306
    • Username: <your Tool Labs MySQL user name (from ~/.my.cnf)>
    • Password: <your Tool Labs MySQL password (from ~/.my.cnf)>
    • Default Schema: <name of your Tool Labs MySQL database, e.g. enwiki_p>
  3. Click "OK"

Replica-db hostnames can be found in /etc/hosts. Bear in mind to add the _p suffix if setting a default schema for replica databases. e.g: enwiki_p.

Ambox notice.png If you are using SSH keys generated with PuTTYgen (Windows users), you need to convert your private key to the 'OpenSSH' format. Load your private key in PuTTYgen, then click Conversions » Export OpenSSH key. Use this file as SSH Key File above.

Submitting, managing and scheduling jobs on the grid

Every non-trivial task performed in Tool Labs should be dispatched by the grid engine, which ensures that the job is run in a suitable place with sufficient resources. The basic principle of running jobs is fairly straightforward:

  • You submit a job to a work queue from a submission server (e.g., -login) or web server
  • The grid engine master finds a suitable execution host to run the job on, and starts it there once resources are available
  • As it runs, your job will send output and errors to files until the job completes or is aborted.

Jobs can be scheduled synchronously or asynchronously, continuously, or simply executed once. If a continuous job fails, the grid will automatically restart the job so that it keeps going.

What is the grid engine?

The grid engine is highly flexible system for assigning resources to jobs, including parallel processing. The Tool Labs grid engine is implemented with Open Grid Engine (the open-source fork of Sun Grid Engine). You can find more documentation on the Open Grid Engine website.

Commonly used Grid Engine commands include:

  • qsub: submit jobs to the grid
  • qalter: modify job settings (while the job is waiting or running)
  • qstat: get information about a queued or running job
  • qacct: extracts arbitrary accounting information from the cluster logfile (also after job termination, useful for debugging)
  • qdel: abort or cancel a job

You can find detailed information about these commands in the Grid Engine Manual

The Open Grid Engine commands are very flexible, but a little complex at first – you might prefer to use the helper scripts instead (jsub, jstart, jstop) described in more detail in the next sections.

Submitting simple one-off jobs using 'jsub'

Jobs with a finite duration can be submitted to the work queue with either Open Grid’s 'qsub' command or the 'jsub' helper script, which is simpler to use and described in this section. (For information about qsub, please see the Open Grid Engine Manual.

To run a finite job on demand (at interval from cron, for instance, or from a web tool or the command line), simply use the 'jsub' command:

$ jsub [options…] program [args…]

By default, jsub will schedule the job to be run as soon as possible, and print the eventual output to files (‘jobname.out’ and ‘jobname.err’) in your home directory. Unless a job name is explicitly specified with jsub options, the job will have the same name as the program, minus extensions (e.g., if you have a program named and start it with jsub, the job's name will be foobot.)

Once your jobs has been submitted to the grid, you will receive an output similar to the one below, which includes the job id and job name.

Your job 120 ("foobot") has been submitted

Example: The following example uses the jsub command to run The 'qstat' command returns job status information. By default, job output is placed in the 'mybot.out' and 'mybot.err' files in the home directory.

tools.shtest@tools-login:~$ jsub
Your job 105033 ("mybot") has been submitted
tools.shtest@tools-login:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
105033 0.25000 mybot      tools.shtest r     05/24/2013 08:52:00 task@tools-exec-02.pmtpa.wmfla     1
tools.shtest@tools-login:~$ qstat
tools.shtest@tools-login:~$ ls
access.log  mybot.err
cgi-bin     mybot.out  public_html
tools.shtest@tools-login:~$ cat mybot.out

jsub options

In addition to a number of customized options, jsub supports many, but not all qsub options:

jsub options
Option Behavior

Send errors that occur during job submission to stderr rather than the error output file (errors that occur while runnning the script are always sent to the error file).

-mem value

Request value amount of memory for the job, where value is a number suffixed by 'k', 'm' or 'g'. The default is 256m. For more information, please see Allocating additional memory.


Run the named job only once; fail the job if a job by the same name is already running or queued. For more information, please see Running a job only once.


Start a self-restarting job on the continuous queue. Please see the section on continuous jobs and jstart for more information.


Suppress output if job has been submitted successfully (e.g., if set, cron jobs will not send mail on successful submit)

-i, -o, and -e (qsub option)

Selects the file used for standard input, output and error of the job, respectively. By default, jsub will append stdout and stderr to the files jobname.out and jobname.err in the tool account's home directory, and will not have standard input. If a directory is given for -o or -e, new files jobname.ojobid and jobname.ejobid are created there for each job.

-j y (qsub option)

send standard output and error together to the output file

-sync y (qsub option)

Normally, jsub queues up the job and returns immediately. The 'sync y' option waits for the job to be complete instead. For more information, please see Synchronizing jobs.

-cwd (qsub option)

Start the script in the same directory you invoked jsub from (for more info, see qsub docs)

-N jobname (qsub option)

Specify a job name. The default is the name of the program run, without extension. For more information, please see Naming jobs.

Naming jobs

The job name identifies the job and can also be used to control it (e.g., to suspend or stop it). By default, jobs are assigned the name of the program or script, minus its extension. For instance, if you started a program named '' with jsub, the job's name would be 'foobot'.

It's important to note that you can have more than one job, running or queued, bearing the same name. Some of the utilities that accept a job name may not behave as expected in those cases.

Specify a different name for the job using the jsub’s -N option:

jsub -N NewName program [args…]

Allocating additional memory

By default, jobs are allowed 256MB of memory; you can request more (or less) with jsub’s -mem option (or qsub's -l h_vmem=memory). Keep in mind that a job that requests more resources may be penalized in its priority and may have to wait longer before being run until sufficient resources are available.

$ jsub -mem 500m program [args…]

For example, loading a PHP script via jsub requires at least 350MB of memory to work properly:

jsub -mem 350m php /data/project/yourproject/public_html/test.php

Synchronizing jobs

By default, jobs are processed asynchronously in the background. If you need to wait until the job has completed (for instance, to do further processing on its output), you can add the -sync y (for sync y[es]!) option to the jsub command:

$ jsub -sync y program [args...]

Running a job only once

If you need to make certain that the job isn't running multiple times (such as when you invoke it from a crontab), you can add the -once option. If the job is already running or queued, the grid engine will simply mark the failed attempt in the error file and return immediately.

$ jsub -once program [args...]

Quoted arguments

Jsub (actually qsub) always strips the quotes in the arguments of a job. If the arguments include any special bash characters like spaces, "|" or "&" then the job submission will likely fail, even when the arguments are given quoted to jsub (see bugzilla:48811).

The best way to avoid this issue is to use a wrapper script.

A simple workaround is to use two layers of quotes:

$ jsub program -arg1 "'^(foo|bar)$'"

Submitting continuous jobs (such as bots) with 'jstart'

Continuous jobs, such as bots, have a dedicated queue ('continuous') which is set up slightly differently from the standard queue:

  • Jobs started on the continuous queue are automatically restarted if they, or the node they run on, crash
  • In case of outage or lack of resources, continuous jobs will be stopped and restarted automatically on a working node
  • Only tool accounts can start continuous jobs
  • Continuous jobs are not restarted if they end normally (with the exit status 0)

For convenience, the jstart script (which accepts all the jsub options) facilitates the submission of continuous jobs:

$ jstart [options…] program [args…]

The jstart script will start the program in continuous mode (if it is not already running), and ensure that the program keeps running.

Note that the jstart script is equivalent to:

$ jsub -once -continuous program [args…]

jsub's '-once' option is important for ensuring that the job can be managed reliably with job and jstop utilities. The '-continuous' option ensures that the job will be restarted automatically until it exits normally with an exit value of zero, indicating completion.

Managing Jobs

Each job submitted to the grid has a unique job id as well as a job name (which will not be unique if you have more than one instance running). The name and id identify the job, and can also be used to retrieve information about its status.

If you don’t know the job id, you can find it with either the ‘job’ command or the ‘qstat’ command. Both of these commands can also be used to return additional status information, as described in the next sections.

Finding a job id and status with the ‘job’ command

If you know that your job has only one instance running (if you used the -once option when starting it, for example) you can use the ‘job’ command to get its job id:

tools.xbot@tools-login:~$ job xbot

Use the job command’s -v (‘verbose’) option to return additional status information:

tools.xbot@tools-login:~$ job -v xbot
Job 'xbot' has been running since 2013-04-01T21:00:00 as id 717898

The verbose response is particularly useful from scripts or web services.

Once you know the job id, you can use the ‘qstat’ command to return additional information about it. See Returning the status of a particular job for more information.

Using ‘qstat’ to return status information

The ‘qstat’ command returns detailed information about the status of queued jobs. If you know the job id of a particular job, you can use qstat’s ‘-j’ option to return information about that job. If you use the ‘qstat’ command without options, it will return the status of all your currently running and pending jobs. More information about running qstat without options and with the -j option is included in the following sections. For more information about qstat in general, please see the Open Grid Manual.

Returning the status of all your queued jobs

To see the status of all of your running and pending jobs (including the job number), use the ‘qstat’ command without options. ‘qstat’ will then return the job id, priority, name, owner, state (e.g., r(unning) or s(uspended)), the date and time the job was submitted or started, and the name of the assigned job queue (e.g., continuous) for each job.

For example:

tools.xbot@tools-login:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
120    0.50000   xbot   tools.xbot         r     04/01/2013 21:00:00 continuous@tools-exec-01.pmtpa     1        

Common job states include:

  • r (running)
  • qw (queued/waiting)
  • d (deleted)
  • E (error)
  • s (suspended)

See the Open Grid Manual for a complete list of states and abbreviations.

Returning the status of a particular job

If you know the job Id of a job, you can find out more information about it using the 'qstat command's ‘-j’ option. For example, the following command returns detailed information about job id 990.

tools.toolname@tools-login:~$ qstat -j 990
job_number:                 990
exec_file:                 job_scripts/990
submission_time:            Wed Apr 13 08:32:39 2013
owner:                      tools.toolname
uid:                        40005
group:                      tools.toolname
gid:                        40005
sge_o_home:                 /data/project/toolname/ sge_o_log_
name:                           tools.toolname
sge_o_path:                 /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/X11R6/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /data/project/toolname
sge_o_host:                 tools-login
account:                    sge
stderr_path_list:           NONE:NONE:/data/project/toolname//taskname.err
hard resource_list:         h_vmem=256m
mail_list:                  tools.toolname@tools-login.pmtpa.wmflabs
notify:                     FALSE
job_name:                   epm
stdout_path_list:           NONE:NONE:/data/project/toolname//taskname.out
jobshare:                   0
hard_queue_list:            task
script_file:                /data/project/toolname/
usage    1:                 cpu=00:21:08, mem=158.09600 GBs, io=0.00373, vmem=127.719M, maxvmem=127.723M

Common shell exit code numbers[1] returned e.g. by qacct include (there are no standard exit codes, aside from 0 meaning success - non-zero doesn't necessarily mean failure):

exit_status Meaning Example Comments
0 Success No errors, meaning success
1 Catchall for general errors let "var1 = 1/0" Miscellaneous errors, such as "divide by zero" and other impermissible operations
2 Misuse of shell builtins (according to Bash documentation) empty_function() {} Missing keyword or command
126 Command invoked cannot execute /dev/null Permission problem or command is not an executable
127 "command not found" illegal_command Possible problem with $PATH or a typo
128 Invalid argument to exit exit 3.14159 exit takes only integer args in the range 0 - 255
128+n Fatal error signal "n" kill -9 $PPID of script $? returns 137 (=128+9)
128+2=130 Script terminated by Control-C Ctrl-C Control-C generates SIGINT which is fatal error signal 2
128+9=137 Process terminated by kernel (no further signal handling performed) kill -9 $PPID of script Kernel immediately terminates any process sent this signal, generating SIGKILL which is fatal error signal 9
128+11=139 Segmentation fault (kernel killed process due to segfault) E.g. the program accessed a not assigned memory location, generating SIGSEGV which is fatal error signal 11
255 Exit status out of range exit -1 exit takes only integer args in the range 0 - 255

Confer the signal(.h) man pages for a more comprehensive list of the values ("n") of the possible fatal error signals (SIG...) issued by the kernel.

Stopping jobs with ‘qdel’ and ‘jstop’

If you started a job with the 'jstart' command, or if you know there is only one job with the same name, then you can also use the 'jstop' utility command with the job name to stop it:

jstop job_name

You can also use the underlying ‘qdel’ command with a job’s number or name:

qdel job_number/job_name

This will also delete matching jobs that have only been queued, but not started yet. Do note that if you specify a 'job_name', all queued or running jobs with that name are deleted.

If you do not know the job number, you can find it using the ‘qstat’ command.

Restarting jobs

To stop and restart a running job in a single command (e.g. you made a bugfix), use:

qmod -rj job_number

Suspending and unsuspending jobs with ‘qmod’

Suspending a job allows it to be temporarily paused, and then resumed later. To suspend a job use:

qmod -sj job_id

The job will be paused (SIGSTOP). Note that the qstat command will return a state of ‘s’ for suspended jobs. If you do not know the job number, you can find it using the ‘qstat’ command.

To unsuspend the job and let it continue running use:

qmod -usj job_id

Unsuspended jobs should return to the 'r' state in qstat.

Scheduling jobs at regular intervals with cron

To schedule jobs to be run at specific days or time of days, you can use cron to submit the jobs to the grid.

Scheduling a command more often than every five minutes (for example * * * * * command) is highly discouraged, even if the command is "only" jsub. In these cases, you very probably want to use 'jstart' instead. The grid engine ensures that jobs submitted with 'jstart' are automatically restarted if they exit.

Creating a crontab

Crontabs are set (as on any Unix system) using "crontab -e" or "crontab FILE".

Note that the PATH is set differently for interactive shells and cron jobs.

Specifying time zones

The ‘tools’ project, like other hosting environments, uses the time zone UTC. If you need to schedule a job for another time zone, you can specify so in the crontab. For example, to schedule a job for midnight in Germany, you can use the crontab line:

0 22,23 * * * [ "$(TZ=:Europe/Berlin date +\%H)" = "00" ] && jsub ...

The above crontab line instructs the system to check on 22:00 UTC (23:00 CET and 0:00 CEST) and 23:00 UTC (0:00 CET and 1:00 CEST) whether it is midnight in Berlin, and if so, calls jsub. Note that you can't just replace "Berlin" with "Hamburg"; the values for TZ are limited to those found at /usr/share/zoneinfo. If you're unsure what the offset of your time zone to UTC is, you can run the check hourly by replacing 22,23 with *.


Mail to users

Mail sent to (where user is a shell account) will be forwarded to the email address that user has set in their Wikitech preferences, if it has been verified (the same as the 'Email this user' function on wikitech).

Any existing .forward in the user's home will be ignored.

Mail to tools

Mail can also be sent "to a tool" with:

Where "anything" is an arbitrary alphanumeric string. Mail will be forwarded to the first of:

  • The email(s) listed in the tool's ~/.forward.anything, if present;
  • The email(s) listed in the tool's ~/.forward, if present; or
  • The wikitech email of the tool's individual maintainers.

Additionally, is an alias pointing to mostly useful for automated email generating from within Labs.

~/.forward and ~/.forward.anything need to be readable by the user Debian-exim; to achieve that, you probably need to chmod o+r ~/.forward*.

Processing email programatically

In addition to mail forwarding, tools can have incoming mail sent to an arbitrary program by setting one of its .forwards (as above) to:

|jmail program

In that case, program will be invoked as a job on the grid and will have the email presented to it as its standard input. If program fails to run, or exits with a non-zero status, then the email will bounce with the standard error included it the bounce message.

Please be aware that mail processing on the grid is limited in memory and in runtime (30s CPU time, 60s wall clock) so you should not do heavy processing in your script. If you need more than this, then have the initial script simply queue the email for later processing from another component.

Mail from tools

When sending mail from a job, the usual command line method of piping the message body to /usr/bin/mail may not work correctly because /usr/bin/mail attempts to deliver the message to the local MSA in a background process which will be killed if it is still running when the job exits.

If piping to a subprocess to send mail is needed, the message including headers may be piped to /usr/sbin/exim -odf -i.

# This does not work when submitted as a job
echo "Test message" | /usr/bin/mail -s "Test message subject"

# This does
echo -e "Subject: Test message subject\n\nTest message" | /usr/sbin/exim -odf -i

Developing on Tool Labs


All code in the ‘tools’ project must be open source. Please add a license at the beginning! Even if you have not yet deployed or finished your work, it is non-free software unless you explicitly license it.

You may use any free license of your choice, provided that it is OSI approved.

Heavy processing

If you will be doing heavy processing (e.g., compiles or tool test runs), please use the development environment ( instead of the primary login host ( so as to help maintain the interactive performance of the primary login host.

The tools-dev host is functionally identical to tools-login.

Where to put shared Tool code and files

There are currently three possible approaches to sharing code between tools:

  • Shared code can be stored in git submodules, which allow users to keep a git repository within another git repository. Sharing code in this way retains the maintainability and other source controls advantages of git. For more information about git submodules, please see the git documentation.
  • Access to a tool's code can be delegated to other tools by adding them as service users. The list of service users for a tool can be accessed from the "Manage members" link on Special:NovaServiceGroup. (It may be appropriate to create a new 'tool' to house the shared code.)

Shared files

  • Shared config or other files may be placed in the '/shared' directory, which is readable (and potentially writeable) by all users in tools project. In this directory are available, for instance:

Web services

Every tool can have a dedicated web server started on a dedicated queue of the grid; that web server is lighttpd (documentation), which is lightweight enough to be run many times on a single node.

It is also possible to run your own webserver (e.g. to run a Scala-based tool). See #other web servers below.

General information

  • Tools get general error logs in ~/error.log
  • PHP scripts are automatically invoked with FCGI
  • The web server is mostly configurable (including adding other FCGI handlers)
  • customization being instead handled through ~/.lighttpd.conf
  • Everything runs with the tool's UID, regardless of file ownership.

Starting and stopping the web server

You can start your web server (from the tool account) with the command:

webservice start

Likewise, you can use the webservice command to stop and restart your server, or to request its status.

If you've used the apache setup previously, you may have a access.log owned by root in your tool's home directory. You will need to remove or rename it before the web service can start.

Configuring the web server

As it starts, the web server reads any configuration in ~/.lighttpd.conf, and merges it with the default configuration (which is likely to be adequate for most tools).

Ambox notice.png Sometimes merge fails if an option is already set in the default configuration. So instead of using   option = value   try   option += value.

Default configuration

This is the default (if you don't specify any other/additional settings in your tool's .lighttpd.conf)

server.modules = (

server.port = $port
server.use-ipv6 = "disable"
server.username = "$prefix.$tool"
server.groupname = "$prefix.$tool"
server.core-files = "disable"
server.document-root = "$home/public_html" = "$"
server.errorlog = "$home/error.log"
server.breakagelog = "$home/error.log"
server.follow-symlink = "enable"
server.max-connections = 20
server.max-keep-alive-idle = 60
server.max-worker = 5
server.stat-cache-engine = "fam"
ssl.engine = "disable"

alias.url = ( "/$tool" => "$home/public_html/" )

index-file.names = ( "index.php", "index.html", "index.htm" )
dir-listing.encoding = "utf-8"
server.dir-listing = "disable"
url.access-deny = ( "~", ".inc" )
static-file.exclude-extensions = ( ".php", ".pl", ".fcgi" )

accesslog.use-syslog = "disable"
accesslog.filename = "$home/access.log"

cgi.assign = (
  ".pl" => "/usr/bin/perl",
  ".py" => "/usr/bin/python",
  ".pyc" => "/usr/bin/python",

fastcgi.server += ( ".php" =>
                "bin-path" => "/usr/bin/php-cgi",
                "socket" => "/tmp/php.socket.$tool",
                "max-procs" => 1,
                "bin-environment" => (
                        "PHP_FCGI_CHILDREN" => "4",
                        "PHP_FCGI_MAX_REQUESTS" => "10000"
                "bin-copy-environment" => (
                        "PATH", "SHELL", "USER"
                "broken-scriptfilename" => "enable"

(config as of Mar 31, 2014)

Example configurations

FCGI Flask config
fastcgi.server += ( "/gerrit-patch-uploader" =>
        "socket" => "/tmp/patchuploader-fcgi.sock",
        "bin-path" => "/data/project/gerrit-patch-uploader/src/gerrit-patch-uploader/app.fcgi",
        "check-local" => "disable",
        "max-procs" => 1,

For Flask, the fcgi handler looks like this:

Url rewrite
url.rewrite-once += ( "/id/([0-9]+)" => "/index.php?id=$1",
                      "/link/([a-zA-Z]+)" => "/index.php?link=$1" )

Details: ModRewrite

Header, mimetype, error handler
# Allow Cross-Origin Resource Sharing (CORS) 
setenv.add-response-header  += ( "Access-Control-Allow-Origin" => "",
                                 "Access-Control-Allow-Methods" => "POST, GET, OPTIONS" )

# Set cache-control directive for static files and resources
$HTTP["url"] =~ "\.(jpg|gif|png|css|js|txt|ico)$" {
	setenv.add-response-header += ( "Cache-Control" => "max-age=386400, public" )

# Add custom mimetype
mimetype.assign  += ( ".bulk"  => "text/plain" )

# Add custom error-404 handler
server.error-handler-404  += "/error-404.php" 

Details: ModSetEnv  Mimetype-Assign   Error-Handler-404   HTTP access control (CORS)

Directory or file index
# Enable basic directory index
$HTTP["url"] =~ "^/?" {
	dir-listing.activate = "enable"
Deny access to hidden files
# Deny access to hidden files
$HTTP["url"] =~ "/\." {
	url.access-deny = ("")

Details: ModAccess

Custom index
# Enable index for specific directory 
$HTTP["url"] =~ "^/download($|/)" {
	dir-listing.activate = "enable" 

# Custom index file or custom directory generator
index-file.names += ("")

Details: ModDirlisting

Request logging

Add the line:

# Enable request logging
debug.log-request-handling = "enable"
Apache-like cgi-bin directory

Add the following stanza:

$HTTP["url"] =~ "^/your_tool/cgi-bin" {
	cgi.assign = ( "" => "" )

This does require that cgi-bin be under your public_html rather than alongside it.

Enable Status & Statistics
# modify <toolname> for your tool
# this will enable counters<toolname>/server-status (resp: .../server-statistics)
server.modules += ("mod_status")
status.status-url = "/<toolname>/server-status"
status.statistics-url = "/<toolname>/server-statistics"

Details: ModStatus

Using cookies

Since all tools in the 'tools' project reside under the same domain, you should prefix the name of any cookie you set with your tool's name. In addition, you should be aware that cookies you set may be read by every other web tool your user visits.

Accordingly, you should avoid storing privacy-related or security information in cookies. A simple workaround is to store session information in a database, and use the cookie as an opaque key to that information. Additionally, you can explicitly set a path in a cookie to limit its applicability to your tool; most clients should obey the Path directive properly.

Web logs

Your tool's web logs are placed in the tool account's ~/access.log in common format. Please note that the web logs are anonymized in accordance with the Foundation’s privacy policy. Each user IP address will appear to be that of the local host, for example. In general, the privacy policy precludes the logging of personally identifiable information; special permission from Foundation legal counsel is required if such information is required.

Error logs can be found in the tool account's ~/error.log; this includes the standard error of invoked scripts.

Other web servers

Ambox warning pn.svg Please note that using lighttpd with fcgi is the suggested way to run a tool. However, for some languages, running a direct webserver is the simplest way.

First of all, create a file called '', with the following contents:

exec portgrabber <tool name> <command to run tool>

and make sure it's executable: chmod +x ./

portgrabber will make sure the proxy is configured correctly, and will pass the port number to the tool as last argument. For instance, Python's SimpleHTTPServer takes a port number as argument:

python -m SimpleHTTPServer 8000

would run the server on port 8000. To run it under the web proxy, you would use the following script:

exec portgrabber MyToolName python -m SimpleHTTPServer

Then submit it to the grid, using the 'webgrid-tomcat' queue:

jstart -q webgrid-tomcat ./

You can check the status with qstat, which will show a running job after 10 seconds or so. You can now reach your tool at !

Ambox notice.png Note that, as with the lighttpd setup, your tool will receive URL's that include your tool prefix - e.g. /MyToolName/index.html instead of /index.html. You may need to adapt your tool configuration to handle this.


The Python Wikipediabot Framework (pywikipedia or pywikibot) is a collection of Python tools that automate work on MediaWiki sites. Please confer mw:Manual:Pywikipediabot/Installation first.

A snapshot of the Pywikipedia ‘core’ branch (formerly ‘rewrite’) is maintained at ‘/shared/pywikipedia/core’. The ‘compat’ (formerly ‘trunk’) branch is maintained at ‘/shared/pywikipedia/trunk,’ but because of the possibility of session cookie leaks, as well as the difficulty of using compat in a centralized way, we recommend that you install ‘compat’ locally if you need to use this.

In general, we recommend using the shared ‘core’ files because the code is updated frequently. If you are a developer and/or would like to control when the code is updated, you may also choose to install 'core' locally in your tool directory.

Note that the shared 'core' code consists only of the source files; each bot operator will need to create his or her own configuration files (such as ‘’) and set up a PYTHONPATH and other environment variables. Please see Using the shared Pywikipedia files for more information.

Using the shared Pywikipedia files (recommended setup)

For most purposes, using the centralized ‘core’ files is recommended as the code is updated frequently. The shared files are available at /shared/pywikipedia/core, and steps for configuring your tool account are provided below. The configuration files themselves are stored in your tool account in the '.pywikibot' directory, or another directory, where they can be used via the -dir option (all of this is described in more detail in the instructions).

If you are a developer and/or would like to control when the code is updated, or if you would like to use the ‘compat’ branch instead of 'core' (not all the Pywikpedia scripts have been ported to ‘core’), please see Installing Pywikipediabot locally for instructions.

To set up your Tools account to use the shared ‘core’ framework:

1. Become your tool-account

maintainer@tools-login:~$ become toolname

2. In your home directory, create (or edit, if it exists already) a ‘.bash_profile’ file to include the following line. The path should be on one line, though it may appear to be on multiple lines depending on your screen width. When you save the .bash_profile file, your settings will be updated for all future shell sessions:

export PYTHONPATH=/shared/pywikipedia/core:/shared/pywikipedia/core/externals/httplib2:/shared/pywikipedia/core/scripts

3. Import the path settings into your current session:

tools.tool@tools-login$ source .bash_profile

4. In your home directory, create a subdirectory named ‘.pywikibot’ (the ‘.’ is important!) for bot-related files:

tools.tool@tools-login$ mkdir .pywikibot

5. Configure Pywikipediabot. To create configuration files, use the following command and then follow the instructions. You may also use an existing configuration file (e.g., ‘’) that works on another system by copying it into your .pywikibot directory:

tools.tool@tools-login$ python /shared/pywikipedia/core/

6. Test out your setup. In general, all jobs should be run on the grid, but it’s fine to test your setup on the command line:

tools.tool@tools-login$ python /shared/pywikipedia/core/scripts/

You should see the following terminal output (or something similar):

Pywikibot [http] branches/rewrite (r11526, 2013/05/12, 18:51:23, OUTDATED) Python 2.7.3 (default, Aug  1 2012, 05:14:39) [GCC 4.6.3] unicode test: ok

Note that you do not run scripts using, but run scripts directly, e.g., python /shared/pywikipedia/core/scripts/

If you need to use multiple files, you can do so by adding -dir:<path where you want your> to every python command. To use the local directory, use -dir:. (colon dot).

For more information about Pywikipediabot, please see the Pywikipediabot documentation. The Pywikipedia mailing list ( and IRC (irc:// channel are good places to go for additional help. Other useful information about using the centralized 'core' files is available here: User:Russell Blau/Using pywikibot on Labs

Setup pywikibot on Labs (locally)

If you want to use the compat branch, we highly recommend installing it locally (it's almost impossible to use the shared files correctly and, if you try, you might leak session cookies to a location where anyone can read them, you might need additional libraries, etc.). For core, you can also install the files locally -- this would allow you to upgrade whenever it suits you, instead of always running the latest version.

Installing core

Similar to the instructions given in this mail do:

Clone the 'core' git repository:

$ git clone --recursive pywikibot-core
$ cd pywikibot-core

then you can compress the git repository by running

$ git gc --aggressive --prune
$ cd scripts/i18n/
$ git gc --aggressive --prune
$ cd ../../externals/httplib2/
$ git gc --aggressive --prune

which results in a repo of size ~9MB.

You have 2 choices on how you want to proceed now and setup core. You can do so by using an additional tool called virtualenv and install it as module into a virtual environment, or you can run it from sources - similiar like compat - by using the integrated wrapper. For the second method no installation is needed.

install as module - virtualenv

If you would like to install a local version of the 'core' branch, we recommend that you use virtualenv, which is particularly useful if your code uses a lot of externals (e.g. IRC bots, image handling bots, etc.).

To set up the Pywikibot core branch from cloned repo:

Create a virtualenv. You can call it whatever you'd like (e.g., 'pwb', in this example); shorter names are easier:

$ virtualenv pwb

Activate it

$ source ~/pwb/bin/activate

and then do the following, which basically installs pwb-core as a symlink. This way, if you modify the directory, you don't need to install it again. This will also call python

$ cd pywikibot-core
$ python develop

To use the code from outside the virtual environment (e.g. to submit jobs to the grid engine), use:

$ /data/project/tooluser/pwb/bin/python /data/project/tooluser/path/to/


$ $HOME/pwb/bin/python /home/path/to/

Note: If you want to run a script in interactive mode to debug, you'll need to run source ~/pwb/bin/activate first.

run from sources - wrapper

After cd'ing into pywikibot-core, run

$ python

which will ask a series of questions on how you want to configure your local copy. This will generate the required config files for you. Alternatively, if you have already config file from previous version, you can copy those existing config files into the pywikibot-core directory.

Some bot scripts require extra packages to be installed -- see the file externals/README for more details.


Follow the instructions given in this mail and do:

Clone the 'compat' git repository:

$ git clone --recursive pywikibot-compat

then you might want to compress the code down to the necessary parts (this is what you definitively wanted to do on the TS, but on Labs this is not needed) by

$ cd pywikibot-compat
$ cd i18n/
$ git gc --aggressive --prune
$ cd ../externals/opencv/
$ git gc --aggressive --prune
$ cd ../pycolorname/
$ git gc --aggressive --prune
$ cd ../spelling/
$ git gc --aggressive --prune

(a first 'git gc --aggressive --prune' in the pywikibot-compat directory is not needed anymore)

this results in a repo of size ~25MB. Now you have to setup pywikibot, by running (in fact running any bot script - like e.g. your favourite one - works)

$ python

similar as described in in the core section above.

You may setup all externals manually if you want - but this is not needed in compat, confer mw:Manual:Pywikipediabot/Installation#Dependencies for further info. If you do not install them, you may be asked to install some extra packages depending on what scripts you run.

You will also have to enter the password for your bot eventually.

Now you have finished the configuration of compat and can continue setting up the webspace and jobs to execute.

setup web-space

Per default, the directory listing on is disabled. If you want to allow it for all users, login to your tool account (as already described) and

$ cd ~/public_html
$ echo Options +Indexes > .htaccess

If you run a bot with the -log option, you will find the log files within the logs/ directory. If you want to allow users to access it from the web, do

$ cd ~/public_html
$ mkdir logs
$ cd logs
$ ln -s ~/pywikibot-core/logs core

If you want a specific file type to be handled different by your browser, e.g. .log files like text files, use (confer this)

$ echo AddType text/plain .log > .htaccess

and (don't forget to) clear your browsers cache afterwards.

Next you might want to consider you cgi-bin directory

$ cd ~/cgi-bin

follow the hints given at Nova Resource:Tools/Help#Logs exactly, e.g. even the two commands

$ /usr/bin/python      # valid
$ /usr/bin/env python  # in-valid

work and do the same in shell, only the first one is valid and works here, the second is invalid! Another point to mention is that PHP scripts go into public_html, not cgi-bin. Python scripts on the other hand can be placed in public_html or cgi-bin as you wish. I would recommend to use public_html for documents and keep it listable, whereas cgi-bin should be used for CGI scripts and be protected (not listable).

setup job submission

After installing, you can run your bot directly via a shell command, though this is highly discouraged. You should use the grid to run jobs instead.

In order to setup the submission of the jobs you want to execute and use the grid engine you should first consider Nova Resource:Tools/Help#Submitting, managing and scheduling jobs on the grid and if you are familiar with the Toolserver and its architecture consult Migrating from toolserver also.

In general labs uses SGE and its commands like qsub et al, this is explained in this document which you should use in order to get an idea which command and what parameters you want to use.

To run a bot using the grid, you might want to be in the pywikipedia directory (this is not needed) - which means you have to write a small wrapper script. The following example script ( is used to run

$ cat
cd /path/to/pywikipedia

To submit a job, set the permissions for the script and then use the 'jsub' command to send the job to the grid:

$ chmod 755
$ jsub -N job_name

Job output will be written to output and error files in your home directory called YOURJOBNAME.out and YOURJOBNAME.err, respectively (e.g., versiontest.out and versiontest.err in this example):

$ cat ~/versiontest.out
Pywikipedia [https] r/pywikibot/compat (r10211, 8fe6bdc, 2013/08/18, 14:00:57, ok)
Python 2.7.3 (default, Aug  1 2012, 05:14:39)
[GCC 4.6.3]
use_api = True
use_api_login = True
unicode test: ok

An infinitely running job (e.g. irc-bot) like this (cronie entry from TS submit host):

06 0 * * * qcronsub -l h_rt=INFINITY -l virtual_free=200M -l arch=lx -N script_wui $HOME/rewrite/ -log


$ jsub -once -continuous -l h_vmem=256M -N script_wui python $HOME/pywikibot-core/ -log

or shorter

$ jstart -l h_vmem=256M -N script_wui python $HOME/pywikibot-core/ -log

the first expression is good for debugging. Memory values smaller than 256MB seam not to work here, since that is the minimum. If you experience problems with your jobs, like e.g.

Fatal Python error: Couldn't create autoTLSkey mapping

you can try increasing the memory value - which is also needed here, because this script uses a second thread for timing and this thread needs memory too. Therefore use finally

$ jstart -l h_vmem=512M -N script_wui python $HOME/pywikibot-core/ -log

Now in order to create a crontab follow Scheduling jobs at regular intervals with cron and setup for crontab file like:

$ crontab -e

and enter

06 0 * * * jstart -l h_vmem=512M -N script_wui python $HOME/pywikibot-core/ -log
additional configuration

Furthermore additional tools to support you and your bot at work are available:

Tips for working collaboratively

How to use <programming-language-X> to write tools on labs

Do you have experience that might help another user? Please share it (or point to it) here!


Redis is a key-value store similar to memcache, but with more features. It can be easily used to do publish/subscribe between processes, and also maintain persistent queues. Stored values can be different data structures, such as hash tables, lists, queues, etc. Stored data persists across service restarts. For more information, please see the Wikipedia article on Redis.

A Redis instance that can be used by all tools is available on tools-redis, on the standard port 6379. It has been allocated a maximum of 7G of memory, which should be enough for most usage. You can set limits for how long your data stays in Redis; otherwise it will be evicted when memory limits are exceeded. See the Redis documentation for a list of available commands.

Libraries for interacting with Redis from PHP (phpredis) and Python (redis-py) have been installed on all the web servers and exec nodes. For an example of a bot using Redis, see SuchABot.

For quick & dirty debugging, you can connect directly to the Redis server with nc -C tools-redis 6379 and execute commands (for example "INFO").


Redis has no access control mechanism, so other users can accidentally/intentionally overwrite and access the keys you set. Even if you are not worried about security, it is highly probably that multiple tools will try to use the same key (such as lastupdated, etc). To prevent this, it is highly recommended that you prefix all your keys with an application-specific, lengthy, randomly generated secret key.

You can very simply generate a good enough prefix by running the following command:

openssl rand -base64 32

PLEASE PREFIX YOUR KEYS! We have also disabled the redis commands that let users 'list' keys.

A note about memcache

There is no memcached on toollabs. Please use Redis instead.


The 'tools' project, like all labs projects, has access to a directory storing the public Wikimedia datasets (i.e. the dumps generated by Wikimedia). The most recent two dumps can be found in:


This directory is read-only, but you can copy files to your tool's home directory and manipulate them in whatever way you like.

If you need access to older dumps, you must manually download them from the Wikimedia downloads server.

/public/dumps/pagecounts-raw contains some years of the pagecount/projectcount data derived by Erik Zachte from Domas Mituzas' archives.

CatGraph (aka Graphserv/Graphcore)

CatGraph is a custom graph database that provides tool developers fast access to the Wikipedia category structure. For more information, please see the documentation.


If you run into problems, please feel free to come into the #wikimedia-labs IRC (chat) channel using and look for Coren (Marc-Andre Pelletier) or petan (Petr Bena). The labs-l mailing list at is another good place to ask for help, especially if the people in chat are not responding. You can also search for help pages, or look more widely with the custom search at .


What gets backed up?

The basic rule is: there is a lot of redundancy, but no backups of labs projects beyond the filesystem's time travel feature for short-term disaster recovery. Labs users should make certain that they use source control to preserve their code, and make regular backups of irreplacable data.

Time travel

Although Labs users are ultimately responsible for backing up files and important data, "time travel" provides snapshots of the file system at fixed intervals and provides a short-term disaster recovery option.

You can access hourly snapshots for the last three hours and daily snapshots for the last three days. The snapshots are beneath /data/project/.snapshot. Its subdirectories are auto-mounted, so this directory may appear empty even though there are snapshots. To see the timestamps at which backups were made, look at the following files:

  • /home/.snaplist
  • /data/project/.snaplist

These files contain a list of the timestamps at which backups were made. To access a snapshot subdirectory directly, append the timestamp to the directory (e. g. /data/project/.snapshot/20130609.2117).

To automount a snapshot, cd to the timestamp directory:

cd /data/project/.snapshot/<timestamp>

The snapshot will unmount itself after some period of not access.

Moving a tool from Toolserver to Tool Labs

We know that you are putting your free time into the development of tools to improve Wikimedia projects and that migrating your tools from Toolserver to Tool Labs requires additional work. Unfortunately, at some point in 2014, WMDE will discontinue the Toolserver, and so staying is no option.

The Tool Labs environment is designed to support the development and maintenance of tools and bots, but it differs from the Toolserver environment in ways that will impact migrating users. We are aware that the transition will require work, and--though ideally smooth--may not be entirely so. If you have questions or need assistance, please feel free to come into the #wikimedia-labs IRC (chat) channel using and look for Coren (Marc-Andre Pelletier) or petan (Petr Bena). The labs-l mailing list at is another good place to ask for help, especially if the people in chat are not responding.

If you want to copy files from the Toolserver to Tool Labs, keep in mind that ssh/scp between the two currently works in one direction only. You can ssh from the Toolserver to Tool Labs but not the other way around.

Please see Migration of Toolserver tools for more information and FAQs specific to moving tools and bots from Toolserver to Tool Labs. Also see Magnus Manske's experience when migrating a tool. If you are planning migration or have already accomplished it, please consider documenting the experience to help other users through the process.

Thank you for all your contributions!


Do I explicitly have to specify the license of my tools?

Yes. If you think "this is just a draft, nothing ready" and you do not put a license into your code it's non-free software contradicting the idea of Tool Labs. So please add a license in the beginning! You can use any OSI-approved license. Read more on the licenses on the Open Source Initiative's website:

What about file permissions? Who can see my code?

There are projects where users have root, so that all users in the project have full access to the whole project. This setup is not mandatory though: Tools can also use tool user IDs to control file permissions. On the tools project (which will be where toolserver tools migrate), you have full control over access permissions of your code and data. By default, only the tool maintainers have access (all the maintainers of a tool are in the tool's group).

Do stewards have a specific project on WMF Labs?

There is no plan to have distinct projects for different tool makers; but the tools are separated from each other. There is nothing that prevents you from sharing the maintenance of some tools between different stewards (in fact, it is recommended that you do so to ensure that there is always someone able to keep them up at need).

Can I delete a tool?

No, you can't do this yourself. The reason is that you or other members might accidentally delete precious stuff. You can delete the content of your directories. If you really want a tool / a service group to be deleted, please contact an admin.

If you are planning to try out Tool Labs but don't know yet if you are going to keep our tests as a later project, don't hesitate to create a tool (a service group) and to create a new one later where you put the stuff you want to keep.

Can I rename a tool?

No, sorry, this is not possible. You'd have to create a new one and put your code in there.

Can I have a subdomain for my web service?

Sorry, not yet. This is still in discussion at WMF. Currently, your web services are available under<YOURTOOL>.

How do I access the database replicas?

  • In your home directory you find your credentials for mariadb (in the file You need to specify this file and the server you want to connect to. Some examples:
mysql --defaults-file=~/ -h enwiki.labsdb # <- for English WP
mysql --defaults-file=~/ -h dewiki.labsdb # <- for German language WP
mysql --defaults-file=~/ -h wikidatawiki.labsdb # <- for Wikidata
mysql --defaults-file=~/ -h commonswiki.labsdb # <- for Commons
  • Alternatively you can rename the credentials file from to .my.cnf and just run
mysql -h commonswiki.labsdb # <- for Commons

Why can't I access user preferences in the replicas?

The db replication gives access to everything that is visible for logged-in users without special privileges. Others' user preferences are considered private information in Wikimedia Labs and are thus redacted from the replicas.

Why am I getting errors about must be installed for pthread_cancel to work?

Encountering this while trying to run jobs on the grid engine means you need to give your job more memory; the (obscure) error message is caused by the system being unable to load your executable and all its shared libraries. As a rule, most scripting language require around 300-350M of virtual memory to load completely. See Allocating additional memory for more information.

Is there a GUI tool for database work?

Not in Tool Labs, but you can run one locally on your computer (for example the MySQL Workbench Here is how you connect to the database:

>For the login:
>For the database, it depends on the exact one you want to use, of course - for example: enwiki.labs

Why does public_html not work in my home directory?

Users do not and cannot have a public_html folder to themselves. The only web-accessible directories are in /data/project/<toolname>/public_html/*. To have a URL such as<username>/, you must create a tool called <username>, which will create a folder called /data/project/<username>/public_html/. Nobody--except you, the user--will *ever* be given access to your home or its files. Allowing public services to run from a home directory means that their management could not be shared or taken over if they end up abandoned, defeating the purpose.

I get a Permission denied error when running my script. Why's that?

Make sure that you are running your script from your tool account rather than your user account.

How can I detect if I'm running in Labs? And which project (tools or toolsbeta)?

There is a file that contains the project name on every labs instance: /etc/wmflabs-project. Testing for its presence tells you "you are in the WMF Labs", and checking its contents will tell you which project; that would be "tools" for the Tool Labs, or "toolsbeta" for the experimental tool labs. If using PHP checkout : $_SERVER['INSTANCENAME'], $_SERVER['INSTANCEPROJECT'] which contain strings of the current location such as 'tools-login' and 'tools'

My connection seems slow. Any advice?

When connecting to Tool Labs from Europe you might have higher ping times. Try using mosh ( To connect, use mosh -a (-a to force predictive echo) (instead of ssh

I want to ssh to bot instances outside of Labs. Any advice?

If you want to ssh to specific bot instances other than, it is helpful to create a new SSH key:

$ ssh-keygen
$ cat ~/.ssh/
ssh-rsa .... user@host

Copy the 'ssh-rsa .... user@host' line to your authorized keys in the labs console.

How can I check my filesystem usage?

If you would like information on your filesystem usage, please come ask Ryan on IRC.

My tool requires a package that is not currently installed in Tool Labs. How can I add it?

You might not be the only one missing that package. Please submit a ticket in bugzilla and ask the admins to install it project-wide. If they have reasons not to do so you can always install software locally / just for yourself.

See also stub on how to get a sample python application installed onto the tools-project as quickly as possible.

My tool needs a more specialized infrastructure than Tool Labs provides. What should I do?

Tool Labs is a simplified environment intended to be a direct toolserver replacement for most small tools. When you need something more complicated or when you need to manage specialized infrastructure, what Tool Labs can't do you can probably do with your own Labs project! (Instead of inside one of the projects "tools" or "toolsbeta".)

I keep reading about puppet. What is it?

As a tool maintainer you don't have to worry about puppet.

The nutshell: puppet is a system by which you describe the configuration of a machine. When used, it will apply the necessary changes to make the machine you apply it to match that configuration.

In practice, the sysadmin would make any change of configuration intended for the machine in puppet (including what to install, files to edit, etc) so that it can be reapplied to a blank machine to configure it "just like it was", or to make a clone, and so on. In the case of a project where the tool maintainer do the system administration, it might be desirable to actually configure and install the tool itself through puppet so that it is easy to return to a known state.

In the case of the Tool Labs, however, the actual tools would not normally be configured through puppet (it's possible, but not worthwile): they live on a shared filesystem rather than on the individual machines. What puppet is used for is to maintain the components of the grid, making adding "one more compute node" or "an extra webserver" as simple as to create a new instance and set puppet accordingly. When we find a tool has a dependency, the tools labs sysadmins will add it to puppet so that every host part of the grid (current and future) will have that configured accordingly without manual intervention.

What is the labsconsole?

It's the old name of what is now Wikitech (

I'm being prompted for a password when I try to 'become my-tool-account'. What's wrong?

If you are seeing "a password is required" message when you try to become your tool (i.e., sudo yourtoolaccount), it is likely because you were logged in to your Labs account when you created the tool account. Unix group membership is checked on login only, so an existing session will not have access to the new tool group. Log out and then log in to your Labs account again to fix this problem.

Are there any plans for adding monitoring or profiling tools?

Yes, in the very long term and not guaranteed. Want to help? Find out more here: User:Yuvipanda/Icinga for tools