Open main menu

Note: This page is in a draft form as part of planned improvements to Toolforge developer documentation. Some information that was previously available here has been moved to the About Toolforge page. You may also find information you are looking for linked from Portal:Toolforge.


Contents

About Toolforge

See About Toolforge to learn more about what Toolforge is.

Tool Accounts

See Tool Accounts to learn what Tool Accounts are, and how to use them to create and maintain tools.

This page will help you understand what a Tool Account is, the first steps to create a Tool Account/tool, basic configurations, and how to add and remove maintainers.

Using Toolforge and managing your files

Toolforge can be accessed in a variety of ways – from its public IP to a GUI client. Please see Help:Access for general information about accessing Cloud VPS projects.

The tools list

The Toolforge tools list page is publicly available and contains a list of all currently-hosted Tool accounts along with their maintainers. Tool accounts that have an associated web page appear as links. Users with access to the 'tools' project can create new tool accounts here, and add or remove maintainers to and from existing tool accounts.

Updating files

After you can ssh successfully, you can transfer files via sftp and scp. Note that the transferred files will be owned by you. You will likely wish to transfer ownership to your tool account. To do this:

0. chgrp toolaccount FILE

1. become your tool account:

yourshellaccountname@tools-login:~$ become toolaccount
tools.toolaccount@tools-login:~$

2. As your tool account, take ownership of the files:

tools.toolaccount@tools-login:~$ take FILE

The take command will change the ownership of the file(s) and directories recursively to the calling user (in this case, the tool account).

Handling permissions

if you're getting permission errors, note that you can also transfer files the other way around: copy the files as your tool account to /data/project/<projectname>.

Another, probably easier, way is to set the permission to group-writable for the tools directory. For example, if your shell account's name is alice and your tool name is alicetools you could do something like this after logged in as a shell user

become alicetools
chmod -R g+w /data/project/alicetools
logout
cp -rv /home/alice/* /data/project/alicetools/

Using git

The best option is to create a Git repository to which project participants commit files. To access the files, become the tool account, check that repository out in your tool's directory, and thereafter run a regular git pull whenever you want to deploy new files.

See #Setting up code review and version control for more details about using source control for your tool.

Putty and WinSCP

Note that instructions for accessing Toolforge with Putty and WinSCP differ from the instructions for using them with other Cloud VPS projects. Please see Help:Access to Toolforge instances with PuTTY and WinSCP for information specific to Toolforge.

Other graphical file managers (e.g., Gnome/KDE)

For information about using a graphical file manager (e.g., Gnome/KDE), please see Accessing instances with a graphical file manager.

Installing MediaWiki core

You want to install MediaWiki core and make your installation visible on the web.

One-time steps per tool

First, you have to do some preparatory steps which you need only once per tool.

become <YOURTOOL>

If you have not installed composer yet:

mkdir ~/bin
curl -sS https://getcomposer.org/installer | php -- --install-dir=$HOME/bin --filename=composer

If your local bin directory it not in your $PATH (use echo $PATH to find out), then create or alter the file ~/.profile and add the lines:

# set PATH so it includes user's private bin if it exists
if [ -d "$HOME/bin" ] ; then
   PATH="$HOME/bin:$PATH"
fi

Finish your session as <YOURTOOL> and start a new one, or:

. ~/.profile

Now you are done with the one-time preparations.

For each instance of core

The following steps are needed for each new installation of MediaWiki. We assume that you want to access MediaWiki via the web in a directory named MW — you are free to use another name. If not already done:

become <YOURTOOL>

Then:

cd ~/public_html

If you plan to submit changes:

git clone ssh://<YOURUSERNAME>@gerrit.wikimedia.org:29418/mediawiki/core.git MW

or else, if you only want to use MediaWiki without submitting changes:

git clone https://gerrit.wikimedia.org/r/mediawiki/core.git MW

will do and spares resources. Next, recent versions of MediaWiki have external dependencies, so you need to install those:

cd MW
composer install
git review -s

Run webservice start and then you should be able to access the initial pre-install screen of MediaWiki from your web browser as:

https://tools.wmflabs.org/<YOURTOOL>/MW/

and proceed as usual. See how to create new databases for your MediaWiki installations.


Make the Tool translatable

If your tool is used from the web, and assuming you think it's worth something at all, you want to make it translatable. You can and should use the Intuition framework (PHP only), which allows you to use translatewiki.net and delivers you the localisation.

Don't waste your time, learn from our experience with MediaWiki: read the message documentation tips and other internationalization hints.


Configuring Tools

Tools and bot code should be stored in your tools account, where it can be managed by multiple users and accessed by all execution hosts. Specific information about configuring web services and bots, along with information about licensing, package installation, and shared code storage, is available at the § Developing on Toolforge section.

Note that bots and tools should be run via the grid, which finds a suitable host with sufficient resources to run each. Simple, one-off jobs can be submitted to the grid easily with the jsub command. Continuous jobs, such as bots, can be submitted with jstart.

Setting up code review and version control

Although it's possible to just stick your code in the directory and mess with it manually every time you want to change something, your future self and your future collaborators will thank you if you instead use source control, a.k.a. version control and a code review tool. Wikimedia Cloud VPS makes it pretty easy to use Git for source control and Gerrit for code review, but you also have other options.

Using Diffusion

  • Go to toolsadmin
  • Find your tool
  • Click the create new repository button

Requesting a Gerrit/Git repository for your tool

Toolforge users may request a Gerrit/Git repository for their tools. Access to Git is managed via Wikimedia Cloud VPS and integrated with Gerrit, a code review system.

In order to use the Wikimedia Cloud VPS code review and version control, you must upload your ssh key to Gerrit and then request a repository for your tool.

  1. Log in to https://gerrit.wikimedia.org/ with your Wikimedia developer account username and password.
  2. Add your SSH public key (select “Settings” from the drop-down menu beside your user name in the upper right corner of the screen, and then “SSH Public Keys” from the Settings menu).
  3. Request a Gerrit project for your tool: Gerrit/New repositories

For more information, please see:

For more information about using Git and Gerrit in general, please see Git/Gerrit.

Setting up a local Git repository

It is fairly simple to set up a local Git repository to keep versioned backups of your code. However, if your tool directory is deleted for some reason, your local repository will be deleted as well. You may wish to request a Gerrit/Git repository to safely store your backups and/or to share your code more easily. Other backup/versioning solutions are also available. See User:Magnus Manske/Migrating from toolserver § GIT for some ideas.

To create a local Git repository:

1. Create an empty Git repository

maintainer@tools-login:~$ git init

2. Add the files you would like to backup. For example:

maintainer@tools-login:~$ git add public_html

3. Commit the added files

git commit -m 'Initial check-in'

For more information about using Git, please see the git documentation.

Enabling simple public HTTP access to local Git repository

If you've set up a local Git repository like the above in your tool directory, you can easily set up public read access to the repository through HTTP. This will allow you to, for instance, clone the Git repository to your own home computer without using an intermediary service such as GitHub.

First create the www/static/ subdirectory in your tool's home directory, if it does not already exist:

mkdir ~/www
mkdir ~/www/static/

Now go to the www/static/ directory, and make a symbolic link to your bare Git repository (the hidden .git subdirectory in the root of your repository):

cd ~/www/static/
ln -s ~/.git yourtool.git

Now change directory into the symbolic link you just created, and run the git update-server-info command to generate some auxiliary info files needed for the HTTP connectivity:

cd yourtool.git
git update-server-info

Enable a few Git hooks for updating said auxiliary info files every time someone commits, rewrites or pushes to the repository:

ln -s hooks/post-update.sample hooks/post-commit
ln -s hooks/post-update.sample hooks/post-rewrite
ln -s hooks/post-update.sample hooks/post-update
chmod a+x hooks/post-update.sample

You're done. You should now be able to clone the repository from any remote machine by running the command:

git clone http://tools-static.wmflabs.org/yourtool/yourtool.git

Using Github or other external service

Before you start you might want to setup your Git user account.

# Login to your tool account
become mytool
# Your name
git config user.name "Your Name"
# Your e-mail (use the one you set up in Github)
git config user.email "your-mail@example.com"

Then you can clone remote repo (as you always do):

git clone https://github.com/yourGithubName/yourGithubRepoName.git

You can do updates any way you want, but you might want to use this simple update script to securely update code:

#!/bin/bash

read -r -p "Stop the service and pull fresh code? (Y/n)" response
if ! [[ $response =~ ^([nN][oO]|[nN])$ ]]
then
	webservice stop
	cd ./public_html
	echo -e "\nUpdating the code..."
	git pull
	echo
	read -r -p "OK to start the service? (Y/n)" response
	if ! [[ $response =~ ^([nN][oO]|[nN])$ ]]
	then
		webservice start
	fi
fi

Save above in your tool account home folder as e.g. "update.sh". Don't forget to add executive rights to you and your tool group (i.e. `chmod 770 update.sh`).

There's also a tutorial for setting up the tool to be automatically updated whenever the GitHub repository is pushed to.

Database access

This is a brief summary of the /Database documentation page.


Tool and Tools users are granted access to replicas of the production databases. Private user data has been redacted from these replicas (some rows are elided and/or some columns are made NULL depending on the table). For most practical purposes this is identical to the production databases and sharded into clusters in much the same way.

Database credentials are generated on account creation and placed in a replica.my.cnf file in the home directory of both a Tool and a Tools user account. This file cannot be modified or removed by users.

Symlinking the access file can be practical:

 ln -s $HOME/replica.my.cnf $HOME/my.cnf


To connect to the English Wikipedia replica, specify the alias of the hosting cluster (enwiki.analytics.db.svc.eqiad.wmflabs) and the alias of the database replica (enwiki_p) :

mysql --defaults-file=$HOME/replica.my.cnf -h enwiki.analytics.db.svc.eqiad.wmflabs enwiki_p

To connect to the Wikidata cluster:

mysql --defaults-file=$HOME/replica.my.cnf -h wikidatawiki.analytics.db.svc.eqiad.wmflabs

To connect to Commons cluster:

mysql --defaults-file=$HOME/replica.my.cnf -h commonswiki.analytics.db.svc.eqiad.wmflabs

There is also a shortcut for connecting to the replicas: sql <dbname>[_p] The _p is optional, but implicit (i.e. the sql tool will add it if absent).

To connect to the English Wikipedia database replica using the shortcut, simply type:

sql enwiki

To connect to ToolsDB where you can create and write to tables, type:

sql local

This sets server to "tools.db.svc.eqiad.wmflabs" and db to "". It's equivalent to typing-

mysql --defaults-file=$HOME/replica.my.cnf -h tools.db.svc.eqiad.wmflabs


Connecting from a Servlet in Tomcat

  1. create directory "lib" in directory "public_tomcat"
  2. copy "mysql-connector-java-bin.jar" to "public_tomcat/lib"
  3. import org.apache.tomcat.jdbc.pool.DataSource;
    import org.apache.tomcat.jdbc.pool.PoolProperties;
    
    String DBURL = "jdbc:mysql://tools.db.svc.eqiad.wmflabs:3306/";
    String DBDRIVER = "com.mysql.jdbc.Driver";
    String DATABASE = DBUSER + "__" + PROJECT;
    
    PoolProperties p = new PoolProperties();
    p.setUrl (DBURL + DATABASE);
    p.setDriverClassName(DBDRIVER );
    p.setUsername (DBUSER );
    p.setPassword (DBPASSWORD );
    p.setJdbcInterceptors(
    	"org.apache.tomcat.jdbc.pool.interceptor.ConnectionState;" +
    	"org.apache.tomcat.jdbc.pool.interceptor.StatementFinalizer");
    DataSource datasource = new DataSource();
    datasource.setPoolProperties(p);
    Connection connection = datasource.getConnection ();
    Statement statement = connection.createStatement();
    
  4. javac -classpath javax.servlet.jar:tomcat-jdbc.jar myhttpservlet.java

Code samples for common languages

Copied with edits from mw:Toolserver:Database access#Program access (not all tested, use with caution!)

In most programming languages, it will be sufficient to tell MySQL to use the database credentials found in $HOME/.my.cnf assuming that you have created a symlink from $HOME/.my.cnf to $HOME/replica.my.cnf.

Below are various examples in a few common programming languages.

Bash

-- 2> /dev/null; date; echo '
/* Bash/SQL compatible test structure
 *
 * Run time: ? <SLOW_OK>
 */
SELECT 1
;-- ' | mysql -ch tools.db.svc.eqiad.wmflabs enwiki_p > ~/query_results-enwiki; date;

C

#include <my_global.h>
#include <mysql.h>

...

 char *host = "tools.db.svc.eqiad.wmflabs";
 MYSQL *conn = mysql_init(NULL);

 mysql_options(conn, MYSQL_READ_DEFAULT_GROUP, "client");
 if (mysql_real_connect(conn, host, NULL, NULL, NULL, 0, NULL, 0) == NULL) {
    printf("Error %u: %s\n", mysql_errno(conn), mysql_error(conn));
    ...
 }

Perl

use User::pwent;
use DBI;

my $database = "enwiki_p";
my $host = "tools.db.svc.eqiad.wmflabs";

my $dbh = DBI->connect(
    "DBI:mysql:database=$database;host=$host;"
    . "mysql_read_default_file=" . getpwuid($<)->dir . "/replica.my.cnf",
    undef, undef) or die "Error: $DBI::err, $DBI::errstr";

Python

Using User:Legoktm/toolforge library is probably the easiest way. This wrapper library supports both Python 3 and legacy Python 2 applications and provides convenience functions for connecting to the Wiki Replica databases.

import toolforge
conn = toolforge.connect('enwiki') # You can also use "enwiki_p"
# conn is a pymysql.connection object.
with conn.cursor() as cur:
    cur.execute(query)  # Or something....

We used to recommend oursql as well, but as of 2019-02-20 it seems to be abandoned or at least not actively maintained and failing to compile against MariaDB client libraries.

PHP (using PDO)

<?php
$ts_pw = posix_getpwuid(posix_getuid());
$ts_mycnf = parse_ini_file($ts_pw['dir'] . "/replica.my.cnf");
$db = new PDO("mysql:host=enwiki.analytics.db.svc.eqiad.wmflabs;dbname=enwiki_p", $ts_mycnf['user'], $ts_mycnf['password']);
unset($ts_mycnf, $ts_pw);

$q = $db->prepare('select * from page where page_id = :id');
$q->execute(array(':id' => 843020));
print_r($q->fetchAll());
?>

PHP (using MySQLi)

<?php
$ts_pw = posix_getpwuid(posix_getuid());
$ts_mycnf = parse_ini_file($ts_pw['dir'] . "/replica.my.cnf");

$mysqli = new mysqli('enwiki.analytics.db.svc.eqiad.wmflabs', $ts_mycnf['user'], $ts_mycnf['password'], 'enwiki_p');

// YOUR REQUEST HERE

?>

Java

Class.forName("com.mysql.jdbc.Driver").newInstance();
Properties mycnf = new Properties();
mycnf.load(new FileInputStream(System.getProperty("user.home")+"/replica.my.cnf"));
String password = mycnf.getProperty("password");
password=password.substring((password.startsWith("\""))?1:0, password.length()-((password.startsWith("\""))?1:0));
mycnf.put("password", password);
mycnf.put("useOldUTF8Behavior", "true");
mycnf.put("useUnicode", "true");
mycnf.put("characterEncoding", "UTF-8");
mycnf.put("connectionCollation", "utf8_general_ci");
String url = "jdbc:mysql://tools.db.svc.eqiad.wmflabs:3306/enwiki_p";
Connection conn = DriverManager.getConnection(url, mycnf);


Submitting, managing and scheduling jobs on the grid

This is a brief summary of the /Grid documentation page.


Every non-trivial task performed in Toolforge should be dispatched by the grid engine, which ensures that the job is run in a suitable place with sufficient resources. The basic principle of running jobs is fairly straightforward:

  • You submit a job to a work queue from a submission server (e.g., -login) or web server
  • The grid engine master finds a suitable execution host to run the job on, and starts it there once resources are available
  • As it runs, your job will send output and errors to files until the job completes or is aborted.

Jobs can be scheduled synchronously or asynchronously, continuously, or simply executed once. If a continuous job fails, the grid will automatically restart the job so that it keeps going.

To schedule jobs to be run at specific days or time of days, you can use cron to submit the jobs to the grid.

Scheduling a command more often than every five minutes (e.g. * * * * * command) is highly discouraged, even if the command is "only" jsub. In these cases, you very probably want to use 'jstart' instead. The grid engine ensures that jobs submitted with 'jstart' are automatically restarted if they exit.


Email

Mail to users

Mail sent to user@tools.wmflabs.org (where user is a shell account) will be forwarded to the email address that user has set in their Wikitech preferences, if it has been verified (the same as the 'Email this user' function on wikitech).

Any existing .forward in the user's home will be ignored.

Mail to a Tool

Mail can also be sent "to a tool" with:

toolname.anything@tools.wmflabs.org

Where "anything" is an arbitrary alphanumeric string. Mail will be forwarded to the first of:

  • The email(s) listed in the tool's ~/.forward.anything, if present;
  • The email(s) listed in the tool's ~/.forward, if present; or
  • The wikitech email of the tool's individual maintainers.

Additionally, tools.toolname@tools.wmflabs.org is an alias pointing to toolname.maintainers@tools.wmflabs.org mostly useful for automated email generating from within Cloud VPS.

~/.forward and ~/.forward.anything need to be readable by the user Debian-exim; to achieve that, you probably need to chmod o+r ~/.forward*.

Mail from Tools

From the Grid

When sending mail from a job, the usual command line method of piping the message body to /usr/bin/mail may not work correctly because /usr/bin/mail attempts to deliver the message to the local MSA in a background process which will be killed if it is still running when the job exits.

If piping to a subprocess to send mail is needed, the message including headers may be piped to /usr/sbin/exim -odf -i.

# This does not work when submitted as a job
echo "Test message" | /usr/bin/mail -s "Test message subject" user@example.com

# This does
echo -e "Subject: Test message subject\n\nTest message" | /usr/sbin/exim -odf -i user@example.com
  • Note: /usr/bin/echo supports -e in case your shell's internal echo command doesn't.

From within a container

To send mail from within a Kubernetes container, use the mail.tools.wmflabs.org SMTP server.

Containers running on the Toolforge Kubernetes cluster do not install and configure a local mailer service like the exim service that is installed on grid engine nodes. Tools running in Kubernetes should instead send email using an external SMTP server. The mail.tools.wmflabs.org service name should be usable for this. This service name is used as the public MX (mail exchange) host for inbound SMTP messages to the tools.wmflabs.org domain and points to a server that can process both inbound and outbound email for Toolforge.

Web server

This is a brief summary of the /Web documentation page.

Every tool can have a dedicated web server running on either the job grid or kubernetes. The default 'lighttpd' webservice type runs a lighttpd web server configured to serve static files and PHP scripts from the tool's $HOME/public_html directory.

You can start a tool's web server with the webservice command:

$ become my_cool_tool
$ webservice start

You can also use the webservice command to stop, restart, and check the status of the webserver. Use webservice --help to get a full list of arguments.

Developing on Toolforge

This is a brief summary of the /Developing documentation page.
  • License your source code and document that with a LICENSE or COPYING file in the tool's home directory and header comments in the source code. See Help:Toolforge/Developing § Licensing your source code for more help on why and how to select a license.
  • Use public version control (gerrit, diffusion, GitHub, Bitbucket, ...) for your tool's source code and deploy changes to the Toolforge servers by updating a checkout of that public version control. See Help:Toolforge § Setting up code review and version control for additional information.
  • Keep passwords and other credentials (OAuth secrets, etc) separated from the main application code so that they are not exposed publicly in your version control system of choice.
  • Create a page in the Tool: namespace documenting the basics of what your tool does and how to start and stop it.
  • Find co-maintainers for your tools who can help out at least with starting/stopping jobs when needed.
  • Make many small tools that each do one specific task rather than a catch-all tool that does many different tasks.


The full documentation page provides tips and instructions for developing code in the Toolforge, including specific language support.

Redis

Redis is a key-value store similar to memcache, but with more features. It can be easily used to do publish/subscribe between processes, and also maintain persistent queues. Stored values can be different data structures, such as hash tables, lists, queues, etc. Stored data persists across service restarts. For more information, please see the Wikipedia article on Redis.

A Redis instance that can be used by all tools is available on tools-redis, on the standard port 6379. It has been allocated a maximum of 12G of memory, which should be enough for most usage. You can set limits for how long your data stays in Redis; otherwise it will be evicted when memory limits are exceeded. See the Redis documentation for a list of available commands.

Libraries for interacting with Redis from PHP (phpredis) and Python (redis-py) have been installed on all the web servers and exec nodes. For an example of a bot using Redis, see gerrit-to-redis.

For quick & dirty debugging, you can connect directly to the Redis server with nc -C tools-redis 6379 and execute commands (for example "INFO").

Security

Redis has no access control mechanism, so other users can accidentally/intentionally overwrite and access the keys you set. Even if you are not worried about security, it is highly probable that multiple tools will try to use the same key (such as lastupdated, etc). To prevent this, it is highly recommended that you prefix all your keys with an application-specific, lengthy, randomly generated secret key.

You can very simply generate a good enough prefix by running the following command:

openssl rand -base64 32

PLEASE PREFIX YOUR KEYS! We have also disabled the redis commands that let users 'list' keys. This protection however should not be trusted to protect any secret data. Do not store plain text secrets or decryption keys in Redis for your own protection.

Can I use memcache?

There is no memcached on Toolforge. Please use Redis instead.

Elasticsearch

This is a brief summary of the /Elasticsearch documentation page.


Elasticsearch is a full text search system built on Apache Lucene. It can be used to index and search data stored as JSON documents. It is the technology used to power Wikimedia's CirrusSearch system.

An Elasticsearch cluster that can be used by all tools is available on tools-elastic-0[123], on the non-standard port 80. This Elasticsearch cluster is a shared resource and all documents indexed in it can be read by anonymous users from within Toolforge. Write access needed to create new indexes, and store or update documents requires a username and password.

See full documentation at /Elasticsearch for more information.

Dumps

The 'tools' project has access to a directory storing the public Wikimedia datasets (i.e. the dumps generated by Wikimedia). The most recent two dumps can be found in:

/public/dumps/public

This directory is read-only, but you can copy files to your tool's home directory and manipulate them in whatever way you like.

If you need access to older dumps, you must manually download them from the Wikimedia downloads server.

/public/dumps/pagecounts-raw contains some years of the pagecount/projectcount data derived by Erik Zachte from Domas Mituzas' archives.

CatGraph (aka Graphserv/Graphcore)

CatGraph is a custom graph database that provides tool developers fast access to the Wikipedia category structure. For more information, please see the documentation.

Celery

It is possible to run a celery worker in a kubernetes container as a continuous job (for instance to execute long-running tasks triggered by a web frontend). The redis service can be used as a broker between the worker and the web frontend. Make sure you use your own queue name so that your tasks get sent to the right workers.

Backups

What gets backed up?

The basic rule is: there is a lot of redundancy, but no user-accessible backups. Toolforge users should make certain that they use source control to preserve their code, and make regular backups of irreplaceable data. With luck, some files may be recoverable by Cloud Services administrators in a manual process. But this requires human intervention and will likely not rescue the file that was created five minutes ago and deleted two minutes ago. If necessary, ask on IRC or file a Phabricator task.

Troubleshooting

See Troubleshooting Toolforge for information about common issues and errors and to learn more about how to report problems when you encounter them.

Communication and support

We communicate and provide support through several primary channels. Please reach out with questions and to join the conversation.

Communicate with us
Connect Best for
Phabricator Workboard #Cloud-Services Task tracking and bug reporting
IRC Channel #wikimedia-cloud connect General discussion and support
Mailing List cloud@ Information about ongoing initiatives, general discussion and support
Announcement emails cloud-announce@ Information about critical changes (all messages mirrored to cloud@)
News wiki page News Information about major near-term plans
Blog Clouds & Unicorns Learning more details about some of our work