Jump to content

Help:Toolforge/Database

From Wikitech
This page can be improved by breaking up its content into other docs. See phab:T232404. Contributions welcome!

Tools and Toolforge users have access to two sets of databases:

On the wiki replicas, private user data has been redacted (some rows are elided and/or some columns are made NULL depending on the table). For most practical purposes this is identical to the production databases and sharded into clusters in much the same way.

Database credentials are generated on account creation and placed in a file called replica.my.cnf in the home directory of both a Tool and a Tools user account. This file cannot be modified or removed by users.

Symlinking the access file can be practical:

$ ln -s $HOME/replica.my.cnf $HOME/.my.cnf


Connecting to the database replicas

You can connect to the database replicas (and/or the cluster where a database replica is hosted) by specifying your access credentials and the alias of the cluster and replicated database. For example:

To connect to the English Wikipedia replica, specify the alias of the hosting cluster (enwiki.analytics.db.svc.wikimedia.cloud) and the alias of the database replica (enwiki_p):

$ mariadb --defaults-file=$HOME/replica.my.cnf -h enwiki.analytics.db.svc.wikimedia.cloud enwiki_p

To connect to the Wikidata cluster:

$ mariadb --defaults-file=$HOME/replica.my.cnf -h wikidatawiki.analytics.db.svc.wikimedia.cloud

To connect to Commons cluster:

$ mariadb --defaults-file=$HOME/replica.my.cnf -h commonswiki.analytics.db.svc.wikimedia.cloud

There is also a shortcut for connecting to the replicas: sql <dbname>[_p] The _p is optional, but implicit (i.e. the sql tool will add it if absent).

To connect to the English Wikipedia database replica using the shortcut, simply type:

$ sql enwiki

To connect to ToolsDB where you can create and write to tables, type:

$ sql tools

This sets server to "tools.db.svc.wikimedia.cloud" and db to "". It's equivalent to typing-

$ mariadb --defaults-file=$HOME/replica.my.cnf -h tools.db.svc.wikimedia.cloud


Naming conventions

As a convenience, each mediawiki project database (enwiki, bgwiki, etc) has an alias to the cluster it is hosted on. The alias has the form:

${PROJECT}.{analytics,web}.db.svc.wikimedia.cloud

where ${PROJECT} is the internal database name of a hosted Wikimedia project.

analytics vs web

The choice of "analytics" or "web" is up to you. The analytics service name connects to Wiki Replica servers where SQL queries will be allowed to run for a longer duration (currently 3 hours instead of 5 minutes),[1][2] but at the cost of all queries being potentially slower. Use of the web service name should be reserved for webservices which are running queries that display to users.

Language codes and project families

Wikipedia project database names generally follow the format ${LANGUAGE_CODE}${PROJECT_FAMILY}. ${LANGUAGE_CODE} is the ISO 639 two-letter code for the primary content language (e.g. en for English, es for Spanish, bg for Bulgarian, ...). ${PROJECT_FAMILY} is an internal label for the wiki's project family (e.g. wiki for Wikipedia, wiktionary for Wiktionary, ...). Some wikis such as Meta-Wiki have database names that do not follow this pattern (metawiki). The full mapping of wikis to database names is available via the db-names Toolforge tool.

The replica database names themselves consist of the Wikimedia project name, suffixed with _p (an underscore, and a p), for example:

enwiki_p for the English Wikipedia replica

Shards

In addition each cluster can be accessed by the name of its Wikimedia production shard which follows the format s${SHARD_NUMBER}.{analytics,web}.db.svc.wikimedia.cloud (for example, s1.analytics.db.svc.wikimedia.cloud hosts the enwiki_p database). The shard where a particular database is can change over time. You should only use the shard name for opening a database connection if your application requires it for specific performance reasons such as for heavily crosswiki tools which would otherwise open hundreds of database connections.

Old names

You may find outdated documentation that uses *.labsdb aliases (for example enwiki.labsdb) to refer to the Wiki Replica databases. These service names are deprecated and have not had new wikis added since January 2018. Please update the docs or code that you find these references in to use the ${PROJECT}.{analytics,web}.db.svc.wikimedia.cloud naming convention.
You may find outdated documentation that uses ${project}.{analytics,web}.db.svc.eqiad.wmflabs aliases (for example enwiki.web.db.svc.eqiad.wmflabs) to refer to the Wiki Replica databases. These service names are deprecated. Please update the docs or code that you find these references in to use the ${PROJECT}.{analytics,web}.db.svc.wikimedia.cloud naming convention.

Connection handling policy

Usage of connection pools (maintaining open connections without them being in use), persistent connections, or any kind of connection pattern that maintains several connections open even if they are unused is not permitted on shared MariaDB instances (Wiki Replicas and ToolsDB).

The memory and processing power available to the database servers is a finite resource. Each open connection to a database, even if inactive, consumes some of these resources. Given the number of potential users for the Wiki Replicas and ToolsDB, if even a relatively small percentage of users held open idle connections, the server would quickly run out of resources to allow new connections. Please close your connections as soon as you stop using them. Note that connecting interactively and being idle for a few minutes is not an issue—opening dozens of connections and maintaining them automatically open is.

Idle connections can and will be killed by database and system administrators when discovered. If you (for example, by connector configuration or application policy) then reopen those connections automatically and keep them idle, you will be warned to stop.

Connecting to the wiki replicas from other Cloud VPS projects

The *.{analytics,web}.db.svc.wikimedia.cloud servers should be directly accessible from other Cloud VPS projects as well as Toolforge (these are provided in DNS), but there is no automatic creation of database credential files. The easiest way to get user credentials for use in another project is to create a Toolforge tool account and copy its credentials to your Cloud VPS instance.

Connecting to the database replicas from your own computer

Since at the moment Wiki Replicas are not public (phabricator:T318191), you can access the database replicas from your own computer by setting up an SSH tunnel. If you use MySQL Workbench, you can find a detailed description for that application below.

Tunneling is a built-in capability of ssh. It allows creating a listening TCP port on your local computer that will transparently forward all connections to a given host and port on the remote side of the ssh connection. The destination host and port do not need to be the host that you are connecting to with your ssh session, but they do need to be reachable from the remote host.

In the general case, need to add a port forwarding in your ssh tool. Windows 10 has OpenSSH included and the ssh command can be used. On older versions of Windows, you can use the tool PuTTY by add in Connection → SSH → Tunnels the following settings (as shown in dialog box at right).

PuTTY Tunnels Configuration

In Linux or Windows 10, you can add the option -L $LOCAL_PORT:$REMOTE_HOST:$REMOTE_PORT to your ssh call, e. g.:

$ ssh -L 3306:enwiki.analytics.db.svc.wikimedia.cloud:3306 yourusername@login.toolforge.org

This will set up a tunnel so that connections to port 3306 on your own computer will be relayed to the enwiki.analytics.db.svc.wikimedia.cloud database replica's MariaDB server on port 3306. This tunnel will continue to work as long as the SSH session is open.

The mariadb command line to connect using the tunnel from the example above would look something like:

$ mariadb --user=$USER_FROM_REPLICA.MY.CNF --host=127.0.0.1 --port=3306 --password enwiki_p

The user and password values needed can be found in the $HOME/replica.my.cnf credentials file for your Toolforge user account or a tool that you have access to.

Note that you need to explicitly use the 127.0.0.1 IP address; using localhost instead will give an error as the client will try to connect over an Unix socket which will not work.

SSH tunneling for local testing which makes use of Wiki Replica databases

  1. Setup SSH tunnels: ssh -N yourusername@dev.toolforge.org -L 3306:enwiki.analytics.db.svc.wikimedia.cloud:3306
    • -N prevents ssh from opening an interactive shell. This connection will only be useful for port forwarding.
    • The first port is the listening port on your machine and the second one is on the remote server. 3306 is the default port for MySQL.
    • For multiple database connections, add additional -L $LOCAL_PORT:$REMOTE_HOST:$REMOTE_PORT sections to the same command or open additional ssh connections.
    • If you need to connect to more than one Wiki Replica database server, each database will need a different listening port on your machine (e.g. 3307, 3308, 3309, ...). Change the associated php/python connect command to send requests to that port instead of the default 3306.
  2. (optional) Edit your /etc/hosts file to add something like 127.0.0.1 enwiki.analytics.db.svc.wikimedia.cloud for each of the databases you're connecting to.
  3. You might need to copy over the replica.my.cnf file to your local machine for this to work.

TLS connection failures

Some client libraries may attempt to enable TLS encryption when connecting to the Wiki Replica or ToolsDB databases. Depending on the backing server's configuration, this may either fail silently because TLS is not supported at all, or it may fail with authentication or decryption errors because TLS is partially enabled. In this second case, the problem is caused by MariaDB servers which do support TLS encryption but are using self-signed certificates which are not available to the client and do not match the service names used for connections from Cloud Services hosts.

The "fix" for these failures is to configure your client to avoid TLS encryption. How to do this will vary based on the client libraries in use, but should be something that you can find an answer for by searching the Internet/Stack Overflow/library documentation.

Databases

Replica database schema (tables and indexes)

The database replicas for the various Wikimedia projects follow the standard MediaWiki database schema described on mediawiki.org and in the MediaWiki git repository.

Many of the indexes on these tables are actually compound indexes designed to optimize the runtime performance of the MediaWiki software rather than to be convenient for ad hoc queries. For example, a naive query by page_title such at SELECT * FROM page WHERE page_title = 'NOFX'; will be slow because the index which includes page_title is a compound index with page_namespace. Adding page_namespace to the WHERE clause will improve the query speed dramatically: SELECT * FROM page WHERE page_namespace = 0 AND page_title = 'NOFX';

Stability of the mediawiki database schema

sql/mysql/tables-generated.sql shows the HEAD of the mediawiki changes. Extra tables may be available due to additional extensions setup in production. Also some tables may have been redacted or filtered for containing private data such as the user passwords or private ip addresses. Aside from that, while we try to synchronize production with development HEAD, changes to the database structure may be applied in advance (or more commonly) lag behind its publication. The reason for this is that schema changes are being continuously applied to production databases, and due to the amout of data, it may take a few hours to a few months (in the case of more complex cases) to be finalized.

Core tables, such as revision, page, user, recentchanges rarely change, but cloud maintainers cannot guarantee they will never change, as they have to follow the production changes. While we are happy for people to setup scripts and tools on top of the database copies (wikireplicas) expect the schema to change every now and then. If you cannot do small tweaks from time to time to adapt to the latest schema changes, using the API instead of the database internals is suggested, as API changes have more guarantees of stability and a proper lifecycle and deprecation policy. That is not true for mediawiki database internals, although compatibility views can sometimes be setup to require only minimal changes.

Tables for revision or logging queries involving user names and IDs

The revision and logging tables do not have indexes on user columns. In an email, one of the system administrators pointed out that this is because "those values are conditionally nulled when supressed" (see also phab:T68786 for some more detail). One has to instead use the corresponding revision_userindex or logging_userindex for these types of queries. On those views, rows where the column would have otherwise been nulled are elided; this allows the indexes to be usable.

Example query that will use the appropriate index (in this case on the rev_actor column)

SELECT rev_id, rev_timestamp FROM revision_userindex WHERE rev_actor=1234;

Example query that fails to use an index because the table doesn't have them:

SELECT rev_id, rev_timestamp FROM revision WHERE rev_actor=1234;

You should use the indexes so queries will go faster (performance).

Redacted tables

The majority of the user_properties table has been deemed sensitive and removed from the Wiki Replica databases. Only the disableemail, fancysig, gender, and nickname properties are available.

Unavailable tables

Some of the standard MediaWiki tables that are in use on Wikimedia wikis are not available. The following tables are missing or empty:

Metadata database

There is a table with automatically maintained meta information about the replicated databases: meta_p.wiki. See toolforge:db-names for a web-based list.

The database host containing the meta_p database is: meta.analytics.db.svc.wikimedia.cloud.

MariaDB [meta_p]> DESCRIBE wiki;
+------------------+--------------+------+-----+---------+-------+
| Field            | Type         | Null | Key | Default | Extra |
+------------------+--------------+------+-----+---------+-------+
| dbname           | varchar(32)  | NO   | PRI | NULL    |       |
| lang             | varchar(12)  | NO   |     | en      |       |
| name             | text         | YES  |     | NULL    |       |
| family           | text         | YES  |     | NULL    |       |
| url              | text         | YES  |     | NULL    |       |
| size             | decimal(1,0) | NO   |     | 1       |       |
| slice            | text         | NO   |     | NULL    |       |
| is_closed        | decimal(1,0) | NO   |     | 0       |       |
| has_echo         | decimal(1,0) | NO   |     | 0       |       |
| has_flaggedrevs  | decimal(1,0) | NO   |     | 0       |       |
| has_visualeditor | decimal(1,0) | NO   |     | 0       |       |
| has_wikidata     | decimal(1,0) | NO   |     | 0       |       |
| is_sensitive     | decimal(1,0) | NO   |     | 0       |       |
+------------------+--------------+------+-----+---------+-------+

Example data:

MariaDB [meta_p]> select * from wiki limit 1 \G
*************************** 1. row ***************************
          dbname: aawiki
            lang: aa
            name: Wikipedia
          family: wikipedia
             url: https://aa.wikipedia.org
            size: 1
           slice: s3.labsdb
       is_closed: 1
        has_echo: 1
 has_flaggedrevs: 0
has_visualeditor: 1
    has_wikidata: 1
    is_sensitive: 0

Identifying lag

Extended replication lag (on the order of multiple days) can be an expected and unavoidable side effect of some types of production database maintenance (e.g. schema changes). When this cause is confirmed, expect wiki replicas to catch up automatically once the maintenance finishes.

If there is a network/Wiki Replica db infrastructure problem, production problem, maintenance (scheduled or unscheduled), excessive load or production or user's queries blocking the replication process, the Wiki Replicas can "lag" behind the production databases.

To identify lag, see the replag tool or execute yourself on the database host you are connected to:

(u3518@enwiki.analytics.db.svc.wikimedia.cloud) [heartbeat_p]> SELECT * FROM heartbeat;
+-------+----------------------------+--------+
| shard | last_updated               | lag    |
+-------+----------------------------+--------+
| s1    | 2018-01-09T22:47:05.001180 | 0.0000 |
| s2    | 2018-01-09T22:47:05.001190 | 0.0000 |
| s3    | 2018-01-09T22:47:05.001290 | 0.0000 |
| s4    | 2018-01-09T22:47:05.000570 | 0.0000 |
| s5    | 2018-01-09T22:47:05.000670 | 0.0000 |
| s6    | 2018-01-09T22:47:05.000760 | 0.0000 |
| s7    | 2018-01-09T22:47:05.000690 | 0.0000 |
| s8    | 2018-01-09T22:47:05.000600 | 0.0000 |
+-------+----------------------------+--------+
8 rows in set (0.00 sec)

This table is based on the tool pt-heartbeat, not on SHOW MASTER STATUS, producing very accurate results, even if replication is broken, and directly comparing it to the original master, and not the replicas's direct master.

  • shard: s1-8. Each of the production masters. The wiki distribution can be seen at: https://noc.wikimedia.org/db.php
  • last_updated: Every 1 second, a row in the master is written with the date local to the master. Here you have its value, once replicated. As it is updated every 1 second, it has a measuring error of [0, 1+] seconds.
  • lag: The difference between the current date and the last_updated column (timestampdiff(MICROSECOND,`heartbeat`.`heartbeat`.`ts`,utc_timestamp())/1000000.0). Again note that updates to this table only happen every second (it can vary on production), so most decimals are meaningless.

To directly query the replication lag for a particular wiki, use requests like:

MariaDB [fawiki_p]> SELECT lag FROM heartbeat_p.heartbeat JOIN meta_p.wiki ON shard = SUBSTRING_INDEX(slice, ".", 1) WHERE dbname = 'fawiki';

+------+
| lag  |
+------+
|    0 |
+------+
1 row in set (0.09 sec)

Please note that some seconds or a few minutes of lag is considered normal, due to the filtering process and the hops done before reaching the public hosts.

User databases

User-created databases can be created on a shared server: tools.db.svc.wikimedia.cloud. Database names must start with the name of the credential user followed by two underscores and then the name of the database: <credentialUser>__<DBName> (e.g. "s51234__mydb").

The credential user is not your user name. It can be found in your $HOME/replica.my.cnf file. The name of the credential user looks something like 'u1234' for a user and 's51234' for a tool account. You can also find the name of the credential user using a live database connection:

SELECT SUBSTRING_INDEX(CURRENT_USER(), '@', 1);
If your tool needs more than 25GB of storage, open connection limits that ToolsDB cannot support, or a Postgres runtime, Trove databases may be a better fit. ToolsDB is a shared resource that must impose connection and size limitations in exchange for zero administration requirements. Using Trove removes those limitations, but requires a small amount of administration. Tools can request the ability to create Trove databases via a Toolforge quota-request task.

Privileges on the database

Users have all privileges and have access to all grant options on their databases. Database names ending with _p are granted read access for everyone. Please create a ticket if you need more fine-grained permissions, like sharing a database only between 2 users, or other special permissions.

Public databases in ToolsDB (the ones with a name ending in _p) can also be accessed from Quarry and Superset.

Steps to create a user database

To create a database on tools.db.svc.wikimedia.cloud:

  1. Become your tool account.
    maintainer@tools-login:~$ become toolaccount
  2. Connect to tools.db.svc.wikimedia.cloud with the replica.my.cnf credentials:
    mariadb --defaults-file=$HOME/replica.my.cnf -h tools.db.svc.wikimedia.cloud
    You could also just type:
    sql tools
  3. In the MariaDB console, create a new database (where CREDENTIALUSER is your credentials user, which can be found in your ~/replica.my.cnf file, and DBNAME the name you want to give to your database. Note that there are 2 underscores between CREDENTIALUSER and DBNAME):
    MariaDB [(none)]> CREATE DATABASE CREDENTIALUSER__DBNAME;

You can then connect to your database using:

$ mariadb --defaults-file=$HOME/replica.my.cnf -h tools.db.svc.wikimedia.cloud CREDENTIALUSER__DBNAME

Or:

$ sql tools
MariaDB [(none)]> USE CREDENTIALUSER__DBNAME;

Example

Assuming that your tool account is called "mytool", this is what it would look like:

$ maintainer@tools-login:~$ become mytool
$ tools.mytool@tools-login:~$ mariadb --defaults-file=$HOME/replica.my.cnf -h tools.db.svc.wikimedia.cloud
MariaDB [(none)]> select substring_index(current_user(), '@', 1) as uname;
+---------------+
| uname         |
+---------------+
| u123something |
+---------------+
1 row in set (0.00 sec)
MariaDB [(none)]> create database u123something__wiki;
Caution: The legacy tools-db service name was deprecated in September 2017 and removed in May 2019. Use tools.db.svc.wikimedia.cloud instead.

Note: Some projects like python-Django can throw an exception like MySQLdb._exceptions.OperationalError: (1709, 'Index column size too large. The maximum column size is 767 bytes.') when migrated using the setup above. This can be fixed by altering the database charset to utf-8in most cases. To avoid this, create the database using the following command instead to specify the charset:

MariaDB [(none)]> CREATE DATABASE CREDENTIALUSER__DBNAME CHARACTER SET utf8;

ToolsDB read-only replica host

We maintain two copies of the ToolsDB database, using a MariaDB primary-replica setup.

The read-only replica host can be accessed using the same credentials and the following hostname: tools-readonly.db.svc.wikimedia.cloud

Using the read-only replica host is recommended if you have to run queries that take a long time to complete, as in this way you will reduce the load on the primary host.

Please note that the replica host can sometimes lag behind the primary host, but we are doing our best to keep this lag at a minimum.

ToolsDB Backups

We don't do offline backups of any of the databases in ToolsDB. ToolsDB users can backup their data using mariadb-dump (included in the mariadb image) if necessary:

:# use umask to make the dump private (use unless the database is public)
$ toolforge jobs run --command "umask o-r; ( mariadb-dump --defaults-file=~/replica.my.cnf --host=tools-readonly.db.svc.wikimedia.cloud credentialUser__DBName > ~/DBname-$(date -I).sql )" --image mariadb backup

Note that we don't recommend storing backups permanently on NFS (/data/project, /home, or /data/scratch on Toolforge) or on any other Cloud VPS hosted drive. True backups should be kept offsite.

ToolsDB Caveats

The Toolsforge team tries to keep ToolsDB configurations as close to MariaDB defaults as possible. This can lead to surprising behaviors, such as:

  1. Transactions not rolled back on query timeouts, which can be common during high load on a shared database (see this issue)

If you encounter an issue, feel free to add it above.

Query Limits

One can use max_statement_time (unit is seconds, it allows decimals):

SET max_statement_time = 300;

And all subsequent queries on the same connection will be killed if they run for longer than the given time.

For example:

mariadb[(none)]> SET max_statement_time = 10;
Query OK, 0 rows affected (0.00 sec)

mariadb[(none)]> SELECT sleep(20);
+-----------+
| sleep(20) |
+-----------+
|         1 |
+-----------+
1 row in set (10.00 sec)

It works on Quarry, too!

You can also set limits with a single SQL query. For example:

SET STATEMENT max_statement_time = 300 FOR
SELECT COUNT(rev_id) FROM revision_userindex
INNER JOIN actor
 ON rev_actor = actor_id
WHERE actor_name = 'Jimbo Wales'

Example queries

See Help:MySQL queries. Add yours!

Connecting with...

MySQL Workbench

If you are using an ed25519 key, with a passcode, you might have issues configuring this. See the MySQL bug. Consider establishing a separate SSH tunnel outside of MySQL Workbench, then using MySQL Workbench with connection method "Standard (TCP/IP)" and hostname 127.0.0.1, the other credentials remaining unchanged.
Example configuration of MySQL Workbench for Toolforge

You can connect to databases on Toolforge with MySQL Workbench (or similar client applications) via an SSH tunnel.

Instructions for connecting via MySQL Workbench are as follows:

  1. Launch MySQL Workbench on your local machine.
  2. Click the plus icon next to "MySQL Connections" in the Workbench window (or choose "Manage Connections..." from the Database menu and click the "new" button).
  3. Set Connection Method to "Standard TCP/IP over SSH"
  4. Set the following connection parameters:
    • SSH Hostname: login.toolforge.org
    • SSH Username: <your Toolforge shell username>
    • SSH Key File: <your Toolforge SSH private key file>[3]
    • SSH Password: password/passphrase of your private key (if set) - not your wiki login password.
    • MySQL Hostname: enwiki.analytics.db.svc.wikimedia.cloud (or whatever server your database lives on)
    • MySQL Server Port: 3306
    • Username: <your Toolforge MariaDB user name (from $HOME/replica.my.cnf)>
    • Password: <your Toolforge MariaDB password (from $HOME/replica.my.cnf)>
    • Default Schema: <name of the database, e.g. enwiki_p>
  5. Click "OK"

Replica-db hostnames can be found in /etc/hosts. Bear in mind to add the _p suffix if setting a default schema for replica databases. e.g: enwiki_p.

If you are using SSH keys generated with PuTTYgen (Windows users), you need to convert your private key to the 'OpenSSH' format. Load your private key in PuTTYgen, then click Conversions » Export OpenSSH key. Use this file as SSH Key File above.

If you are getting errors with SSL, you can try disabling it. From the menu bar: Database -> select your connection -> SSL -> Change "Use SSL" to "No".

Code samples for common languages

Copied with edits from mw:Toolserver:Database access#Program access (not all tested, use with caution!)

In most programming languages, it will be sufficient to tell MariaDB to use the database credentials found in $HOME/.my.cnf assuming that you have created a symlink from $HOME/.my.cnf to $HOME/replica.my.cnf.

Below are various examples in a few common programming languages.

Bash

-- 2> /dev/null; date; echo '
/* Bash/SQL compatible test structure
 *
 * Run time: ? <SLOW_OK>
 */
SELECT 1
;-- ' | mariadb -ch tools.db.svc.wikimedia.cloud enwiki_p > ~/query_results-enwiki; date;

C

#include <my_global.h>
#include <mysql.h>

...

 char *host = "tools.db.svc.wikimedia.cloud";
 MYSQL *conn = mysql_init(NULL);

 mysql_options(conn, MYSQL_READ_DEFAULT_GROUP, "client");
 if (mysql_real_connect(conn, host, NULL, NULL, NULL, 0, NULL, 0) == NULL) {
    printf("Error %u: %s\n", mysql_errno(conn), mysql_error(conn));
    ...
 }

Perl

use User::pwent;
use DBI;

my $database = "enwiki_p";
my $host = "tools.db.svc.wikimedia.cloud";

my $dbh = DBI->connect(
    "DBI:mysql:database=$database;host=$host;"
    . "mysql_read_default_file=" . getpwuid($<)->dir . "/replica.my.cnf",
    undef, undef) or die "Error: $DBI::err, $DBI::errstr";

Python

Without installing the toolforge library, this will work:

import configparser
import pathlib
import pymysql
import pymysql.cursors

replica = pathlib.Path.home().joinpath("replica.my.cnf")
config = configparser.ConfigParser()
config.read_string(replica.read_text())
connection = pymysql.connections.Connection(
    host="commonswiki.analytics.db.svc.wikimedia.cloud",
    database="commonswiki_p",
    user=config.get("client", "user"),
    password=config.get("client", "password"),
    cursorclass=pymysql.cursors.DictCursor,
)

with connection.cursor() as cur:
    cur.execute(query)  # Or something....
connection.close()

Using User:Legoktm/toolforge library, however, is probably the easiest way. This wrapper library supports both Python 3 and legacy Python 2 applications and provides convenience functions for connecting to the Wiki Replica databases.

import toolforge
conn = toolforge.connect('enwiki') # You can also use "enwiki_p"
# conn is a pymysql.connection object.
with conn.cursor() as cur:
    cur.execute(query)  # Or something....

We used to recommend oursql as well, but as of 2019-02-20 it seems to be abandoned or at least not actively maintained and failing to compile against MariaDB client libraries.

Python: Django

If you are using Django, first install mysqlclient (inside your tool's virtual environment, accessed via a webservice shell):

export MYSQLCLIENT_CFLAGS="-I/usr/include/mariadb/"
export MYSQLCLIENT_LDFLAGS="-L/usr/lib/x86_64-linux-gnu/ -lmariadb"
pip install mysqlclient

Then insert the database in the settings.py file as following, with s12345 your user name:

import configparser
import os

HOME=os.environ.get('HOME') #get environment variable $HOME

replica_path=HOME + '/replica.my.cnf'
if os.path.exists(replica_path):          #check that the file is found
    config = configparser.ConfigParser()
    config.read(replica_path)
else:
    print('replica.my.cnf file not found')

DATABASES = {
    'default': {
         'ENGINE': 'django.db.backends.mysql',
         'NAME': 's12345__mydbname',                                 
         'USER': config['client']['user'],                          #for instance "s12345"
         'PASSWORD': config['client']['password'],
         'HOST': 'tools.db.svc.wikimedia.cloud',
         'PORT': '',
     }
}

PHP (using PDO)

<?php
$ts_pw = posix_getpwuid(posix_getuid());
$ts_mycnf = parse_ini_file($ts_pw['dir'] . "/replica.my.cnf");
$db = new PDO("mysql:host=enwiki.analytics.db.svc.wikimedia.cloud;dbname=enwiki_p", $ts_mycnf['user'], $ts_mycnf['password']);
unset($ts_mycnf, $ts_pw);

$q = $db->prepare('select * from page where page_id = :id');
$q->execute(array(':id' => 843020));
print_r($q->fetchAll());
?>

PHP (using MySQLi)

<?php
$ts_pw = posix_getpwuid(posix_getuid());
$ts_mycnf = parse_ini_file($ts_pw['dir'] . "/replica.my.cnf");
$mysqli = new mysqli('enwiki.analytics.db.svc.wikimedia.cloud', $ts_mycnf['user'], $ts_mycnf['password'], 'enwiki_p');
unset($ts_mycnf, $ts_pw);

$stmt = $mysqli->prepare('select * from page where page_id = ?');
$id = 843020;
$stmt->bind_param('i', $id);
$stmt->execute();
$result = $stmt->get_result();
print_r($result->fetch_all(MYSQLI_BOTH));
?>

Java

Class.forName("com.mysql.jdbc.Driver").newInstance();
Properties mycnf = new Properties();
mycnf.load(new FileInputStream(System.getProperty("user.home")+"/replica.my.cnf"));
String password = mycnf.getProperty("password");
password=password.substring((password.startsWith("\""))?1:0, password.length()-((password.startsWith("\""))?1:0));
mycnf.put("password", password);
mycnf.put("useOldUTF8Behavior", "true");
mycnf.put("useUnicode", "true");
mycnf.put("characterEncoding", "UTF-8");
mycnf.put("connectionCollation", "utf8_general_ci");
String url = "jdbc:mysql://tools.db.svc.wikimedia.cloud:3306/enwiki_p";
Connection conn = DriverManager.getConnection(url, mycnf);

Node.js

mysql2 client provides a promise-based interface.

const mysql = require('mysql2/promise');
async function sample() {
  const connection = await mysql.createConnection({
    host: 'tools.db.svc.wikimedia.cloud', 
    port: 3306,
    database: 's12345__mydbname', 
    user: 's12345', 
    password: ''
  });
  const [rows, fields] = await connection.execute('SELECT * FROM table WHERE name = ? AND age > ?', ['Morty', 14]);
  for (let row in rows) console.log(row);
}


See also

Note

  1. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/wmcs/db/wikireplicas/web_multiinstance.yaml#9
  2. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/wmcs/db/wikireplicas/analytics_multiinstance.yaml#9
  3. If your private key is in a RFC4716 format, you will have to convert it to a PEM key.