Help:Tool Labs/Database

From Wikitech
Jump to: navigation, search
Tool Labs HelpFAQGetting startedRulesAccessDatabasesJob gridWebHow toList of ToolsGlossary

Tool and Labs accounts are granted access to replicas of the production databases. Private user data has been redacted from these replicas (some rows are elided and/or some columns are made NULL depending on the table), but otherwise the schema is, for all practical purposes, identical to the production databases and the databases are sharded into clusters in much the same way.

Database credentials (credential user/password) are stored in the 'replica.my.cnf' file found in the tool account’s home directory. To use these credentials with command-line tools by default , copy 'replica.my.cnf' to '.my.cnf'.

If you do not have a 'replica.my.cnf' in your home directory, please create a ticket in Phabricator.


Connecting to the database replicas

You can connect to the database replicas (and/or the cluster where a database replica is hosted) by specifying your access credentials and the alias of the cluster and replicated database. For example:

To connect to the English Wikipedia replica, specify the alias of the hosting cluster (enwiki.labsdb) and the alias of the database replica (enwiki_p) :

mysql --defaults-file="${HOME}"/replica.my.cnf -h enwiki.labsdb enwiki_p

To connect to the Wikidata cluster:

mysql --defaults-file=~/replica.my.cnf -h wikidatawiki.labsdb

To connect to Commons cluster:

mysql --defaults-file=~/replica.my.cnf -h commonswiki.labsdb

There is also a shortcut for connecting to the replicas: sql <dbname>[_p] The _p is optional, but implicit (i.e. the sql tool will add it if absent).

To connect to the English Wikipedia database replica using the shortcut, simply type:

sql enwiki

To connect to tools-db using the shortcut, type:

sql local

This sets server to "tools-db" and db to "". It's equivalent to typing-

mysql --defaults-file=~/replica.my.cnf -h tools-db

To connect to a given Labs database, say, 'labsdb1004.eqiad.wmnet':

mysql --defaults-file=replica.my.cnf --host labsdb1004.eqiad.wmnet


Naming conventions

As a convenience, each mediawiki project database (enwiki, bgwiki, etc) has an alias to the cluster it is hosted on. The alias has the form:

${PROJECT}.labsdb

where ${PROJECT} is either the internal database name of a hosted Wikimedia project or tools for the tools database server.

Wikipedia project database names generally follow the format ${LANGUAGE_CODE}${PROJECT_FAMILY}. ${LANGUAGE_CODE} is the ISO 639 two-letter code for the primary content language (e.g. en for English, es for Spanish, bg for Bulgarian, ...). ${PROJECT_FAMILY} is an internal label for the wiki's project family (e.g. wiki for Wikipedia, wiktionary for Wiktionary, ...). Some wikis such as Meta-Wiki have database names that do not follow this pattern (metawiki).

The replica database names themselves consist of the Wikimedia project name, suffixed with _p (an underscore, and a p), for example:

enwiki_p for the English Wikipedia replica

In addition each cluster can be accessed by the name of its Wikimedia production shard which follows the format s${SHARD_NUMBER}.labsdb (for example, s1.labsdb hosts the enwiki_p database). The shard where a particular database is located can change over time. You should only use the shard name for opening a database connection if your application requires it for specific performance reasons such as for heavily crosswiki tools which would otherwise open hundreds of database connections.

If you use a user database (see below), connect to a database server instead of one of the aliases. The target of the aliases can (and do!) change, and your database will seem to have disappeared if it does. The current database servers are c1.labsdb, c2.labsdb and c3.labsdb.

Connection handling policy

Usage of connection pools (maintaining open connections without them being in use), persistent connections, or any kind of connection pattern that maintains several connections open even if they are unused is not permitted.

The memory and processing power available to the database servers is a finite resource. Each open connection to a database, even if inactive, consumes some of these resources. Given the number of potential users for the database replicas if even a relatively small percentage of users held open idle connections the server would quickly run out of resources to allow new connections. Please close your connections as soon as you stop using them. Note that connecting interactively and being idle for a few minutes is not an issue—opening dozens of connections and maintaining them automatically open is.

Idle connections can and will be killed by database and system administrators when discovered. If you then reopen those connections automatically and keep them idle, you will be warned to stop.

Connecting to the database replicas from other Labs instances

The *.labsdb servers should be directly accessible from other Labs instances as well (these are provided in DNS), but there is no automatic creation of database credential files. Please create a task in the Labs project to have one created for you.

Connecting to the database replicas from your own computer

You can access the database replicas from your own computer by setting up an SSH tunnel. If you use MySQL Workbench, you can find a detailed description for that application below.

Tunneling is a built-in capability of ssh. It allows creating a listening TCP port on your local computer that will transparently forward all connections to a given host and port on the remote side of the ssh connection. The destination host and port do not need to be the host that you are connecting to with your ssh session, but they do need to be reachable from the remote host.

In the general case, need to add a port forwarding in your ssh tool. In Windows, you can use the tool PuTTY by add in Connection → SSH → Tunnels the following settings.

In Linux, you can add the option -L $LOCAL_PORT:$REMOTE_HOST:$REMOTE_PORT to your ssh call, e. g.:

$ ssh -L 4711:enwiki.labsdb:3306 tools-login.wmflabs.org

This will set up a tunnel so that connections to port 4711 on your own computer will be relayed to the enwiki.labsdb database replica's MySQL server on port 3306. This tunnel will continue to work as long as the SSH session is open.

The mysql command line to connect using the tunnel from the example above would look something like:

$ mysql --user=$USER_FROM_REPLICA.MY.CNF --host=127.0.0.1 --port=4711 --password enwiki_p

The user and password values needed can be found in the $HOME/replica.my.cnf credentials file for your Tool Labs user account or a tool that you have access to.

Note that you need to explicitly use the 127.0.0.1 IP address; using localhost instead will give an "Access denied" error.

Steps to setup SSH tunneling for testing a tool labs application which makes use of tool labs databases on your own computer

  1. Setup SSH tunnels: ssh -N yourusername@tools-dev.wmflabs.org -L 3306:enwiki.labsdb:3306
    • -N prevents ssh from opening an interactive shell. This connection will only be useful for port forwarding.
    • The first port is the listening port on your machine and the second one is on the remote server. 3306 is the default port for MySQL.
    • For multiple database connections, add additional -L $LOCAL_PORT:$REMOTE_HOST:$REMOTE_PORT sections to the same command or open additional ssh connections.
    • If you need to connect to more than one Labs database server each database will need a different listening port on your machine (e.g. 3307, 3308, 3309, ...). Change the associated php/python connect command to send requests to that port instead of the default 3306.
  2. (optional) Edit your /etc/hosts file to add something like 127.0.0.1 enwiki.labsdb for each of the databases you're connecting to.
  3. You might need to copy over the replica.my.cnf file to your local machine for this to work.

Databases

Replica database schema (tables and indexes)

The database replicas for the various Wikimedia projects follow the standard MediaWiki database schema described on mediawiki.org and in the MediaWiki git repository.

Many of the indexes on these tables are actually compound indexes designed to optimize the runtime performance of the MediaWiki software rather than to be convenient for ad hoc queries. For example, a naive query by page_title such at SELECT * FROM page WHERE page_title = 'NOFX'; will be slow because the index which includes page_title is a compound index with page_namespace. Adding page_namespace to the WHERE clause will improve the query speed dramatically: SELECT * FROM page WHERE page_namespace = 0 AND page_title = 'NOFX';

Tables for revision or logging queries involving user names and IDs

The revision and logging tables do not have indexes on user columns. In an email, one of the system administrators pointed out that this is because "those values are conditionally nulled when supressed". One has to instead use the corresponding revision_userindex or logging_userindex for these types of queries. On those views, rows where the column would have otherwise been nulled are elided; this allows the indexes to be usable.

Example query that will use the appropriate index (in this case on the rev_user_text column, the rev_user column works the same way for user IDs):

SELECT rev_id, rev_timestamp FROM revision_userindex WHERE rev_user_text="Foo"

Example query that fails to use an index because the table doesn't have them:

SELECT rev_id, rev_timestamp FROM revision WHERE rev_user_text="Foo"

You should use the indexes so queries will go faster (performance).

Redacted tables

language, skin, timecorrection, and varient properties have been deemed sensitive and are removed from user_properties table of Labs DB replicas.

Unused tables

Not all standard MediaWiki tables are in use on Wikimedia wikis. The following tables are missing or empty:

Metadata database

From bugzilla:48626 there is a table with automatically maintained meta information about the replicated databases: meta_p.wiki. See Quarry #4031 for an up-to-date version.

MariaDB [meta_p]> DESCRIBE wiki;
+------------------+--------------+------+-----+---------+-------+
| Field            | Type         | Null | Key | Default | Extra |
+------------------+--------------+------+-----+---------+-------+
| dbname           | varchar(32)  | NO   | PRI | NULL    |       |
| lang             | varchar(12)  | NO   |     | en      |       |
| name             | text         | YES  |     | NULL    |       |
| family           | text         | YES  |     | NULL    |       |
| url              | text         | YES  |     | NULL    |       |
| size             | decimal(1,0) | NO   |     | 1       |       |
| slice            | text         | NO   |     | NULL    |       |
| is_closed        | decimal(1,0) | NO   |     | 0       |       |
| has_echo         | decimal(1,0) | NO   |     | 0       |       |
| has_flaggedrevs  | decimal(1,0) | NO   |     | 0       |       |
| has_visualeditor | decimal(1,0) | NO   |     | 0       |       |
| has_wikidata     | decimal(1,0) | NO   |     | 0       |       |
| is_sensitive     | decimal(1,0) | NO   |     | 0       |       |
+------------------+--------------+------+-----+---------+-------+

Example data:

MariaDB [nlwiki_p]> select * from meta_p.wiki limit 1 \G
 *************************** 1. row ***************************
          dbname: aawiki
            lang: aa
            name: Wikipedia
          family: wikipedia
             url: http://aa.wikipedia.org
            size: 1
           slice: s3.labsdb
       is_closed: 1
        has_echo: 1
 has_flaggedrevs: 0
has_visualeditor: 1
    has_wikidata: 1
    is_sensitive: 0

Identifying lag

If there is a network/labs db infrastructure problem, production problem, maintenance (scheduled or unscheduled), excessive load or production or user's queries blocking the replication process, labs replicas can get behind the production databases "lag".

To identify lag, see https://tools.wmflabs.org/replag/ or execute yourself on the database host you are connected to:

mysql> SELECT * FROM heartbeat_p.heartbeat;
+-------+----------------------------+------+
| shard | last_updated               | lag  |
+-------+----------------------------+------+
| s6    | 2015-11-24T12:21:13.500950 |    0 |
| s2    | 2015-11-24T12:21:13.501200 |    0 |
| s7    | 2015-11-24T12:21:13.501190 |    0 |
| s3    | 2015-11-24T12:21:13.501110 |    0 |
| s4    | 2015-11-24T12:21:13.501170 |    0 |
| s1    | 2015-11-24T12:21:13.500670 |    0 |
| s5    | 2015-11-24T12:21:13.500780 |    0 |
+-------+----------------------------+------+
7 rows in set (0.00 sec)

This table is based on the tool pt-heartbeat, not on SHOW MASTER STATUS, producing very accurate results, even if replication is broken, and directly comparing it to the original master, and not the replicas's direct master.

  • shard: s1-7. Each of the production masters. The wiki distribution can be seen at: https://noc.wikimedia.org/db.php
  • last_updated: Every 0.5 seconds, a row in the master is written with the date local to the master. Here you have its value, once replicated. As it is updated every 0.5 seconds, it has a measuring error of [0, 0.5] seconds.
  • lag: The difference between the current date and the last_updated column, rounded to seconds (timestampdiff(SECOND,`heartbeat`.`heartbeat`.`ts`,utc_timestamp())). Due to a MariaDB bug, decimals are truncated, but expect a number with 6 decimals soon.

Please note that some seconds or a few minutes of lag is considered normal, due to the filtering process and the hops done before reaching the public hosts.

Replica drift

This is a brief summary of the /Replica drift documentation page.


The replicated database are not exact copies of the production database, which causes the database to slowly drift from the production contents. This shows up in various queries, but queries that involve recently deleted/restored pages seem to be affected the most. The impact of this is kept as small as possible by regular database re-imports.

If you encounter replica drift, please report it as a comment on phab:T138967

User databases

User-created databases can be created on the database hosting the replica servers or on a shared server: tools.labsdb. Unless you need direct access to the replica databases (e.g. for a join), you should create your database on tools.labsdb. If you do need to create a database on a replica server, make sure to connect to that specific server (e.g. c1.labsdb) and not to an alias (e.g. enwiki.labsdb), as the alias-server relationship can (and do) change.

Database names must start with the name of the credential user (not your user name), which can be found in your ~/replica.my.cnf file (the name of the credential user looks something like 'u1234' for a user and 's51234' for a service group). (Note that it may take up to 20 minutes after your project account has been created for the replica.my.cnf file to show up in the directory.) The name of the credential user is followed by two underscores and then the name of the database:

credentialUser__DBName

Privileges on the database

Users have all privileges and have access to all grant options on their databases. Database names ending with _p are granted read access for everyone. Please create a ticket if you need more fine-grained permissions, like sharing a database only between 2 users, or other special permissions.

Steps to create a user database on tools.labsdb

To create a database on tools.labsdb:

  1. Become your tool account.
    maintainer@tools-login:~$ become toolaccount
  2. Connect to tools-db with the replica.my.cnf credentials:
    mysql --defaults-file="${HOME}"/replica.my.cnf -h tools.labsdb
    You could also just type:
    sql local
  3. In the mysql console, create a new database (where CREDENTIALUSER is your credentials user, which can be found in your ~/replica.my.cnf file, and DBNAME the name you want to give to your database. Note that there are 2 underscores between CREDENTIALUSER and DBNAME):
    MariaDB [(none)]> CREATE DATABASE CREDENTIALUSER__DBNAME;

You can then connect to your database using:

mysql --defaults-file="${HOME}"/replica.my.cnf -h tools.labsdb CREDENTIALUSER__DBNAME

Example

Assuming that your tool account is called "mytool", this is what it would look like:

maintainer@tools-login:~$ become mytool
tools.mytool@tools-login:~$ cat replica.my.cnf | grep user
user='123something'
mysql --defaults-file="${HOME}"/replica.my.cnf -h tools.labsdb
create database 123something__wiki;
Warning Caution: The alias tools-db is deprecated.

Steps to create a user database on the replica servers

Warning Warning: User databases on replicas should only be used for temporary data storage. Databases may be removed without prior notice and will not be backed up or replicated in any way.

If you would like your database to interact with the replica databases (i.e., if you need to do actual SQL joins with the replicas, which can only be done on the same cluster) you can create a database on the replica servers.

To create a database on the replica servers:

1. Become your tool account:

maintainer@tools-login:~$ become toolaccount

2. Connect to the replica servers with the replica.my.cnf credentials. You must specify the host of the replica (e.g., enwiki.labsdb):

mysql --defaults-file="${HOME}"/replica.my.cnf -h XXwiki.labsdb
Warning Caution: There are multiple database servers, and XXwiki.labsdb may be switched between servers for load balancing or maintenance. Make sure your tool can survive removal of the database. You can connect to a specific server using cX.labsdb where X can be 1, 2 or 3 but this will not guarantee that your data will always be present.

3. In the mysql console, create a new database (where CREDENTIALUSER is your credentials user, which can be found in your ~/replica.my.cnf file, and DBNAME the name you want to give to your database. Note that there are 2 underscores between CREDENTIALUSER and DBNAME):

MariaDB [(none)]> CREATE DATABASE CREDENTIALUSER__DBNAME

You can then connect to your database using:

mysql --defaults-file="${HOME}"/replica.my.cnf -h XXwiki.labsdb CREDENTIALUSER__DBNAME
Warning Caution: Writing (INSERT, UPDATE, DELETE) tables on the replica servers leads to the replication being stopped for all users on that server until the query finishes. So you need to make sure that such queries can be executed in a time frame that does not disrupt other users' work too much, for example by processing data in smaller batches or rethinking your data flow. As a rule of thumb, queries should finish in less than a minute.

Example queries

See Help:MySQL queries. Add yours!

Connecting with...

Python

Use User:Legoktm/wmflib.

MySQL Workbench

You can connect to databases on Tool Labs with MySQL Workbench (or similar client applications) via an SSH tunnel.

PGAdmin (for OpenStreetMap databases)

This guide assumes you already know what PGAdmin is and how to use it.

Connecting to OSM via the official CLI PostgreSQL

  1. SSH to tools-login.wmflabs.org or tools-dev.wmflabs.org
  2. psql -h osmdb.eqiad.wmnet -U osm gis
  3. Start querying


Connecting from a Servlet in Tomcat

  1. create directory "lib" in directory "public_tomcat"
  2. copy "mysql-connector-java-bin.jar" to "public_tomcat/lib"
  3. import org.apache.tomcat.jdbc.pool.DataSource;
    import org.apache.tomcat.jdbc.pool.PoolProperties;
    
    String DBURL    = "jdbc:mysql://tools-db.tools.eqiad.wmflabs:3306/";
    String DBDRIVER = "com.mysql.jdbc.Driver";
    String DATABASE = DBUSER + "__" + PROJECT;
    
    PoolProperties p = new PoolProperties();
    p.setUrl            (DBURL + DATABASE);
    p.setDriverClassName(DBDRIVER        );
    p.setUsername       (DBUSER          );
    p.setPassword       (DBPASSWORD      );
    p.setJdbcInterceptors(
    	"org.apache.tomcat.jdbc.pool.interceptor.ConnectionState;" +
    	"org.apache.tomcat.jdbc.pool.interceptor.StatementFinalizer");
    DataSource datasource = new DataSource();
    datasource.setPoolProperties(p);
    Connection connection = datasource.getConnection  ();	
    Statement  statement  = connection.createStatement();
    
  4. javac -classpath javax.servlet.jar:tomcat-jdbc.jar myhttpservlet.java