From Wikitech
Jump to navigation Jump to search

There is an analytics-specific set of MariaDB servers which hosts full, unredacted replicas of the production MediaWiki databases.

Database setup

The production databases are split up into a number of shards. This system has the same setup, with a separate MariaDB database instance for each shard.

Each shard has a nice hostname that redirects to it:

  • s1-analytics-replica.eqiad.wmnet
  • s2-analytics-replica.eqiad.wmnet
  • ...
  • s8-analytics-replica.eqiad.wmnet
  • x1-analytics-replica.eqiad.wmnet
  • staging-db-analytics.eqiad.wmnet

The main gotcha now is to figure out what port to use to connect to each section, since as we said above there are multiple MySQL instances running. This is the scheme to use:

  • 331 + the digit of the section in case of sX. Example: s5 will be accessible at s5-analytics-replica.eqiad.wmnet:3315
  • 3320 for x1. Example: x1-analytics-replica.eqiad.wmnet:3320
  • 3350 for staging

We have created DNS SRV records to ease the use of the dbstore shard into scripts (see Python functions below for an example):

  • _s1-analytics._tcp.eqiad.wmnet
  • ...
  • _s8-analytics._tcp.eqiad.wmnet
  • _x1-analytics._tcp.eqiad.wmnet
  • _staging-analytics._tcp.eqiad.wment

These records look "weird" if you are not familiar with them, but they allow us to get a hostname/port combination from one DNS name! Please note: the mysql client is sadly not capable of reading those, so you cannot use them directly with it.

You can access these analytics replicas from either stat1007 or stat1006. To use the mysql client interactively, type: mysql -h sX-analytics-replica.eqiad.wmnet -P 331X -A. The -A disables tab autocompletion, which can be bothersome when pasting multi-line queries into the prompt. You'll then be dropped into the MySQL command line.[1]

If you'd rather generate a TSV file[2] and then retrieve it later, you can also do so from the command line. You can type type:

mysql -h s1-analytics-replica.eqiad.wmnet -P 3321 {{database name}} -e "{{your query;}}" > {{filename}}.tsv

It'll go off to generate the file on its own.[3]

As well as connecting directly, it's also possible to connect automatically from your programming language of choice, be it R or Python. For Python, we have the MySQLdb module installed on stat1006 and stat1007. For R, we have RMySQL.

The following Python snippets are an example about how it is possible to find the correct Mysql hostname+port combination given a wiki db name:

# The second function needs dnspython to work
import dns.resolver

def get_mediawiki_section_dbname_mapping(mw_config_path, use_x1):
    db_mapping = {}
    if use_x1:
        dblist_section_paths = [mw_config_path.rstrip('/') + '/dblists/all.dblist']
        dblist_section_paths = glob.glob(mw_config_path.rstrip('/') + '/dblists/s[0-9]*.dblist')
    for dblist_section_path in dblist_section_paths:
        with open(dblist_section_path, 'r') as f:
            for db in f.readlines():
                db_mapping[db.strip()] = dblist_section_path.strip().rstrip('.dblist').split('/')[-1]

    return db_mapping

def get_dbstore_host_port(db_mapping, use_x1, dbname):
    if dbname == 'staging':
        shard = 'staging'
    elif use_x1:
        shard = 'x1'
            shard = db_mapping[dbname]
        except KeyError:
            raise RuntimeError("The database {} is not listed among the dblist files of the supported sections."
    answers = dns.resolver.query('_' + shard + '-analytics._tcp.eqiad.wmnet', 'SRV')
    host, port = str(answers[0].target), answers[0].port
    return (host,port)

Example of a Python script to retrieve data for a specific wiki from a Notebook (credits to Neil for the work!):

import dns.resolver
import mysql.connector as mysql

query = "select count(*) from recentchanges"
ans = dns.resolver.query('_s7-analytics._tcp.eqiad.wmnet', 'SRV')[0]
conn = mysql.connect(
cursor = conn.cursor()

This is of course a quick solution, it can be done in any number of ways. If you have comments/suggestions/improvements please feel free to improve :)

The first function, get_mediawiki_section_dbname_mapping, needs the path of the mediawiki-config repository local checkout. If you run your script on a stat100* or notebook100* host, then you can find the repository checked out under /srv/mediawiki-config. This is one source of the mapping between mediawiki db sections (s1..s8, x1) to wiki names. For example, itwiki is contained into two files:

  • dblists/s2.dblist
  • dblists/all.dblist

The above files means two things:

1) itwiki's database will be available on s2-analytics-replica.eqiad.wmnet

2) itiwiki's database will be available on x1-analytics-replica.eqiad.wmnet (if we need the tables related to the extensions, since x1 contains all the wikis).

This is a pre-requisite to use the get_dbstore_host_port, that starts from the mapping described above and uses DNS SRV records as explained above.

MySQL wrapper

On all the Analytics nodes (stat100[4,6,7] and notebook100[3,4]) there is a tool called analytics-mysql that should help using the new databases:

elukey@stat1006:~$ analytics-mysql itwiki --print-target

elukey@stat1006:~$ analytics-mysql itwiki --print-target --use-x1

elukey@stat1006:~$ analytics-mysql itwiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 465860
Server version: 10.1.37-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:research@dbstore1004.eqiad.wmnet. [itwiki]> Bye

elukey@stat1006:~$ analytics-mysql itwiki -e 'show tables limit 3'
ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'limit 3' at line 1
elukey@stat1006:~$ analytics-mysql itwiki -e 'show tables'
| Tables_in_itwiki         |
  1. The stat machines automatically authenticate to the MariaDB replicas using passwords stored in preloaded config (.cnf) files. On stat1006, the file is /etc/mysql/conf.d/research-client.cnf, while on stat1007 it's /etc/mysql/conf.d/analytics-research-client.cnf. These files are automatically referenced from /etc/mysql/my.cnf, which the command line mysql program reads by default. Other programs may not do this automatically, and require an explicit pointer to the underlying .cnf file.
  2. The mysql utility doesn't have the ability to generate files in other formats like CSV.
  3. The file extension you choose doesn't actually affect the command, but since -e generates a TSV file, you should use the corresponding file extension so other programs load the resulting file correctly.