Jump to content

Data Platform/Systems/MediaWiki replicas

From Wikitech

The Data Platform MediaWiki replicas host full, unredacted copies of the main MediaWiki databases (MariaDB clusters s1-8) and the x1 extension storage cluster. Note that the full text of MediaWiki page revisions is not included, as that is stored in the external storage clusters.

The easiest way to query them is using Wmfdata-Python.

Clients

Analytics MariaDB is used in two ways:

  • user queries (jupyter and analytics-mysql below)
  • monthly sqoop a few private tables into hadoop at the start of the month

Database setup

The production databases are split up into a number of shards. This system has the same setup, with a separate MariaDB database instance for each shard.

Each shard has a nice hostname that redirects to it:

  • s1-analytics-replica.eqiad.wmnet
  • s2-analytics-replica.eqiad.wmnet
  • ...
  • s8-analytics-replica.eqiad.wmnet
  • x1-analytics-replica.eqiad.wmnet
  • staging-db-analytics.eqiad.wmnet

The main gotcha now is to figure out what port to use to connect to each section, since as we said above there are multiple MySQL instances running. This is the scheme to use:

  • 331 + the digit of the section in case of sX. Example: s5 will be accessible at s5-analytics-replica.eqiad.wmnet:3315
  • 3320 for x1. Example: x1-analytics-replica.eqiad.wmnet:3320
  • 3350 for staging

We have created DNS SRV records to ease the use of the dbstore shard into scripts (see Python functions below for an example):

  • _s1-analytics._tcp.eqiad.wmnet
  • ...
  • _s8-analytics._tcp.eqiad.wmnet
  • _x1-analytics._tcp.eqiad.wmnet
  • _staging-analytics._tcp.eqiad.wment

These records look "weird" if you are not familiar with them, but they allow us to get a hostname/port combination from one DNS name! Please note: the mysql client is sadly not capable of reading those, so you cannot use them directly with it.

You can access these analytics replicas from either stat1007 or stat1006. To use the mysql client interactively, type: mysql -h sX-analytics-replica.eqiad.wmnet -P 331X -A. The -A disables tab autocompletion, which can be bothersome when pasting multi-line queries into the prompt. You'll then be dropped into the MySQL command line.[1]

If you'd rather generate a TSV file[2] and then retrieve it later, you can also do so from the command line. You can type type:

mysql -h s1-analytics-replica.eqiad.wmnet -P 3311 {{database name}} -e "{{your query;}}" > {{filename}}.tsv

It'll go off to generate the file on its own.[3]

As well as connecting directly, it's also possible to connect automatically from your programming language of choice, be it R or Python. For Python, we have the MySQLdb module installed on stat1006 and stat1007. For R, we have RMySQL.

You can use nslookup to find the port. For example:

$ nslookup -q=SRV _x1-analytics._tcp.eqiad.wmnet
Server:         10.3.0.1
Address:        10.3.0.1#53

Non-authoritative answer:
_x1-analytics._tcp.eqiad.wmnet  service = 0 1 3320 dbstore1005.eqiad.wmnet.

The following Python snippets are an example about how it is possible to find the correct Mysql hostname+port combination given a wiki db name:

# The second function needs dnspython to work
import dns.resolver

def get_mediawiki_section_dbname_mapping(mw_config_path, use_x1):
    db_mapping = {}
    if use_x1:
        dblist_section_paths = [mw_config_path.rstrip('/') + '/dblists/all.dblist']
    else:
        dblist_section_paths = glob.glob(mw_config_path.rstrip('/') + '/dblists/s[0-9]*.dblist')
    for dblist_section_path in dblist_section_paths:
        with open(dblist_section_path, 'r') as f:
            for db in f.readlines():
                db_mapping[db.strip()] = dblist_section_path.strip().rstrip('.dblist').split('/')[-1]

    return db_mapping


def get_dbstore_host_port(db_mapping, use_x1, dbname):
    if dbname == 'staging':
        shard = 'staging'
    elif use_x1:
        shard = 'x1'
    else:
        try:
            shard = db_mapping[dbname]
        except KeyError:
            raise RuntimeError("The database {} is not listed among the dblist files of the supported sections."
                               .format(dbname))
    answers = dns.resolver.query('_' + shard + '-analytics._tcp.eqiad.wmnet', 'SRV')
    host, port = str(answers[0].target), answers[0].port
    return (host,port)

Example of a Python script to retrieve data for a specific wiki from a Notebook (credits to Neil for the work!):

import dns.resolver
import mysql.connector as mysql


query = "select count(*) from recentchanges"
ans = dns.resolver.query('_s7-analytics._tcp.eqiad.wmnet', 'SRV')[0]
conn = mysql.connect(
    host=str(ans.target), 
    port=ans.port, 
    database="hewiki",
    option_files='/etc/mysql/conf.d/analytics-research-client.cnf',
    charset='binary',
    autocommit=False
)
cursor = conn.cursor()
cursor.execute(query)
cursor.fetchall()

This is of course a quick solution, it can be done in any number of ways. If you have comments/suggestions/improvements please feel free to improve :)

The first function, get_mediawiki_section_dbname_mapping, needs the path of the mediawiki-config repository local checkout. If you run your script on a stat100* or notebook100* host, then you can find the repository checked out under /srv/mediawiki-config. This is one source of the mapping between mediawiki db sections (s1..s8, x1) to wiki names. For example, itwiki is contained into two files:

  • dblists/s2.dblist
  • dblists/all.dblist

The above files means two things:

1) itwiki's database will be available on s2-analytics-replica.eqiad.wmnet

2) itiwiki's database will be available on x1-analytics-replica.eqiad.wmnet (if we need the tables related to the extensions, since x1 contains all the wikis).

This is a pre-requisite to use the get_dbstore_host_port, that starts from the mapping described above and uses DNS SRV records as explained above.

MySQL wrapper

On all the Analytics nodes (stat10[08-11]) there is a tool called analytics-mysql that should help using the new databases:

elukey@stat1006:~$ analytics-mysql itwiki --print-target
dbstore1004.eqiad.wmnet:3312

elukey@stat1006:~$ analytics-mysql itwiki --print-target --use-x1
dbstore1005.eqiad.wmnet:3320

elukey@stat1006:~$ analytics-mysql itwiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 465860
Server version: 10.1.37-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:research@dbstore1004.eqiad.wmnet. [itwiki]> Bye

elukey@stat1006:~$ analytics-mysql itwiki -e 'show tables limit 3'
ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'limit 3' at line 1
elukey@stat1006:~$ analytics-mysql itwiki -e 'show tables'
+--------------------------+
| Tables_in_itwiki         |
+--------------------------+
[..]

It’s possible that you’re not allowed to run analytics-mysql directly, but are a member of a group that is allowed to run it, for example:

lucaswerkmeister-wmde@stat1007:~$ analytics-mysql wikidatawiki
ERROR 1045 (28000): Access denied for user 'lucaswerkmeister-wmde'@'10.64.21.118' (using password: NO)
lucaswerkmeister-wmde@stat1007:~$ sudo -u analytics-wmde analytics-mysql wikidatawiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 1233204
Server version: 10.4.15-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [wikidatawiki]>
  1. The stat machines automatically authenticate to the MariaDB replicas using passwords stored in preloaded config (.cnf) files. On stat1006, the file is /etc/mysql/conf.d/research-client.cnf, while on stat1007 it's /etc/mysql/conf.d/analytics-research-client.cnf. These files are automatically referenced from /etc/mysql/my.cnf, which the command line mysql program reads by default. Other programs may not do this automatically, and require an explicit pointer to the underlying .cnf file.
  2. The mysql utility doesn't have the ability to generate files in other formats like CSV.
  3. The file extension you choose doesn't actually affect the command, but since -e generates a TSV file, you should use the corresponding file extension so other programs load the resulting file correctly.