Jump to content

Data Platform/Systems/MediaWiki replicas

From Wikitech

The Data Platform MediaWiki replicas host full, unredacted copies of the main MediaWiki databases (MariaDB clusters s1-8) and the x1 extension storage cluster. Note that the full text of MediaWiki page revisions is not included, as that is stored in the external storage clusters (es1-7).

The easiest way to query them is using Wmfdata-Python. Another way to query them is to SSH into a stat machine and use the MariaDB monitor.

Audience

Analytics MariaDB is used in two ways:

  • user queries (jupyter and analytics-mysql below)
  • monthly sqoop a few private tables into hadoop at the start of the month

Writing queries

See Help:Wiki Replicas/Queries#Writing queries (this is the documentation for the separate, public MediaWiki replicas, so the access information doesn't apply, but the information on the data layout and query writing applies almost exactly).

Connecting via SSH

If you want to SSH into a stat machine and use the MariaDB monitor to type SQL commands, follow the instructions at SRE/Production access#Setting up your access to set up your SSH, then type:

ssh stat1008.eqiad.wmnet

Once you are connected, type a MariaDB monitor bash command listed below.

Database setup

The production databases are split up into a number of shards (see this list). This system has the same setup, with a separate MariaDB database instance for each shard.

Shard URL Port Databases MariaDB monitor bash command
s1 s1-analytics-replica.eqiad.wmnet 3311 enwiki
mysql -h s1-analytics-replica.eqiad.wmnet -P 3311 -A
s2 s2-analytics-replica.eqiad.wmnet 3312 17 large wikis, including Chinese, Italian, Dutch, Polish, and Portuguese Wikipedias and English Wiktionary
mysql -h s2-analytics-replica.eqiad.wmnet -P 3312 -A
s3 s3-analytics-replica.eqiad.wmnet 3313 Most small wikis (~900). votewiki, wikimania wikis, private wikis
mysql -h s3-analytics-replica.eqiad.wmnet -P 3313 -A
s4 s4-analytics-replica.eqiad.wmnet 3314 commons, testcommons
mysql -h s4-analytics-replica.eqiad.wmnet -P 3314 -A
s5 s5-analytics-replica.eqiad.wmnet 3315 German Wikipedia, some other large wikis, and most new wikis since mid-2020
mysql -h s5-analytics-replica.eqiad.wmnet -P 3315 -A
s6 s6-analytics-replica.eqiad.wmnet 3316 French, Japanese, and Russian Wikipedias, and Wikitech
mysql -h s6-analytics-replica.eqiad.wmnet -P 3316 -A
s7 s7-analytics-replica.eqiad.wmnet 3317 CentralAuth, Meta, 10 large Wikipedias (fawiki), and French Wiktionary
mysql -h s7-analytics-replica.eqiad.wmnet -P 3317 -A
s8 s8-analytics-replica.eqiad.wmnet 3318 wikidata
mysql -h s8-analytics-replica.eqiad.wmnet -P 3318 -A
x1 x1-analytics-replica.eqiad.wmnet 3320 cognate_wiktionary, flowdb, wikishared, plus a database for every wiki
mysql -h x1-analytics-replica.eqiad.wmnet -P 3320 -A
staging staging-db-analytics.eqiad.wmnet 3350 staging (miscellaneous temporary tables used by people in the deployment group?)
mysql -h staging-db-analytics.eqiad.wmnet -P 3350 -A

We have created DNS SRV records to ease the use of the dbstore shard into scripts (see Python functions below for an example):

  • _s1-analytics._tcp.eqiad.wmnet
  • ...
  • _s8-analytics._tcp.eqiad.wmnet
  • _x1-analytics._tcp.eqiad.wmnet
  • _staging-analytics._tcp.eqiad.wment

These records look "weird" if you are not familiar with them, but they allow us to get a hostname/port combination from one DNS name! Please note: the mysql client is sadly not capable of reading those, so you cannot use them directly with it.

You can access these analytics replicas from most? stat machines. To use the mysql client interactively, type:

mysql -h sX-analytics-replica.eqiad.wmnet -P 331X -A

The -A disables tab autocompletion, which can be bothersome when pasting multi-line queries into the prompt. You'll then be dropped into the MySQL command line.[1]

If you'd rather generate a TSV file[2] and then retrieve it later, you can also do so from the command line. You can type type:

mysql -h s1-analytics-replica.eqiad.wmnet -P 3311 [[:Template:Database name]] -e "[[:Template:Your query;]]" > [[:Template:Filename]].tsv

It'll go off to generate the file on its own.[3]

As well as connecting directly, it's also possible to connect automatically from your programming language of choice, be it R or Python. For Python, we might have the MySQLdb module installed. For R, we have RMySQL.

You can use nslookup to find the port. For example:

$ nslookup -q=SRV _x1-analytics._tcp.eqiad.wmnet
Server:         10.3.0.1
Address:        10.3.0.1#53

Non-authoritative answer:
_x1-analytics._tcp.eqiad.wmnet  service = 0 1 3320 dbstore1005.eqiad.wmnet.

The following Python snippets are an example about how it is possible to find the correct Mysql hostname+port combination given a wiki db name:

# The second function needs dnspython to work
import dns.resolver

def get_mediawiki_section_dbname_mapping(mw_config_path, use_x1):
    db_mapping = {}
    if use_x1:
        dblist_section_paths = [mw_config_path.rstrip('/') + '/dblists/all.dblist']
    else:
        dblist_section_paths = glob.glob(mw_config_path.rstrip('/') + '/dblists/s[0-9]*.dblist')
    for dblist_section_path in dblist_section_paths:
        with open(dblist_section_path, 'r') as f:
            for db in f.readlines():
                db_mapping[db.strip()] = dblist_section_path.strip().rstrip('.dblist').split('/')[-1]

    return db_mapping


def get_dbstore_host_port(db_mapping, use_x1, dbname):
    if dbname == 'staging':
        shard = 'staging'
    elif use_x1:
        shard = 'x1'
    else:
        try:
            shard = db_mapping[dbname]
        except KeyError:
            raise RuntimeError("The database {} is not listed among the dblist files of the supported sections."
                               .format(dbname))
    answers = dns.resolver.query('_' + shard + '-analytics._tcp.eqiad.wmnet', 'SRV')
    host, port = str(answers[0].target), answers[0].port
    return (host,port)

Example of a Python script to retrieve data for a specific wiki from a Notebook (credits to Neil for the work!):

import dns.resolver
import mysql.connector as mysql


query = "select count(*) from recentchanges"
ans = dns.resolver.query('_s7-analytics._tcp.eqiad.wmnet', 'SRV')[0]
conn = mysql.connect(
    host=str(ans.target), 
    port=ans.port, 
    database="hewiki",
    option_files='/etc/mysql/conf.d/analytics-research-client.cnf',
    charset='binary',
    autocommit=False
)
cursor = conn.cursor()
cursor.execute(query)
cursor.fetchall()

This is of course a quick solution, it can be done in any number of ways. If you have comments/suggestions/improvements please feel free to improve :)

The first function, get_mediawiki_section_dbname_mapping, needs the path of the mediawiki-config repository local checkout. If you run your script on a stat100* or notebook100* host, then you can find the repository checked out under /srv/mediawiki-config. This is one source of the mapping between mediawiki db sections (s1..s8, x1) to wiki names. For example, itwiki is contained into two files:

  • dblists/s2.dblist
  • dblists/all.dblist

The above files means two things:

1) itwiki's database will be available on s2-analytics-replica.eqiad.wmnet

2) itiwiki's database will be available on x1-analytics-replica.eqiad.wmnet (if we need the tables related to the extensions, since x1 contains all the wikis).

This is a pre-requisite to use the get_dbstore_host_port, that starts from the mapping described above and uses DNS SRV records as explained above.

MySQL wrapper

On all the Analytics nodes (stat10[08-11]) there is a tool called analytics-mysql that should help using the new databases:

elukey@stat1006:~$ analytics-mysql itwiki --print-target
dbstore1004.eqiad.wmnet:3312

elukey@stat1006:~$ analytics-mysql itwiki --print-target --use-x1
dbstore1005.eqiad.wmnet:3320

elukey@stat1006:~$ analytics-mysql itwiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 465860
Server version: 10.1.37-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:research@dbstore1004.eqiad.wmnet. [itwiki]> Bye

elukey@stat1006:~$ analytics-mysql itwiki -e 'show tables limit 3'
ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'limit 3' at line 1
elukey@stat1006:~$ analytics-mysql itwiki -e 'show tables'
+--------------------------+
| Tables_in_itwiki         |
+--------------------------+
[..]

It’s possible that you’re not allowed to run analytics-mysql directly, but are a member of a group that is allowed to run it, for example:

lucaswerkmeister-wmde@stat1007:~$ analytics-mysql wikidatawiki
ERROR 1045 (28000): Access denied for user 'lucaswerkmeister-wmde'@'10.64.21.118' (using password: NO)
lucaswerkmeister-wmde@stat1007:~$ sudo -u analytics-wmde analytics-mysql wikidatawiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 1233204
Server version: 10.4.15-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [wikidatawiki]>
  1. The stat machines automatically authenticate to the MariaDB replicas using passwords stored in preloaded config (.cnf) files. The file might be /etc/mysql/conf.d/research-client.cnf, or /etc/mysql/conf.d/analytics-research-client.cnf. These files are automatically referenced from /etc/mysql/my.cnf, which the command line mysql program reads by default. Other programs may not do this automatically, and require an explicit pointer to the underlying .cnf file.
  2. The mysql utility doesn't have the ability to generate files in other formats like CSV.
  3. The file extension you choose doesn't actually affect the command, but since -e generates a TSV file, you should use the corresponding file extension so other programs load the resulting file correctly.