Data Platform/Systems/MediaWiki replicas
The Data Platform MediaWiki replicas host full, unredacted copies of the main MediaWiki databases (MariaDB clusters s1-8) and the x1 extension storage cluster. Note that the full text of MediaWiki page revisions is not included, as that is stored in the external storage clusters.
The easiest way to query them is using Wmfdata-Python.
Clients
Analytics MariaDB is used in two ways:
- user queries (jupyter and analytics-mysql below)
- monthly sqoop a few private tables into hadoop at the start of the month
Database setup
The production databases are split up into a number of shards. This system has the same setup, with a separate MariaDB database instance for each shard.
Each shard has a nice hostname that redirects to it:
- s1-analytics-replica.eqiad.wmnet
- s2-analytics-replica.eqiad.wmnet
- ...
- s8-analytics-replica.eqiad.wmnet
- x1-analytics-replica.eqiad.wmnet
- staging-db-analytics.eqiad.wmnet
The main gotcha now is to figure out what port to use to connect to each section, since as we said above there are multiple MySQL instances running. This is the scheme to use:
- 331 + the digit of the section in case of sX. Example: s5 will be accessible at s5-analytics-replica.eqiad.wmnet:3315
- 3320 for x1. Example: x1-analytics-replica.eqiad.wmnet:3320
- 3350 for staging
We have created DNS SRV records to ease the use of the dbstore shard into scripts (see Python functions below for an example):
- _s1-analytics._tcp.eqiad.wmnet
- ...
- _s8-analytics._tcp.eqiad.wmnet
- _x1-analytics._tcp.eqiad.wmnet
- _staging-analytics._tcp.eqiad.wment
These records look "weird" if you are not familiar with them, but they allow us to get a hostname/port combination from one DNS name! Please note: the mysql client is sadly not capable of reading those, so you cannot use them directly with it.
You can access these analytics replicas from either stat1007 or stat1006. To use the mysql
client interactively, type: mysql -h sX-analytics-replica.eqiad.wmnet -P 331X -A
. The -A
disables tab autocompletion, which can be bothersome when pasting multi-line queries into the prompt. You'll then be dropped into the MySQL command line.[1]
If you'd rather generate a TSV file[2] and then retrieve it later, you can also do so from the command line. You can type type:
mysql -h s1-analytics-replica.eqiad.wmnet -P 3311 {{database name}} -e "{{your query;}}" > {{filename}}.tsv
It'll go off to generate the file on its own.[3]
As well as connecting directly, it's also possible to connect automatically from your programming language of choice, be it R or Python. For Python, we have the MySQLdb module installed on stat1006 and stat1007. For R, we have RMySQL.
You can use nslookup
to find the port. For example:
$ nslookup -q=SRV _x1-analytics._tcp.eqiad.wmnet
Server: 10.3.0.1
Address: 10.3.0.1#53
Non-authoritative answer:
_x1-analytics._tcp.eqiad.wmnet service = 0 1 3320 dbstore1005.eqiad.wmnet.
The following Python snippets are an example about how it is possible to find the correct Mysql hostname+port combination given a wiki db name:
# The second function needs dnspython to work
import dns.resolver
def get_mediawiki_section_dbname_mapping(mw_config_path, use_x1):
db_mapping = {}
if use_x1:
dblist_section_paths = [mw_config_path.rstrip('/') + '/dblists/all.dblist']
else:
dblist_section_paths = glob.glob(mw_config_path.rstrip('/') + '/dblists/s[0-9]*.dblist')
for dblist_section_path in dblist_section_paths:
with open(dblist_section_path, 'r') as f:
for db in f.readlines():
db_mapping[db.strip()] = dblist_section_path.strip().rstrip('.dblist').split('/')[-1]
return db_mapping
def get_dbstore_host_port(db_mapping, use_x1, dbname):
if dbname == 'staging':
shard = 'staging'
elif use_x1:
shard = 'x1'
else:
try:
shard = db_mapping[dbname]
except KeyError:
raise RuntimeError("The database {} is not listed among the dblist files of the supported sections."
.format(dbname))
answers = dns.resolver.query('_' + shard + '-analytics._tcp.eqiad.wmnet', 'SRV')
host, port = str(answers[0].target), answers[0].port
return (host,port)
Example of a Python script to retrieve data for a specific wiki from a Notebook (credits to Neil for the work!):
import dns.resolver
import mysql.connector as mysql
query = "select count(*) from recentchanges"
ans = dns.resolver.query('_s7-analytics._tcp.eqiad.wmnet', 'SRV')[0]
conn = mysql.connect(
host=str(ans.target),
port=ans.port,
database="hewiki",
option_files='/etc/mysql/conf.d/analytics-research-client.cnf',
charset='binary',
autocommit=False
)
cursor = conn.cursor()
cursor.execute(query)
cursor.fetchall()
This is of course a quick solution, it can be done in any number of ways. If you have comments/suggestions/improvements please feel free to improve :)
The first function, get_mediawiki_section_dbname_mapping, needs the path of the mediawiki-config repository local checkout. If you run your script on a stat100* or notebook100* host, then you can find the repository checked out under /srv/mediawiki-config. This is one source of the mapping between mediawiki db sections (s1..s8, x1) to wiki names. For example, itwiki is contained into two files:
- dblists/s2.dblist
- dblists/all.dblist
The above files means two things:
1) itwiki's database will be available on s2-analytics-replica.eqiad.wmnet
2) itiwiki's database will be available on x1-analytics-replica.eqiad.wmnet (if we need the tables related to the extensions, since x1 contains all the wikis).
This is a pre-requisite to use the get_dbstore_host_port, that starts from the mapping described above and uses DNS SRV records as explained above.
MySQL wrapper
On all the Analytics nodes (stat10[08-11]) there is a tool called analytics-mysql
that should help using the new databases:
elukey@stat1006:~$ analytics-mysql itwiki --print-target
dbstore1004.eqiad.wmnet:3312
elukey@stat1006:~$ analytics-mysql itwiki --print-target --use-x1
dbstore1005.eqiad.wmnet:3320
elukey@stat1006:~$ analytics-mysql itwiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 465860
Server version: 10.1.37-MariaDB MariaDB Server
Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql:research@dbstore1004.eqiad.wmnet. [itwiki]> Bye
elukey@stat1006:~$ analytics-mysql itwiki -e 'show tables limit 3'
ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'limit 3' at line 1
elukey@stat1006:~$ analytics-mysql itwiki -e 'show tables'
+--------------------------+
| Tables_in_itwiki |
+--------------------------+
[..]
It’s possible that you’re not allowed to run analytics-mysql
directly, but are a member of a group that is allowed to run it, for example:
lucaswerkmeister-wmde@stat1007:~$ analytics-mysql wikidatawiki
ERROR 1045 (28000): Access denied for user 'lucaswerkmeister-wmde'@'10.64.21.118' (using password: NO)
lucaswerkmeister-wmde@stat1007:~$ sudo -u analytics-wmde analytics-mysql wikidatawiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 1233204
Server version: 10.4.15-MariaDB MariaDB Server
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [wikidatawiki]>
- ↑ The stat machines automatically authenticate to the MariaDB replicas using passwords stored in preloaded config (
.cnf
) files. On stat1006, the file is/etc/mysql/conf.d/research-client.cnf
, while on stat1007 it's/etc/mysql/conf.d/analytics-research-client.cnf
. These files are automatically referenced from/etc/mysql/my.cnf
, which the command linemysql
program reads by default. Other programs may not do this automatically, and require an explicit pointer to the underlying.cnf
file. - ↑ The
mysql
utility doesn't have the ability to generate files in other formats like CSV. - ↑ The file extension you choose doesn't actually affect the command, but since
-e
generates a TSV file, you should use the corresponding file extension so other programs load the resulting file correctly.