Help:Wiki Replicas/Queries
SQL queries against the Wiki Replicas databases allow you to search and collect metadata about pages and actions on the Wikimedia movement's public wikis. Also consider the Superset web interface as an alternative for one-off queries.
Database layout
The database layout is available at mw:Manual:Database layout.
There are also two commands you can use to view the layout. SHOW TABLES will show the available tables in a database. DESCRIBE table_name will show the available columns in a specific table.
Sections
The various databases are stored in 'sections'. The sections are named with a leading 's' and a digit—for example s1, s2, etc.—followed by the internal domain name '{analytics,web}.db.svc.wikimedia.cloud' (e.g. s1.analytics.db.svc.wikimedia.cloud and s1.web.db.svc.wikimedia.cloud).
Data storage
There are a few tricks to how data is stored in the various tables.
- Page titles use underscores and never include the namespace prefix. (eg: page_title='The_Lord_of_the_Rings')
- User names use spaces, not underscores. (eg: actor_name='Jimbo Wales')
- Namespaces are integers. A key to the integers is available at mw:Manual:Namespace.
Views
Toolforge has exact replicas of Wikimedia's databases, however, certain information is restricted using MariaDB views.
For example, the user table view does not show things like user_password or user_email to Toolforge users. You can only access data from the public (redacted) databases marked as _p (eg: enwiki_p).
Alternative views
There are some alternative views in which the data from the underlying tables are redacted in a different way, so that the corresponding indices can be used. If you utilize the listed columns in your queries (especially in WHERE clause or ORDER BY statement) it is recommended to use these alternative views. This will speed up things quite a lot.
columns | canonical view | alternative view (recommended) |
---|---|---|
actor
|
see here | |
ar_actor |
archive |
archive_userindex
|
comment
|
see here | |
fa_actor |
filearchive |
filearchive_userindex
|
ipb_address
|
ipblocks
|
ipblocks_ipindex
|
log_namespace |
logging |
logging_logindex
|
log_title
| ||
log_page
| ||
log_actor |
logging |
logging_userindex
|
oi_actor |
oldimage |
oldimage_userindex
|
rc_actor |
recentchanges |
recentchanges_userindex
|
rev_actor |
revision |
revision_userindex
|
- Technical bit
The script that creates and maintains the indexes used by the alternative views is called maintain_replica_indexes.py. This script uses the definitions found in index-conf.yaml to generate those indexes. That last file is where you can learn, for instance, that an additional index called log_actor_deleted
is being created on the logging
table using log_actor, log_deleted
columns.
The alternative tables boost the queries not just using indexes; they also do so by subsetting the original table. For instance, actor_revision
table only includes those actors that match a row in the revision
table's rev_actor
column. The script that populates the alternative views can be found in maintain-views.py; this script uses maintain-views.yaml to populate those views. The last file is where you can find the definition of actor_revision
table, for instance.
Wiki text
Unfortunately there is no way to access the wikitext directly via the replica databases. You have to use the mw:API instead.
Accessing the databases
There are a variety of ways to access the databases.
preset shell script
From the command line, a shell script exists that automatically selects the correct sections for you. The sql shell script is a wrapper around the mariadb
command. Either method works, though the sql shell script selects the appropriate sections for you.
$ sql enwiki_p
This command will connect to the appropriate cluster (in this case, s1.analytics.db.svc.wikimedia.cloud) and give you a prompt where you can run queries.
$ sql enwiki_p < test-query.sql > test-query.txt
This command takes a .sql file that contains your query, selects the appropriate sections, runs the query, and outputs to a text file. The advantage here is that it doesn't add lines around the data (making a table), it instead outputs the data in a tab-delimited format.
$ sql enwiki_p < test-query.sql | gzip >test-query.txt.gz
This command does the same thing as the command above, but after outputting to a text file, it gzips the data. This can be very helpful if you're dealing with large amounts of data.
mariadb command
If you wish to use the mariadb command directly, a sample query would look like this:
$ mariadb --defaults-file=~/replica.my.cnf -h s1.analytics.db.svc.wikimedia.cloud enwiki_p -e "SHOW databases;"
The --defaults-file option points to the file replica.my.cnf where your credentials are stored (username, password). The -h option tells the client which host to access (in this case s1.analytics.db.svc.wikimedia.cloud) and enwiki_p is the database you want to access. The -e option specifies a a command to execute enclosed in quotes. You can also pipe the results to a file.
$ mariadb --defaults-file=~/replica.my.cnf -h s1.analytics.db.svc.wikimedia.cloud enwiki_p < test-query.sql > test-query.txt
mariadb command with default
Another way to use the mariadb command directly but without the need to specify the --defaults-file option each time, is to copy your replica.my.cnf to .my.cnf once.
$ ln -s replica.my.cnf .my.cnf
From that point your login data is automatically loaded by the mariadb
command. All subsequent commands then look similar as shown above.
$ mariadb -h s1.analytics.db.svc.wikimedia.cloud -e "show databases;"
Writing queries
Because the replica databases are read-only, nearly all of the queries you will want to run will be SELECT queries.
SELECT * FROM `user`;
This query selects all columns from the user table. More information about SQL queries are available below (in the example queries) and in the MariaDB documentation.
Queries end in a semi-colon (;). If you want to cancel the query, end it in \c. If you want to output in a non-table format, use \G.
Optimizing queries
To optimize queries, you can ask the server how the query is executed (a so-called EXPLAIN query). Because of the permissions restrictions on the Wiki Replica servers, a simple EXPLAIN query will not work. There is however a workaround using MariaDB's SHOW EXPLAIN FOR:
- Open 2 SQL sessions connected to the same backend database server
- In session 1:
SELECT CONNECTION_ID() AS conid;
- Note the number returned.
- Run the query to be explained.
- In session 2:
- Use the number you found above for <conid>.
SHOW EXPLAIN FOR <conid>;
If the rightmost column contains 'using filesort', your query is likely to take a long time to execute. Review whether the correct index is used, and consider switching to one of the alternative views.
A helpful tool for this is the online SQL Optimizer, which will display the execution plan for a given query. Superset can also show the plan for a currently executing query.
SHOW EXPLAIN FOR
. Successfully using SHOW EXPLAIN FOR
requires both sessions to be connected to the same instance.meta_p database
The "meta_p" database contains metadata about the various wikis and databases. It consists of one table: wiki and it is available on meta.{ANALYTICS,WEB}.db.svc.wikimedia.cloud
.
Example queries
- Help:Wiki Replicas/Queries/Example queries contains a large collection of examples which were once in this page directly.
- en:Wikipedia:Database reports (and its equivalents in other languages) contain many useful examples.
- queries on Superset are another good source of examples