Jump to content

Help:Wiki Replicas/Queries

From Wikitech

SQL queries against the Wiki Replicas databases allow you to search and collect metadata about pages and actions on the Wikimedia movement's public wikis. Also consider the Superset web interface as an alternative for one-off queries.

This page can be improved by better integrating its content with related docs. See phab:T232404. Contributions welcome!

Database layout

The database layout is available at mw:Manual:Database layout.

There are also two commands you can use to view the layout.   SHOW TABLES   will show the available tables in a database.   DESCRIBE table_name   will show the available columns in a specific table.

Sections

The various databases are stored in 'sections'. The sections are named with a leading 's' and a digit—for example s1, s2, etc.—followed by the internal domain name '{analytics,web}.db.svc.wikimedia.cloud' (e.g. s1.analytics.db.svc.wikimedia.cloud and s1.web.db.svc.wikimedia.cloud).

Data storage

There are a few tricks to how data is stored in the various tables.

  1. Page titles use underscores and never include the namespace prefix. (eg: page_title='The_Lord_of_the_Rings')
  2. User names use spaces, not underscores. (eg: actor_name='Jimbo Wales')
  3. Namespaces are integers. A key to the integers is available at mw:Manual:Namespace.

Views

Toolforge has exact replicas of Wikimedia's databases, however, certain information is restricted using MariaDB views.

For example, the user table view does not show things like user_password or user_email to Toolforge users. You can only access data from the public (redacted) databases marked as _p (eg: enwiki_p).

Alternative views

There are some alternative views in which the data from the underlying tables are redacted in a different way, so that the corresponding indices can be used. If you utilize the listed columns in your queries (especially in WHERE clause or ORDER BY statement) it is recommended to use these alternative views. This will speed up things quite a lot.

columns canonical view alternative view (recommended)
actor see here
ar_actor archive archive_userindex
comment see here
fa_actor filearchive filearchive_userindex
ipb_address ipblocks ipblocks_ipindex
log_namespace logging logging_logindex
log_title
log_page
log_actor logging logging_userindex
oi_actor oldimage oldimage_userindex
rc_actor recentchanges recentchanges_userindex
rev_actor revision revision_userindex
Also see News/2019 Actor storage changes on the Wiki Replicas for additional information about slow performance using the actor and/or comment tables
; Technical bit

The script that creates and maintains the indexes used by the alternative views is called maintain_replica_indexes.py. This script uses the definitions found in index-conf.yaml to generate those indexes. That last file is where you can learn, for instance, that an additional index called log_actor_deleted is being created on the logging table using log_actor, log_deleted columns.

The alternative tables boost the queries not just using indexes; they also do so by subsetting the original table. For instance, actor_revision table only includes those actors that match a row in the revision table's rev_actor column. The script that populates the alternative views can be found in maintain-views.py; this script uses maintain-views.yaml to populate those views. The last file is where you can find the definition of actor_revision table, for instance.

Wiki text

Unfortunately there is no way to access the wikitext directly via the replica databases. You have to use the mw:API instead.

Accessing the databases

There are a variety of ways to access the databases.

preset shell script

From the command line, a shell script exists that automatically selects the correct sections for you. The sql shell script is a wrapper around the mariadb command. Either method works, though the sql shell script selects the appropriate sections for you.

$ sql enwiki_p

This command will connect to the appropriate cluster (in this case, s1.analytics.db.svc.wikimedia.cloud) and give you a prompt where you can run queries.

$ sql enwiki_p < test-query.sql > test-query.txt

This command takes a .sql file that contains your query, selects the appropriate sections, runs the query, and outputs to a text file. The advantage here is that it doesn't add lines around the data (making a table), it instead outputs the data in a tab-delimited format.

$ sql enwiki_p < test-query.sql | gzip >test-query.txt.gz

This command does the same thing as the command above, but after outputting to a text file, it gzips the data. This can be very helpful if you're dealing with large amounts of data.

mariadb command

If you wish to use the mariadb command directly, a sample query would look like this:

$ mariadb --defaults-file=~/replica.my.cnf -h s1.analytics.db.svc.wikimedia.cloud enwiki_p -e "SHOW databases;"

The --defaults-file option points to the file replica.my.cnf where your credentials are stored (username, password). The -h option tells the client which host to access (in this case s1.analytics.db.svc.wikimedia.cloud) and enwiki_p is the database you want to access. The -e option specifies a a command to execute enclosed in quotes. You can also pipe the results to a file.

$ mariadb --defaults-file=~/replica.my.cnf -h s1.analytics.db.svc.wikimedia.cloud enwiki_p < test-query.sql > test-query.txt

mariadb command with default

Another way to use the mariadb command directly but without the need to specify the --defaults-file option each time, is to copy your replica.my.cnf to .my.cnf once.

$ ln -s replica.my.cnf .my.cnf

From that point your login data is automatically loaded by the mariadb command. All subsequent commands then look similar as shown above.

$ mariadb -h s1.analytics.db.svc.wikimedia.cloud -e "show databases;"

Writing queries

Because the replica databases are read-only, nearly all of the queries you will want to run will be SELECT queries.

SELECT * FROM `user`;

This query selects all columns from the user table. More information about SQL queries are available below (in the example queries) and in the MariaDB documentation.

Queries end in a semi-colon (;). If you want to cancel the query, end it in \c. If you want to output in a non-table format, use \G.

Optimizing queries

To optimize queries, you can ask the server how the query is executed (a so-called EXPLAIN query). Because of the permissions restrictions on the Wiki Replica servers, a simple EXPLAIN query will not work. There is however a workaround using MariaDB's SHOW EXPLAIN FOR:

  1. Open 2 SQL sessions connected to the same backend database server
  2. In session 1:
    • SELECT CONNECTION_ID() AS conid;
    • Note the number returned.
    • Run the query to be explained.
  3. In session 2:
    • Use the number you found above for <conid>.
    • SHOW EXPLAIN FOR <conid>;

If the rightmost column contains 'using filesort', your query is likely to take a long time to execute. Review whether the correct index is used, and consider switching to one of the alternative views.

A helpful tool for this is the online SQL Optimizer, which will display the execution plan for a given query. Superset can also show the plan for a currently executing query.

The Wiki Replica databases are split out between sets of separate MariaDB instances. To make sure you are on the same server instance, you should connect to the same database URL with both connections when trying SHOW EXPLAIN FOR. Successfully using SHOW EXPLAIN FOR requires both sessions to be connected to the same instance.

meta_p database

The "meta_p" database contains metadata about the various wikis and databases. It consists of one table: wiki and it is available on meta.{ANALYTICS,WEB}.db.svc.wikimedia.cloud.

List meta_p.wiki

Example queries

See also