Help:MySQL queries

From Wikitech
Jump to navigation Jump to search

MySQL queries against the Wiki Replicas databases allow you to search and collect metadata about pages and actions on the Wikimedia movement's public wikis. Also consider the Quarry web interface as an alternative for one-off queries.

Database layout

The database layout is available at the MediaWiki wiki: mw:Manual:Database layout.

A dump of the currently running database layouts can be found here.

There are also two commands you can use to view the layout.   SHOW TABLES   will show the available tables in a database.   DESCRIBE table_name   will show the available columns in a specific table.

Slices

The various databases are stored in 'slices'. The slices are named with a leading 's' and a digit—for example s1, s2, etc.—followed by the internal domain name '{analytics,web}.db.svc.eqiad.wmflabs' (e.g. s1.analytics.db.svc.eqiad.wmflabs and s1.web.db.svc.eqiad.wmflabs). A table of this information is available here: List meta_p.wiki.

Data storage

There are a few tricks to how data is stored in the various tables.

  1. Page titles use underscores and never include the namespace prefix. (eg: page_title='The_Lord_of_the_Rings')
  2. User names use spaces, not underscores. (eg: actor_name='Jimbo Wales')
  3. Namespaces are integers. A key to the integers is available here.

Views

Toolforge has exact replicas of Wikimedia's databases, however, certain information is restricted using MySQL views.

For example, the user table view does not show things like user_password or user_email to Toolforge users. You can only access data from the public (redacted) databases marked as _p (eg: enwiki_p).

Alternative views

There are some alternative views in which the data from the underlying tables are redacted in a different way, so that the corresponding indices can be used. If you utilize the listed columns in your queries (especially in WHERE clause or ORDER BY statement) it is recommended to use these alternative views. This will speed up things quite a lot.

columns canonical view alternative view (recommended)
ar_actor archive archive_userindex
fa_actor filearchive filearchive_userindex
log_namespace logging logging_logindex
log_title
log_page
log_actor logging logging_userindex
oi_actor oldimage oldimage_userindex
rc_actor recentchanges recentchanges_userindex
rev_actor revision revision_userindex
Technical bit

The script that creates and maintains the indexes used by the alternative views is called maintain_replica_indexes.py. This script uses the definitions found in index-conf.yaml to generate those indexes. That last file is where you can learn, for instance, that an additional index called log_actor_deleted is being created on the logging table using log_actor, log_deleted columns.

The alternative tables boost the queries not just using indexes; they also do so by subsetting the original table. For instance, actor_revision table only includes those actors that match a row in the revision table's rev_actor column. The script that populates the alternative views can be found in maintain-views.py; this script uses maintain-views.yaml to populate those views. The last file is where you can find the definition of actor_revision table, for instance.

Wiki text

Unfortunately there is no way to access the wikitext directly via MySQL replica databases. You have to use the mw:API instead.

Accessing the databases

There are a variety of ways to access the databases.

preset shell script

From the command line, a shell script exists that automatically selects the correct slice for you. The sql shell script is a wrapper around the mysql command. Either method works, though the sql shell script selects the appropriate slice for you.

$ sql enwiki_p

This command will connect to the appropriate cluster (in this case, s1.analytics.db.svc.eqiad.wmflabs) and give you a MySQL prompt where you can run queries.

$ sql enwiki_p < test-query.sql > test-query.txt

This command takes a .sql file that contains your query, selects the appropriate slice, runs the query, and outputs to a text file. The advantage here is that it doesn't add lines around the data (making a table), it instead outputs the data in a tab-delimited format.

$ sql enwiki_p < test-query.sql | gzip >test-query.txt.gz

This command does the same thing as the command above, but after outputting to a text file, it gzips the data. This can be very helpful if you're dealing with large amounts of data.

mysql command

If you wish to use the mysql command directly, a sample query would look like this:

$ mysql --defaults-file=~/replica.my.cnf -h s1.analytics.db.svc.eqiad.wmflabs enwiki_p -e "SHOW databases;"

The --defaults-file option points to the file replica.my.cnf where your MySQL credentials are stored (username, password). The -h option tells MySQL which host to access (in this case s1.analytics.db.svc.eqiad.wmflabs) and enwiki_p is the database you want to access. The -e option tells MySQL to execute the following command enclosed in quotes. You can also have MySQL output the results to a file.

$ mysql --defaults-file=~/replica.my.cnf -h s1.analytics.db.svc.eqiad.wmflabs enwiki_p < test-query.sql > test-query.txt

mysql command with default

Another way to use the mysql command directly but without the need to specify the --defaults-file option each time, is to copy your replica.my.cnf to .my.cnf once.

$ ln -s replica.my.cnf .my.cnf

From that point your login data is automatically loaded by the mysql command. All subsequent commands then look similar as shown above.

$ mysql -h s1.analytics.db.svc.eqiad.wmflabs -e "show databases;"

Writing queries

Because the replica databases are read-only, nearly all of the queries you will want to run will be SELECT queries.

SELECT * FROM `user`;

This query selects all columns from the user table. More information about MySQL queries are available below (in the example queries) and in the MySQL SELECT.

Queries end in a semi-colon (;). If you want to cancel the query, end it in \c. If you want to output in a non-table format, use \G.

Optimizing queries

To optimize queries, you can ask the MySQL server how the query is executed (a so-called EXPLAIN query). Because of the permissions restrictions on the Wiki Replica servers, a simple EXPLAIN query will not work. There is however a workaround using MariaDB's SHOW EXPLAIN FOR:

  1. Open 2 SQL sessions connected to the same backend database server
  2. In session 1:
    • SELECT CONNECTION_ID() AS conid;
    • Note the number returned.
    • Run the query to be explained.
  3. In session 2:
    • Use the number you found above for <conid>.
    • SHOW EXPLAIN FOR <conid>;

If the rightmost column contains 'using filesort', your query is likely to take a long time to execute. Review whether the correct index is used, and consider switching to one of the alternative views.

A helpful tool for this is the online SQL Optimizer, which will display the execution plan for a given query. Quarry can also show the plan for a currently executing query.

meta_p database

The "meta_p" database contains metadata about the various wikis and databases. It consists of one table: wiki and it is available on every slice (s1–s8.analytics.db.svc.eqiad.wmflabs).

List meta_p.wiki

Example queries

See also

(This page was originally imported from http://wiki.toolserver.org/view/MySQL_queries)