MySQL
See also MariaDB.
Editing db.php
db.php is MediaWiki's configuration file for database connections. It sets a variable called $wgLBFactoryConf.
Common tasks
- Depooling
- If the server is down, comment it out in sectionLoads and (if necessary) groupLoadsByDB. To remove load from an overloaded but up server, set its load ratio in sectionLoads to zero. A server which is in sectionLoads but with zero load will delay maintenance scripts, they will wait for it to have low replication lag.
- Load tuning
- The load numbers in sectionLoads are ratios. They can be converted to a load fraction by dividing by the sum of loads in that section. So for example if you have (db1 => 100, db2 => 200), then db1 will have 33% and db2 will have 67%. Tweak these values to balance the I/O load as seen in Ganglia.
- Master switch
- The first server in a given section in sectionLoads is the master, so you can change the master by changing the order of servers or commenting out the first server. See switch master for the full procedure. The master should generally have zero read load, because the load from applications requiring a master connection will be substantial.
- Reclaiming disk space
- See Reclaim space on full db servers.
Comprehensive documentation
The primary source of documentation is includes/db/LBFactory_Multi.php, copied below:
- sectionsByDB
- A map of database names to section names
- sectionLoads
- A 2-d map. For each section, gives a map of server names to load ratios. For example: array( 'section1' => array( 'db1' => 100, 'db2' => 100 ) )
- serverTemplate
- A server info associative array as documented for $wgDBservers. The host, hostName and load entries will be overridden.
- groupLoadsBySection
- A 3-d map giving server load ratios for each section and group. For example: array( 'section1' => array( 'group1' => array( 'db1' => 100, 'db2' => 100 ) ) )
- groupLoadsByDB
- A 3-d map giving server load ratios by DB name.
- hostsByName
- A map of hostname to IP address.
- externalLoads
- A map of external storage cluster name to server load map
- externalTemplateOverrides
- A set of server info keys overriding serverTemplate for external storage
- templateOverridesByServer
- A 2-d map overriding serverTemplate and externalTemplateOverrides on a server-by-server basis. Applies to both core and external storage.
- templateOverridesByCluster
- A 2-d map overriding the server info by external storage cluster
- masterTemplateOverrides
- An override array for all master servers.
- readOnlyBySection
- A map of section name to read-only message. Missing or false for read/write.
Replication lag
Replication lag can be checked on https://dbtree.wikimedia.org/. Current tips are being gradually added to MariaDB/troubleshooting#Replication lag.
When a transaction completes on the master database server, the transaction is written to the binlog, and then the slave servers begin executing it. Replication lag is the time from when the binlog entry was written, to the time the slave server finishes executing that entry.
Replication lag is a performance problem, and so the solution is potentially complex, involving software optimisation or hardware expansion.
Long-running reads
The most common cause is when a long-running read query executes on the slave, thrashing the disk and causing all other queries (including write queries) to execute slowly. The more severe incarnation of this is when large numbers of heavy read queries are queued up on all slaves, all slaves become lagged, and the master dies due to stale data failover in MediaWiki.
To diagnose this, run the SQL query SHOW FULL PROCESSLIST and save the result for later analysis. Look queries that have been running for longer than a few minutes. Identify common features of these queries, such as the function comment. Note that if the server is running very slowly, innocuous queries will become long-running. Generally the longest-running of all queries is the culprit.
The best emergency response to this, when general site performance is affected, is to first kill all the long-running queries. You can do this with shellscript - DO NOT RUN IT UNLESS YOU ARE FULLY SURE YOU KNOW WHAT YOU ARE DOING:
hosts="server1 server2 server3" pattern="BadClass::overloadServers" for host in $hosts; do db-mysql $host -e 'show processlist' | grep "$pattern" | awk '{print "kill",$1,";"}' | db-mysql $host done
This typically buys you a few minutes while the users resubmit their queries. Use these few minutes to identify the culprit feature, to patch it out and deploy. You can buy a few minutes more by running the kill script in a loop.
General overload
If SHOW PROCESSLIST shows large numbers of read queries running, but the read queries are just the usual MediaWiki traffic, and nothing new or special stands out, then this suggests a general overload. Check ganglia, iostat, etc. for the server in question. If one server is especially overloaded, reduce its load ratio. If all slaves are overloaded, buy more hardware.
Listing queries
From any cumin host
db-mysql $host -e 'show processlist' db-mysql $host -e 'show full processlist'
Checking queries from a specific IP:
db-mysql $host mysql> select * from information_schema.processlist where user='wikiuser' and Info like '%<IP>%';
Killing queries
DO NOT RUN THIS UNLESS YOU KNOW WHAT YOU ARE DOING
examples:
mysql> kill <id>;
all the busy ones: pt-kill --match-user wikiuser --match-command Query --busy-time 120 --interval 1 --kill --print --victims all
from a certain IP: pt-kill --match-user wikiuser --match-command Query --match-info "<IP>" --interval 1 --kill --print --victims all
High write rate
If SHOW PROCESSLIST shows few running read queries, but the writer thread is constantly busy, and the replication lag is increasing, this implies a write-driven overload.
(Note that if lag is decreasing in this scenario, it implies that some previous event caused the replication lag but that the server is recovering.)
The solution to this is to identify the source of the writes, and to throttle them appropriately. Maintenance scripts, for instance, need to do short-running queries and to call wfWaitForSlaves(5); regularly.
Hardware and configuration problems can also be responsible. A server with write-behind caching disabled (either due to BIOS configuration or a dead RAID controller battery) will have very slow write performance and so will tend to exhibit write-driven lag.
Similarly, servers with innodb_flush_log_at_trx_commit=1 will be unacceptably slow for our write load. It should be set to zero instead.
Broken replication
Broken replication can be identified by a constantly increasing replication lag with no replicated write queries running. SHOW SLAVE STATUS will typically show an error message which allows you identify the problem in more detail.
Connection errors, when the slave fails to connect to the master, are a common problem for new slaves. These show up in the server's error log.
Unusual errors on established slaves may indicate data corruption. After identifying the source of the problem, the simplest solution is to wipe slave's hard drive and resynchronise it from another slave.
Maintenance Recipes
clear out old binlogs to free up space
Sometimes binlogs don't get automatically expunged. To safely delete old unneeded binlogs (good to keep 5-7 days worth if you can):
- look at the time stamps of the binlogs and decide the oldest one you want to keep (eg .010)
- purge all older logs:
PURGE BINARY LOGS TO 'mysql-bin.010';
This command can be safely run on a master. More info can be found in the mysql documentation for purge binary logs.
misc. snippets
show database / table sizes
size of one database, tables in detail
SELECT TABLE_NAME, table_rows, data_length, index_length, round(((data_length + index_length) / 1024 / 1024),2) "Size in MB" FROM information_schema.TABLES WHERE table_schema = "schema_name";
all databases
SELECT table_schema "Data Base Name", SUM( data_length + index_length) / 1024 / 1024 "Data Base Size in MB" FROM information_schema.TABLES GROUP BY table_schema ;