db-kill

From Wikitech

db-kill is an emergency-only tool, that simplifies the killing of long running queries on a host. It kills (until manually stopped) queries running for over 10 seconds or sleeping over 10 seconds on the given mysql instance. It is intended to complement the automatic query killer that kicks in at 60+ seconds and only kills queries longer than 60 seconds, when that is not enough to contain a query overload.

Details

db-kill is intended as an easy-to-use wrapper for pt-kill, with minimal options, for operators that are not too familiar with MariaDB administration.

Rather than connecting to a host, running SHOW PROCESSLIST and doing a bunch of complicated database commands, it tries to simplify by having some good defaults -when an emergency kill run is needed.

db-kill does:

  • kill only webrequests/jobqueue, with the use of a specific user (doesn't kill admin requests, dumps, root, etc)
  • set sane defaults for interval, busy-time and idle-time, and killing running or sleeping queries in pt-kill
  • make easy to run against any instance, allowing using hostname:section (e.g. db1099:s1)
  • print the full commandline for pt-kill, in case manual adjustments are needed
  • log all actions to a file for later reviewing
  • have a --dry-run option to monitor effects without actual kills
  • use automatically the "extra port", in case the default one is overloaded by other queries.

Usage

db-kill must be run as root from one of the cumin hosts (cumin1002.eqiad.wmnet, cumin2002.codfw.wmnet), ideally from the same datacenter of the databases to target:

To kill long running queries on the s2 standalone mariadb instance db2059.codfw.wmnet:

db-kill db2059

If one wants to kill queries from the enwiki instance on the multi-instance host db1099.eqiad.wmnet:

db-kill db1099:s1

Alternatively, the following will be equivalent to the previous execution (s1 runs on port 3311):

db-kill db1099:3311

To test what will be killed on the standalone host db1117.eqiad.wmnet, without actually killing anything:

db-kill --dry-run db1117

Output:

Running:
/usr/bin/pt-kill --print --victims all --interval 5 --busy-time 10 --idle-time 10 --match-command 'Query|Execute|Sleep' --match-user wikiuser --log /var/log/db-kill.log h=db1117.eqiad.wmnet,P=3307
# 2021-10-13T10:27:42 KILL 2816313971 (Query 10 sec) SELECT /* MediaWiki\Revision\RevisionStore::loadSlotRecords  */  slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model  FROM `slots` JOIN `content` ON ((slot_content_id = content_id))   WHERE slot_revision_id = '4219499'
# 2021-10-13T10:27:42 KILL 2816313958 (Query 10 sec) SELECT /* ProofreadPage\Page\DatabaseIndexForPageLookup::findIndexesWhichLinkTo  */  page_namespace,page_title  FROM `page`,`pagelinks`    WHERE pl_namespace = 250 AND pl_title = 'Ксьондзівські_найми.pdf/8' AND (pl_from=page_id) AND pl_from_namespace = 252
# 2021-10-13T10:27:42 KILL 2816312509 (Query 10 sec) SELECT /* MediaWiki\Permissions\RestrictionStore::loadRestrictions  */  pr_type,pr_expiry,pr_level,pr_cascade  FROM `page_restrictions`    WHERE pr_page = 3823
...

Hit Ctrl-C to finish the monitoring and killing process- otherwise it will continue forever.

Arguments

db-kill --help
usage: db-kill [-h] [--dry-run] instance

Execute pt-kill on the given instance.

positional arguments:
  instance    Instance to connect to, in hostname, host:port or host:section
              format

optional arguments:
  -h, --help  show this help message and exit
  --dry-run   It prints the queries that would be killed, but does not send
              the kill commands

Code