Jump to content

User:Afeldman/percona-live-london-notes

From Wikitech

Tutorial 1: DBA Swiss Army Knife

tldr

The first part was an overview of storage performs (calculating iops, etc) targeted towards DBA's without a systems background. Then mostly went over percona-toolkit, mostly a combined fork of maatkit + aspersa. I was very familiar with maatkit but not the aspersa tools, some of which seem quite useful in front line troubleshooting. Most interesting, is that percona is developing their own online schema change tool, pt-online-schema-change based on the ideas of facebooks osc tool, but not the same. This is great because while facebook claims theirs works in all cases, and I'm sure it does, the version on launchpad has not been updated since the first commit, and has some problems.

Notes

@lefred / lefred@percona.com / http://lefred.be/

percona toolkit

  • gpl v2, derived from maatkit + aspersa
  • swapiness - should be 0 on dbs
  • disk scheduler - deadline is preferable (noop is as good)
  • to see schedulers that are available, and which is in use
 # cat /sys/block/sda/queue/scheduler 
 noop anticipatory deadline [cfq] 
  • tool #2: the tool that managers your raid controller (i.e. MegaCli, megactl, arconf, etc)
    • important to know info on underlying disks, raid, and write cache policy
    • write cache should be write-back (or no perf gain over normal disk) w/bbu
    • know bbu health: i.e. MegaCli -AdpBbuCmd -GetBbuStatus -aAll
  • IOPS
    • to calculate on magnetic disks, need (such as from manufacturer data):
      • rotation speed
      • avg latency
      • avg seek time
    • for one disk:
 iops = 1 / (avg latency in sec + avg read/wrte seek time in dec)
  • raid level : penalty
    • 0 : 1
    • 1 : 2
    • 10 : 2
    • 5 : 4
    • 6 : 6
 1 / ( (total workload of iops * percentage of workload read ops) + ((total iops * percentage of write ops) / RAID iops penalty) )


  • tool 3: sysbench
    • i.e. check the output of random writes
 sysbench --test=fileio --file-test-mode=rndrw --file-total-size=10G --file-extra-flags=direct --num-threads=8 prepare / then run
 rndrw = combined rand read/write, rndwr = rand write, rndrd = randread, also seq.. 
    • try with different threads and find the max before perf degrades
      • use for tuning innodb io threads
    • flush kernel buffer cache before each test:
 echo 3 > /proc/sys/vm/drop_caches
 pt-diskstats -d sda\$  (= sda*)
 press "c" and enter when running for all columns

Know your environment

  • tool 5: pt-mysql-summary (make sure you have .my.cnf w/pw) [this is good]
  • tool 6: pt-show-grants
    • understand access

Important questions

  • Which queries use the most resources> (might not appear in a slow query log)
  • How are indexes used?
  • Are there duplicate keys?
  • To answer - capture data:
    • use slow query log (but remember, fast queries can be bad/wrong too)
    • use general query log (not the best, Query_time is missing in <= 5.1)
    • use tcpdump (maybe best) - capture at different times, i.e. peak vs. non-peak
 # time tcpdump -i any -s 65535 -x -n -q -tttt 'port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2' > $(hostname)_queries_$(date +%Y-%m-%d_%H-%M).tcp
  • tool 7: pt-query-digest (replaces mk-query-digest)
 # pt-query-diest --watch-server "x.x.x.x:xxxx" --type tcpdump file_name.tcp
  • tool 8: pt-index-usage
    • read queries from a log and analyze how they use indexes (can't use tcpdump output directly)
    • has some options to transforms updates, etc. into selects for explain purposes
  • tool 9: pt-duplicate-key-checker
  • tool 10: innotop
  • tool 11: pt-mext -- show many samples of "show global status" side-by-side
 # pt-mext -r -- mysqladmin ext -i10 -c5
    • pay attention to: Created_tmp_*, Hander_read_*, innodb_buffer_pool_read_*, innodb_os_log_written, Key_read_requests, Key_reads, Qcache*, Select_full_*join, Select_range, Select_scan, Table_logs_waited, Threads_running
    • if these keep increasing, you needs optimizing.
    • innodb_os_log_written -- an hour of this should fit in your innodb_log
  • tool 13: mydumper (good for dumping myisam tables in parallel)
  • tool 14: xtrabackup - w00t
  • tool 15: pt-archiver -- archive rows from a mysql table into another with a low impact fwd-only job that won't impact oltp queries much or set locks
  • tool 16: common-schema -- assists in schema changes
  • tool 17: pt-online-schema-change -- online non-blocking shcema changes similar to facebooks [this is good]'
  • tool 18: pt-upgrade - execute on multiple servers and check for differences
  • tool 19: pt-deadlock-logger - get readable info on latest deadlock
  • tool 20: pt-kill - pattern match killer, i.e. selects from user yy over xx minutes
  • tool 21: pt-variable-advisor - check variables and advice on possible issues
  • tool 23-25: pt-stalk, pt-collect, pt-sift (these are buggy at the moment)
  • replication tools: pt-slave-delay, pt-slave-find, pt-slave-restart (brute force slave start), pt-heartbeat, pt-table-checksum, pt-table-sync

Tutorial 2: Advanced Mysql Scaling Strategies for Developers

tldr

Make sure your queries use indexes correctly. Had some well done exercises.

Notes

What is wrong with rdbms? Can be slow. Transaction lock/stall/deadlock out. Forces a relational schema. sql sucks.

What's right? ACID - very very good at keeping data safe - 50 yrs of research around that.

rdbms: sql parser -> query optimizer -> storage engine

sorting (group by, order by)

range scans:

 select A,B from Y where X = 123 and Z IN (1,100,40,3)
 key (X, Z)  
  • optimizer can still go to a range scan of z=1-100 instead of using the index depending on number of items in IN() and/or the estimated cardinality of Z (which may be wildly incorrect.)

subqueries - bad - mysql just doesn't properly support them, output from a subquery may match a quick index lookup in a where clause but it won't work that way. mariadb 5.3 will be the first mysql version to fix this.

clustered pk and secondary indexes

  • if primary key (id)
    • key (x,y) stored looks like: key (x,y,id)
  • primary key should be as small as possible, and should always be picked (even if autoinc)
  • autoinc keys hamper insert performance as there is a mutex around the incrementer

optimizer uses cardinality to pick index to use - in innodb, this isn't saved and might be estimated from reading only 6 rows when a table is open, so can be wildly wrong.

 show index from table -- includes cardinality estimate

Where did a slow query actually spend its time?

 set profile=1;
 run query
 show profile; --- see query optimizing example 2

using slow as an enhanced general log (especially if you have the microtime patch), and making a digest

 set global long_query_time = 0;
 set log_slow_verbosity = 'microtime' (or 'full')
 flush logs; 
 ...wait..
 pt-query-digest --limit=100% slow.log > slow.txt 

caching (layers) objects - large and small (memcache) - if a large object is made of lots of small objects, and it's invalidated when one small object changes, cache the small objects tables - summary tables (i.e. the old business_sort @kiva) -- great tool for automating them - http://code.google.com/p/flexviews/ denormalize - put together columns from relevant tables to get rid of joins

queues: great way to reduce concurrency. i.e. if server can only perform well with 64 write threads, but 1000 threads might be trying to insert data, threads could write to an in memory queue instead, with 64 job workers processing it

too many indexes? pt-index-usage + pt-duplicate-key-checker (USER_STATISTICS)

indexes != data but when indexing strings, try to make sure the index provides the full column contents on many queries (95% or more): where column = varchar(255) utf8

 1. select count(distinct(column))
 2. select count(distinct left(column,20))
 2/1 = percent of the rows an 20 char index would meet. size index to be useful.

useful links: http://code.google.com/p/common-schema/ http://palominodb.com/blog/2011/10/12/why-are-your-indexes-larger-your-actual-data

Query Optimization Example 1

Given this query histogram in query-digest output:

# Rank Query ID           Response time   Calls   R/Call Apdx V/M   Item
# ==== ================== =============== ======= ====== ==== ===== ======
#    1 0xB79B29D99178325A 5972.9879 24.2%   14998 0.3983 1.00  0.04 SELECT Property PropertyDocument User Seller

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          0   14998
# Exec time     24   5973s    28us      5s   398ms   477ms   126ms   393ms
# Lock time      0      4s       0   117ms   260us   167us     3ms   113us
# Rows sent      0  50.05k       0       6    3.42    5.75    2.31    2.90
# Rows examine  10   2.17G       0 162.99k 151.57k 158.07k  37.15k 158.07k
# Rows affecte   0       0       0       0       0       0       0       0
# Rows read      0 140.04k       0  14.02k    9.56   28.75  163.68    3.89
# Bytes sent     0  30.60M   1.43k   2.76k   2.09k   2.62k  433.34   1.96k
# Tmp tables     0  13.79k       0       1    0.94    0.99    0.23    0.99
# Tmp disk tbl   0       0       0       0       0       0       0       0
# Tmp tbl size  25   6.56G       0 597.78k 458.44k 590.13k 249.51k 590.13k
# Query size     0   7.39M     511     530  516.82  511.45    0.20  511.45

# Query_time distribution
#   1us
#  10us  ###
# 100us  #
#   1ms  #
#  10ms
# 100ms  ################################################################
#    1s  #
#  10s+

Start with an explain:

mysql> EXPLAIN 
    -> SELECT `Property`.`PropertyId`, `PropertyDocument`.*, `User`.`username`, `Seller`.* FROM `Property`
    ->  LEFT JOIN `PropertyDocument` ON PropertyDocument.PropertyId  = Property.PropertyId
    ->  INNER JOIN `User` ON User.id = Property.sellerId
    ->  INNER JOIN `Seller` ON Seller.userId = Property.sellerId 
	WHERE (Property.catalogNumber = "907" AND Property.subjectCode = "APT" 
	AND Property.districtId = 1956410 AND Property.status > 0) GROUP BY `Property`.`PropertyId` ORDER BY `Property`.`dateAdded` DESC LIMIT 3 OFFSET 9\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: Seller
   partitions: NULL
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 27978
        Extra: Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: Property
   partitions: NULL
         type: ref
possible_keys: merchantId
          key: merchantId
      key_len: 4
          ref: Seller.userId
         rows: 3
        Extra: Using where
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: PropertyDocument
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: Property.PropertyId
         rows: 1
        Extra:
*************************** 4. row ***************************
           id: 1
  select_type: SIMPLE
        table: User
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: Seller.userId
         rows: 1
        Extra:
4 rows in set (0.01 sec)

Now lets look at the indexes and structure of the relevant tables:

mysql> SHOW CREATE TABLE `Seller`\G
*************************** 1. row ***************************
       Table: Seller
Create Table: CREATE TABLE `Seller` (
  `userId` int(10) unsigned NOT NULL,
 ..
  PRIMARY KEY (`userId`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

mysql> SHOW CREATE TABLE `Property`\G
*************************** 1. row ***************************
       Table: Property
Create Table: CREATE TABLE `Property` (
  `PropertyId` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `sellerId` int(10) unsigned NOT NULL,
..
  `dateAdded` datetime NOT NULL,
  `status` tinyint(4) DEFAULT NULL,
..
  `districtId` int(10) unsigned DEFAULT NULL,
  `catalogNumber` varchar(20) CHARACTER SET latin1 DEFAULT NULL,
  `subjectCode` varchar(20) CHARACTER SET latin1 DEFAULT NULL,
..
  PRIMARY KEY (`PropertyId`),
  KEY `merchantId` (`sellerId`)
) ENGINE=InnoDB AUTO_INCREMENT=255575 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

mysql> show table status like 'Property'\G
*************************** 1. row ***************************
           Name: Property
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 137289
 Avg_row_length: 195
    Data_length: 26787840
Max_data_length: 0
   Index_length: 11042816
      Data_free: 4194304
 Auto_increment: 259379
..
1 row in set (0.00 sec)

mysql> show table status like 'Seller'\G
*************************** 1. row ***************************
           Name: Seller
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 27523
 Avg_row_length: 57
    Data_length: 1589248
Max_data_length: 0
   Index_length: 0
      Data_free: 4194304
 Auto_increment: NULL
..
1 row in set (0.00 sec)

Solution

No index is being used for the Seller table, or to satisfy this:

WHERE (Property.catalogNumber = "907" AND Property.subjectCode = "APT" 
	AND Property.districtId = 1956410 AND Property.status > 0)

although a key on Property (merchantId) is being used to satisfy its join to seller.

What mysql is doing - reading ever row in Seller, using an index to pull the corresponding Property row, and then checking the Property row against the WHERE. If there was an index on Property that covered at least a good chunk of the WHERE as well as merchantId, it would be used and mysql would no longer read every single row in Seller. Which column in there Property WHERE to index? Could cover all of it if it makes sense but it might not. Check the cardinality of each, picking one might be just as good as covering them all.

Query Optimization Example 2

Here is a very slow (10+ sec) query with a count distict, group by, and limit+offset. Woo! It has to examine millions of rows in order to return a 100.

  • As seen in query-digest:
# Profile
# Rank Query ID           Response time    Calls  R/Call  Apdx V/M   Item
# ==== ================== ================ ====== ======= ==== ===== =====
#    1 0xE09874F5486FB14A 22034.4158 63.6%   1489 14.7981 0.02  0.27 SELECT orders order_lines

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          0    1489
# Exec time     63  22034s   251us     20s     15s     15s      2s     15s
# Lock time      0    62ms    31us     1ms    41us    47us    39us    38us
# Rows sent      0   2.96k       1       4    2.03    2.90    0.26    1.96
# Rows examine  59  14.94G       4  10.48M  10.27M  10.25M   1.39M  10.25M
# Rows affecte   0       0       0       0       0       0       0       0
# Rows read      0   4.25k       0       5    2.92    2.90    0.52    2.90
# Bytes sent     0 154.79k      93     132  106.45  118.34    3.94  102.22
# Tmp tables     0   1.45k       1       1       1       1       0       1
# Tmp disk tbl   0       0       0       0       0       0       0       0
# Tmp tbl size   0 446.68M 307.19k 307.19k 307.19k 307.19k       0 307.19k
# Query size     0 380.95k     261     262  261.98  258.32       0  258.32

# Query_time distribution
#   1us
#  10us
# 100us  #
#   1ms
#  10ms
# 100ms
#    1s
#  10s+  ################################################################
  • explain:
mysql> EXPLAIN SELECT DISTINCT o.order_id
FROM orders o, order_lines ol
WHERE o.order_id = ol.order_id
AND ol.status_mode = '5'
AND o.completed = 1
AND ol.status_ != '4'
AND ol.trans_type IN (2,5)
GROUP BY ol.order_id
LIMIT 0, 25\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ol
   partitions: NULL
         type: index
possible_keys: idx1,idx2,idx3,idx4,idx5
          key: idx1
      key_len: 8
          ref: NULL
         rows: 1436
        Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: o
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY,ix_dx1
          key: PRIMARY
      key_len: 4
          ref: commerce.ol.order_id
         rows: 1
        Extra: Using where
2 rows in set (0.05 sec)
  • Note that explain doesn't display the very high "rows examined" we see in the query-digest histogram. Explain can be very useful, but completely masks some issues.
  • Examine the tables and indexs
mysql> SHOW CREATE TABLE `order_lines`\G
*************************** 1. row ***************************
       Table: order_lines
Create Table: CREATE TABLE `order_lines` (
  `line_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `order_id` int(11) unsigned NOT NULL,
  `inventory_id` int(11) NOT NULL,
  `trans_type` int(11) NOT NULL DEFAULT '0',
  `status_mode` int(11) NOT NULL DEFAULT '0',
  `status_` int(11) DEFAULT NULL,
  `sub_status` int(11) unsigned DEFAULT NULL,
  ....(another 45 columns)...
  PRIMARY KEY (`line_id`),
  KEY `idx1` (`order_id`,`line_id`),
  KEY `idx2` (`status_mode`,`status`),
  KEY `idx3` (`status_`,`order_id`),
  KEY `idx4` (`status_`,`sub_status`),
  KEY `idx5` (`c`,`trans_type`)
) ENGINE=InnoDB 
1 row in set (0.02 sec)
  • The optimizer is picking an index that applies to the group by and join, which seems sane. Let's try indexes that apply against part of the order_lines WHERE clause though. First trying a key that matches status_mode. Why? Status likely only contains a small number of values (1-4?), completed is likely just 0/1. Indexes there are unlikely to help due to bad cardinality.
mysql> EXPLAIN SELECT DISTINCT o.order_id
FROM orders o, order_lines ol use index(idx2)
WHERE o.order_id = ol.order_id
AND ol.status_mode = '5'
AND o.completed = 1
AND ol.status_ != '4'
AND ol.trans_type IN (2,5)
GROUP BY ol.order_id
LIMIT 0, 25\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ol
   partitions: NULL
         type: range
possible_keys: idx2
          key: idx2
      key_len: 9
          ref: NULL
         rows: 1543214
        Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: o
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY,ix_dx1
          key: PRIMARY
      key_len: 4
          ref: commerce.ol.order_id
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)
  • That looks worse in the explain, still using where+temporary+filesort, but Rows: is much larger. Lets try a key that covers trans_type, the other candidate to handle the WHERE.
mysql> EXPLAIN SELECT DISTINCT o.order_id
FROM orders o, order_lines ol use index(idx5)
WHERE o.order_id = ol.order_id
AND ol.status_mode = '5'
AND o.completed = 1
AND ol.status_ != '4'
AND ol.trans_type IN (2,5)
    ->
GROUP BY ol.order_id
LIMIT 0, 25\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ol
   partitions: NULL
         type: range
possible_keys: idx5
          key: idx5
      key_len: 8
          ref: NULL
         rows: 185506
        Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: o
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY,ix_dx1
          key: PRIMARY
      key_len: 4
          ref: commerce.ol.order_id
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)
  • That looks bad too, but is a lot better than when using idx2. How many rows actually match the order_lines WHERE?
mysql> select count(1) from order_lines ol where ol.status_mode = '5'
AND ol.status_ != '4'
AND ol.trans_type IN (2,5);
+----------+
| count(1) |
+----------+
|   124849 |
+----------+
1 row in set (0.22 sec)
  • Now lets use profiling and see what is actually happening when using the three different indexes on order_lines.
mysql> set profiling = 1;
Query OK, 0 rows affected (0.00 sec)
  • Forcing idx2 (covers status_mode)
mysql> SELECT DISTINCT o.order_id
FROM orders o, order_lines ol use index(idx2)
WHERE o.order_id = ol.order_id
AND ol.status_mode = '5'
AND o.completed = 1
AND ol.status_ != '4'
AND ol.trans_type IN (2,5)
    ->
GROUP BY ol.order_id
LIMIT 0, 25\G
*************************** 1. row ***************************
order_id: 10444406
*************************** 2. row ***************************
order_id: 10704076
*************************** 3. row ***************************
order_id: 11031609
3 rows in set (6.46 sec)
  • Letting the optimizer pick idx1 (order_id)
mysql> SELECT DISTINCT o.order_id
FROM orders o, order_lines ol
WHERE o.order_id = ol.order_id
AND ol.status_mode = '5'
AND o.completed = 1
AND ol.status_ != '4'
AND ol.trans_type IN (2,5)
    ->
GROUP BY ol.order_id
LIMIT 0, 25\G
*************************** 1. row ***************************
order_id: 10444406
*************************** 2. row ***************************
order_id: 10704076
*************************** 3. row ***************************
order_id: 11031609
3 rows in set (16.22 sec)
  • and idx5 (trans_type)
mysql> SELECT DISTINCT o.order_id
FROM orders o, order_lines ol use index(idx5)
WHERE o.order_id = ol.order_id
AND ol.status_mode = '5'
AND o.completed = 1
AND ol.status_ != '4'
AND ol.trans_type IN (2,5)
    ->
GROUP BY ol.order_id
LIMIT 0, 25\G
*************************** 1. row ***************************
order_id: 10444406
*************************** 2. row ***************************
order_id: 10704076
*************************** 3. row ***************************
order_id: 11031609
3 rows in set (0.32 sec)
  • Profiling result for forcing idx2
mysql> show profile for query 1 ;
+----------------------+----------+
| Status               | Duration |
+----------------------+----------+
| starting             | 0.000093 |
| Opening tables       | 0.000015 |
| System lock          | 0.000005 |
| Table lock           | 0.000009 |
| init                 | 0.000040 |
| optimizing           | 0.000017 |
| statistics           | 0.000115 |
| preparing            | 0.000020 |
| Creating tmp table   | 0.000031 |
| executing            | 0.000003 |
| Copying to tmp table | 6.411286 |
| Removing duplicates  | 0.000014 |
| Sorting result       | 0.000016 |
| Sending data         | 0.000014 |
| end                  | 0.000003 |
| removing tmp table   | 0.000008 |
| end                  | 0.000004 |
| query end            | 0.000004 |
| freeing items        | 0.000038 |
| logging slow query   | 0.000002 |
| logging slow query   | 0.000076 |
| cleaning up          | 0.000004 |
+----------------------+----------+
22 rows in set (0.01 sec)
  • All of the time was in copying to tmp table, 6.4 seconds. Profiling for idx1 (no force index)
mysql> show profile for query 2 ;
+----------------------+-----------+
| Status               | Duration  |
+----------------------+-----------+
| starting             |  0.000067 |
| Opening tables       |  0.000013 |
| System lock          |  0.000005 |
| Table lock           |  0.000006 |
| init                 |  0.000027 |
| optimizing           |  0.000011 |
| statistics           |  0.000197 |
| preparing            |  0.000080 |
| Creating tmp table   |  0.000017 |
| executing            |  0.000003 |
| Copying to tmp table | 16.228572 |
| Removing duplicates  |  0.000013 |
| Sorting result       |  0.000015 |
| Sending data         |  0.000014 |
| end                  |  0.000003 |
| removing tmp table   |  0.000008 |
| end                  |  0.000004 |
| query end            |  0.000003 |
| freeing items        |  0.000032 |
| logging slow query   |  0.000002 |
| logging slow query   |  0.000085 |
| cleaning up          |  0.000004 |
+----------------------+-----------+
22 rows in set (0.00 sec)
  • 16 seconds to make the tmp table. So using the index that satisfies the join and group by is >50% slower than using an index that satisfies a slightly cardinal part of the where clause, even though explain seems to indicate otherwise. idx5 looked better than idx2 and Rows: should be apples to apples comparable between the two in meaning, unlike with idx1.
mysql> show profile for query 3 ;
+----------------------+----------+
| Status               | Duration |
+----------------------+----------+
| starting             | 0.000092 |
| Opening tables       | 0.000014 |
| System lock          | 0.000005 |
| Table lock           | 0.000009 |
| init                 | 0.000039 |
| optimizing           | 0.000017 |
| statistics           | 0.000103 |
| preparing            | 0.000039 |
| Creating tmp table   | 0.000026 |
| executing            | 0.000003 |
| Copying to tmp table | 0.315404 |
| Removing duplicates  | 0.000016 |
| Sorting result       | 0.000014 |
| Sending data         | 0.000013 |
| end                  | 0.000004 |
| removing tmp table   | 0.000006 |
| end                  | 0.000004 |
| query end            | 0.000004 |
| freeing items        | 0.000037 |
| logging slow query   | 0.000002 |
| cleaning up          | 0.000004 |
+----------------------+----------+
21 rows in set (0.00 sec)

Solution

Presto! forcing idx5 turns a 16 second query into < 400ms, even though it doesn't match the group by or join. It results in the smallest temp table though (by far), and profiling shows that is the speed bottleneck on this query.

Data Recovery for Mysql

tldr

This talk was awesome. Not data recovery as in how to best use backups, but data recovery as in "someone just dropped a table and ( the backup system was broken | we're still screwed if we lose everything since yesterdays backup )"

Pre-Req

https://launchpad.net/percona-data-recovery-tool-for-innodb/trunk (always build and use the latest from trunk)

Important points to understand about Innodb

  • Doesn't store data as "tables" but as clustered primary indexes
  • Tables are thus identified as an index_id
  • The mapping of a table name to an index_id is in the data dictionary (SYS_TABLES, SYS_INDEXES)
  • This is in ibdata, which is split into pages
  • Find the index_id for the table you want to restore
  • Re-create the table definition
  • Extract lost data still on disk
  • Restore data into original mysql instance

Index identifier:

  • 8 byte int. tbl name->indx id in innodb dict
  • visible in tbl monitor outpt + I_S tbls in percona server

Table monitor output:

 create table innodb_table_monitor (id int) eng=innodb  

Dumps the innodb data dict to the error log.

Basic structure

  • innodb page formt - 16k
  • fil_header - 36b
  • page_header - 56b
  • infinum+supermum records - varies (first/last pointers within the btree)
  • user records - v
  • free space
  • page directory - v
  • fil_trailer - fixed

fil header - start of all pages

  • fil_page_space - 4byte checksum
  • page_offset - ordinal page number from start of space

..

  • fil_page_type = data, index, etc... - may need to search here for types of pages
  • In index pages, header includes page_index_id - match to index id in data dict.
  • Can also extract pages where type = 2ndry index + index_id matching tbl to extract partial data if PK pages aren't recoverable
  • Rows in page - infimum = 1st, supermum = last.
  • In pri key order in terms of btree next ptr linkage, not physically
  • Records saved in insert order

Row format

  • offsets - 1 byte for small, 2 bytes for longer types (varchar lengths stored here)
  • extra bytes - 5 for compact, 6 for redundant
  • field content - variable

Extra bytes (redundant row format)

  • record status - 2 bits
  • deleted_flag - 1 bit (useful for recovery)
  • min_rec_flag - 1 is record is minimum record
  • n_fields - number of fields in records, 1 to 1023

..

  • next 16 bits - pointer to next record

Extra bytes (compact - what we normally use)

  • record stat+deleted_flag_min_rec_flag - 4 bits to mark deleted recs and mark a predefined minimum in alpha order

..

  • Depending on recovery scenario, may just be looking for rows with the delete bit set.
  • row : (10, 'abc', 20)
    • actual: (10, trx_id, ptr_id, 'abc', 20)
      • redundant: 4 6 7 6 4 ... next 00 00 00 0a .. .. abc 80 00 00 14
    • offests xtra byts fields ..................
      • compact: 6 .. next 00 00 00 0a .. .. abc 80 00 00 14
  • Fields in pk: 1. ID, 2. db_trx_od, db_roll_ptr, field / name of fields
Data types
  • ints, floats = fixed size
  • varchar - variable
  • char - fixed if not utf-8, otherwise also variable
  • date, time - fixed
  • decimal - stored as string before 5.0.3, binary after
  • Blobs
    • field length (offset) is one or two bytes long
    • if record size < (univ page size / 2-200) -- ~7k - the record is stored internally (in PK page)
    • otherwise - 768 bytes in page, rest in external (unless new innodb format, always external page)

Innodb dictionary

  • (sys_indexes, sys_tables)
  • sys_* tables - correspond "table name" -> "index_id" + storage for other internal info
  • sys_* structure - always stored in redundant format with fixed sizes and positions in ibdata
  • sys_tables:
    • Rows like:
      • mydb/foo 40 8 1 0 0 null
    • One row per column in table
    • The first int is the index_id, so in the above, the table foo in mydb has an index_id of 40

Failure Scenarios

Wrong Delete

 "delete from `actor`" - wanted to del 1 row, but acidentally matched all
  • what happens? rows marked as deleted but not instantly removed from pages, first in buff pool, then flushed to disk
    • trx_id and roll_ptr are updated
    • row remain in pages until UNDO log is purged
  • What to do first
    • kill mysql asap -- kill -9 mysqld_safe + mysqld - don't want a safe shutdown!
    • find pages which belong to dropped table (via primary index)
    • fetch records from the pages using constraints_parser from the recovery toolkit

Dropped table/database

  • drop table actor
  • very often drop followed by create
  • bad because .frm files are removed, even worse when innodb_file_per_table
  • What happens in innodb: page is marked as free or deleted from tbl space
  • What to do:
    • kill -9 mysql, remount mysql partition as ro and image or copy, recovery toolkit - fetch records from index
  • truncate table - equiv to drop/create. dict is updated but index_id is likely reused - what to do? same as with a drop

Wrong update statement

  • What happens? if the new row is same size, in-place update. otherwise, insert/delete
  • old values go to undo, roll_ptr points to the old value in undo space
  • what to do? kill -9, but no tool currently available to recover

Other issues:

  • removed ibdata1 file - take mysqldump asap before mysql is stopped + ibdconnect from recovery toolkit
  • wrong backups - restore and mysql doesn't start. try innodb_force_recovery + fetch records from index page

Corrupt table space

  • hw failure, os/fs failure, other bugs.
  • what to do? innodb_force_recovery + fetch_data.sh + constraints_parser

Innodb Recovery Tool

  • http://www.percona.com/software/mysql-innodb-data-recovery-tools/
    • page_parser - scans a byte stream, find pages and sorts by type/index id
      • accepts files like ibdata1 but can even scan a disk (i.e. /dev/sda1) not mounted rw in the case of fully deleted files, such as a drop table when using file-per-table
    • constraints_parser - fetches records from innodb pages.
      • accepts a page or directory of previously dumped pages
    • ibdconntect - a tool to connect an ibd file to system tablespace
      • updating the innodb data dict (while mysql is down)
        • ibdconnect -o /var/lib/mysql/ibdata1 -f /my/file.ibd ..
    • fetch_data.sh - fetches data from partially corrupt tables using pk ranges
  • Recovery pre-reqs
    • media - ibdata, *.ibd, hdd image
    • table structs - old sql dump or frm files
  • How to get show create table for 'actor' from .frm files:
    • stop mysql and replace actor.frm
    • run mysql with innodb_force_recovery=4
    • show create actor - you get the actual structure
  • If you don't have the frm file, but have the structure from an old sql dump, you can create a new frm file. Requires filling out a header in the toolkit source and recompiling.

Disk Abstraction Xrays

  • Std linux I/O is buffered, flushes to disk when buffer is full. If app or server crashes before the flush, the data is lost.
  • What about unbuffered I/O? Is it really?
  • Shows example of writing a big dd to an 8-disk raid 10, with dd set to unbuffered.
    • Look at bi/bo in iostat/vmstat - shows many seconds going by with no writes, then a very large write, repeat
    • runs sync during the quiet time, triggers a large write
  • unbuffered I/O only means it doesn't go into the buffer cache. There are other lower level buffers though.
    • Journaled file systems generally have a journaling buffer
    • The kernel also wants to combine small writes into larger sequential ones for efficiency
  • If file is opened with O_SYNC, file gets synced do disk on every write operation. This is slow.
    • Innodb generally uses buffered writes for log files, then calls fsync() on them as needed/configured. (i.e. per sync_binlog value)
  • O_DIRECT - each write really gets written, at least to the controller/driver
  • pt-diskstats - provides a wr_mrg col that shows kernel write merges

B+Tree Indexes and Innodb

  • A quick in-depth dive into indexing

why index / what is a b+tree / b+tree characteristics / cost entimation formula / advantages / primary vs. secondary btree indexes / characteristics of an ideal primary key / in-order vs random (inorder better), why composite indexes, etc

  • Why index? linear search is very slow - O(n)
    • Many diff types of indexes - hash (only in memery and ndb), bitmap (not in mysql), b+tree and derivs (myisam + innodb)
    • Indexes increase search performance but add cost to all writes
  • Generalized version of binary search tree - see wikipedia
    • Classic disk based structure for indexing records in ordered key set for fast sequential disk reads.
    • Eeading a single record from a very large table results in only a few pages being read.
  • Any index structure other than b+tree is subject to overflow.. hash indexes are faster though, up to O+1
  • B+tree structure
    • every node can have p-1 key values and p node pointers (p is called the order of the tree)
    • the leaf node contains the data, internal nodes are only used as guides
    • leaf nodes are connected together as double linked list
    • keys are stored in the nodes in sorted order
    • all leaf nodes are at the same height, that's why it's called a balanced tree
  • Assuptions:
    • h = heigh of tree
    • p = branching factor of the tree
    • n = total # of records in tree
    • p can be as big as (node size in bytes / key length in bytes) + 1
    • h = log n / log p
      • 1M rows, 4 byte keys => h = 2 (means with 1 million rows, only need to do 2 operations to find your value)
      • 1B rows, 4 byes keys => h = 3 (when grows to 1 billion rows, that only adds 1 additional operation to find value)
  • Cost calcs:
    • Search cost for a single row
      • S = h I/O ops
    • Update cost for a single row
      • U = search cost + rewrite leaf page = (h + 1 I/O ops)
    • Insert cost for row
      • I = search cost + rewrite internal page + rewrite leaf page I = h + 1 + 1 = (h + 2 I/O ops)
    • Delete cost
      • D = search cost + rewrite internal page + rewrite leaf page D = h + 1 + 1 = (h + 2 I/O ops)
    • Above cost for updates are only if the new data is of the same size, otherwise cost = delete + insert
    • Inserts may require splits into multiple operations if the leaf node is full. worst case = rebuilding tree to the root node (but its shallow, so that's H operations)
  • What's good about b+tree
    • height is shallow + large branching factor
    • generally every node in a tree corresponds to a page of data (page size btwn 2k-16k)
    • a tree needs reblancing after an insert or delete. but tree is wide, many keys can fit in a node, so rebalancing isn't needed for many insert/deletes
    • extremely efficient range scans since leaf nodes are linked together as double link list
    • so only need to traverse from root->leaf once, then read leaf nodes in a nice sequential read for length of the range read. entire tree can be read without visiting higher nodes at all.
    • implicit sorting, since nodes contain keys sorted in pkey order, therefore implicitly returned in sorted order. no external sorting needed.
  • B+tree index in innodb - typical, no strings attacked
    • leaf nodes contain the data (what data depends on if primary or secondary)
  • primary keys
    • pri index holds the actual row data in its leaf nodes
    • pri index can also be called a clustered index because data is clustered around pkey values
    • always a single clustered index per table
  • secondary keys
    • secondary indexes have their key values plus the pk value in the index and no row data
    • pk values stored in the leaf nodes of a secondary index act as pointers to the data
    • this means fetching a row via a secondary index lookups requires two lookups
  • cost of secondary index lookup
    • C = Height of secondary index tree + Height of primary index btree
  • characteristics of an ideal pkey index
    • create pkey on cols that are not update often (btree reconstruct on every update, has a lock)
    • keep the size as small as possible
    • select the column(s) to create key on that have sequential increasing value if possible
    • random value columns, such as a UUID are very bad candidates
  • order
    • in-order inserts result in good page fill percent, meaning can keep inserting into same page until full.
    • random can lead to page fragmentation - meaning gaps inside of a page, with more pages needed to store the same amount of data.
    • more pages can also equal disk extent fragmentation, so range scans may no longer be possible as sequential reads.
  • so, do your best to insert in sequential order
  • composite indexes
    • if two secondary indexes on last_name and hire_date + pri key on emp_no. search cost:
      • S = h(hire_date) + h(last_name) _ merge and intersect cost + h(emp_no)
    • but if you create a composite index on (hire_date, last_name),
      • S = h(hire_date, last_name) + h(emp_no)
    • suppose 1000 rows that match hire_date = '1923-23-23'
    • 1000 rows that match last_name = 'Peek'.
    • 4 rows match both.
      • 4 row reads with a composite index, 2000 without
  • B+tree can only work with filters that filter at least on the prefix of the index
    • idx(first_name, last_name) can't be used for a lookup on just last_name.
    • also can be used for first_name 'abc%' but not '%abc'
  • selectivity
    • defined as unique values / total no. of records
    • primary index is most selective
    • suppose you index a column that stores gender, so only 2 distinct value
    • secondary only stores a pointer to data in the pri index
      • indexing a gender col or bool means each key value provides a huge number of pk pointers
  • speeding secondary index
    • perf can be improved a lot if you don't need the whole row and can avoid the extra pk lookup.
    • this can be done by include all the columns queried in the definition of the secondary. (no need to add pkey index value tho)
    • added values that we want to lookup up but not filter on must be added to the right! this is known as covering index optimization.
  • select hire_date, last_name, first_name where hire_date='xxx' and last_name='abc';
    • can be fully met by just idx(hire_date, last_name, first_name)
  • index trade-off analysis
    • can be done using the cost estimation formula above and with help from pt-query-digest
    • select optimal data type for columns, esp ones that are indexed, i.e. int vs bigint
    • when selecting cols for pk, select those that keep the pk short, are sequential, and rarely updated
    • use fixed length datatypes as much as possible. i.e. ip addresses may vary between say 4-10 bytes, so might want to use a varchar, but fixed will ensure that updates can always be done in place
    • avoid uuid pkeys if possible
    • inserts faster when pkey in order
    • when creating index on strings, don't need to do the full column, just a prefix idx(str_col(4)) - but analyze to see what an efficient prefix would be
    • b+tree indexes are only useful on cols with good selectivity

Why MariaDB?

tldr

It's a true open source community driven project that takes in all of Percona's work with innodb (xtradb), unencumbered open backports of cutting edge improvements Oracle is developing for Mysql 6 Enterprise (currently commercial only), and perhaps most importantly, they place a big emphasis on quality. All compiler warnings have been fixed (a mysql first) and unit tests proved 98% code coverage. Monty was pissed when Sun called 5.1 GA while it still had major bugs open (http://monty-says.blogspot.com/2008/11/oops-we-did-it-again-mysql-51-released.html) and MariaDB GA releases are intended to be as bullet proof as any mysql version to date (for whatever that's worth..)

notes

  • community developed, feature enhanced, backwards compat
  • monty program is just a sponsor, 64% of committers
  • aims: 100% compat with mysql, stable bug free releases (unlike mysql), GPLv2 - no enterprise versions
  • diffs
    • no ndb storage engine
  • much better test cases and coverage than mysql (~98%)
  • compiler warnings eliminated - none at all with gcc -Wall
  • lots of long open mysql bugs fixed
  • whats in 5.1:
    • xtradb
    • extended stats in slow query log (from percona)
    • processlist with microsecond precision
    • table elimination - don't need to open all tables when querying highly normalized data
    • pool of threads (mysql 6 enterprise backport) - use a limited threadpool to handle all queries vs. one per connections. better if most queries are short and few table/row locks. mariadb implementation is libevent based.
  • 5.2:
    • pluggable auth - can use ldap etc
    • user statistics (userstatsv2 - percona, google, weldon whipple) kb.askmonty.org/v/user-statistics
      • stats per client, user, index, table stats
    • segmented myisam key cache - mitigates thread key cache lock
    • includes sphinx storage engine - can run sphinx searches from sql without connecting separately
      • can join search results against tables
  • 5.3 - not GA / prod stable yet
    • adds group commit, much faster than the facebook implementation, they'll switch
    • checksums for binlog events (5.6 backports)
    • obtain binlog pos without blocking other queries
    • mysqldump --single-transaction --master-data = completely non-blocking (if xtradb)
    • row based replication for tables w/o pkey
    • progress reports for alter table! (including in mytop output)
    • optimizer changes (backported from 5.6) @@optimizer_switch flags. disk access and major join improvements + optimized subqueries
      • queries that are reasonable in oracle but impossible to run on mysql due to requiring near infinite resources, finally becoming possible!

Domas Live: Tuning Mysql at Facebook

  • mysqlatfacebook - extra resiliency + reduced operations effort
  • at fb, every single query has a matching memcache object
  • multiple cache sets per dc due to huge network reqs
  • 60 million mysql queries per second - apart from writes, that's all just long tail
  • main performance focus is on reliable throughput in prod
    • consistent performance is much better than often fast, sometimes slow
    • avoid perf stalls + ensure hw is used
    • fb always looks at 99th percentile, no avg or means!
    • worst offender analysis - topN and histograms instead of tier averages
  • emphasis on stalls
    • "dogpiles"
    • temporary slow down - even 0.1s is huge
  • tools for debugging stalls
    • dogpiled (fb inhouse, not currently open)
      • snapshot state of server at time of distress
    • aspersa (stalk, collect) (included in precona-toolkit)
    • http://poormansprofiler.org/
      • later iterations - apmp, hpmp, tpmp
    • gdb ftw
  • facebook identified all kinds of wrf stalls and limitations in mysql under load
    • extending a table holds a global mutex
    • drop table - sql and innodb mutexes
    • purge contention - unnecessary innodb dict lock
    • binlog reads - no commits can happen if an old event is being read
    • kernel mutex around transaction creation, lock creation/removal, and deadlock detection
      • fb has deadlock detection disabled on all prod mysql instances (hey, i wanted to do this on db40)
    • Background page flush not really bkgrnd
  • efficiency - increasing utilization of hw, memory to disk ratio, finding bottlenecks (usually disk), sometimes network, app os server software chokepoints
    • covering indexes can help with running a larger disk to ram ratio
  • disk efficiency
    • normally iops bound
    • allow higher queue lengths - can run at >8 pending ops per disk
    • xfs/deadline
    • 300 iops on 166rps disks
  • ram efficiency
    • compact records, compact objects etc
    • clustered and covering index planning
    • force index - avoid unnecessary io and cached pages
    • full table scans: etl queries, mysqldump, etc.. tune midpoint insertion lru for innodb to allow these to happen without killing oltp perf
    • eliminated 40% of i/o via force indexes
    • use o_direct for logs! not normally supported, in current my@fb
    • everything else (non mysql procs) writing on the db should use o_direct too if possible / posix fadvise
  • pure flash - solves all problems!
    • but it's still not very dense and expensive
  • fb heavily uses flashcache
    • fushionio for cache in front of multi-tb sata raids
    • has a very long warmup time tho
  • pure fushionio flash servers resulted in the "mysql 2x" project
    • pure flash provided more capacity than one mysql slave instance could make use of, so tried multi instances per server
    • multiple slaves per server, listening on different ports
    • worked just fine for utilizing the hardware, but turned out to be a huge ops pain
  • making use of some parallel replication with mk-slave-prefetch
    • also running a python based prefetcher that domas wrote, will be open sourced soon
    • apparently it works well but requires that queries be structured/commented a certain way
    • looking forward to parallel replication features built into mysql 5.6