External storage/Update 2011-08

A description of the project to bring the external storage service up to date.

The "External Storage Update" project is Completed as of 2011-11-14. Owner: Bhartshorne. See also RT:1300

Background and context

The "External storage" hosts are used to store the text of a wiki page in the form of compressed revisions of pages. When a user asks for a page or a diff to an older version of a page, mediawiki grabs the compressed content from the external store database and uses it to display the user's request.

External storage is made up of 14 clusters of sets of 3 databases running mysql. There is one cluster running on ms1, 2, and 3, and 11 clusters running on sets of 3 apache servers.

cluster rc1 is on the ms hosts
clusters 1 and 2 run on ms3 under a separate IP address and are holdovers from old fedora clusters. This data is not replicated to ms1 and 2; it only exists on ms3
clusters 3-10, 20, and 21 run on sets of 3 apache servers (clusters 11-19 no longer exist)
cluster 22 gets all new writes and runs on the ms hosts

The sharding to different clusters is in the main database for a wiki as the old_text column of the text table. The clusters are defined in dbs.php.

There are several types of content on the clusters. Text storage data describes all the types of content, what each of them means, and how to count how many there are of each type.

Goals

get external storage replicated to eqiad to protect us from apaches dying, accidental deletion, or physical disaster in Tampa
get external storage off of apache nodes and onto dedicated database hardware for reliability and performance
make all external store data available from dedicated hardware in both colos for clarity of configuration and performance
combine all external store data (ms & apaches) onto one mysql cluster
document external storage's setup, maintenance requirements, and procedures

Process

DONE

acquire hardware in eqiad and set it up (RT:1442). These boxes are Dell PowerEdge R510s.
dig into whether a recompress run is appropriate as part of this project
- due to Text storage data types changing in unexpected ways, recompression is put off.
verify ES hardware has sufficient disk space for data from all the apaches
dig into the existing clusters and look at database version, schema, and content differences. RT:1447
- found variance in schema across the external stores, most notably in the lack of autoincrement in some. I decided to leave them as is and merely copy the data; a recompress project can normalize that later.
- cluster1 and 2 were innodb, and I changed them to MyISAM when copying them over to the new es hosts but left the other schema differences untouched (no autoincrement and blob_text was mediumtext instead of longblob).
dig into each cluster and look for content differences within the in-theory equivalent datasets
- in each case that there were differences, I took the version with the most rows. I found no differences of more than one row per table.
copy actual data files (.MYI, .MYD, .frm) from one host per non-ms1-backed cluster to es1001 (more detail)
copy all data on es1001 to es1002 and es1003, establish mysql replication
use mysqldump to get the data out of cluster1 and cluster2, change the table type from InnoDB to MyISAM, import into es1001 (and let replication populate es100*)
compare the data on the apache hosts and cluster1/2 to the data in the es hosts and verify that it is the same.
set up es1004 as a replica of ms3

TODO

rebuild /a/ with LVM on all ES boxes
copy actual data files for the rc1 cluster from es1004 to es100{1,2,3}
copy the data for cluster22 from es1004 to es100{1,2,3}, set up es1001 as a slave of ms3
purchase and set up ES hardware for pmtpa
copy the data over and set up es1 as a slave of es1001
set up es{2,3,4} as slaves of es1
promote es1 as the new master, replacing ms3, reconfigure es1001 to slave off es1

Documentation

Starting from http://wikitech.wikimedia.org/view/External_storage, expand and improve documentation until it sufficiently describes the current environment and how to work with it
- architecture of the current cluster
- how to move new inserts from one cluster to the other
- how to set up a new cluster

Database replication - desired end state

We currently have 7 machines between pmtpa and eqiad (ms1,2,3 and es1001,2,3,4). We will replace ms1,2,3 with new hardware that matches the ES hosts (Dell R510s) called es1,2,3,4. We will consolidate all of the existing clusters into one (though the shards will remain in the table names).

If we need to fail over to eqiad, replication between es1 and es1001 should change direction. es1001 becomes the new master. The master (and all slaves actively serving reads) should always be in the same datacenter as the active apaches.

Here's a picture:

Putting the new hosts into rotation and testing

All the content from the ms3 cluster (rc1, cluster22, cluster1, cluster2) and the random srv clusters (cluster3 through 10, 20) are living together on the new ES hosts. This section describes how to put the new hardware into service and validate that it is working as expected.

db level test for ES content

Describe /home/ben/externalStorage/verify-copy.sh

application level test for the ES data content

The effect of this test is that you can issue a curl on the command line and get back content that comes from the DB instead of the cache.

Preparation:

choose a test host. In this example, srv190
choose a wiki to use for your tests. In this example, svwiki (because it has content in all ES dbs)
choose an external store cluster to test. In this example, cluster6.

Test Setup:

modify /home/w/conf/pybal/pmtpa/apaches and take a sample server out of rotation (eg srv190, used throughout this example)
on srv190, modify /usr/local/apache/common-local/wmf-config/InitialiseSettings.php to turn off revision caching for your test wiki
- add a line to the wgRevisionCacheExpiry array for your db and set it to 0 seconds:
'svwiki' => 0,

Test execution:

find the revision ID and blob ID of a few random pages on cluster6 (the blob ID is the numery part of the old_text column)
select page.page_title, revision.rev_id, text.old_text from text,revision,page where text.old_text like 'DB://cluster6%' and text.old_id = revision.rev_text_id and page.page_id = revision.rev_page limit 20;

sample row: | "Kommen_till_mig",_Herren_Jesus | 2406666 | DB://cluster6/35393 |
set up tcpdump on the old and new databases serving cluster6 (in this case, 10.0.2.166, 10.0.2.178, and es4)
tcpdump -A -s0 port 3306 2>&1 | grep "SELECT.*blob_text" | grep 35393
execute curl to retrieve the page using the revision ID
curl -v -H "Host: sv.wikipedia.org" "http://srv190.pmtpa.wmnet/w/index.php?oldid=2406666" > /tmp/before-test
You should see a line appear in the tcpdump output on one of the old DBs similar to this:
..^Py...s....SELECT /* DatabaseBase::selectField 208.80.152.165 */ blob_text FROM `blobs` WHERE blob_id = '35393' LIMIT 1
on srv190, modify /usr/local/apache/common-local/wmf-config/db.php:
- change the database for cluster6 from 10.0.2.166 and 10.0.2.178 to es4 (10.0.0.228)
- change the templateOverridesByCluster array to add cluster6's table name as blobs_cluster6
repeat your curl, saving it to a new file
curl -v -H "Host: sv.wikipedia.org" "http://srv190.pmtpa.wmnet/w/index.php?oldid=2406666" > /tmp/after-test
you should see a line appear in the tcpdump output on the new DB similar to this:
..^Py...s....SELECT /* DatabaseBase::selectField 208.80.152.165 */ blob_text FROM `blobs` WHERE blob_id = '35393' LIMIT 1
diff the before and after files; the only difference should be the line containing Served by srv190 in 0.208 secs (the number of seconds is likely to be different.)

Post test:

undo the changes to db.php on srv190
undo the changes to InitialiseSettings.php on srv190
put srv190 back into rotation in pybal