External storage/Srv-data-migration

This page contains historical information. It may be outdated or unreliable.

2011

The "External Storage Update" project is Completed as of 2011-11-15. Owner: Bhartshorne. See also RT:1300

Detail on the process to migrate data from the srv external store hosts to the new es100x hosts.

players

List of external store hosts:

cluster3: 10.0.2.151, 10.0.2.163, 10.0.2.175
cluster4: 10.0.2.152, 10.0.2.164, 10.0.2.176
cluster5: 10.0.2.153, 10.0.2.165, 10.0.2.177
cluster6: ~~10.0.2.154~~, 10.0.2.166, 10.0.2.178
cluster7: 10.0.2.155, 10.0.2.167, 10.0.2.179
cluster8: ~~10.0.2.156~~, 10.0.2.168, ~~10.0.2.180~~
cluster9: 10.0.2.157, ~~10.0.2.169~~, 10.0.2.181
cluster10: 10.0.2.158, 10.0.2.170, 10.0.2.182
cluster20: 10.0.2.160, 10.0.2.172, 10.0.2.184
cluster21: 10.0.2.161, ~~10.0.2.173~~, 10.0.2.185
cluster22: ms3, ms2, ms1
cluster1: 10.0.7.1, ~~10.0.7.101~~
cluster2: 10.0.7.1, ~~10.0.7.101~~

setup

All working files living in ~ben/externalStorage/

create a dsh file with all the above hosts
- ext-srv-hosts.dsh (doesn't include ms1/2/3)
verify that all external store hosts have the same schema
- dsh -M -F5 -f /tmp/ext-srv-hosts.dsh -- 'mysqldump --no-data --all-databases --comments=FALSE | md5sum'
- they don't.
  - Some hosts are missing the 'autoincrement' flag on all their database tables.
  - some hosts are missing some records
  - hosts have a different number of databases
  - 10.0.2.184 has the most dbs and has autoincrement.
for each db on each cluster, find out which host to use from that cluster RT:1520
- since some hosts are missing some records, do a count(*) and select the record with the highest blob_id on each table on each cluster and if the count is different, choose the host with the most records. If the count does not equal the highest blob id, flag it as weird.
- distribute count_entries.sh to all hosts
  - for i in $(cat ext-srv-hosts.dsh) ; do scp count_entries.sh $i:; done
- dsh -F5 -f ext-srv-hosts.dsh -- "bash ./count_entries.sh" > all_db_rowcounts.txt
  - I've discovered that this approach missed dbs that actually do exist on the hosts (eg srv163/etwiki wasn't in my output file but does exist on the host).
  - second try: for i in {1..4}; do dsh -F1 -f ext-srv-hosts.dsh -- "bash ./count_entries.sh" ; done > adr2.txt
  - cat adr2.txt | sort | uniq > adr2-unique.txt
  - diff adr2-unique.txt all_db_rowcounts.txt - yup! not the same. whee!!!
- conclusion:
  - cluster3 - srv151 (all hosts equally impaired - some deleted rows)
  - cluster4 - srv152
  - cluster5 - srv153
  - cluster6 - srv166
  - cluster7 - srv155
  - cluster8 - srv168
  - cluster9 - srv181
  - cluster10 - srv158
  - cluster20 - srv160 - more rows than its peers
  - cluster21 - srv161 - more rows than its peer
create a list of all dbs present on any external store host, use it to create a sql file that will create all the dbs
- for i in {1..6}; do dsh -f ./ext-srv-hosts.dsh -m ms1,ms2,ms3 -F1 -- "mysql -e 'show databases;'" ; done | sort | uniq > all_dbs.txt
- cat all_dbs.txt | grep -v -e '^mysql$' -e '^information_schema$' -e '^test$'| sed -e "s/^$.*$/CREATE DATABASE \1 ;/" > create_all_dbs.sql

action

create all databases on es1001
- es1001: mysql < create_all_dbs.sql
stop mysql on es1001
copy myisam files from source clusters to es1001, renaming them to blobs_cluster# as you go
- for i in $(cat cluster_masters.dsh ) ; do scp send_dbs_to_es1001.sh $i: ; done
- dsh -F3 -f ./cluster_masters.dsh -- "bash ./send_dbs_to_es1001.sh"
start mysql, stop mysql on es1001
run myisamchk on all tables:
- es1001: myisamchk --silent /a/sqldata/*/*.MYI

post

establish replication between es1001 and es1002
- es1001: update mysql user table to allow repl to connect from 10.% rather than 10.0.%
- es1001: mysqladmin shutdown
- es1001: rsync -a /a/sqldata/ es1002:/a/sqldata
- es1001: mysqld_safe&
- es1002: mysqld_safe&
- es1002: start replication

 es1001# mysql -u root mysql -e "update user set Host='10.%' where User='repl';"
 es1002# mysql -u root
 mysql> CHANGE MASTER TO
   MASTER_HOST='10.64.0.25',  # <-- es1001
   MASTER_USER='repl',
   MASTER_PASSWORD='repl_password',    # <-- in /home/w/doc
   MASTER_LOG_FILE='es1001-bin.000001',
   MASTER_LOG_POS=0;

do an verify on the content
- fenari: for cluster in 3 4 5 6 7 8 9 10 20 21 ; for i in $(cat /home/w/common/all.dblist); do echo -n "$i "; ./verify-copy.sh $cluster_host $i blobs es1002.eqiad.wmnet $i blobs_cluster$cluster ; done > /tmp/verify_cluster$cluster& done
- fenari: tail -f /tmp/verify_cluster* | grep -v "failed: 0"

end state

as of 2011-09-29:

data from all apache servers is copied and verified
clusters 1 and 2 are copied and verified

State:

es100{1,2,3} have data for clusters 1-10, 20, 21
ms{1,2,3} and es1004 have data for cluster 22 and rc1