User:Marostegui

From Wikitech
Jump to: navigation, search

To be moved to a proper page::

OLD: db2019

NEW: db2051

- prepare puppet patch to make NEW a master and OLD a regular slave: https://gerrit.wikimedia.org/r/#/c/369880/1

- prepare mediawiki patch to make NEW a master in the hosts array: https://gerrit.wikimedia.org/r/#/c/369879/2

- prepare operations/software s4.hosts patch to put NEW at the bottom of the list: https://gerrit.wikimedia.org/r/#/c/369877/

- disable alerts on s4 codfw [DONE]

- stop puppet on both OLD and NEW [DONE]

- Manually change NEW my.cnf to and set STATEMENT as binlog format [DONE]

* binlog-format = STATEMENT [DONE]

* plugin_load                        = rpl_semi_sync_master=semisync_master.so;rpl_semi_sync_slave=semisync_slave.so [DONE]

* expire-log-days=30 [DONE]

* rpl_semi_sync_master_enabled       = 1 [DONE] 

* rpl_semi_sync_master_timeout       = 100 [DONE]

* rpl_semi_sync_master_wait_no_slave = 0 [DONE]

* rpl_semi_sync_slave_enabled        = 1 [DONE]

- Restart mysql on NEW and check binlog format and all the options [DONE & checked]

- Disable GTID on slaves and master: [DONE]

    * STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID = no; START SLAVE; [DONE]    

- Start topology changes and move slaves under NEW (give time between moves, so the new master can recover from lag)  [DONE]

ie: repl.pl --switch-sibling-to-child --parent=db2051.codfw.wmnet:3306 --child=db2037.codfw.wmnet:3306 (do not use repl.pl for delayed slaves, multisource slaves require the additional --parent-set or --child-set="default_master_connection='s4'") (I was planning not to change dbstore2001 as it will be rebuilt - dbstore2002 does not replicate s4 yet) (wait some time between switches so that the new master will have time to recover from lag. It is not a proble, topology change will fail with no transaction lost, but just a heads up to not do all of them without pause, scripted- I also disable GTID on the master and the slaves to avoid issues, later enable it on the slaves)

Once all the slaves have been moved under NEW, move NEW under eqiad master to be at the same level as OLD  [DONE]

repl.pl --switch-child-to-sibling --parent=db2019.codfw.wmnet:3306 --child=db2051.codfw.wmnet:3306  [DONE]

- Kill heartbeat on the OLD master and continue with puppet patch as soon as possible [DONE[]

- merge puppet patch and run puppet on both hosts and make sure heartbeat comes up on NEW [DONE]

- merge mediawiki patch [DONE]

- merge operations/software patch  [DONE]

(if this was eqiad, we would also deploy to dns, but there is not codfw alias that I can see - good point. I am going to document this probably on wikitech, at least for myself as a quick checklist :-) ) actually, wait, because I have to discuss with you potential upcoming changes to the process ok! I was going to do it tomorrow, because I am sure thing will arise today so we can polish it a bit, I am not going to create a page to blindly follow it, more like a checklist of things you have to keep in mind and a roadmap to follow, so it is easier if this happens at 3am and you are alone (you = me actually, you have this all in your mind!)

- move OLD under NEW (do this last in case you had to revert)   [DONE]

repl.pl --switch-sibling-to-child --parent=db2051.codfw.wmnet:3306 --child=db2019.codfw.wmnet:3306

- Once everything is validated, enable GTID on the master and slaves [DONE]

* STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID = Slave_pos; START SLAVE;