MariaDB/misc

From Wikitech
Jump to navigation Jump to search

There are 5 "miscellaneous" shards: m1-m5.

  • m1: Basic ops utilities
  • m2: otrs, gerrit and others
  • m3: phabricator and other older task systems
  • m4: eventlogging system
  • m5: openstack and other labs-related dbs

On the last cleanup, many unused databases were archived and/or deleted, and a contact person was discovered for each of them.

Sections description

m1

Current schemas

These are the current dbs, and what was needed to failover then:

  • bacula ; sudo service bacula-director restart after the migration. I had already made sure no jobs were running with status director. Tested after with a list media
  • etherpadlite ; seems like etherpad-lite crashed after the migration and systemd took care of restarting it. etherpad crashes anyway at least once a week if not more so no big deal ; tested by opening a pad
  • heartbeat: needs "manual migration"- change master role on puppet
  • librenms: required manual kill of its connections @netmon1001: apache reload
  • puppet: required manual kill of its connections; This caused the most puppet spam. Either restart puppet-masters or kill connections **as soon** as the failover happens.
  • racktables: went fine, no problems
  • rt: required manual kill of its connections ; @unobtinium: apache reload

Deleted/archived schemas

  • reviewdb: not really on m1 anymore (it was migrated to m2). To delete.
  • blog: to archive
  • bugzilla: to archive * kill archived and dropped
  • bugzilla3: idem kill archived and dropped
  • bugzilla4: idem archive, actually, we also have this on dumps.wm.org https://dumps.wikimedia.org/other/bugzilla/ but that is the sanitized version, so keep this archive just in case i guess
  • bugzilla_testing: idem kill archived and dropped
  • communicate: ? archived and dropped
  • communicate_civicrm: not fundraising! we're not sure what this is, we can check users table to determine who administered it archived and dropped
  • dashboard_production: Puppet dashboard db. Never used it in my 3 years here, product sucks. Kill with fire. - alex archived and dropped
  • outreach_civicrm: not fundraising, this is the contacts.wm thing, not used anymore, but in turn it means i dont know what "communicate" is then, we can look at the users tables for info on the
  • admin: archived and dropped
  • outreach_drupal: kill archived and dropped
  • percona: jynus dropped
  • query_digests: jynus archived and dropped
  • test: archived and dropped
  • test_drupal: er, kill with fire ? kill archived and dropped

owners, (or in many cases just people that volunteer to help for the failover)

  • bacula: Alex
  • etherpadlite: Alex
  • heartbeat: will be handled as part of the failover process by DBAs
  • librenms: Arzhel
  • puppet: Alex
  • racktables: jmm
  • rt: Daniel, alex can help

m2

Current schemas

These are the current dbs, and what was needed to failover then:

  • reviewdb: jynus will have a look on the service on cobalt. (This is Gerrit, i can be around or we can ask Chad. -- Daniel) (I can be around -- Chad)
  • otrs: alex
  • debmonitor: volans, moritz
  • frimpressions
  • heartbeat
  • iegreview
  • scholarships

Deleted/archived schemas

  • testotrs: alex: kill it with ice and fire
  • testblog: archive it like blog
  • bugzilla_testing: archive it with the rest of bugzillas

owners, (or in many cases just people that volunteer to help for the failover)

m3

Current schemas

  • phabricator_*: 57 schemas to support phabricator itself
  • rt_migration: schema needed for some crons related to phabricator jobs
  • bugzilla_migration: schema needed for some crons related to phabricator jobs

Dropped schemas

  • fab_migration

m4

See eventlogging databases.

m5

Current schemas

  • labswiki: schema for wikitech (MediaWiki)
  • striker: schema for toolsadmin.wikimedia.org (Striker)
  • 'nodepooldb: Nodepool, connections are long/permanently established. Contact: Releng
  •  ???: schema(s) for OpenStack

Example Failover process

  1. Disable GTID on db1063, connect db2078 and db1001 to db1063 DONE
  2. Disable puppet @db1016, puppet @db1063 DONE
 puppet agent --disable "switchover to db1063"
  1. Merge gerrit: https://gerrit.wikimedia.org/r/420317 and https://gerrit.wikimedia.org/r/420318 DONE
  2. Run puppet and check config on dbproxy1001 and dbproxy1006 DONE

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg DONE

  1. Disable heartbeat @db1016 DONE
 killall perl
  1. Set old m1 master in read only DONE
 mysql --skip-ssl -hdb1016 -e "SET GLOBAL read_only=1"
  1. Confirm new master has catched up DONE
 mysql --skip-ssl -hdb1016 -e "select @@hostname; show master status\G show slave status\G"; mysql --skip-ssl -hdb1063 -e "select @@hostname; show master status\G show slave qstatus\G"
  1. Start puppet on db1063 (for heartbeat)
 puppet agent -tv
  1. Switchover proxy master @dbproxy1001 and dbproxy1006 DONE
 systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio DONE
  1. kill connections DONE
 ? which command is used- it would be nice to document it and put everything on the wiki
  1. Run puppet on old master @db1016 DONE
 puppet agent -tv
  1. Set new master as read-write and stop slave DONE
 mysql -h db1063.eqiad.wmnet -e "SET GLOBAL read_only=0; STOP SLAVE;"
  1. Check services affected at https://phabricator.wikimedia.org/T189655 DONE
  2. RESET SLAVE ALL on new master DONE
  3. Change old master to replicate from new master DONE
  4. Update tendril master server id for m1 (no need to change dns) DONE
  5. Patch prometheus, dblists DONE
  6. Create decommissioning ticket for db1016 - https://phabricator.wikimedia.org/T190179
  7. Close T166344