From Wikitech
Jump to navigation Jump to search

There are 5 "miscellaneous" shards: m1-m5.

  • m1: Basic ops utilities
  • m2: otrs, gerrit and others
  • m3: phabricator and other older task systems
  • m5: openstack and other labs-related dbs

On the last cleanup, many unused databases were archived and/or deleted, and a contact person was discovered for each of them.

Sections description


Current schemas

These are the current dbs, and what was needed to failover then:

  • bacula9 ; sudo service bacula-director restart after the migration. I had already made sure no jobs were running with status director. Tested after with a list media
  • bacula: Nothing
  • etherpadlite ; seems like etherpad-lite errors out and terminates after the migration. Normally systemd takes care of it and restarts it instantly. However if the maintenance window takes long enough, systemd will back out and stop trying to restart, in which case a systemctl restart etherpad-lite will be required. etherpad crashes anyway at least once a week if not more so no big deal ; tested by opening a pad
  • heartbeat: needs "manual migration"- change master role on puppet
  • librenms: required manual kill of its connections @netmon1001: apache reload
  • puppet: required manual kill of its connections; This caused the most puppet spam. Either restart puppet-masters or kill connections **as soon** as the failover happens.
  • racktables: went fine, no problems
  • rt: required manual kill of its connections ; @unobtinium: apache reload

Deleted/archived schemas

  • reviewdb: not really on m1 anymore (it was migrated to m2). To delete.
  • blog: to archive
  • bugzilla: to archive * kill archived and dropped
  • bugzilla3: idem kill archived and dropped
  • bugzilla4: idem archive, actually, we also have this on but that is the sanitized version, so keep this archive just in case i guess
  • bugzilla_testing: idem kill archived and dropped
  • communicate: ? archived and dropped
  • communicate_civicrm: not fundraising! we're not sure what this is, we can check users table to determine who administered it archived and dropped
  • dashboard_production: Puppet dashboard db. Never used it in my 3 years here, product sucks. Kill with fire. - alex archived and dropped
  • outreach_civicrm: not fundraising, this is the contacts.wm thing, not used anymore, but in turn it means i dont know what "communicate" is then, we can look at the users tables for info on the
  • admin: archived and dropped
  • outreach_drupal: kill archived and dropped
  • percona: jynus dropped
  • query_digests: jynus archived and dropped
  • test: archived and dropped
  • test_drupal: er, kill with fire ? kill archived and dropped

owners, (or in many cases just people that volunteer to help for the failover)

  • bacula9, bacula: Jaime
  • etherpadlite: Alex. Killed idle db connection.
  • heartbeat: will be handled as part of the failover process by DBAs
  • librenms: Arzhel. Killed idle db connection.
  • puppet: Alex
  • racktables: jmm
  • rt: Daniel, alex can help. Restarted apache2 on ununpentium to reset connections.


Current schemas

These are the current dbs, and what was needed to failover then:

  • reviewdb: Gerrit: Normally needs a restart on gerrit1001 just in case. People: akosiaris, hashar
  • otrs: Normally requires restart of otrs-daemon, apache on mendelevium. People: akosiaris
  • debmonitor: Normally nothing is required. People: volans, moritz
    • Django smoothly fails over without any manual intervention.
    • At most check sudo tail -F /srv/log/debmonitor/main.log on the active Debmonitor host (debmonitor1001 as of Jul. 2019).
      • Some failed writes logged with HTTP/1.1 500 and a stacktrace like django.db.utils.OperationalError: (1290, 'The MariaDB server is running with the --read-only option so it cannot execute this statement') are expected, followed by the resume of normal operations with most write operations logged as HTTP/1.1 201.
    • In case of issues it's safe to try a restart performing: sudo systemctl restart uwsgi-debmonitor.service
  • heartbeat: Nothing required
  • recommendation-api: Normally requires a restart on scb. People: akosiaris
  • iegreview: Shared nothing PHP application; should "just work". People: bd808, Niharika
  • scholarships: Shared nothing PHP application; should "just work". People: bd808, Niharika

dbproxies will need reload (systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio). You can check what's the active proxy by:

host m2-master.eqiad.wmnet

The passive can be checked by running grep -iR m2 hieradata/hosts/* on the puppet repo

Deleted/archived schemas

  • testotrs: alex: kill it with ice and fire
  • testblog: archive it like blog
  • bugzilla_testing: archive it with the rest of bugzillas

owners, (or in many cases just people that volunteer to help for the failover)

  • reviewdb: Daniel, Chad, Akosiaris on SRE side
  • otrs: Akosiaris
  • heartbeat: DBA
  • debmonitor: volans, moritzm
  • recommendationapi: bmansurov, #Research on Phabricator. Akosiaris on SRE side


Current schemas

  • phabricator_*: 57 schemas to support phabricator itself
  • rt_migration: schema needed for some crons related to phabricator jobs
  • bugzilla_migration: schema needed for some crons related to phabricator jobs

Dropped schemas

  • fab_migration


Current schemas

  • labswiki: schema for wikitech (MediaWiki)
  • striker: schema for (Striker)
  • 'nodepooldb: Nodepool, connections are long/permanently established. Contact: Releng
  • ???: schema(s) for OpenStack

Example Failover process

  1. Disable GTID on db1063, connect db2078 and db1001 to db1063 DONE
  2. Disable puppet @db1016, puppet @db1063 DONE
 puppet agent --disable "switchover to db1063"
  1. Merge gerrit: and DONE
  2. Run puppet and check config on dbproxy1001 and dbproxy1006 DONE

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg DONE

  1. Disable heartbeat @db1016 DONE
 killall perl
  1. Set old m1 master in read only DONE
 mysql --skip-ssl -hdb1016 -e "SET GLOBAL read_only=1"
  1. Confirm new master has catched up DONE
 mysql --skip-ssl -hdb1016 -e "select @@hostname; show master status\G show slave status\G"; mysql --skip-ssl -hdb1063 -e "select @@hostname; show master status\G show slave qstatus\G"
  1. Start puppet on db1063 (for heartbeat)
 puppet agent -tv
  1. Switchover proxy master @dbproxy1001 and dbproxy1006 DONE
 systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio DONE
  1. kill connections DONE
 ? which command is used- it would be nice to document it and put everything on the wiki
  1. Run puppet on old master @db1016 DONE
 puppet agent -tv
  1. Set new master as read-write and stop slave DONE
 mysql -h db1063.eqiad.wmnet -e "SET GLOBAL read_only=0; STOP SLAVE;"
  1. Check services affected at DONE
  2. RESET SLAVE ALL on new master DONE
  3. Change old master to replicate from new master DONE
  4. Update tendril master server id for m1 (no need to change dns) DONE
  5. Patch prometheus, dblists DONE
  6. Create decommissioning ticket for db1016 -
  7. Close T166344