These are the current dbs, and what was needed to failover then:
bacula9: The Bacula metadata database. We make sure there is not backup running at the time so we avoid backup failures. Currently we stop bacula-dir (may require puppet disabling to prevent it from automatically restarting) to make sure no new backups start and potentially fail, as temporarily stopping the director should not have any user imapact. If backups are running, stopping the daemon will cancel the ongoing jobs. Consider rescheduling them (run) if they are important and time-sensitive, otherwise they will be schedule at a later time automatically following configuration. Owners: Jaime, backup: Alex
dbbackups: Database backups metadata, on master failover need manual update as it doesn't use the proxy. Owners: Jaime At the moment, it requires manual migration of connections after failover: https://gerrit.wikimedia.org/r/c/operations/puppet/+/668449
etherpadlite: seems like etherpad-lite errors out and terminates after the migration. Normally systemd takes care of it and restarts it instantly. However if the maintenance window takes long enough, systemd will back out and stop trying to restart, in which case a systemctl restart etherpad-lite will be required. etherpad crashes anyway at least once a week if not more so no big deal ; tested by opening a pad. Owners: Alex. Killed idle db connection on failover.
heartbeat: Its writes should stop/start automatically when switching its puppet primary/replica config. Will need cleanup of old records after switch, for Orchestrator, see: MariaDB#Misc_section_failover_checklist_(example_with_m2) Owners: DBAs.
librenms: required manual kill of its connections @netmon1001: apache reload Owners: Netops (Arzhel). Killed idle db connection.
rddmarc: ?
rt: Old ticket manger, kept in read only for reference of contracts/orders, etc. Owners: Daniel, alex can help. Mosty used by RobH. Required manual kill of its connections ; @unobtinium: apache reload Restarted apache2 on ununpentium to reset connections.
Deleted/archived schemas
bacula old bacula database (for bacula 7.x). Archived into the backups "archive pool"
blog: to archive
bugzilla: to archive * kill archived and dropped
bugzilla3: idem kill archived and dropped
bugzilla4: idem archive, actually, we also have this on dumps.wm.org https://dumps.wikimedia.org/other/bugzilla/ but that is the sanitized version, so keep this archive just in case i guess
bugzilla_testing: idem kill archived and dropped
communicate: ? archived and dropped
communicate_civicrm: not fundraising! we're not sure what this is, we can check users table to determine who administered it archived and dropped
dashboard_production: Puppet dashboard db. Never used it in my 3 years here, product sucks. Kill with fire. - alex archived and dropped
outreach_civicrm: not fundraising, this is the contacts.wm thing, not used anymore, but in turn it means i dont know what "communicate" is then, we can look at the users tables for info on the
admin: archived and dropped
outreach_drupal: kill archived and dropped
percona: jynus dropped
puppet: required manual kill of its connections; This caused the most puppet spam. Either restart puppet-masters or kill connections **as soon** as the failover happens. Puppet no longer uses mysql, but its own postgres-backed storage. Was kept for a while for stats/observability. Owner: Alex
query_digests: jynus archived and dropped
racktables: Migrated to netbox, which uses Postgres. Finally removed. Owners: DC ops. jmm checked it after failover. went fine, no problems.
test: archived and dropped
test_drupal: er, kill with fire ? kill archived and dropped
m2
Current schemas
These are the current dbs, and what was needed to failover then:
otrs: Normally requires restart of otrs-daemon, apache on mendelevium. People: arnoldokoth, lsobanski
debmonitor: Normally nothing is required. People: volans, moritz
Django smoothly fails over without any manual intervention.
At most check sudo tail -F /srv/log/debmonitor/main.log on the active Debmonitor host (debmonitor1001 as of Jul. 2019).
Some failed writes logged with HTTP/1.1 500 and a stacktrace like django.db.utils.OperationalError: (1290, 'The MariaDB server is running with the --read-only option so it cannot execute this statement') are expected, followed by the resume of normal operations with most write operations logged as HTTP/1.1 201.
In case of issues it's safe to try a restart performing: sudo systemctl restart uwsgi-debmonitor.service
heartbeat: Its writes should stop/start automatically when switching its puppet primary/replica config. Will need cleanup of old records after switch, for Orchestrator, see: MariaDB#Misc_section_failover_checklist_(example_with_m2) Owners: DBAs.
xhgui: performance team
excimer: performance team
recommendationapi: k8s service, nothing required, should "just work". People: akosiaris, only user is Android application.
sockpuppet: Sockpuppet detection service (also known as the similar-users service). PySpark model currently generates the CSV files and the application needs to be restarted to reload these files. Ideally the process that creates these files would simply update the database in-place. https://phabricator.wikimedia.org/T268505. People: Hnowlan
mwaddlink: (https://phabricator.wikimedia.org/T267214 )The Link Recommendation Service is an application hosted on kubernetes with an API accessible via HTTP. It responds to a POST request containing wikitext of an article and responds with a structured response of link recommendations for the article. It does not have caching or storage; the client (MediaWiki) is responsible for doing that. MySQL table per wiki is used for caching the actual link recommendations (task T261411); each row contains serialized link recommendations for a particular article. https://wikitech.wikimedia.org/wiki/Add_Link . People: kostajh
dbproxies will need reload (systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio). You can check what's the active proxy by:
host m2-master.eqiad.wmnet
The passive can be checked by running grep -iR m2 hieradata/hosts/* on the puppet repo
Deleted/archived schemas
testotrs: alex: kill it with ice and fire
testblog: archive it like blog
bugzilla_testing: archive it with the rest of bugzillas
reviewdb + reviewdb-test (deprecated & deleted): Gerrit: Normally needs a restart on gerrit1001 just in case. People: akosiaris, hashar
m3
Current schemas
phabricator_*: 57 schemas to support phabricator itself
rt_migration: schema needed for some crons related to phabricator jobs
bugzilla_migration: schema needed for some crons related to phabricator jobs
heartbeat: Its writes should stop/start automatically when switching its puppet primary/replica config. Will need cleanup of old records after switch, for Orchestrator, see: MariaDB#Misc_section_failover_checklist_(example_with_m2) Owners: DBAs.
heartbeat: Its writes should stop/start automatically when switching its puppet primary/replica config. Will need cleanup of old records after switch, for Orchestrator, see: MariaDB#Misc_section_failover_checklist_(example_with_m2) Owners: DBAs.