Portal:Data Services/Admin/OSMDB
Jump to navigation
Jump to search
This page is currently a draft. More information and discussion about changes to this draft on the talk page. |
DON'T TRY THIS YET! This is a first pass.
The following is documentation for WMCS and OSM database administrators to be able to perform operations on the Postgresql database servers that make up the OSMDB service in Cloud Services.
Failing over OSMDB for maintenance
Most of the data on these instances is all replicated from the upstream OSM databases. However, there are a few writeable schemas that users have access to and may have no other backups of. Therefore, it is a good idea to fail things over and treat the data with some care. We have instituted no systems for automatic failover, and postgresql doesn't have a native one, so we must do things manually using the STONITH (Shoot The Other Node In The Head) method.
- With that in mind, the best practice is to first announce the maintenace and that things will briefly read-only and possible unavailable for a few moments.
- Disable puppet on both servers
- Change the DNS entry for osm.db.svc.eqiad.wmflabs wherever that is
- Once most things are pointing at the read-only replica, stop postgresql on the primary server with
systemctl stop postgresql
. Make sure it's stopped! If postgresql is still showing up inps
commands, try waiting for it to stop. The last resort iskill -9
, but really try not to do that. - On the replica, become root with
sudo -i
and runsu postgres -c '/usr/lib/postgresql/9.6/bin/pg_ctl promote -D /srv/postgres/9.6/main'
. Check /var/log/postgresql/postgresql-9.6-main.log for messages that it restarted as the master and is accepting queries. - In this kind of arrangement postgresql doesn't have a beautiful way of demoting a primary to a standby server. The pg_rewind tool is not going to work because we don't have the right configuration. Therefore, you are going to have to rebuild the database of the primary to fail back or to remain in this configuration with replication enabled.
- On the old primary, you need to delete the database or move it aside and then run
sudo -u postgres pg_basebackup -D /srv/postgres/9.6/main/ -h clouddb1004.clouddb-services.eqiad.wmflabs --checkpoint=fast --xlog-method=stream -U replication -W
and enter the replication password that you carefully extracted from the recovery.conf file at /srv/postgres/9.6/main/recovery.conf.
- On the old primary, you need to delete the database or move it aside and then run
- If everything is staying in this configuration, swap the puppet roles and enable puppet. If not, you should fail back once things are done by repeating the entire process.