Orchestrator

You may also be looking for the WikiFunctions function-orchestrator.

Orchestrator is a service for managing mysql cluster replication. The data-persistence SRE team currently hosts a read-only deployment of it within WMF, replacing the old Tendril/Dbtree.

Operations

Current deployment

It's publicly accessible as https://orchestrator.wikimedia.org/ (this requires NDA)

It runs on dborch1001.eqiad.wmnet, a Ganeti VM. It's backend database is named orchestrator, and it lives on db2093.codfw.wmnet (the db_inventory node in codfw, see T266003 for the background).

Adding a section to orchestrator

Deploy the orchestrator grants to the section (modules/role/templates/mariadb/grants/orchestrator.sql.erb in the puppet repo). This should be done on the active DC's primary instance, and also on both DC's sanitarium hosts.
Clean up the heartbeat table so that there's no stale entries.
1. E.g. run this against all instances individually: set session sql_log_bin=0; delete from heartbeat where server_id=171974662 limit 1
Add the primary instance to orchestrator.
1. Ssh to the dborch node, and run sudo orchestrator -c discover -i FQDN
2. N.B. it needs to be the FQDN of the instance.

Upgrading orchestrator

Orchestrator automatically deploys schema changes when it gets upgraded. It tracks these in the orchestrator_db_deployments table. On startup it will check to see if the current version number is in that table, and if not it will perform all schema changes. It will not detect if a later version has been deployed. This means that we need a full backup of the orchestrator database before doing an upgrade, as otherwise we do not have a way to rollback.

On dborch1001:

Update apt, so the new package is available: sudo apt update
Stop orchestrator: sudo systemctl stop orchestrator
Take a backup of the orchestrator backend database: sudo mysqldump --defaults-file=/etc/mysql/orchestrator_srv.cnf --ssl -h db2093.codfw.wmnet orchestrator > orchestrator.sql.$(date +"%Y-%m-%d")
Upgrade the orchestrator packages: sudo apt install orchestrator orchestrator-client

Test that the orchestrator binary works from the cmdline:

$ sudo orchestrator -c clusters-alias
2021-10-14 14:03:12 DEBUG Connected to orchestrator backend: orchestrator_srv:?@tcp(db2093.codfw.wmnet:3306)/orchestrator?timeout=1s
2021-10-14 14:03:12 DEBUG Orchestrator pool SetMaxOpenConns: 128
2021-10-14 14:03:12 DEBUG Initializing orchestrator
2021-10-14 14:03:12 INFO Connecting to backend db2093.codfw.wmnet:3306: maxConnections: 128, maxIdleConns: 32
db1103.eqiad.wmnet:3306	x1
db1104.eqiad.wmnet:3306	s8
db1107.eqiad.wmnet:3306	m3
...

Start orchestrator: sudo systemctl start orchestrator

Test that orchestrator-client works:

$ orchestrator-client -c clusters-alias
db1103.eqiad.wmnet:3306,x1
db1104.eqiad.wmnet:3306,s8
db1107.eqiad.wmnet:3306,m3
...

Test that the web u/i works.

If a rollback is needed, unfortunately there's no good story. You need to have (or rebuild) the previous version of the orchestrator packages, and upload them to apt.wm.o, and go from there.

Packaging

Updating orchestrator packaging to a new upstream version

Check out the orchestrator package repo: https://gerrit.wikimedia.org/r/admin/repos/operations/debs/orchestrator
On the master branch, run ./debian/repack v$VER. Note the leading v in the upstream version umber. This will create a tarball in the current directory.
Move the tarball out of the git working dir: mv orchestrator_$VER.orig.tar.xz ..
Import it: gbp import-orig ../orchestrator_$VER.orig.tar.xz. This will add a commit to the upstream branch a new upstream/$VER tag referencing it. It will then merge the new upstream branch into master.
Push these new branches directly to gerrit, as they are not reviewable:
1. git checkout upstream; git push; git push origin upstream/$VER
2. git checkout master; git push
Create a debian changelog entry for the new version: dch -D bullseye-wikimedia --force-distribution -v $VER-1. If you forget to do this, trying to build a package will fail horribly with dpkg-source: error: unrepresentable changes to source
Test building the package to make sure that still works, and then send a CR for review with your changes.

Creating a new orchestrator release

You will need a gpg key to sign the new release. git tag will prompt you for your gpg password when creating the new tag.

For simplicity, set 2 environment variables in your shell, $DEBVER for the new release you're creating, and $OLDDEBVERfor the previous release. E.g.: DEBVER=3.2.6-1; OLDVER=3.2.3-3
Add/update a debian changelog entry for $DEBVER. Send a CR for review for any changes.
1. If it doesn't already exist, create it with dch -D bullseye-wikimedia --force-distribution -v ${DEBVER:?}
Create a git tag for the release, and populate it with changes made since the last release: git tag -s -a -F <(echo orchestrator ${DEBVER:?}; echo; git log --no-decorate --oneline debian/${OLDDEBVER:?}..) -e debian/${DEBVER:?}. This will prompt you for a gpg password to sign the tag with.
Check that the new tag looks good: git show debian/${DEBVER:?}
Push the tag to the upstream repo: git push origin debian/${DEBVER:?}

Building orchestrator packages

NOTE: you must build with golang 1.14 only! It's not currently possible to build orchestrator on the standard build host due to its golang version requirements. Until Puppet host certs do not contain Subject Alt Name entries is fixed, or a workaround implemented, we're limited to only golang 1.14. Go stopped supporting verifying the common name in 1.15 and so requires a subject alt name.

Check out the orchestrator package repo: https://gerrit.wikimedia.org/r/admin/repos/operations/debs/orchestrator
Install the following prerequisites:
1. sudo apt install devscripts debhelper dh-golang
2. Install golang 1.14, Using buster backport packages on bullseye was the only way to make dh_golang happy
Build with debclean -d && debuild -d -us -uc (-d is needed to work around the fact that the build requirement on golang 1.14 isn't being satisfied by a debian package).

Uploading new orchestrator packages

This is a simplified version of Debian Packaging#Upload to Wikimedia Repo.

On apt1002: mkdir -p ~/orchestrator && rm ~/orchestrator/*.changes
From your build dir on your local machine: scp ../*changes ../*deb ../*dsc apt1001.eqiad.wmnet:orchestrator/
Back on apt1001: cd ~/orchestrator && sudo -i reprepro -C main include bullseye-wikimedia $PWD/*.changes
In #wikimedia-operations on irc: !log uploaded orchestrator $VERSION packages to apt.wm.o (bullseye) TXXXXXX

Troubleshooting

Entry in database_resolve that maps to a bare hostname

+--------------------+--------------------+---------------------+
| hostname           | resolved_hostname  | resolved_timestamp  |
+--------------------+--------------------+---------------------+
| pc1008.eqiad.wmnet | pc1008             | 2020-11-18 10:11:58 |
+--------------------+--------------------+---------------------+

This can cause a 'ghost' cluster to appear, containing the bare-hostname version of the host. To fix this:

systemctl stop orchestrator
orchestrator -c forget -i <instance> for all instances in the ghost cluster
orchestrator -c reset-hostname-resolve-cache
systemctl start orchestrator

Stopping orchestrator is required to stop it from reinserting the bad entry into hostname_resolve.

The entries can be queried via orchestrator -c show-resolve-hosts

There appears to be 'fake lag'

If Orchestrator shows lag but replication seems to be working normally, the most likely cause is that there is leftover records on the pt-heartbeat table from a previous, different topology. Cleanup of records may have been skipped after switchover/failover. Check the Orchestrator references at for the post-switchover steps:

Common day to day operations

Moving replicas around

The safest option (or the dashboard on Smart mode - which is the one we use by default):

orchestrator -c relocate

Relocate command accepts any valid destination. relocate figures out the best way to move a slave. If GTID is enabled, use it. If Pseudo-GTID is available, use it.

The following two methods ARE NOT preferred ways of moving replicas around, better to use relocate.

move-up
move-down

The following commands should be used unless we are not worried about data integrity (or we stop replication on the involved hosts at the same time)

Classic file:pos relocation:
* move-up                                 Move a replica one level up the topology
* move-up-replicas                        Moves replicas of the given instance one level up the topology
* move-below                              Moves a replica beneath its sibling. Both replicas must be actively replicating from same master.
* move-equivalent                         Moves a replica beneath another server, based on previously recorded "equivalence coordinates"
* repoint                                 Make the given instance replicate from another instance without changing the binglog coordinates. Use with care
* repoint-replicas                        Repoint all replicas of given instance to replicate back from the instance. Use with care
* take-master                             Turn an instance into a master of its own master; essentially switch the two.
* make-co-master                          Create a master-master replication. Given instance is a replica which replicates directly from a master.
* get-candidate-replica                   Information command suggesting the most up-to-date replica of a given instance that is good for promotion

External link

https://wikimediastatus.net