Maps-migration

This page describes lesson learn and steps for Migrating Maps servers.

Intro

In early 2019 we migrated maps from Debian Jessie to Debian Stretch (phab:T198622). The following lessons were learnt to help aid other migrations.

Steps taken for stretch migration

downtime the maps server
disable puppet
depool
merge patch e.g: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/486062/
run puppet on the other maps node e.g if maps1001 is being migrated, run puppet on maps100[234] to ensure that iptables rules are updated and maps1001 is excluded from the old cluster and can join the new one.
run puppet on install server (for the new stretch DHCP configuration)
reimage (make sure it isn't pooled after reimage)

  -- wmf-auto-reimage --phab-task-id T198622 --mask tilerator --conftool maps1001.eqiad.wmnet

mask tilerator until we can check that cassandra / postgresql are up to date
check that maps1001 postgres slave is initialiized, otherwise run:

sudo systemctl stop postgresql@9.6-main.service

   rm /srv/postgresql/9.6/main/*
   sudo -u postgres /usr/bin/pg_basebackup -X stream -D /srv/postgresql/9.6/main -h <new_maps_master fqdn> -U replication -w && sudo systemctl restart postgresql@9.6-main.service

While the above command is running in the background, make sure recovery.conf is created with the necessary parameters in /srv/postgresql/9.6/main
run puppet to create recovery.conf file
restart postgresql@9.6-main
check that cassandra joined the cluster
unmask tilerator
check that tilerator / kartotherian are working correctly
increase cassandra replication factor
- SELECT * FROM system.schema_keyspaces; - To check replication factor
- ALTER KEYSPACE "v4" WITH replication = {'class':'SimpleStrategy', 'replication_factor':3}; # for v4 keyspace, make sure replication_factor is the equal to number nodes
- ALTER KEYSPACE "system_auth" WITH replication = {'class':'SimpleStrategy', 'replication_factor':2}; #for system_auth keyspace, replication factor should be total_nodes - 1
nodetool repair on all maps nodes, sequentially
(nodetool repair can continue in background while we continue with the next steps)
pool

Things to note for cassandra

Increasing replication_factor for system_auth keyspace causes the downtime because privileges are cleared and clients accessing cassandra cannot read or right. See https://wikitech.wikimedia.org/wiki/Incident_documentation/20190122-maps. So when doing this make sure servers in the cluster is depooled and downtimed and Immediately run the following commands: to restore privileges back.

GRANT SELECT ON ALL KEYSPACES to kartotherian;
GRANT SELECT ON ALL KEYSPACES to tilerator;
GRANT MODIFY ON ALL KEYSPACES to tilerator;
GRANT CREATE ON ALL KEYSPACES to tilerator;
GRANT MODIFY ON ALL KEYSPACES to tileratorui;
GRANT SELECT ON ALL KEYSPACES to tileratorui;
GRANT CREATE ON ALL KEYSPACES to tileratorui;