Maps-migration

From Wikitech

This page describes lesson learn and steps for Migrating Maps servers.

Intro

In early 2019 we migrated maps from Debian Jessie to Debian Stretch (phab:T198622). The following lessons were learnt to help aid other migrations.

Steps taken for stretch migration

  • downtime the maps server
  • disable puppet
  • depool
  • merge patch e.g: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/486062/
  • run puppet on the other maps node e.g if maps1001 is being migrated, run puppet on maps100[234] to ensure that iptables rules are updated and maps1001 is excluded from the old cluster and can join the new one.
  • run puppet on install server (for the new stretch DHCP configuration)
  • reimage (make sure it isn't pooled after reimage)
  -- wmf-auto-reimage --phab-task-id T198622 --mask tilerator --conftool maps1001.eqiad.wmnet
  • mask tilerator until we can check that cassandra / postgresql are up to date
  • check that maps1001 postgres slave is initialiized, otherwise run:
sudo systemctl stop postgresql@9.6-main.service
   rm /srv/postgresql/9.6/main/*
   sudo -u postgres /usr/bin/pg_basebackup -X stream -D /srv/postgresql/9.6/main -h <new_maps_master fqdn> -U replication -w && sudo systemctl restart postgresql@9.6-main.service 
  • While the above command is running in the background, make sure recovery.conf is created with the necessary parameters in /srv/postgresql/9.6/main
  • run puppet to create recovery.conf file
  • restart postgresql@9.6-main
  • check that cassandra joined the cluster
  • unmask tilerator
  • check that tilerator / kartotherian are working correctly
  • increase cassandra replication factor
    • SELECT * FROM system.schema_keyspaces; - To check replication factor
    • ALTER KEYSPACE "v4" WITH replication = {'class':'SimpleStrategy', 'replication_factor':3}; # for v4 keyspace, make sure replication_factor is the equal to number nodes
    • ALTER KEYSPACE "system_auth" WITH replication = {'class':'SimpleStrategy', 'replication_factor':2}; #for system_auth keyspace, replication factor should be total_nodes - 1
  • nodetool repair on all maps nodes, sequentially
  • (nodetool repair can continue in background while we continue with the next steps)
  • pool

Things to note for cassandra

Increasing replication_factor for system_auth keyspace causes the downtime because privileges are cleared and clients accessing cassandra cannot read or right. See https://wikitech.wikimedia.org/wiki/Incident_documentation/20190122-maps. So when doing this make sure servers in the cluster is depooled and downtimed and Immediately run the following commands: to restore privileges back.

GRANT SELECT ON ALL KEYSPACES to kartotherian;
GRANT SELECT ON ALL KEYSPACES to tilerator;
GRANT MODIFY ON ALL KEYSPACES to tilerator;
GRANT CREATE ON ALL KEYSPACES to tilerator;
GRANT MODIFY ON ALL KEYSPACES to tileratorui;
GRANT SELECT ON ALL KEYSPACES to tileratorui;
GRANT CREATE ON ALL KEYSPACES to tileratorui;