SRE/Service Operations/Documentation/Reboots

There are times when we need to reboot our fleet. Here are some notes to help us go through it like

Mediawiki Bare Metal

appservers

For each $cluster of parsoid, api_appserver, appserver, and jobrunner; for each $dc of eqiad and codfw:

sudo cookbook sre.hosts.reboot-cluster -D $dc -c $cluster -p 5 -s 45

The cookbook will depool and repool the hosts. Beware that it may cause issues with deployments as hosts are still scap targets or proxies while rebooting.

Datastores

Memcache cluster

Servers of this cluster need no depooling of any sort, and we have a pool of servers to pick up the traffic of any unavailable server. The cached data in the memcache cluster are crucial for our latency, so it is highly recommended to reboot 1 server at a time per DC, with a sleep time of 15'-20' minimum between reboots.

https://grafana-rw.wikimedia.org/d/000000316/memcache?orgId=1

Redis misc

Redis hosts can be rebooted one after the other, taking care of waiting for replication to be back up after the reboot.

For each instance in /etc/redis:

sudo -i
redis-cli -p $instance_port
AUTH <password>
INFO Replication

docker-registry doesn't use HA for redis, so the reboot of its redis nodes will cause a short unavailability.

Etcd

Kubernetes

Production

Workers

sudo cookbook -d sre.k8s.reboot-nodes --batchsize 3 -g main --reason "Reason" --alias wikikube-worker-codfw

sudo cookbook -d sre.k8s.reboot-nodes --batchsize 3 -g main --reason "Reason" --alias wikikube-worker-eqiad

Control plane

For each host in the control plane (kubemaster, kubetcd)

sudo cookbook sre.hosts.reboot-single -r "Reason" $host

Staging