SRE/Service Operations/Documentation/Reboots
There are times when we need to reboot our fleet. Here are some notes to help us go through it like
Mediawiki Bare Metal
appservers
For each $cluster
of parsoid
, api_appserver
, appserver
, and jobrunner
; for each $dc
of eqiad
and codfw
:
sudo cookbook sre.hosts.reboot-cluster -D $dc -c $cluster -p 5 -s 45
The cookbook will depool and repool the hosts. Beware that it may cause issues with deployments as hosts are still scap targets or proxies while rebooting.
Datastores
Memcache cluster
Servers of this cluster need no depooling of any sort, and we have a pool of servers to pick up the traffic of any unavailable server. The cached data in the memcache cluster are crucial for our latency, so it is highly recommended to reboot 1 server at a time per DC, with a sleep time of 15'-20' minimum between reboots.
Redis misc
Redis hosts can be rebooted one after the other, taking care of waiting for replication to be back up after the reboot.
For each instance in /etc/redis:
sudo -i
redis-cli -p $instance_port
AUTH <password>
INFO Replication
docker-registry
doesn't use HA for redis, so the reboot of its redis nodes will cause a short unavailability.
Etcd
Kubernetes
Production
Workers
sudo cookbook -d sre.k8s.reboot-nodes --batchsize 3 -g main --reason "Reason" --alias wikikube-worker-codfw
sudo cookbook -d sre.k8s.reboot-nodes --batchsize 3 -g main --reason "Reason" --alias wikikube-worker-eqiad
Control plane
For each host in the control plane (kubemaster, kubetcd)
sudo cookbook sre.hosts.reboot-single -r "Reason" $host