Jump to content

SRE/Service Operations/Documentation/Reboots

From Wikitech

There are times when we need to reboot our fleet. Here are some notes to help us go through it like

Datastores

Memcache cluster

Servers of this cluster need no depooling of any sort, and we have a pool of servers to pick up the traffic of any unavailable server. The cached data in the memcache cluster are crucial for our latency, so it is highly recommended to reboot 1 server at a time per DC, with a sleep time of 15'-20' minimum between reboots.

Redis misc

Redis hosts can be rebooted one after the other, taking care of waiting for replication to be back up after the reboot.

For each instance in /etc/redis:

sudo -i
redis-cli -p $instance_port
AUTH <password>
INFO Replication

docker-registry doesn't use HA for redis, so the reboot of its redis nodes will cause a short unavailability.

Etcd

Kubernetes

Production

Workers

sudo cookbook -d sre.k8s.reboot-nodes --batchsize 15 --k8s-cluster wikikube-codfw --reason "Reason" --alias wikikube-worker-codfw --minimal-cordoning

sudo cookbook -d sre.k8s.reboot-nodes --15 --k8s-cluster wikikube-eqiad --reason "Reason" --alias wikikube-worker-eqiad --minimal-cordoning

Control plane

For each host in the control plane (kubemaster, kubetcd)

sudo cookbook sre.hosts.reboot-single -r "Reason" $host

Dragonfly supernodes

Lock scap to avoid overloading the registry servers with a big deployment

deploy1003:~$ scap lock --all "Dragonfly supernodes reboot"

Then reboot using the sre.hosts.reboot-single cookbook

Staging

Poolcounter

Each poolcounter server should be removed from mediawiki-config/wmf-config/ProductionServices.php with a gerrit patch before reboot, then that patch should be deployed to production using scap backport.

After the reboot, add the server back and remove the next server in the same patch, scap backport it, and repeat until all poolcounter servers are rebooted and back in mediawiki-config/wmf-config/ProductionServices.php

Thumbor uses poolcounter as well, but will fail open if its poolcounter server is down. You can forgo swapping the servers out in thumbor's deployment-charts/helmfile.d/services/thumbor/values-{eqiad,codfw}.yaml files if the interruption is short.

Chartmuseum

Chartmuseum is addressed through helm-charts.discovery.wmnet. It is backed by one VM in each datacentre.

# Depool codfw
sudo confctl --object-type discovery select 'dnsdisc=helm-charts.*,name=codfw' set/pooled=false
# Reboot codfw
sudo cookbook sre.hosts.reboot-single -r "May 2025 Reboots" chartmuseum2001.codfw.wmnet
# Repool codfw
sudo confctl --object-type discovery select 'dnsdisc=helm-charts.*,name=codfw' set/pooled=true
# Rinse and repeat for eqiad
sudo confctl --object-type discovery select 'dnsdisc=helm-charts.*,name=eqiad' set/pooled=false
sudo cookbook sre.hosts.reboot-single -r "May 2025 Reboots" chartmuseum1001.eqiad.wmnet
sudo confctl --object-type discovery select 'dnsdisc=helm-charts.*,name=eqiad' set/pooled=true