Service restarts

From Wikitech
Jump to: navigation, search

This page collects procedures to restart services (or reboot the underlying server) in the WMF production cluster.

Application servers (also image/video scalers and job runners)

When rebooting an application server it should be depooled before the reboot. Whether a server has been correctly depooled can be checked by tailing /var/log/apache2/other_vhosts_access.log.

Restarts of HHVM should be spread out a little, e.g. by waiting 30 seconds between each restart:

cumin -b 1 -s 15 'mw1*' 'service hhvm restart'

Jobrunners can be stopped completely with the following commands:

service jobchron stop
service jobrunner stop

Our infrastructure is resilient against job errors so this is a safe operation, but please be careful anyway avoiding stopping too many jobrunners at the same time.

Note that restarting jobrunner in the non-active datacenter will lead to surprises when puppet tries to stop it, see also bug T158288.

The mediawiki servers also run a local TLS terminator based on nginx, which is used for asyncronous processing of Restbase/Parsoid updates. The service is handled via pybal/confctl. A restart of nginx itself is also acceptable without depooling.

aqs

The aqs servers can be depooled/repooled via conftool (one at a time). Before repooling a server, make sure cassandra is resynced via nodetool (see the Cassandra section for details).

The aqs service is stateless and can be restarted on the individual servers (but only one at a time).

Bacula

Before rebooting a storage host or the director make sure no backup run is currently in progress. This can be checked on helium via:

 sudo bconsole 
 status director

Cache proxies (varnish) (cp)

The Varnish servers use a custom depool/repool mechanism for reboots:

 sudo touch /var/lib/traffic-pool/pool-once
 sudo reboot

The pool-once file is read by a systemd unit (traffic-pool.service), which gets run during shutdown. It depools the servers and repools them after the completed reboot.

When restarting nginx

 cumin 'foo*' -b 1 -s 15 'service nginx upgrade'

performs a graceful online restart with 15 second delay in between.

When restarting Varnishkafka

systemctl restart varnishkafka-webrequest

Important note: restarting Varnishkafka means that its sequence number internal variable is set to 0, affecting the JSON messages/event sent to Kafka (they all carry that field). This is usually not a big problem but if all the caching hosts are restarted in once it may cause alarms to Analytics for inconsistent data in Hadoop (hours after the restarts). Please do the restarts in small batches and alert the Analytics team in advance.

Cassandra (as used in aqs and restbase)

Cassandra as used in restbase uses a multi-instance setup, i.e. one host runs multiple cassandra processes, typically named "a", "b", etc. For each instance there is a corresponding nodetool-NAME binary that can be used, e.g nodetool-a status -r.

A restart of cassandra as used for restbase does not require a depooling of the server (restbase will pick a different cassandra node if the local one is unavailable).

Before starting check that no regular maintenance tasks, like compactions or bootstrap procedures, are ongoing:

# All good!

elukey@aqs1001:~$ nodetool netstats
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 456709
Mismatch (Blocking): 0
Mismatch (Background): 3625
Pool Name                    Active   Pending      Completed
Commands                        n/a         0      651146733
Responses                       n/a         0      623716036


# Something is ongoing, please wait or double check with the service owners!

elukey@aqs1004:~$ nodetool-a compactionstats
pending tasks: 587
                                     id   compaction type                                           keyspace   table      completed          total    unit   progress
   cc5bf510-5f80-11e6-a76a-794b8f4f573c        Compaction   local_group_default_T_pageviews_per_article_flat    data   241794291874   366374730011   bytes     66.00%
Active compaction remaining time :   0h07m44s
elukey@aqs1004:~$ nodetool-a netstats
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed
Large messages                  n/a         0              1
Small messages                  n/a         0     3097126002
Gossip messages                 n/a         0         905824

The restart of the Cassandra instances can be performed using the c-foreach-restart command, it figures out how many instances are running and proceeds step by step:

 c-foreach-restart

If you want to reboot a Cassandra server, the instances can be drained using c-foreach-nt, after the instances are drained, the server can be restarted:

 c-foreach-nt drain

Before proceeding with the next node:

  • check whether the restarted node has correctly rejoined the cluster (the name of the tool is relative to the restarted service instance):
c-any-nt status -r

Directly after the restart the tool might throw an exception "No nodes are present in the cluster" but this usually sorts out within a few seconds. If the node has correctly rejoined the cluster, it should be listed with "UN" prefix, e.g.:

UN  xenon-a.eqiad.wmnet              224.65 GB  256     ?       0d691414-4132-4854-a00d-1d2671e15728  rack1

If you want more info about the status of the host:

# Regular status
elukey@aqs1002:~$ nodetool netstats
Mode: NORMAL   <===================
Not sending any streams.
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed
Commands                        n/a         0              0
Responses                       n/a         0             76


# After a restart
elukey@aqs1003:~$ nodetool netstats
error: org.apache.cassandra.db:type=StorageService
-- StackTrace --
javax.management.InstanceNotFoundException: org.apache.cassandra.db:type=StorageService
...

elukey@aqs1003:~$ nodetool netstats
Mode: STARTING <===================
Not sending any streams.

Cumin

For reboots make sure noone is currently using the host. After a reboot, the keyholder needs to be rearmed:

 sudo keyholder arm

(The passphrase is in pwstore in the cumin-master-key-passphrase file).

Druid

At the moment, Druid is 'experimental', in that its usefulness is still being evaluated at scale. It is intended to be used for the Analytics/Data_Lake. As of August 2016, if Druid goes down, the Analytics team will notice, and some Hadoop jobs may fail, but no real world service will be impacted. This may change as 2016 closes.

Please note two important things:

  • Zookeeper is running on the Druid hosts (used by the Druid daemons).
  • The Pivot UI (pivot.w.o) uses Druid as backend storage to show data (more specifically it is configured to retrieve data from druid1001.eqiad.wmnet).

See: Analytics/Cluster/Druid#Full Restart of services

DNS recursors (in production and labservices)

The client machines in production and labs use two name servers for name resolution and glibc handles the lookup in a redundant manner, so as long as only one service is restarted at a time, there's no user-visible effect of the restart.

Elasticsearch

The cluster continues to work fine as long as elasticsearch is only restarted on one node at a time (or the host rebooted). The overall cluster state can be queried from any node.

On an arbitrary elasticsearch node the following command returns the overall state of the elasticsearch cluster:

 curl -s localhost:9200/_cluster/health?pretty

Initially the "status" field should be "green". After elasticsearch has been stopped/rebooted, the "number_of_nodes" will go down by one and the "status" will switch to "yellow". The search cluster will resync, but it might take 1-2 hours to reach that state. Once it has recovered that next node can be restarted/rebooted. See search cluster administration for more details about elasticsearch administration.

The time needed for recovery can be slightly decreased by disabling shard allocation during the downtime of a node. This can be done by running es-tool stop-replication / es-tool start-replication on any of the elasticsearch node:

 es-tool stop-replication
 reboot
 es-tool start-replication

The elasticsearch hosts also use Nginx for TLS termination, restarting nginx using "service nginx restart" will kill currently open requests, so it's recommended to depool the server for the restart.

etcd

Warning Warning: Please note that the eqiad etcd cluster is replicated via etcdmirror in codfw, and failing to replicate means paging people. The following hiera variable indicates what is the instance that is running etcdmirror (at the moment conf2002):

profile::etcd::replication::active: true

Etcdmirror reads from the eqiad cluster and replicates its data in codfw. This means that if you reboot conf1* hosts in eqiad and you don't downtime conf2002 in Icinga you'll cause a page. Log on conf2002 and check what is the source host for the replication:

elukey@conf2002:~$ sudo systemctl status etcdmirror-conftool-eqiad-wmnet.service
● etcdmirror-conftool-eqiad-wmnet.service - Etcd mirrormaker
   Loaded: loaded (/lib/systemd/system/etcdmirror-conftool-eqiad-wmnet.service; enabled)
   Active: active (running) since Mon 2017-07-17 14:52:18 UTC; 55min ago
 Main PID: 23540 (etcd-mirror)
   CGroup: /system.slice/etcdmirror-conftool-eqiad-wmnet.service
           └─23540 /usr/bin/python /usr/bin/etcd-mirror --strip --src-prefix /conftool --dst-prefix /conftool https://conf1001.eqiad.wmnet:2379 http://localhost:2378

In this case etcdmirror on conf2002 is pulling data from conf1001. The etcd nodes are internal clustered and can be rebooted one at a time. After a reboot, the cluster health can be checked via one of the following:

# codfw status
etcdctl -C https://conf2002.codfw.wmnet:2379 cluster-health

# etcd status
etcdctl -C https://conf1002.eqiad.wmnet:2379 cluster-health
 /usr/local/bin/nrpe_etcd_cluster_health --url https://conf1001.eqiad.wmnet:2379

Warning Warning: Whenever you reboot/restart any etcd host please verify on conf2002 (or the current active host) that etcdmirror is active and not issuing errors in its logs.

Warning Warning: Whenever you reboot/restart any etcd host please verify with Traffic that Pybal is happy and working correctly - https://phabricator.wikimedia.org/T169765

Warning Warning: check the notes about Zookeeper since it is co-hosted on the conf* hosts.

More in depth info about the etcd cluster can be found in Etcd#Operations

Exim

The exim service/the mx* hosts can be restarted/rebooted individually without external impact; mail servers trying to deliver mails will simply re-try at a later point if the SMTP service is unavailable:

service exim4 restart

EventLogging

EventLogging is a python based service that reads/writes from Analytics Kafka (more info in Analytics/EventLogging). Do not confuse it with EventBus! If you need to restart the service or reboot the host you can follow Analytics/EventLogging/Oncall#Restart EventLogging, but please reach out to the Analytics IRC channel first just to be sure (#wikimedia-analytics).

failoid

Failoid is used for DNS discovery to indicate that a service is failing. It's iptables setup rejects a connection immediately instead of letting the client run into a timeout. As such, Failoid instances can be rebooted one at a time unless there's currently an ongoing service outage.

Ganeti

Ganeti nodes can be upgraded without impact on the running VMs. To reboot a node, its virtual machines nodes need to be migrated to other hosts, with the master node needing special attention.

Gerrit

The restart should be pre-announced on #wikimedia-operations (for maybe 15 minutes) to give people a heads-up:

service gerrit restart

Hadoop workers

Please coordinate with the Analytics team before taking any action, there are multiple dependencies to consider before proceeding. For example, Camus might need to be stopped to prevent data loss/lag in HDFS.

Hadoop's master node (analytics1001.eqiad.wmnet) and its standby replica (analytics1002.eqiad.wmnet) are configured for automatic failover, but please read the following page: Analytics/Cluster/Hadoop/Administration#Manual Failover

Three of the Hadoop workers run an additional JournalNode process to ensure that the standby master node is kept in sync with the active one. These are configured in the puppet manifest. When rebooting JournalNode hosts it must be ensured that two additional JournalNode hosts are up and running.

service hadoop-hdfs-journalnode restart

The other Hadoop workers are running two services (hadoop-hdfs-datanode and hadoop-yarn-nodemanager). The services on the Hadoop workers should be restarted in this order:

service hadoop-yarn-nodemanager restart
service hadoop-hdfs-datanode restart

The service restarts have no user-visible impact (and the machines can also be rebooted). It's best to wait a few minutes before proceeding with the next node.

The Yarn node managers support graceful reload, so all the Yarn containers running on the same node are not killed/restarted at the same time (the node manager dumps its state on disk and it is able to restore its config and running containers while starting). This means that until a container finishes, new package upgrades like open-jdk ones, will not be picked up and will show up in commands like lsof.

Haproxy

HAProxy servers are used for routing misc servers. They are currently a SPOF, so if you need to restart them, make sure they are not in use by using a different proxy.

HAProxy configuration can be reloaded without stopping it. However, HAProxy need explicit configuration of the files used for config. If the name or number of files change (not only the contents itself), reload doesn't work, and it requires a full service restart.

Hive

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Since it is not a stateless service, please contact the Analytics team before restarting it to pause to avoid any failure in the Hadoop cluster. It is composed by two daemons, the server and the metastore.

Kafka brokers (analytics)

Several consumers might get upset by metadata changes due to broker restarts, please make sure that the Analytics team is alerted beforehand:

One Kafka broker can be restarted/rebooted at a time:

service kafka restart

It needs to be ensured that all replicas are fully replicated. After restarting a broker a replica election should be performed.

Warning Warning: Restarting a broker will trigger a bug in the Kafka 0.9 truncate log function that will set all the logs' mtime to now (more info https://phabricator.wikimedia.org/T136690). This will mess up the regular Kafka cleaning policy that we have set, namely remove all the files with mtime older than 7 days. This could lead to excessive data stored in one disk partition and disk full alarms. To avoid this, the Analytics team deployed a limit for the total topic partition size (500GiB), so even in case of restarts we should be ok. Just alert the Analytics team on IRC to give them a heads up.

Kafka brokers (eventbus)

kafka100[123] and kafka200[123] are not Analytics brokers but they are part of EventBus, so you will need to follow EventBus/Administration

Warning Warning: please sync with the Services team to coordinate the restart/reboot of kafka[123]00[123], since they might need to temporarily stop services like ChangeProp to avoid any risk of causing an outage.

After the restart, please check if events have been dropped following EventBus/Administration#Replaying failed events. If you find errors please report them to the Analytics team!

Labstore / NFS

Notes can be found at https://etherpad.wikimedia.org/p/labstore_reboots

Logstash

After rebooting a Logstash node running Elasticsearch (1004-1006), it needs to be waited until the cluster state has recovered to "green" (see the "Elasticsearch" section for details). 1001 to 1003 and 1007 to 1009 run Logstash, Kibana (Apache proxied via Varnish) and a data-less Elasticsearch node. The multiple logstash endpoints are behind LVS. They can also be rebooted/restarted one at a time after depooling them, from experience it could take ~5 min for logstash to listen again on its ports so allow enough time between (de)pools.

LVS

The LVS servers are configured in primary/backup pairs (configured on the routers and visible in puppet in modules/lvs/manifests/configuration.pp). To redirect the traffic from a primary to the backup, pybal can be stopped (traffic is then being redirected to the backup).

maps

The maps servers can be depooled/repooled via conftool (one at a time). Before repooling a server, make sure cassandra is resynced via nodetool (see the Cassandra section for details). Restarts of postgres on the master should be avoided while the download of the OSM data is in progress (triggered daily at 01:27 and visible by the replicate-osm process)

MySQL/MariaDB

Long running queries from terbium maintenance, SPOF in certain mysql services (masters, specialized slave roles, etc.) prevent from easy restarts.

The procedure is, for a core production slave:

  • Depool from mediawiki
  • Wait for queries to finish
  • Stop replication mysql -e "STOP SLAVE"
  • Stop the server, /etc/init.d/mysql stop then reboot

For a core production master:

For a misc server:

  • Failover using HAProxy (dbproxy1***)
  • Some services need a reload due to long-running connections or persistent connections. This is documented on: MariaDB/misc

More info on ways to speed up this at MariaDB and MariaDB/troubleshooting

Memcached

Memcached is used as caching layer for MediaWiki and it is co-hosted with Redis on mcXXXX machines (eqiad and codfw). MediaWiki uses nutcracker (https://github.com/twitter/twemproxy) to abstract the connection to the memcached cluster with one local socket and to avoid "manual" data partitioning.

Restarting the service is very easy but please remember that the cache is only in memory and it is not persisted on disk before restarts. Direct consequences of a restart might be:

A complete restart of the memcached cluster must be coordinated carefully with ops and the performance team to establish a good procedure to avoid performance hits. If you need to stop memcached for a long maintenance (e.g. OS re-install, etc..) please remove the related host from Heira first (example https://gerrit.wikimedia.org/r/#/c/273430/).

If you want to rapidly check if memcached is working after a restart or an upgrade:

mc1008:~$ echo stats | nc localhost 11211

cmd_set, cmd_get, total_items and current_items should show values greater than zero (increasing over time). This is not exhaustive of course!

Please remember that memcached on mcXXXX hosts is co-hosted with Redis, read carefully its section on this page if you need to operate on the whole host rather than only memcached:

Redis is running with a special service name to allow its use as multi-instance (several Redis processes on the same node).

sudo service redis-instance-tcp_6379 restart

It is used in various places for different tasks like:

  • Storage of user sessions on mcXXXX hosts (co-hosted with Memcached)
  • Queue for Job tasks on rdbXXXX hosts

Restarting Redis is generally a safe operation since the daemon persists its data to disk before restarting (unlike Memcached). Please note that if you need to perform a complete stop of the service (e.g. OS re-install, etc..) you will need to depool the related host from service first (example https://gerrit.wikimedia.org/r/#/c/273430/). Useful references:

Please note the removing a mcXXXX host from the Redis pool will cause user sessions to be dropped. This is unavoidable since each mcXXXX host holds a partition of the sessions not replicated elsewhere (this will not be true when codfw replication will be fully working, but hopefully this page will be already updated). Please carefully plan a complete cluster maintenance to avoid a massive loss of user session in a short time window. Please also inform Wikitech Ambassadors (https://lists.wikimedia.org/pipermail/wikitech-ambassadors/) and the performance team with one day of advance.

Puppet will take time to rollout a change like de-pooling a Redis host from its pool because it won't update all the hosts at once. This means that it usually takes ~30 minutes for all the connections to drain from a host. In this timeframe you will see errors in logstash. Please also make sure that all the client connections drop to zero before operating on the host (rebooting, re-installing the OS, etc..) using commands like:

redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6379.conf)" client list | wc -l
redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6379.conf)" monitor

memcached on other services

  • graphite uses memcached to cache queries, it's safe to upgrade.
  • swift frontend servers use memcached to cache lookups for container/account existence and auth tokens. It's safe to upgrade, but the frontend servers should be depooled for the restart.
  • Upgrading memcached on californium loses the sessions for horizon.wikimedia.org, users need to relogin
  • Upgrading memcached on silver is fine, sessions on wikitech persis

ntpd

We run four ntpd servers (chromium, hydrogen, acamar, achenar) and all of these are configured for use by the other servers in the cluster. As such, as long as only one server is restarted/rebooted at at time, everything is fine. The ntpd running locally on the individual servers can easily be restarted at any any time.

ocg

For rebooting the servers, the ocg servers can be depooled/repooled via conftool (one at a time). Note https://phabricator.wikimedia.org/T120077 , though. The host needs to be changed before ocg1002 can be rebooted.

Otherwise the ocg service can simply be restarted via

service ocg restart

Oozie

Oozie is a workflow scheduler system to manage Apache Hadoop jobs written in Java. Since it is not a stateless service, please contact the Analytics team before restarting it to pause its Bundles/Coordinators/Workflows to avoid any failure in the Hadoop cluster.

Ores (Redis)

Restarting Redis is generally a safe operation since the daemon persists its data to disk before restarting.

Rebooting a redis host is slightly more complex: The oresrdb primary host has a fallback slave, oresrdb1001/oresrdb1002. The fallback hosts can be rebooted without impact. The switchover occurs via DNS: https://gerrit.wikimedia.org/r/#/c/349434/

You can use "redis-cli client list" to monitor the rate existing connections from job runners are draining.

openldap

We run two openldap installations (the oit mirror and for labs). Both are using mirrormode replication and the respective clients (mails servers for oit mirror and (primarily) labs instances for openldap-labs). The openldap servers (or the slapd process) can be rebooted/restarted one at a time, the clients will transparently try to reconnect to the other host of the respective cluster. The number of connected clients are shown in grafana for openldap-labs.

Parsoid

For service restarts, parsoid can simply be restarted using 'sudo service parsoid restart'.

When rebooting one of the wtp* hosts, they should be depooled via pybal/conftool (two systems at at time). Whether a server has been correctly depooled can be checked by tailing /var/log/parsoid/parsoid.log.

Pivot

Pivot is stateless and can be restarted at any time.

Pool counters

The pool counters in the inactive data centre can be rebooted rightaway. For the active data centre, they should be removed/readded from mediawiki-config one by one (example commit: https://gerrit.wikimedia.org/r/#/c/318509/) Before rebooting it can be doublechecked with "ss" (on port 7531) that no further mediawikis are connected.

Postgres

labsdb1004 (wikilabels)

The postgres database is used by Wiki Labels (used by Ores). After Postgres upgrades, wiki labels needs a manual restart, so restarts/upgrades should be coordinated with Aaron Halfaker.

labsdb1006/1007 (OSM)

Those can be restarted/upgraded anytime.

puppetdb

Clients only talk to Java-based frontend processes, but during the postgres update a few puppet runs will fail, so either needs to be logged or icinga-wm temporarily disabled.

Prometheus

Prometheus is ran in active/active mode, so to roll-restart you have to depool and repool in sequence. For example:

 confctl select name=<name>.codfw.wmnet set/pooled=no
 ... reboot
 confctl select name=<name>.codfw.wmnet set/pooled=yes

To get the current status:

 # confctl select service=prometheus get
 {"prometheus2003.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": 
 "dc=codfw,cluster=prometheus,service=prometheus"}
 {"prometheus2004.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": 
 "dc=codfw,cluster=prometheus,service=prometheus"}
 {"prometheus1003.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": 
 "dc=eqiad,cluster=prometheus,service=prometheus"}
 {"prometheus1004.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": 
 "dc=eqiad,cluster=prometheus,service=prometheus"}

The Prometheus servers in easms/ulsfo (running on bastions) are not redundant, when restarting/rebooting a temporary loss of metrics data is acceptable.

puppetdb

Restarting the r/w instance via the puppetdb.service will lead to inevitable puppet failures from clients, the r/o backend server should be fine.

The reduce the Icinga spam, ircecho can be stopped temporarily on the Icinga host.

Puppet masters

For reboots, puppet masters can be depooled via a Puppet Hiera config like https://gerrit.wikimedia.org/r/#/c/349419/.

For service restarts, there will be a shower of puppet failures, so it's better to stop the Icinga log bot.

Redis

(Redis is also used on the memcached servers, please see that section for details. This section is about the rdb* servers hosting the job queue for redis.

Redis is running with a special service name to allow its use as multi-instance (several Redis processes on the same node).

sudo systemctl restart redis-instance-tcp_6378
sudo systemctl restart redis-instance-tcp_6379
sudo systemctl restart redis-instance-tcp_6380
sudo systemctl restart redis-instance-tcp_6381

Restarting Redis is generally a safe operation since the daemon persists its data to disk before restarting.

Putting redis entirely out of service is a little more complex: The rdb hosts have a fallback slave (using the subsequent number), e.g. rdb1001 has rdb1002 as it's fallback. The fallback hosts can be rebooted without impact. The primary hosts need to be depooled via mediawiki-config. This can be done by commenting the shard in wmf-config/ProductionServices.php (in $wmfAllServices['eqiad']['jobqueue_redis']

You can use "redis-cli client list" to monitor the rate existing connections from job runners are draining.

relforge

The relforge* cluster is very similar to the elastic* search clusters, but only consists of two hosts, so rebooting/restarting the master causes a service interruption (the service is only used internally), so #wikimedia-discovery should be notified. For the restart of relforge, replication should be stopped and both nodes rebooted at the same time:

 es-tool stop-replication
 # restart service / reboot servers as needed
 es-tool start-replication

restbase

The restbase service on the invidual hosts can be restarted without depooling, but only one host should be restarted at a time. The next node should only be restarted once restbase is listening on it's external port again:

 netstat -ntulp | grep 7231

sca servers

The sca servers can be depooled/repooled via conftool (one at a time). They run multiple services, so better use "confctl --find".

scb servers

The scb servers can be depooled/repooled via conftool (one at a time). They run multiple services, so better use "confctl --find".

stat* servers

Some users might have long-running scripts on those servers, in case of a reboot, it's best send a heads-up mail to analytics@lists.wikimedia.org a day ahead.

Swift

Frontend servers (ms-fe*) should be depooled via pybal when making service restarts or reboots. Whether a server has been correctly depooled can be checked by tailing /var/log/swift/proxy-access.log.

Backend servers can simply be rebooted/restarted one at a time (with a 30 second delay in between when restarting); an unresponsive host is automatically handled by the frontend servers.

The Swift services on frontends and backend servers should be restarted with

swift-init all restart

For frontends on jessie swift-init won't work, the restart should happen via systemd instead:

 systemctl restart swift-proxy

Thumbor

Thumbor nodes must be depooled/repooled when making service restarts or reboots. The restart occurs via the "thumbor-instances" service.

Yubico authentication servers

The authentication servers can be rebooted one at a time. After each reboot the keystore on the YubiHSM needs to be unlocked using

 sudo yhsm-keystore-unlock

Zookeeper

Zookeeper is used by Kafka and Hadoop for configuration management and leader election, plus recently by ChangeProp. Analytics and Services need Zookeeper up and running, so please give an heads up to them before proceeding.

Zookeeper nodes can be restarted one at a time via "service zookeeper restart". Once a node is restarted, before proceeding with the next one please verify its status using the following commands:

elukey@conf2001:~$ echo ruok | nc localhost 2181
imok

elukey@conf2001:~$ echo ruok | nc localhost 2181
imokelukey@conf2001:~$ echo stats | nc localhost 2181
Zookeeper version: 3.4.5--1, built on 05/31/2017 10:10 GMT
Clients:
[...]

Latency min/avg/max: 0/0/947
Received: 91891
Sent: 91894
Connections: 4
Outstanding: 0
Zxid: $someid
Mode: [follower|leader]
Node count: 1473

Please also verify that there is an active leader usin Cumin:

elukey@neodymium:~$ sudo cumin 'conf2*' 'echo stats | nc localhost 2181'

Warning Warning: Please note that only executing "ruok" is not enough since it will only tell you what is the status of the daemon, not the health of the cluster! This caused Incident documentation/20170831-Zookeeper.

Please also double check https://grafana.wikimedia.org/dashboard/db/zookeeper after each restart.

One-off hosts

silver

After the reboot, mysql needs to be started manually (similar to the mysqld for the main databases):

 sudo -H bash
 /etc/init.d/mysql start

(Probably no longer needed with https://gerrit.wikimedia.org/r/#/c/292980/, doublecheck with the next update)

Also, the script that maintains mounts for labs projects queries wikitech running on silver to get a list of all projects. While it's also automatically restarted by puppet after up 30 minutes, this can be fixed by running

 sudo service nfs-exports restart

on labstore1001.

sodium

Package installations/upgrades will fail, so this needs to be announced briefly ahead.