Jump to content

MariaDB/Rebooting a host

From Wikitech

Clean shutdown

  1. First, get a list of instances on the host. The easiest way is to SSH into the host and check the MOTD, which will include something like this:
    DB section s1 (alias: mysql.s1)
    DB section s3 (alias: mysql.s3)
    
    See MariaDB/Multiinstance for more details.
  2. Check if the host needs to be depooled:
    • If any of the instances are known to dbctl, then the host will need to be depooled if it's in an active DC, and repooled afterwards.
      sudo dbctl instance <instance> get
      # E.g. for single-instance host:
      #   sudo dbctl instance db1123 get
      # E.g. for multi-instance host:
      #  sudo dbctl instance db1102:3312 get
      #  sudo dbctl instance db1102:3313 get
      #  sudo dbctl instance db1102:3320 get
      
    • clouddb* hosts also need to be depooled before being rebooted, see Portal:Data Services/Admin/Runbooks/Depool wikireplicas.
  3. Downtime the host in icinga for 1h.
    sudo cookbook sre.hosts.downtime --hours 1 -r "Rebooting dbXXXX TXXXXXX" '<fqdn>'
    
    If the host has any replicas, they will also need to be downtimed, to prevent replication alerts from firing.
  4. On the default config, MariaDB will dump its buffer pool index to disk, and load it automatically on start, decreasing its warmup period. If you want to avoid this (e.g. because the current buffer pool is not fully loaded), connect to each mysql instance and run:
    mysql> SET GLOBAL innodb_buffer_pool_dump_at_shutdown = OFF;
    
    Dynamically doing this will make this option not persist, and revert to ON on next reboot.
  5. If the host is replicating from another host, stop replication manually.
    # Single-instance
    sudo mysql -e "STOP SLAVE"
    # Multi-instance, per instance:
    sudo mysql.<section>
    root@<host>:<section>[(none)]> STOP SLAVE;
    
    If the host is currently replicating a slow transaction, STOP SLAVE will hang until the slow transaction is completed. If you press Ctrl+C to retry later, be aware that MariaDB will still apply the STOP SLAVE command as soon as the slow transaction completes.
  6. Stop mariadb instance(s) on the host
    # single-instance:
    sudo systemctl stop mariadb
    # multi-instance, for each section:
    sudo systemctl stop mariadb@<section>
    # Avoid using a wildcard (mariadb@*), stop one section at a time
    # e.g.:
    #  sudo systemctl stop mariadb@s2
    #  sudo systemctl stop mariadb@s3
    #  sudo systemctl stop mariadb@x1
    
  7. Unmount /srv and disable swap
    sudo umount /srv
    sudo swapoff -a
    
  8. This is a good time to upgrade MariaDB to the latest minor version. It's usually safe to upgrade other packages too.
    sudo apt upgrade
    
  9. Finally reboot the host! You can simply use sudo reboot, but it's preferable to use the reboot-single cookbook, which will log to SAL, check that the host reboots successfully and wait for a successful Puppet run:
    cumin1002:~$ sudo cookbook sre.hosts.reboot-single -r "Reason for reboot" <fqdn-of-host>
    

After boot

On most production hosts, the mariadb instance or instances won't restart automatically. This is intended behavior to prevent a crashed host to be pooled automatically with corrupt data or lag, before its health can be manually checked.

  • If you just did an upgrade, or other kind of dangerous maintenance, it is better to avoid an automatic buffer pool load on start up. To do so, rename the file on each data directory from ib_buffer_pool to ib_buffer_pool.bak This will make the old buffer pool unusable, while allowing a dump to be produced the next time it shuts down for a normal restart.
  • If an upgrade is about to be done, also make sure mariadb doesn't start replication automatically by running:
    systemctl set-environment MYSQLD_OPTS="--skip-slave-start"
    
  • After a clean reboot, you can start mariadb by running:
    # Single-instance:
    sudo systemctl start mariadb
    # Multi-instance, per instance:
    sudo systemctl start mariadb@<section>
    
    Where <section> is one of the sections that are present on that particular server (m1, x1, etc.). Don't worry, only configured sections on puppet will start, others will fail to start if tried.
  • Replication does not always start automatically. Check as follows (it should return IO thread running: Yes / SQL thread running: Yes):
    # Single-instance
    sudo mysql -e "SHOW SLAVE STATUS"
    # Multi-instance, per instance:
    sudo mysql.<section>
    root@<host>:<section>[(none)]> SHOW SLAVE STATUS\G
    
  • If it is stopped and should be running, you can run:
    # Single-instance
    sudo mysql -e "START SLAVE"
    # Multi-instance, per instance:
    sudo mysql.<section>
    root@<host>:<section>[(none)]> START SLAVE;
    
  • Force a Puppet run. The first Puppet run after a reboot is likely to have failed because MariaDB was not running, so now that you've started MariaDB you should run Puppet manually and make sure it completes with no errors.
    sudo run-puppet-agent
    
  • Check that the prometheus mysql exporter is running, and start it manually if it isn't:
    # Single-instance:
    sudo systemctl (status|start) prometheus-mysqld-exporter
    # Multi-instance, per instance:
    sudo systemctl (status|start) prometheus-mysqld-exporter@<section>
    
  • (Only if you are rebooting a clouddb* wikireplica host) Check that the wmf-pt-kill service is running:
    # Multi-instance, per instance:
    sudo systemctl status wmf-pt-kill@<section>
    
  • (Only if you are rebooting a primary host) Check that the pt-heartbeat service is running, and start it manually if it isn't, otherwise lag alerts will be fired.
    # Single-instance:
    sudo systemctl (status|start) pt-heartbeat-wikimedia
    # Multi-instance, per instance:
    sudo systemctl (status|start) pt-heartbeat-wikimedia@<section>
    
    Note: We should try not to reboot primary db instances for obvious reasons, and switch its active primary status beforehand, but that is sometimes done not by choice!

If the server or the instance crashed

  • depool the host from production, if possible (dbctl, haproxy, etc.). If it is not possible, weight the impact of availability vs the possibility of exposing bad or outdated data (e.g. cache db vs enwiki primary server)
  • determine the root cause of the crash with os logs (syslog), hw logs (mgmt interface), etc.
  • start the instance without replication starting automatically (systemctl set-environment MYSQLD_OPTS="--skip-slave-start")
  • start mariadb
  • check the error log journalctl -u mariadb (or mariadb@<section>)
  • do a table check comparing it to other host check (db-compare) to ensure all data is consistent between all servers of the same section
    • Most production hosts have a configuration that makes them be durable on crash (innodb_flush_log_at_trx_commit=1). However, not all kinds of crash can ensure consistency (e.g. HW RAID controller failure)
  • If the server looks good, start replication and repool it into service