MariaDB/Rebooting a host

From Wikitech
Jump to navigation Jump to search

On most production hosts, the mariadb instance or instances won't restart automatically. This is intended behavior to prevent a crashed host to be pooled automatically with corrupt data or lag, before its health can be manually checked.

After a clean reboot, you can start mariadb by running:

systemctl start mariadb

of if it is a multi-instance host:

systemctl start mariadb@<section1>
systemctl start mariadb@<section2>

Where section is the sections that are setup on that particular server (m1, x1, etc.). Don't worry, only configured sections on puppet will start, others will fail to start if tried.

Replication should start automatically, which can be checked with:


(It should return IO thread running: Yes / SQL thread running: Yes)

If it is stopped and should be running, you can run:

mysql -e "START SLAVE"

If the server or the instance crashed

  • depool the host from production, if possible (dbctl, haproxy, etc.). If it is not possible, weight the impact of availability vs the possibility of exposing bad or outdated data (e.g. cache db vs enwiki primary server)
  • determine the root cause of the crash with os logs (syslog), hw logs (mgmt interface), etc.
  • start the instance without replication starting automatically (systemctl set-environment MYSQLD_OPTS="--skip-slave-start")
  • start mariadb
  • check the error log journalctl -u mariadb (or mariadb@<section>)
  • do a table check comparing it to other host check (db-compare) to ensure all data is consistent between all servers of the same section
    • Most production hosts have a configuration that makes them be durable on crash (innodb_flush_log_at_trx_commit=1). However, not all kinds of crash can ensure consistency (e.g. HW RAID controller failure)
  • If the sever looks good, start replication and repool it into service

In all cases, including normal restarts

  • systemctl restart prometheus-mysqld-exporter should do the trick. prometheus-mysqld-exporter@<section> for multiinstance sections
  • We should try not to reboot primary db instances for obvious reasons, and switch its active primary status beforehand, but that is sometimes done not by choice!

    This page is a part of the SRE Data Persistence technical documentation
    (go here for a list of all our pages)