MariaDB/Upgrading a section
Rolling restarts are used to perform security updates or upgrade minor or major MariaDB versions.
See the Kernel and MariaDB version dashboard: https://zarcillo.wikimedia.org/ui/hosts
Monitor kernel update progress with: https://grafana.wikimedia.org/d/fcnrmzq/mariadb-kernel-versions
Monitor MariaDB update progress with: https://grafana.wikimedia.org/d/bd9fc9e2-6bb5-463d-a783-87c59d23b6f1/mariadb-versions
Using the dbtools scripts to update/reboot multiple hosts
The dbtools scripts repository contains rolling_restart.py and similar scripts
Updating the s section
Use rolling_restart.py. The script performs a rolling restart of database hosts while also updating the OS. It updates MariaDB as part of the OS update according to Puppet's configuration.
The script handles, in order:
- Silencing of alerts
- Depooling
- Updating OS including MariaDB (if a new version is present and configured in puppet)
- Rebooting (switches to a new kernel if found)
- Checking replication lag (except for RO `es*` sections)
- Repooling
- Enabling alerts
Usage:
- Check out its repository and
auto_schemain a dedicated directory on a cumin host - Edit
rolling_restart.pyglobal variables - Run in screen/tmux/byobu as:
sudo PYTHONPATH=../auto_schema python3 rolling_restart.py --run
Updating the es section
For the read only hosts:
# See:
./rolling_restart_es.py -h
# Example:
sudo ./rolling_restart_es.py -t T419961 -r'Security updates' 6.1.164 check
Updating the ms and pc sections
# See:
./rolling_restart_pc_ms.py -h
# Example:
sudo ./rolling_restart_pc_ms.py -t T419961 -r'Security updates' 12:6.1.164 13:6.12.74 reboot ms
Order of upgrades
- Upgrade clouddb* hosts.
- Upgrade Sanitarium hosts in both DCs
- Upgrade Sanitarium primaries in both DCs and ensure sanitarium host hangs from the 10.4 one in the active DC
- Upgrade the candidate master on the standby DC
- Upgrade the backup source in the standby DC (coordinate with Jaime)
- Upgrade the master in the standby DC
- Upgrade the candidate master in the primary DC
- Upgrade the backup source in the primary DC (coordinate with Jaime)
- Switchover the primary host in the primary DC to a Buster+10.4 host
- Upgrade the old primary and make it a candidate primary
Upgrading MariaDB minor version on a single host
Use the sre.mysql.upgrade cookbook.
Upgrading MariaDB major version on a single host
Use the sre.mysql.major-upgrade cookbook.
Upgrade procedure (legacy)
- Patch the dhcp file: [example]
- Run puppet on install1003 and install2003
- Depool the host (if needed) using software/dbtools/depool-and-wait
- Silence the host in Icinga (e.g. on a cumin host,
cookbook sre.hosts.downtime xxxx.wmnet -D1 -t TXXXXXX -r "reimage for upgrade - TXXXXXX") - Stop MySQL on the host
- Run
umount /srv; swapoff -a - Run reimage:
sudo -E sudo cookbook sre.hosts.reimage xxxx.wmnet -p TXXXXXX - Wait until the host is up
- Run
systemctl set-environment MYSQLD_OPTS=”--skip-slave-start” - Run
chown -R mysql. /srv/*; systemctl start mariadb ; mysql_upgrade - Run
systemctl restart prometheus-mysqld-exporter.service - Dropped the host from Tendril and re-add it, otherwise they won’t get updated on tendril metrics
- Check all the tables before starting replication (this can take up to 24h depending on the section)
- In a screen run:
mysqlcheck --all-databases - If any corruption is discovered, fix it with the following:
journalctl -xe -u mariadb | grep table | grep Flagged | awk -F "table" '{print $2}' | awk -F " " '{print $1}' | tr -d "\`" | uniq >> /root/to_fix ; for i in `cat /root/to_fix`; do echo $i; mysql -e "set session sql_log_bin=0; alter table $i engine=InnoDB, force"; done
- In a screen run:
- Start the replica
- Wait until the host is up
- Repool the host.
Accepts a Cumin query to match one or more hosts and:
- Depools the host using the
sre.mysql.depoolcookbook. - Downtimes it
- Stops replication
- Upgrades packages including MariaDB using
apt-get dist-upgrade - Reboots it
- Starts MariaDB and runs
mysql_upgradethen restarts replication - Optionally repools it if
--repoolwas passed - Logs in Phabricator
Previous manual steps (obsoleted)
- You should log that a maintenance is about to happen:
!log Upgrade db1111 T123456
- The package for the mariadb server must be upgraded, usually:where wmf-mariadb* is the package version you want to upgrade to, e.g. wmf-mariadb104, for WMF's version of MariaDB 10.4 WMFf package is built thinking to avoid side effects- so it won't automatically try to stop, restart or alter in any way a running instance- so it is possible to run it an any time, even if a previous version is currently executing. But unless there is a reason for it (e.g. minimizing upgrade downtime) it should probably ran after all current instances are shutdown.
sudo apt upgrade 'wmf-mariadb*'
- Start mysql in a safe way- not starting replication automatically and removing any old buffer pool dump:
sudo systemctl set-environment MYSQLD_OPTS="--skip-slave-start" <for each datadir> sudo mv ib_buffer_pool ib_buffer_pool.bak
- mysql_upgrade must be ran on every instance after startup, and before replication starts, for single instance hosts:For multiple instance hosts, for each instance:
systemctl start mariadb systemctl status mariadb # check it started correctly (it is ok to have some errors on first start up due to ongoing upgrade, due to old table formats) mysql_upgrade
Where section is the list of instances to upgrade on that host (e.g. s1 and s2, x1, s5 and s4, etc.)sudo systemctl start mariadb@<section> sudo systemctl status mariadb@<section> # check it started correctly (it is ok to have some errors on first start up due to ongoing upgrade, due to old table formats) sudo mysql_upgrade -S /run/mysqld/mysqld.<section>.sock
- After upgrade, if the mysql database changed, it is important to perform a reboot. This is normally skippable for minor upgrades, but guarantees it started with the right formatting:
sudo systemctl restart mariadb # or sudo systemctl restart mariadb@<section> (for each section upgraded)
The rest of steps to get the server into production state would be the same as on a regular reboot/restart: MariaDB/Rebooting_a_host (restart replication, repool, reenable monitoring, safety checks)
This page is a part of the SRE Data Persistence technical documentation
(go here for a list of all our pages)