User:Ladsgroup/runbook test
Decision Tree
Scenario 1: Database pages for replication lag
You will get a page on replication lag.
STEP: Depool the host
> $ sudo dbctl instance db2150 depool && sudo dbctl config commit -m "Depool db2150, reason why"
If depool gives an error, it means either it’s already depooled, the port is not set (sometimes it’s not the default port) or it’s a datacenter master.
STEP: Downtime it.
IMPACT: The user facing impact stops here
> $ sudo cookbook sre.hosts.downtime -r "reason why" --hours 24 db2150.codfw.wmnet
STEP: Create a ticket and tag DBA.
STEP: Determine which section the database belongs to
Several ways:
- Load https://orchestrator.wikimedia.org and it will show you the broken host in the notification on the top right
- OR load puppet hieradata/hosts/db1234.yaml.
STEP: Get inside MariaDB and check replication status
cumin1002$ sudo db-mysql db1234 -e ”SHOW SLAVE STATUS\G”
STEP: Check replication status
Go to the section's orchestrator. Eg. https://orchestrator.wikimedia.org/web/cluster/alias/s4
Yellow/Orange means replication lag. Red means replication is broken.
DECISION POINT:
- If one replica is broken go to 1.2
- If all replicas are broken go to 1.3 (and cry)
- If one replica is lagging, go to 1.4
- If more than one replicas are lagging but not most of them to 1.5
- If most or all replicas are lagging go to 1.6
STEP: Once replication is back up and has been caught up (replag being below 1 second), repool it:
https://wikitech.wikimedia.org/wiki/Dbctl#Fully_repool_a_host
1.2 One replica with broken replication
This usually means that the replica has a different schema with the rest of the fleet and that is incompatible with the data being replicated.
Sometimes it means the replica is corrupted, in these cases just reclone it.
STEP: First, stop replication
> STOP SLAVE;
To fix that, check the replication error (it’s in the SHOW SLAVE STATUS). If it’s missing a field, make sure to find the alter table that was missed and add that field, and so on.
Go to “Map of database maintenance” and see if a change could have caused it. If you can’t find out what’s going on, ask in the data persistence IRC team channel and if no one knows or no one is available, just reclone the replica from a working replica (https://wikitech.wikimedia.org/wiki/MariaDB/Clone_a_host#Cloning)
Once done, restart replication
> START SLAVE;
STEP: Monitor the situation either from Orchestrator or Cumin with this snippet:
Follow either choices offered in the check replication wikitech page and make sure that Exec_Master_Log_Pos and Read_Master_Log_Pos offset are moving forward
1.3 More than one replica with broken replication
It means the master is missing a schema change and a write using that has landed on master and got replicated breaking replicas on its way.
The usual steps are to fix replicas by following the steps outlined in 1.2 on each replica (basically reverting the schema change). Then making sure such writes won’t happen again (e.g. by disabling that code path) and re-apply the schema change on replicas.
1.4 One replica is lagging:
This usually means hardware issues or the replica is being overloaded with reads.
STEP: Check server health in grafana:
https://grafana.wikimedia.org/d/000000377/host-overview
DECISION POINT:
- If disk utilization is quite high all the time, check disk health
- If something else, check network
1.5 A couple of replicas are lagging but not most
This usually means primary is getting a lot of writes.
STEP: Go to processlist of a replica that’s lagging:
> SHOW FULL PROCESSLIST;
And check the replication process.
If it’s doing a lot of small transactions, it means primary is getting a lot of writes
For the case of lots of small transactions, try to reduce the writes to make the replicas be able to catch up, that’s usually a maintenance script. Go to mwmaint host and go the list
- mwmaint2002.codfw.wment: $ ps aux | grep -i *wikidbname
- If you can confidently determine that script foo is causing writes, kill it.
1.6 Most or all replicas are lagging
This usually means some large transaction has happened on the primary.
STEP: Go to processlist of a replica that’s lagging:
> SHOW FULL PROCESSLIST;
And check the replication process. It’s usually the the threads with “repl*” as username.
If it’s one large query (look at the time), then one large transaction has reached replication.
In the case of one large transaction, we can’t do much and we have to let it be processed by replicas. Any other action such as using skip_transaction will make things worse, never do that in production.
Scenario 2: Database is unresponsive
You will get a page on packet loss.
STEP: Determine if it’s a replica or a primary Open the hiera file in puppet (hieradata/hosts/dbXXXX.yaml)
If it has values such as:
profile::mariadb::mysql_role: 'master'
Then it’s a master, otherwise it’s a replica.
OR
If you know the section. Go to orchestrator, e.g. https://orchestrator.wikimedia.org/web/cluster/alias/s8
And check topology tree
OR
Query zarcillo (from cumin) to get list of masters:
sudo db-mysql db1215 -A zarcillo -e "select * from masters"
DECISION POINT:
- If it’s a replica, go to 2.2
- If it’s a primary (both primary and secondary data centers), go to 2.3 (and cry)
2.2 One replica is down
STEP: Depool it. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica
STEP: Downtime it.
IMPACT: The user facing impact stops here
STEP: Create a ticket and tag DBA
DECISION POINT: See if you can ssh into the host.
- If you can, go to 2.4
- If you can’t, go to 2.5
2.3 One primary is down
This is a complex issue and a DBA must be involved during the incident.
STEP: If the primary belongs to the secondary datacenter, if possible, depool the whole datacenter from mediawiki requests. (TODO: Link to a dedicated runbook for depooling secondary dc)
For primaries, it is usually easier to force a reboot and bring it back online, rather than switching over primary.
STEP: If you can’t ssh into the host:
Force a reboot from mgmt console
STEP: Once it’s up, make sure mariadb service is up and running:
$ sudo service mariadb status
STEP: Check logs for data corruption. To be done by a DBA.
STEP: If DBA gives greenlight from previous step, set the master to read and write (it gets to RO mode on reboot)
> $ sudo mysql -e “SET GLOBAL read_only = 0;”
can be confirmed by:
> $ sudo mysql -e “select @@read_only;”
DO NOT do this on the primary database of the secondary datacenter. Please be sure to inform DBAs on #wikimedia-data-persistence as well.
IMPACT: The user facing impact stops here
2.4 Replica is accessible but not mariadb
This usually means MariaDB has crashed due to software issues. It is quite rare and requires looking at logs.
STEP: check logs, sysadmin,
> sudo journalctl -xe -u mariadb
Depending on the reason of crash, take further actions.
2.5 Replica is not accessible at all
It means it’s a hardware issue. TODO: This should be its own generalized runbook.
STEP: login to to mgmt console and get logs:
> getsel
STEP: Depending on the issue, create a ticket for DcOps.
Quick links
- Map of databases maintenances on Wikitech, updated automatically
- Host overview on Grafana
- Topology and clusters overview and monitoring on Orchestrator.wikimedia.org
Additional Information:
Provide any additional information that might be useful for SRE’s navigating through the runbook, such as troubleshooting tips, FAQs, or references to related documentation.
Conclusion:
In case of doubt, always check with a DBA (See https://office.wikimedia.org/wiki/Contact_list#Data_Persistence for contact details of DBAs in case of emergency).