You probably reached here because you received an alert saying that 1 or more sections were taking too long to backup.
This is an alert setup for
backupmon1001 at operations-puppet:modules/profile/manifests/dbbackups/check.pp and that can be configured and disabled at operations-puppet:/hieradata/role/common/dbbackups/monitoring.yaml.
It has two main root causes, that have to be considered and handled separately (1. An anomaly has happened. 2. The dataset has grown beyond a reasonable size limit):
Long running backups for MediaWiki metadata or misc database hosts
Backups of misc and regular (metadata) MediaWiki databases, which should be generally under 1 to 2 TB in size should not take more than 2-4h to backup (including compression and inventory). With the exception of m2 (because OTRS db), which as of September 2023 takes 7 hours to dump- a long running backup may mean that a backup process is stuck or otherwise in a weird state. Normally backup freshness is monitored separately, but this could make more apparent that a backup is in that state and should be killed or debugged.
Because redundancy between datacenters, and a single run not deleting previous runs or affecting future ones, there is usually not a hurry to attend this alert, and the database backups owner can handle it when available. Common debugging methods are:
- Compare with previous backups of the same type, section and datacenter on the database backups dashboard
- Check for the status of the backup in the host where it is running (use the dashboard to learn that) and run:
ps aux | grep backupor similar to see if it is blocked or slow due to an IO, filesystem, kernel, hw issue, etc.
- Read the logs from the backups (on cumin hosts for the remote-backup side, and on the local dbprov host for the dumps and preparation)
- Kill ongoing backups unless specially needed and retry them
Long running backups for ES cluster (MediaWiki content) database hosts
ES backups can suffer the same problems as the rest of the hosts (and the same guidelines should be used in that case), but -unlike the previous sections- es database hosts can contain up to 12TB of data. Plus, as there has been traditionally non-dedicated hosts, a full backup could take 2 or 3 days with low concurrency (1 or 2 threads only). This is particularly worrying because large backups to take usually take a long time to recover too.
So the most likely case of the alert happening for es backups is the dataset has slowly grown up so much that there that it takes now many hours to be backed up (in which case killing them and restarting will not fix anything).
In order to minimize this, monitoring of long running backups was set to alarm is it crossed a configured threshold of hours, so that in those cases the cluster can be partitioned into smaller ones- so the same server(s) that are in read-write mode have a cluster in read only mode, static, so no regular backups are needed of them except for the first time they are set in read only; and a cluster in read-write mode that is faster to take.
When es hosts take more time than the one configured as agreed by the DBAs, the email will signal the moment to execute this process of creating a new set of read-write cluster(s): TODO: link to checklist for bug T342685
To create a new blobs
ladsgroup@mwmaint2002:/srv/mediawiki/php-1.41.0-wmf.29/extensions/WikimediaMaintenance/storage$ ./make-all-blobs es2024 blobs_cluster29
And then rotate the clusters in mediawiki. See this patch as an example. Make between running make-all-blobs and deploying the mw change, no new wiki gets created.
Running the check manually
If something has been fixed and the check wants to be run manually to verify the fix, one can do in the backup monitoring host (
systemctl start check-es-backups-duration.service
And it will either confirm no more long running backups in the last week, or it will resend the email.