Jump to content

Data Platform/Systems/PostgreSQL/Backup and Restore

From Wikitech

Performing a manual backup

[1] On the deployment server, run the following commands.

Make sure to replace "postgresql-test" by whatever cluster name is appropriate.
brouberol@deploy1003:~$ cat <<EOF >manual-backup.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Backup
metadata:
  name: postgresql-test-on-demand-backup-$(date "+%Y%m%d%H%M%S")
spec:
  method: barmanObjectStore
  cluster:
    name: postgresql-test
EOF
brouberol@deploy1003:/home/brouberol# kubectl create -f manual-backup.yaml -n postgresql-test
backup.postgresql.cnpg.io/postgresql-test-on-demand-backup-20240917141528 created

You can then check the state of the backup with kubectl.

brouberol@deploy1003:/home/brouberol# kubectl get backups -n postgresql-test  --sort-by='{metadata.creationTimestamp}' --no-headers | tac
postgresql-test-on-demand-backup-20240917141528   72s   postgresql-test   barmanObjectStore   completed
postgresql-test-daily-backup-20240917000000       14h   postgresql-test   barmanObjectStore   completed
postgresql-test-daily-backup-20240916073251       30h   postgresql-test   barmanObjectStore   completed

Restoring a backup by replaying all WAL files

[2] The easiest way to restore the latest archived state of a pre-existing database is to deploy it with the following values:

mode: recovery
recovery:
  method: object_store
  clusterName: postgresql-test
backups:
  enabled: false

When deploying the cluster, this will start a full-recovery pod, that will replay all WAL files, after which the PG pods themselves will be deployed. After everything is deployed successfully, redeploy the cluster with all ^ these ^ values either removed or commented out. This will make sure to start the cluster in a normal mode and re-enable backups (and a backup will be immediately triggered for good measure).

Performing a Point In Time Recovery (PITR)

[2] A PITR allows you to restore the state as backed up at a certain point in time.

You can list the available base backups by descending dates by running

root@deploy1003:~# kubectl get backups -n postgresql-test  --sort-by='{metadata.creationTimestamp}' --no-headers | tac
postgresql-test-daily-backup-20240917000000   13h   postgresql-test   barmanObjectStore   completed
postgresql-test-daily-backup-20240916073251   30h   postgresql-test   barmanObjectStore   completed

Say we'd like to restore postgresql-test-daily-backup-20240917000000 . To do this, first extract the creation timestamp of the backup resource:

root@deploy1003:~# kubectl get backups -n postgresql-test  -o jsonpath='{.metadata.creationTimestamp}{"\n"}' postgresql-test-daily-backup-20240917000000
2024-09-17T00:00:00Z

You then need to convert this timestamp to another format, understood by the https://pgbarman.org/ tools, used to restore the backup.

$ python3 -c'import datetime as dt;print(dt.datetime.fromisoformat("2024-09-17T00:00:00Z").strftime("%Y-%m-%d %H:%M:%S.%f%z"))'
2024-09-17 00:00:00.000000+0000
Make sure to run this oneliner in python > 3.9

You can then deploy the cluster with the following values:

mode: recovery
recovery:
  method: object_store
  clusterName: postgresql-test
  recoveryTarget:
    targetTime: "2024-09-17 00:00:00.000000+0000"
backups:
  enabled: false

Once the cluster is deployed and running, redeploy it with all ^ these ^ values commented out or deleted. This will make sure to start the cluster in a normal mode and re-enable backups (and a backup will be immediately triggered for good measure).

If however, you want to perform a PITR to a timestamp that does not align with when a backup was performed, you can simply specify the timestamp to which you'd like to database to be restored, and the operator will select the closest backup that was completed before that target and will then restore that backup and replay the WAL until it reaches the specified timestamp.

Import data from an external database

The process for importing a database data from outside of Kubernetes described in the CloudnativePG documentation can be performed by deploying the following values:

cluster:
  initdb:
    import:
      host: an-db1001.eqiad.wmnet
      user: airflow-test-k8s
      dbname: airflow-test-k8s
      password: xxxx # commit this in the private puppet repo

external_services:
  postgresql: [analytics]

Once the cluster has been deployed, exec into the PG master pod and run a psql shell. Check that the data has indeed be replicated. Once you are satisfied it has, delete the password from the private puppet repository, run puppet on the deployment server then redeploy the cluster with ^ these ^ values either removed or commented out.

Once the data has been imported, both database state will diverge if the source database is still being written to. In case of an import being used to facilitate a database migration, you should first turn off any writes to the database, import it to Kubernetes, then redeploy the application by changing its configuration to point to the newly deployed CloudnativePG cluster.

Checking the status of the out-of-band backups of backups

We perform a daily sync of all Cloudnative PG clusters' S3 bucket to db1208.eqiad.wmnet, so that they get included in Bacula out-of-band backups. This daily sync is performed on db1208.eqiad.wmnet, via the backup_s3_postgresql systemd timer.

If that daily sync is failing, check the error logs via:

$ ssh db1208.eqiad.wmnet
brouberol@db1208:~$ systemctl status backup_s3_postgresql.service
○ backup_s3_postgresql.service - Create local copies of S3 buckets containing PostgreSQL backups to the local filesystem
     Loaded: loaded (/lib/systemd/system/backup_s3_postgresql.service; static)
     Active: inactive (dead) since Mon 2024-10-14 09:21:23 UTC; 4h 12min ago
TriggeredBy:  backup_s3_postgresql.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 2296733 ExecStart=/usr/bin/rclone -v --config /srv/postgresql_backups/rclone.conf --exclude /rclone.conf sync everything: /srv/postgresql_backups (code=exited, status=0/SUCCESS)
   Main PID: 2296733 (code=exited, status=0/SUCCESS)
        CPU: 30.644s

brouberol@db1208:~$ sudo journalctl -u backup_s3_postgresql.service -n 20
Oct 14 09:21:20 db1208 rclone[2296733]: INFO  : postgresql-airflow-test-k8s.dse-k8s-eqiad/postgresql-airflow-test-k8s/base/20241010T000000/backup.info: Copied (new)
...
Oct 14 09:21:23 db1208 rclone[2296733]: INFO  : postgresql-airflow-test-k8s.dse-k8s-eqiad/postgresql-airflow-test-k8s/base/20241014T000000/data.tar.gz: Copied (new)
Oct 14 09:21:23 db1208 rclone[2296733]: INFO  :
Oct 14 09:21:23 db1208 rclone[2296733]: Transferred:               1.497 GiB / 1.497 GiB, 100%, 93.372 MiB/s, ETA 0s
Oct 14 09:21:23 db1208 rclone[2296733]: Transferred:         5035 / 5035, 100%
Oct 14 09:21:23 db1208 rclone[2296733]: Elapsed time:        16.9s
Oct 14 09:21:23 db1208 systemd[1]: backup_s3_postgresql.service: Deactivated successfully.
Oct 14 09:21:23 db1208 systemd[1]: Finished backup_s3_postgresql.service - Create local copies of S3 buckets containing PostgreSQL backups to the local filesystem.
Oct 14 09:21:23 db1208 systemd[1]: backup_s3_postgresql.service: Consumed 30.644s CPU time.

Investigate why WALs are not being archived

The primary instance should archive their WAL file every 5 minutes. First off, find what instance is currently acting as the PG primary. Then, tail the logs from that instance, either using kubectl or the logstash dashboard.

brouberol@deploy2002:~$ kubectl logs postgresql-airflow-test-k8s-2 | grep archive

These logs should shine some light about the current issue.

Here's an example of such an investigation.

  1. https://phabricator.wikimedia.org/T372282
  2. 2.0 2.1 https://phabricator.wikimedia.org/T372283