Data Platform/Systems/CloudnativePG/Clusters

Using the CloudnativePG operator, it becomes quite easy to deploy PostgreSQL clusters in Kubernetes.

We have created a custom cloudnative-pg-cluster chart, allowing users to create a cluster from a very simple YAML values file, such as this one.

fullnameOverride: postgres-airflow-analytics

By default, each cluster will upload base backups and WALs to a dedicated S3 bucket, and performs a daily backup at midnight UTC.

Observability

Dashboard

All cloudnative-pg clusters are instrumented using Prometheus and can be observed via this Grafana dashboard.

Monitoring

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-data-platform/cloudnative-pg.yaml

Logs

https://logstash.wikimedia.org/app/dashboards#/view/1a9f3aa0-8095-11ef-a846-9f500bbb7f51?_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-15m%2Cto%3Anow))

Operations

Finding what instance is the current PG master

brouberol@deploy2002:~$ kubectl get cluster postgresql-test -ojson | jq -r .status.currentPrimary
postgresql-test-2

Performing a manual backup

^[1] On the deployment server, run the following commands.

Make sure to replace "postgresql-test" by whatever cluster name is appropriate.

brouberol@deploy1003:~$ cat <<EOF >manual-backup.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Backup
metadata:
  name: postgresql-test-on-demand-backup-$(date "+%Y%m%d%H%M%S")
spec:
  method: barmanObjectStore
  cluster:
    name: postgresql-test
EOF
brouberol@deploy1003:/home/brouberol# kubectl create -f manual-backup.yaml -n postgresql-test
backup.postgresql.cnpg.io/postgresql-test-on-demand-backup-20240917141528 created

You can then check the state of the backup with kubectl.

brouberol@deploy1003:/home/brouberol# kubectl get backups -n postgresql-test  --sort-by='{metadata.creationTimestamp}' --no-headers | tac
postgresql-test-on-demand-backup-20240917141528   72s   postgresql-test   barmanObjectStore   completed
postgresql-test-daily-backup-20240917000000       14h   postgresql-test   barmanObjectStore   completed
postgresql-test-daily-backup-20240916073251       30h   postgresql-test   barmanObjectStore   completed

Restoring a backup by replaying all WAL files

^[2] The easiest way to restore the latest archived state of a pre-existing database is to deploy it with the following values:

mode: recovery
recovery:
  method: object_store
  clusterName: postgresql-test
backups:
  enabled: false

When deploying the cluster, this will start a full-recovery pod, that will replay all WAL files, after which the PG pods themselves will be deployed. After everything is deployed successfully, redeploy the cluster with all ^ these ^ values either removed or commented out. This will make sure to start the cluster in a normal mode and re-enable backups (and a backup will be immediately triggered for good measure).

Performing a Point In Time Recovery (PITR)

^[2] A PITR allows you to restore the state as backed up at a certain point in time.

You can list the available base backups by descending dates by running

root@deploy1003:~# kubectl get backups -n postgresql-test  --sort-by='{metadata.creationTimestamp}' --no-headers | tac
postgresql-test-daily-backup-20240917000000   13h   postgresql-test   barmanObjectStore   completed
postgresql-test-daily-backup-20240916073251   30h   postgresql-test   barmanObjectStore   completed

Say we'd like to restore postgresql-test-daily-backup-20240917000000 . To do this, first extract the creation timestamp of the backup resource:

root@deploy1003:~# kubectl get backups -n postgresql-test  -o jsonpath='{.metadata.creationTimestamp}{"\n"}' postgresql-test-daily-backup-20240917000000
2024-09-17T00:00:00Z

You then need to convert this timestamp to another format, understood by the https://pgbarman.org/ tools, used to restore the backup.

$ python3 -c'import datetime as dt;print(dt.datetime.fromisoformat("2024-09-17T00:00:00Z").strftime("%Y-%m-%d %H:%M:%S.%f%z"))'
2024-09-17 00:00:00.000000+0000

Make sure to run this oneliner in python > 3.9

You can then deploy the cluster with the following values:

mode: recovery
recovery:
  method: object_store
  clusterName: postgresql-test
  recoveryTarget:
    targetTime: "2024-09-17 00:00:00.000000+0000"
backups:
  enabled: false

Once the cluster is deployed and running, redeploy it with all ^ these ^ values commented out or deleted. This will make sure to start the cluster in a normal mode and re-enable backups (and a backup will be immediately triggered for good measure).

If however, you want to perform a PITR to a timestamp that does not align with when a backup was performed, you can simply specify the timestamp to which you'd like to database to be restored, and the operator will select the closest backup that was completed before that target and will then restore that backup and replay the WAL until it reaches the specified timestamp.

Increasing the storage size of a cluster

^[3] As we're storing the data on Ceph, we can resize the volume in which PG stores its data on the fly. To do this, simply adjust thecluster.storage.size cluster configuration value.

Assuming the cluster was deployed with the following configuration

cluster:
  storage:
    size: 10Gi

then change it to

cluster:
  storage:
    size: 15Gi

and redeploy the cluster. The volumes should get automatically upsized within minutes.

Bear in mind that volumes cannot be downsized.

Increasing the WAL storage size of a cluster

Follow the same instructions than in Data Platform/Systems/CloudnativePG/Clusters#Increasing the storage size of a cluster but adjust the cluster.walStorage.size parameter instead.

Increasing the max number of connections on PostgreSQL

PostgreSQL server configuration tunables can be set and tweaked via the cluster.postgresql.parameters YAML dictionary (except these ones). For example, if you'd like to increase the max_connectionssetting to 300 in a cluster using all default configuration values, change the cluster configuration to

cluster:
  postgresql:
    parameters:
      max_connections: 300

As each connection opens a UNIX process on the PG server, keep an eye on resource consumption and don't hesitate to increase its amount of CPU and RAM if needed.

Increasing the max number of client connections on PGBouncer

^[4] A subset of PGBouncer configuration options can be tweaked from the release YAML values themselves. For example, if PGBouncer is getting near its client connection limit, you can increase it by tweaking the pooler.parameters dictionary entries. For example:

pooler:
  parameters:
    max_client_conn: 200

All PGBouncer configuration options are available here: https://www.pgbouncer.org/config.html

Import data from an external database

The process for importing a database data from outside of Kubernetes described in the CloudnativePG documentation can be performed by deploying the following values:

cluster:
  initdb:
    import:
      host: an-db1001.eqiad.wmnet
      user: airflow-test-k8s
      dbname: airflow-test-k8s
      password: xxxx # commit this in the private puppet repo

external_services:
  postgresql: [analytics]

Once the cluster has been deployed, exec into the PG master pod and run a psql shell. Check that the data has indeed be replicated. Once you are satisfied it has, delete the password from the private puppet repository, run puppet on the deployment server then redeploy the cluster with ^ these ^ values either removed or commented out.

Once the data has been imported, both database state will diverge if the source database is still being written to. In case of an import being used to facilitate a database migration, you should first turn off any writes to the database, import it to Kubernetes, then redeploy the application by changing its configuration to point to the newly deployed CloudnativePG cluster.

Checking the status of the out-of-band backups of backups

We perform a daily sync of all Cloudnative PG clusters' S3 bucket to db1208.eqiad.wmnet, so that they get included in Bacula out-of-band backups. This daily sync is performed on db1208.eqiad.wmnet, via the backup_s3_postgresql systemd timer.

If that daily sync is failing, check the error logs via:

$ ssh db1208.eqiad.wmnet
brouberol@db1208:~$ systemctl status backup_s3_postgresql.service
○ backup_s3_postgresql.service - Create local copies of S3 buckets containing PostgreSQL backups to the local filesystem
     Loaded: loaded (/lib/systemd/system/backup_s3_postgresql.service; static)
     Active: inactive (dead) since Mon 2024-10-14 09:21:23 UTC; 4h 12min ago
TriggeredBy: ● backup_s3_postgresql.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 2296733 ExecStart=/usr/bin/rclone -v --config /srv/postgresql_backups/rclone.conf --exclude /rclone.conf sync everything: /srv/postgresql_backups (code=exited, status=0/SUCCESS)
   Main PID: 2296733 (code=exited, status=0/SUCCESS)
        CPU: 30.644s

brouberol@db1208:~$ sudo journalctl -u backup_s3_postgresql.service -n 20
Oct 14 09:21:20 db1208 rclone[2296733]: INFO  : postgresql-airflow-test-k8s.dse-k8s-eqiad/postgresql-airflow-test-k8s/base/20241010T000000/backup.info: Copied (new)
...
Oct 14 09:21:23 db1208 rclone[2296733]: INFO  : postgresql-airflow-test-k8s.dse-k8s-eqiad/postgresql-airflow-test-k8s/base/20241014T000000/data.tar.gz: Copied (new)
Oct 14 09:21:23 db1208 rclone[2296733]: INFO  :
Oct 14 09:21:23 db1208 rclone[2296733]: Transferred:               1.497 GiB / 1.497 GiB, 100%, 93.372 MiB/s, ETA 0s
Oct 14 09:21:23 db1208 rclone[2296733]: Transferred:         5035 / 5035, 100%
Oct 14 09:21:23 db1208 rclone[2296733]: Elapsed time:        16.9s
Oct 14 09:21:23 db1208 systemd[1]: backup_s3_postgresql.service: Deactivated successfully.
Oct 14 09:21:23 db1208 systemd[1]: Finished backup_s3_postgresql.service - Create local copies of S3 buckets containing PostgreSQL backups to the local filesystem.
Oct 14 09:21:23 db1208 systemd[1]: backup_s3_postgresql.service: Consumed 30.644s CPU time.

Investigate why WALs are not being archived

The primary instance should archive their WAL file every 5 minutes. First off, find what instance is currently acting as the PG primary. Then, tail the logs from that instance, either using kubectl or the logstash dashboard.

brouberol@deploy2002:~$ kubectl logs postgresql-airflow-test-k8s-2 | grep archive

These logs should shine some light about the current issue.

Here's an example of such an investigation.

References

[1] ttps://phabricator.wikimedia.org/T372282

[:0-2] 2.0 ^2.1 https://phabricator.wikimedia.org/T372283

[3] ttps://phabricator.wikimedia.org/T372276

[4] ttps://phabricator.wikimedia.org/T374950

[1]

[2]

[3]

[4]