Admin database backup
Now a server is dedicated to the admin databases (#3833 (closed)), we should backup the postgresql instance to ensure all the admin/support tools are correctly secured.
Migrated from T3889 (view on Phabricator)
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Vincent Sellier added System administration label
added System administration label
- Vincent Sellier changed the description
changed the description
- Antoine R. Dumont added priority:Normal label
added priority:Normal label
- Author Owner
Unfornunately I didn't find any tool supporting natively the backup of postgresql through zfs snapshots. We have a couple of alternatives:
-
Use borg to to implement a clean backup of the postgresql data directory based on hooks [1]
-
Use a tool like pgbackrest[2], the backup type and storage target need to be choosen (S3, storage server, ...)
-
Implement a backup based on zfs snapshots, which should not be too complicated but we need to manage all the plumbing to deal with full/incremental backups and the history cleaning
-
[1] https://torsion.org/borgmatic/#it's-your-data.-keep-it-that-way.
-
[3] migrated/migration$1258
The zfs snapshot storage can be done in a plain file or sent to another server. We can imagine in a first time sending the backup to banco
Example:
- Creating an initial full backup
zfs send -v data/postgresql@001 | pigz -9 > /tmp/pg_001-full.gz
- Creating an incremental backup based on the previous backup
zfs send -v -i data/postgresql@001 data/postgresql@002 | pigz -9 > /tmp/pg_002-inc.gz
On dali, a compressed full snapshot is taking 45Go and is taking ~15mn. It can probably be improved by compressing on a server with more cpus
root@dali:~# zfs snapshot data/postgresql@test root@dali:~# zfs list -t all NAME USED AVAIL REFER MOUNTPOINT data 61.9G 131G 24K /data data/postgresql 61.8G 131G 61.8G /srv/postgresql/14/main data/postgresql@test 374K - 61.8G - root@dali:~# time zfs send -v data/postgresql@test | pigz -9 > /data/backup.gz full send of data/postgresql@test estimated size is 66.9G total estimated size is 66.9G TIME SENT SNAPSHOT data/postgresql@test 09:39:43 37.9M data/postgresql@test 09:39:44 73.7M data/postgresql@test ... 09:55:38 66.9G data/postgresql@test real 15m57.296s user 55m32.894s sys 1m58.312s root@dali:~# ls -alh /data total 45G drwxr-xr-x 2 root root 3 Jan 27 09:39 . drwxr-xr-x 19 root root 4.0K Jan 26 06:03 .. -rw-r--r-- 1 root root 45G Jan 27 09:55 backup.gz
-
- Vincent Sellier added state:wip label
added state:wip label
- Author Owner
The dali database directory tree was prepared to have a dedicated mount dataset for the wals:
root@dali:~# date Tue Feb 8 18:48:57 UTC 2022 root@dali:~# systemctl stop postgresql@14-main ● postgresql@14-main.service - PostgreSQL Cluster 14-main Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled) Active: inactive (dead) since Tue 2022-02-08 18:48:58 UTC; 5ms ago Process: 2705743 ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast 14-main stop (code=exited, status=0/SUCCESS) Main PID: 31293 (code=exited, status=0/SUCCESS) CPU: 1d 6h 12min 2.894s Feb 08 18:48:57 dali systemd[1]: Stopping PostgreSQL Cluster 14-main... Feb 08 18:48:58 dali systemd[1]: postgresql@14-main.service: Succeeded. Feb 08 18:48:58 dali systemd[1]: Stopped PostgreSQL Cluster 14-main. Feb 08 18:48:58 dali systemd[1]: postgresql@14-main.service: Consumed 1d 6h 12min 2.894s CPU time. Warning: journal has been rotated since unit was started, output may be incomplete. root@dali:~# systemctl status postgresql@14-main ● postgresql@14-main.service - PostgreSQL Cluster 14-main Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled) Active: inactive (dead) since Tue 2022-02-08 18:48:58 UTC; 10s ago Process: 2705743 ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast 14-main stop (code=exited, status=0/SUCCESS) Main PID: 31293 (code=exited, status=0/SUCCESS) CPU: 1d 6h 12min 2.894s Feb 08 18:48:57 dali systemd[1]: Stopping PostgreSQL Cluster 14-main... Feb 08 18:48:58 dali systemd[1]: postgresql@14-main.service: Succeeded. Feb 08 18:48:58 dali systemd[1]: Stopped PostgreSQL Cluster 14-main. Feb 08 18:48:58 dali systemd[1]: postgresql@14-main.service: Consumed 1d 6h 12min 2.894s CPU time. Warning: journal has been rotated since unit was started, output may be incomplete. root@dali:~# mv /srv/postgresql/14/main/pg_wal /srv/postgresql/14/main/pg_wal-save root@dali:~# zfs create -o atime=off -o compression=zstd -o mountpoint=/srv/postgresql/14/main/pg_wal data/postgresql/wal root@dali:~# zfs list NAME USED AVAIL REFER MOUNTPOINT data 66.8G 126G 24K /data data/postgresql 66.7G 126G 66.7G /srv/postgresql/14/main data/postgresql/wal 24K 126G 24K /srv/postgresql/14/main/pg_wal root@dali:~# cp -r /srv/postgresql/14/main/pg_wal-save/* /srv/postgresql/14/main/pg_wal root@dali:~# chown -R postgres: /srv/postgresql/14/main/pg_wal root@dali:~# systemctl start postgresql@14-main root@dali:~# date Tue Feb 8 18:51:23 UTC 2022
- Author Owner
the first local snapshots worked:
root@dali:~# zfs list -t all NAME USED AVAIL REFER MOUNTPOINT data 66.7G 126G 24K /data data/postgresql 66.6G 126G 66.6G /srv/postgresql/14/main data/postgresql@autosnap_2022-02-08_19:04:44_monthly 1.47M - 66.6G - data/postgresql@autosnap_2022-02-08_19:04:44_daily 194K - 66.6G - data/postgresql/wal 31.8M 126G 14.9M /srv/postgresql/14/main/pg_wal data/postgresql/wal@autosnap_2022-02-08_19:04:44_monthly 16.3M - 31.3M - data/postgresql/wal@autosnap_2022-02-08_19:04:44_daily 13K - 15.0M -
The remote sync will be implemented later. The snapshot will probably be sent to an azure VM with some storage (it will depend of the test) as we don't have servers with free zfs storage (except saam but I'm not comfortable to use the archive's main zfs storage for that)
- Phabricator Migration user mentioned in commit swh-sysadmin-provisioning@a8a9c4ce
mentioned in commit swh-sysadmin-provisioning@a8a9c4ce
- Author Owner
- backup01 vm created on azure
- zfs installed (will be reported in puppet):
- add contrib repository
- install zfs
# apt install linux-headers-cloud-amd64 zfs-dkms
- configure zfs pool
root@backup01:~# fdisk /dev/sdc -l Disk /dev/sdc: 200 GiB, 214748364800 bytes, 419430400 sectors Disk model: Virtual Disk Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: gpt Disk identifier: D0FB08C6-F046-F340-AC8B-D6C9372015D5 Device Start End Sectors Size Type root@backup01:~# zpool create data -o ashift=12 wwn-0x6002248004cd2d289007c07940782f60 root@backup01:~# zfs list NAME USED AVAIL REFER MOUNTPOINT data 360K 193G 96K /data root@backup01:~# zfs set canmount=noauto data root@backup01:~# zfs set mountpoint=none data root@backup01:~# zfs list NAME USED AVAIL REFER MOUNTPOINT data 480K 193G 96K none
root@backup01:~# zfs create data/sync root@backup01:~# zfs create data/sync/dali root@backup01:~# zfs list NAME USED AVAIL REFER MOUNTPOINT data 732K 193G 96K none data/sync 192K 193G 96K none data/sync/dali 96K 193G 96K none
-
the
opnsense
gateway on azure needed to be updated to add the admin ip range (192.168.50.0/24) -
New rule added on opnsense to allow ssh connection from the backup server
-
puppet configuration to replicate the admin database snapshots
-
done in the following diffs
- Author Owner
The replication is in place:
root@backup01:~# zfs list -t all NAME USED AVAIL REFER MOUNTPOINT data 120G 72.5G 96K none data/sync 120G 72.5G 96K none data/sync/dali 120G 72.5G 96K none data/sync/dali/postgresql 120G 72.5G 73.2G none data/sync/dali/postgresql@autosnap_2022-02-08_19:04:44_monthly 22.9G - 72.3G - data/sync/dali/postgresql@autosnap_2022-02-18_00:00:01_daily 3.11G - 73.2G - data/sync/dali/postgresql@autosnap_2022-02-19_00:00:01_daily 2.43G - 73.2G - data/sync/dali/postgresql@autosnap_2022-02-20_00:00:01_daily 2.39G - 73.2G - data/sync/dali/postgresql@autosnap_2022-02-21_00:00:01_daily 2.44G - 73.2G - data/sync/dali/postgresql@autosnap_2022-02-22_00:00:00_daily 2.47G - 73.2G - data/sync/dali/postgresql@autosnap_2022-02-23_00:00:02_daily 2.56G - 73.2G - data/sync/dali/postgresql@autosnap_2022-02-24_00:00:00_daily 0B - 73.2G - data/sync/dali/postgresql/wal 600M 72.5G 88.4M none data/sync/dali/postgresql/wal@autosnap_2022-02-08_19:04:44_monthly 61.9M - 61.9M - data/sync/dali/postgresql/wal@autosnap_2022-02-18_00:00:01_daily 90.9M - 107M - data/sync/dali/postgresql/wal@autosnap_2022-02-19_00:00:01_daily 94.7M - 111M - data/sync/dali/postgresql/wal@autosnap_2022-02-20_00:00:01_daily 55.4M - 87.5M - data/sync/dali/postgresql/wal@autosnap_2022-02-21_00:00:01_daily 50.1M - 98.2M - data/sync/dali/postgresql/wal@autosnap_2022-02-22_00:00:00_daily 57.7M - 106M - data/sync/dali/postgresql/wal@autosnap_2022-02-23_00:00:02_daily 52.8M - 68.8M - data/sync/dali/postgresql/wal@autosnap_2022-02-24_00:00:00_daily 0B - 88.4M -
The retention will 2 monthly snapshots and 30 daily snapshots The space occupation should be just around 200Go so we will probably have to extend a little the data disk.
The initial transfer took ~40mn
Feb 24 09:55:15 backup01 systemd[1]: Starting ZFS dataset synchronization of... Feb 24 09:55:16 backup01 syncoid[93491]: NEWEST SNAPSHOT: autosnap_2022-02-24_00:00:00_daily Feb 24 09:55:16 backup01 syncoid[93491]: INFO: Sending oldest full snapshot data/postgresql@autosnap_2022-02-08_19:04:44_monthly (~ 72.4 GB) to new target filesystem: Feb 24 10:52:12 backup01 syncoid[93491]: INFO: Updating new target filesystem with incremental data/postgresql@autosnap_2022-02-08_19:04:44_monthly ... autosnap_2022-02-24_00:00:00_daily (~ 47.4 GB): Feb 24 11:29:17 backup01 systemd[1]: syncoid-dali-postgresql.service: Succeeded. Feb 24 11:29:17 backup01 systemd[1]: Finished ZFS dataset synchronization of. Feb 24 11:29:17 backup01 systemd[1]: syncoid-dali-postgresql.service: Consumed 19min 51.324s CPU time.
The read speed on the amazon disk is the same as the write speed so in case of a need of full recovery, we can expect a recovery time < 1h. We can probably reduce this time by changing the data disk configuration on azure but we will limited at some point by the vpn bandwidth.
- Vincent Sellier removed state:wip label
removed state:wip label
- Vincent Sellier closed
closed
- Phabricator Migration user mentioned in commit swh-sysadmin-provisioning@6ca24e01
mentioned in commit swh-sysadmin-provisioning@6ca24e01