When I deleted the on-demand snapshots this morning, all automatic snapshots were blocked again (they appeared with 0 B in the rancher user interface).
Finally changing the automatic snapshot retention to 2 (default is 5) seems to clean up the corrupted snapshots:
Datas on each nodes:
root@rancher-node-admin-rke2-mgmt1:~# ls-trl /var/lib/rancher/rke2/server/db/snapshots/total 42548-rw------- 1 root root 76881952 Jun 18 10:25 etcd-snapshot-rancher-node-admin-rke2-mgmt1-1718706304-rw------- 1 root root 76881952 Jun 18 10:30 etcd-snapshot-rancher-node-admin-rke2-mgmt1-1718706602
root@rancher-node-admin-rke2-mgmt2:~# ls-trl /var/lib/rancher/rke2/server/db/snapshots/total 42507-rw------- 1 root root 76546080 Jun 18 10:25 etcd-snapshot-rancher-node-admin-rke2-mgmt2-1718706302-rw------- 1 root root 76546080 Jun 18 10:30 etcd-snapshot-rancher-node-admin-rke2-mgmt2-1718706604
root@rancher-node-admin-rke2-mgmt3:~# ls-trl /var/lib/rancher/rke2/server/db/snapshots/total 42431-rw------- 1 root root 76439584 Jun 18 10:25 etcd-snapshot-rancher-node-admin-rke2-mgmt3-1718706300-rw------- 1 root root 76439584 Jun 18 10:30 etcd-snapshot-rancher-node-admin-rke2-mgmt3-1718706604
Configmap etcd-rke2-snapshots on cluster-admin-rke2:
ᐅ kubectl --context cluster-admin-rke2 get configmap rke2-etcd-snapshots -n kube-systemNAME DATA AGErke2-etcd-snapshots 6 298d
During the kubernetes upgrades on the admin cluster, I've created a snapshot before each upgrades (1.27.14 and 1.28.10). The first one finished without any problem. Unfortunately the second one failed.
After trying to clean the snapshots, the results are worth than at the beginning: all the snapshots have a 0 MB size.
Then following this rancher document, I checked the etcd leader, etcd leases and etcd configmaps.
K="kubectl --context archive-staging-rke2"
Checking etcd leader:
ᐅ for etcd in$(eval"$K" get pods -n kube-system -lcomponent=etcd | awk'NR>1{print $1}');doecho"$etcd"eval"$K"-n kube-system exec"$etcd"\-- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt'\--cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt'\--key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key'\endpoint status --write-out tabledoneetcd-rancher-node-staging-rke2-mgmt1+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+| 127.0.0.1:2379 | 381b0c8c394cd86a | 3.5.9 | 66 MB | true | false | 24 | 574272617 | 574272617 | |+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+etcd-rancher-node-staging-rke2-mgmt2+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+| 127.0.0.1:2379 | 3921c7c16f9db3d4 | 3.5.9 | 66 MB | false | false | 24 | 574272625 | 574272625 | |+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+etcd-rancher-node-staging-rke2-mgmt3+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+| 127.0.0.1:2379 | aa5bdc1bc2613335 | 3.5.9 | 66 MB | false | false | 24 | 574272631 | 574272631 | |+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Checking rke2 and rke2-etcd leases:
ᐅ for name in rke2 rke2-etcd;doeval"$K"-n kube-system get lease "$name"doneNAME HOLDER AGErke2 rancher-node-staging-rke2-mgmt3 530dNAME HOLDER AGErke2-etcd rancher-node-staging-rke2-mgmt2 306d
Checking rke2 and rke2-etcd configmaps
ᐅ for name in rke2 rke2-etcd;doecho"$name"eval"$K"-n kube-system get cm "$name"-o yaml | grep holderIdentitydonerke2 control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"rancher-node-staging-rke2-mgmt2","leaseDurationSeconds":45,"acquireTime":"2024-06-13T16:02:45Z","renewTime":"2024-06-13T16:03:09Z","leaderTransitions":565}'rke2-etcd control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"rancher-node-staging-rke2-mgmt2","leaseDurationSeconds":45,"acquireTime":"2024-06-13T16:02:44Z","renewTime":"2024-06-13T16:03:08Z","leaderTransitions":512}'
On test-staging-rke2 cluster, after adding 2 control-plane/etcd nodes, the lease and the configmaps seems to be consistent.
Node etcd-rancher-node-test-rke2-mgmt1 is the leader:
ᐅ kbt exec etcd-rancher-node-test-rke2-mgmt1 -n kube-system \-- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt'\--cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt'\--key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key'\endpoint status --write-out table+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+| 127.0.0.1:2379 | fac31d16f74c8ff6 | 3.5.9 | 46 MB | true | false | 3 | 751200 | 751200 | |+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
ᐅ for cm in rke2 rke2-etcd;doecho'---'kbt get cm "$cm"-n kube-system -o yaml | \awk'/(name|kind):|holderIdentity/'done---kind: ConfigMap control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"rancher-node-test-rke2-mgmt1","leaseDurationSeconds":45,"acquireTime":"2024-03-06T17:16:17Z","renewTime":"2024-06-12T15:33:09Z","leaderTransitions":299}' name: rke2---kind: ConfigMap control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"rancher-node-test-rke2-mgmt1","leaseDurationSeconds":45,"acquireTime":"2024-03-06T17:16:17Z","renewTime":"2024-06-12T15:33:07Z","leaderTransitions":316}' name: rke2-etcd
So only the mgmt1 snapshots are available in the Rnacher UI.
Yesterday I forgot to update the firewall aliases with the new nodes IPs, but after cleaning up all the on-demand snapshots, creating a new manual snapshot from the Rancher web UI give 3 snapshots of mgmt1.
On archive-staging-rke2 cluster, aligning the holder identity in rke2 lease to the one in rke2-etcd lease make the snapshots back to successful state.
As on test-staging-rke2 cluster, only the first control-plane/etcd snapshots are synchronized on the Rancher cluster.
On all Rancher downstream clusters, the holder identity of leases rke2 and configmaps rke2, rke2-etcd have been updated to match the holder identity of lease rke2-etcd.
All the on-demand snapshots have been deleted.
Some on-demand snapshots still appears in the Rancher web UI, let the Rancher automatic snapshots do its job (it should do the cleaning).
I'm not sure the fleet-agent registration errors on test-staging-rke2 are related to the snapshots issue.
(kubectl --context test-staging-rke2 logs -l app=fleet-agent -n cattle-fleet-system)
Automatic snapshots have been processed: test-staging-rke2 and archive-staging-rke2 clusters are fine, cluster-admin-rke2 and archive-production-rke2 once again have an empty snapshot (and the on-demand still appears in the web UI).
save the last snapshot on node rancher-node-staging-rke2-mgmt1 (/root/etcd-snapshots-backups-20240710/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1720623605);
remove all datas in the rke2-etcd-snapshots configmap;
remove the last snapshot on node rancher-node-staging-rke2-mgmt1.
The Rancher web UI no longer contains any snapshot (there is 0 etcdsnapshot on rancher cluster).
The next automatic snapshots should create a clean snapshot.
On cluster-admin-rke2 the new snapshots appears with 0B size in the Rancher UI.
The configmaps and the lease match the holder identity of rke2-etcd lease:
ᐅ kba get lease rke2-etcd -n kube-system -ojsonpath='{.spec.holderIdentity}'rancher-node-admin-rke2-mgmt1ᐅ kba get lease rke2 -n kube-system -ojsonpath='{.spec.holderIdentity}'rancher-node-admin-rke2-mgmt1ᐅ kba get cm rke2-etcd -n kube-system \-ojsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \jq -r'.holderIdentity'rancher-node-admin-rke2-mgmt1ᐅ kba get cm rke2 -n kube-system \-ojsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \jq -r'.holderIdentity'rancher-node-admin-rke2-mgmt1
On the master nodes filesystem the snapshots seems to be fine:
On-demand snapshots on test-staging-rke2 and archive-staging-rke2 no longer fail and block the snapshot process.
It creates 3 snapshots of the first master node (mgmt1). rke2-etcd-snapshots is updated on the downstream cluster and etcdsnapshots is created on the Rancher cluster.
I removed all the on-demand snapshots on all downstream clusters and everything is still fine.
I deleted all the snapshots backups I've done on all nodes.