rancher etcd snapshots not working for staging and admin clusters

On cluster cluster-admin-rke2 following this rancher issue https://www.suse.com/support/kb/doc/?id=000021078:

remove all data from configmap rke2-etcd-snapshots;
schedule the automatic etcd backup every 5 minutes;
wait 25 minutes;

restore former backup period.

When I deleted the on-demand snapshots this morning, all automatic snapshots were blocked again (they appeared with 0 B in the rancher user interface).
Finally changing the automatic snapshot retention to 2 (default is 5) seems to clean up the corrupted snapshots:

Datas on each nodes:

root@rancher-node-admin-rke2-mgmt1:~# ls -trl /var/lib/rancher/rke2/server/db/snapshots/
total 42548
-rw------- 1 root root 76881952 Jun 18 10:25 etcd-snapshot-rancher-node-admin-rke2-mgmt1-1718706304
-rw------- 1 root root 76881952 Jun 18 10:30 etcd-snapshot-rancher-node-admin-rke2-mgmt1-1718706602

root@rancher-node-admin-rke2-mgmt2:~# ls -trl /var/lib/rancher/rke2/server/db/snapshots/
total 42507
-rw------- 1 root root 76546080 Jun 18 10:25 etcd-snapshot-rancher-node-admin-rke2-mgmt2-1718706302
-rw------- 1 root root 76546080 Jun 18 10:30 etcd-snapshot-rancher-node-admin-rke2-mgmt2-1718706604

root@rancher-node-admin-rke2-mgmt3:~# ls -trl /var/lib/rancher/rke2/server/db/snapshots/
total 42431
-rw------- 1 root root 76439584 Jun 18 10:25 etcd-snapshot-rancher-node-admin-rke2-mgmt3-1718706300
-rw------- 1 root root 76439584 Jun 18 10:30 etcd-snapshot-rancher-node-admin-rke2-mgmt3-1718706604

Configmap etcd-rke2-snapshots on cluster-admin-rke2:

ᐅ kubectl --context cluster-admin-rke2 get configmap rke2-etcd-snapshots -n kube-system
NAME                  DATA   AGE
rke2-etcd-snapshots   6      298d

Etcdsnapshotfiles on cluster-admin-rke2:

ᐅ kubectl --context cluster-admin-rke2 get etcdsnapshotfiles -n kube-system 
NAME                                                                  SNAPSHOTNAME                                             NODE                            LOCATION                                                                                                  SIZE       CREATIONTIME
local-etcd-snapshot-rancher-node-admin-rke2-mgmt1-1718706602-e366db   etcd-snapshot-rancher-node-admin-rke2-mgmt1-1718706602   rancher-node-admin-rke2-mgmt1   file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1718706602   76881952   2024-06-18T10:30:02Z
local-etcd-snapshot-rancher-node-admin-rke2-mgmt1-1718706901-07a5fb   etcd-snapshot-rancher-node-admin-rke2-mgmt1-1718706901   rancher-node-admin-rke2-mgmt1   file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1718706901   76881952   2024-06-18T10:35:01Z
local-etcd-snapshot-rancher-node-admin-rke2-mgmt2-1718706604-71ed53   etcd-snapshot-rancher-node-admin-rke2-mgmt2-1718706604   rancher-node-admin-rke2-mgmt2   file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-admin-rke2-mgmt2-1718706604   76546080   2024-06-18T10:30:04Z
local-etcd-snapshot-rancher-node-admin-rke2-mgmt2-1718706905-7bbc73   etcd-snapshot-rancher-node-admin-rke2-mgmt2-1718706905   rancher-node-admin-rke2-mgmt2   file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-admin-rke2-mgmt2-1718706905   76546080   2024-06-18T10:35:05Z
local-etcd-snapshot-rancher-node-admin-rke2-mgmt3-1718706604-0dc0ce   etcd-snapshot-rancher-node-admin-rke2-mgmt3-1718706604   rancher-node-admin-rke2-mgmt3   file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-admin-rke2-mgmt3-1718706604   76439584   2024-06-18T10:30:04Z
local-etcd-snapshot-rancher-node-admin-rke2-mgmt3-1718706901-e743c0   etcd-snapshot-rancher-node-admin-rke2-mgmt3-1718706901   rancher-node-admin-rke2-mgmt3   file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-admin-rke2-mgmt3-1718706901   76439584   2024-06-18T10:35:01Z

Etcdsnapshots on rancher cluster:

ᐅ kubectl --context local get etcdsnapshots -n fleet-default -l rke.cattle.io/cluster-name=cluster-admin-rke2
NAME                                                              AGE
cluster-admin-rke2-etcd-snapshot-rancher-node-admin-rke2-225f19   101s
cluster-admin-rke2-etcd-snapshot-rancher-node-admin-rke2-3fd7ec   90s
cluster-admin-rke2-etcd-snapshot-rancher-node-admin-rke2-4324d9   2m5s
cluster-admin-rke2-etcd-snapshot-rancher-node-admin-rke2-49b6d9   90s
cluster-admin-rke2-etcd-snapshot-rancher-node-admin-rke2-60f1d5   2m5s
cluster-admin-rke2-etcd-snapshot-rancher-node-admin-rke2-6e8640   101s

When the snapshots failed, all the data did not match.
I wait a bit before restoring the original backup configuration.

On cluster-admin-rke2, etcd automatic snapshots configuration is back to his former configuration:


  rkeConfig:

    etcd:
      disableSnapshots: false
      s3: null
      snapshotRetention: 5
      snapshotScheduleCron: 0 */5 * * *

During the kubernetes upgrades on the admin cluster, I've created a snapshot before each upgrades (1.27.14 and 1.28.10). The first one finished without any problem. Unfortunately the second one failed.

This issue looks very similar to the problem: https://github.com/rancher/rke2/issues/5008

Unfortunately, it should have been fixed in rke2 1.28.4 but we are in 1.28.10

another one, ~~not yet released~~ (not reproduced with rke2), but with a workaround provided:

https://www.suse.com/fr-fr/support/kb/doc/?id=000021447

https://github.com/rancher/rke2/issues/5866

It seems we are in this case: https://github.com/rancher/rancher/issues/45770

The snapshot always failed before the last rke2 upgrades (1.28.10).
So when the rke2 version is inferior than 1.28.4.

During the rke2 upgrades on production cluster, the last snapshot failed and the etcd snapshots are also stuck on production.

On cluster archive-staging-rke2 the automatic etcd snapshots only works for one control-plane (mgmt1).

The mgmt1 and mgmt3 control-plane are missing.

After trying to clean the snapshots, the results are worth than at the beginning: all the snapshots have a 0 MB size.
Then following this rancher document, I checked the etcd leader, etcd leases and etcd configmaps.

K="kubectl --context archive-staging-rke2"

Checking etcd leader:

ᐅ for etcd in $(eval "$K" get pods -n kube-system -l component=etcd | awk 'NR>1{print $1}');do
echo "$etcd"
eval "$K" -n kube-system exec "$etcd" \
-- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \
--cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \
--key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \
endpoint status --write-out table
done
etcd-rancher-node-staging-rke2-mgmt1
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 381b0c8c394cd86a |   3.5.9 |   66 MB |      true |      false |        24 |  574272617 |          574272617 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
etcd-rancher-node-staging-rke2-mgmt2
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 3921c7c16f9db3d4 |   3.5.9 |   66 MB |     false |      false |        24 |  574272625 |          574272625 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
etcd-rancher-node-staging-rke2-mgmt3
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | aa5bdc1bc2613335 |   3.5.9 |   66 MB |     false |      false |        24 |  574272631 |          574272631 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Checking rke2 and rke2-etcd leases:

ᐅ for name in rke2 rke2-etcd;do
eval "$K" -n kube-system get lease "$name"
done
NAME   HOLDER                            AGE
rke2   rancher-node-staging-rke2-mgmt3   530d
NAME        HOLDER                            AGE
rke2-etcd   rancher-node-staging-rke2-mgmt2   306d

Checking rke2 and rke2-etcd configmaps

ᐅ for name in rke2 rke2-etcd;do
echo "$name"
eval "$K" -n kube-system get cm "$name" -o yaml | grep holderIdentity
done
rke2
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"rancher-node-staging-rke2-mgmt2","leaseDurationSeconds":45,"acquireTime":"2024-06-13T16:02:45Z","renewTime":"2024-06-13T16:03:09Z","leaderTransitions":565}'
rke2-etcd
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"rancher-node-staging-rke2-mgmt2","leaseDurationSeconds":45,"acquireTime":"2024-06-13T16:02:44Z","renewTime":"2024-06-13T16:03:08Z","leaderTransitions":512}'

mentioned in commit swh-sysadmin-provisioning@18f9627a

On test-staging-rke2 cluster, after adding 2 control-plane/etcd nodes, the lease and the configmaps seems to be consistent.
Node etcd-rancher-node-test-rke2-mgmt1 is the leader:

ᐅ kbt exec etcd-rancher-node-test-rke2-mgmt1 -n kube-system \
-- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \
--cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \
--key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \
endpoint status --write-out table
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | fac31d16f74c8ff6 |   3.5.9 |   46 MB |      true |      false |         3 |     751200 |             751200 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

The leases are correct:

ᐅ for lease in rke2 rke2-etcd;do
echo '---'
kbt describe lease "$lease" -n kube-system | \
awk '/(Name|Kind|Holder Identity):/'
done
---
Name:         rke2
Kind:         Lease
  Holder Identity:         rancher-node-test-rke2-mgmt1
---
Name:         rke2-etcd
Kind:         Lease
  Holder Identity:         rancher-node-test-rke2-mgmt1

And the configmap too:

ᐅ for cm in rke2 rke2-etcd;do
echo '---'
kbt get cm "$cm" -n kube-system -o yaml | \
awk '/(name|kind):|holderIdentity/'
done
---
kind: ConfigMap
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"rancher-node-test-rke2-mgmt1","leaseDurationSeconds":45,"acquireTime":"2024-03-06T17:16:17Z","renewTime":"2024-06-12T15:33:09Z","leaderTransitions":299}'
  name: rke2
---
kind: ConfigMap
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"rancher-node-test-rke2-mgmt1","leaseDurationSeconds":45,"acquireTime":"2024-03-06T17:16:17Z","renewTime":"2024-06-12T15:33:07Z","leaderTransitions":316}'
  name: rke2-etcd

All the etcd nodes are snapshotted:

root@rancher-node-test-rke2-mgmt1:~# ls -trl /var/lib/rancher/rke2/server/db/snapshots/
total 160174
-rw------- 1 root root 45490208 Jul  8 15:00 etcd-snapshot-rancher-node-test-rke2-mgmt1-1720450800
-rw------- 1 root root 45490208 Jul  8 20:00 etcd-snapshot-rancher-node-test-rke2-mgmt1-1720468801
-rw------- 1 root root 45490208 Jul  9 00:00 etcd-snapshot-rancher-node-test-rke2-mgmt1-1720483200
-rw------- 1 root root 45490208 Jul  9 05:00 etcd-snapshot-rancher-node-test-rke2-mgmt1-1720501203
-rw------- 1 root root 45490208 Jul  9 10:00 etcd-snapshot-rancher-node-test-rke2-mgmt1-1720519203
-rw------- 1 root root 45490208 Jul  9 11:27 on-demand-rancher-node-test-rke2-mgmt1-1720524430
-rw------- 1 root root 45490208 Jul  9 11:27 on-demand-rancher-node-test-rke2-mgmt1-1720524464
-rw------- 1 root root 45490208 Jul  9 11:28 on-demand-rancher-node-test-rke2-mgmt1-1720524497

root@rancher-node-test-rke2-mgmt2:~# ls -trl /var/lib/rancher/rke2/server/db/snapshots/
total 93698
-rw------- 1 root root 38436896 Jul  8 15:00 etcd-snapshot-rancher-node-test-rke2-mgmt2-1720450803
-rw------- 1 root root 38436896 Jul  8 20:00 etcd-snapshot-rancher-node-test-rke2-mgmt2-1720468800
-rw------- 1 root root 38436896 Jul  9 00:00 etcd-snapshot-rancher-node-test-rke2-mgmt2-1720483204
-rw------- 1 root root 38543392 Jul  9 05:00 etcd-snapshot-rancher-node-test-rke2-mgmt2-1720501205
-rw------- 1 root root 38543392 Jul  9 10:00 etcd-snapshot-rancher-node-test-rke2-mgmt2-1720519200

root@rancher-node-test-rke2-mgmt3:~# ls -trl /var/lib/rancher/rke2/server/db/snapshots/
total 93834
-rw------- 1 root root 38596640 Jul  8 15:00 etcd-snapshot-rancher-node-test-rke2-mgmt3-1720450802
-rw------- 1 root root 38596640 Jul  8 20:00 etcd-snapshot-rancher-node-test-rke2-mgmt3-1720468804
-rw------- 1 root root 38596640 Jul  9 00:00 etcd-snapshot-rancher-node-test-rke2-mgmt3-1720483203
-rw------- 1 root root 38809632 Jul  9 05:00 etcd-snapshot-rancher-node-test-rke2-mgmt3-1720501204
-rw------- 1 root root 38809632 Jul  9 10:00 etcd-snapshot-rancher-node-test-rke2-mgmt3-1720519205

All these snapshots are referenced in the configmap rke2-etcd-snapshots:

ᐅ kbt get cm -n kube-system rke2-etcd-snapshots -o json | jq '.data | keys[]'
"local-etcd-snapshot-rancher-node-test-rke2-mgmt1-1720450800"
"local-etcd-snapshot-rancher-node-test-rke2-mgmt1-1720468801"
"local-etcd-snapshot-rancher-node-test-rke2-mgmt1-1720483200"
"local-etcd-snapshot-rancher-node-test-rke2-mgmt1-1720501203"
"local-etcd-snapshot-rancher-node-test-rke2-mgmt1-1720519203"
"local-etcd-snapshot-rancher-node-test-rke2-mgmt2-1720450803"
"local-etcd-snapshot-rancher-node-test-rke2-mgmt2-1720468800"
"local-etcd-snapshot-rancher-node-test-rke2-mgmt2-1720483204"
"local-etcd-snapshot-rancher-node-test-rke2-mgmt2-1720501205"
"local-etcd-snapshot-rancher-node-test-rke2-mgmt2-1720519200"
"local-etcd-snapshot-rancher-node-test-rke2-mgmt3-1720450802"
"local-etcd-snapshot-rancher-node-test-rke2-mgmt3-1720468804"
"local-etcd-snapshot-rancher-node-test-rke2-mgmt3-1720483203"
"local-etcd-snapshot-rancher-node-test-rke2-mgmt3-1720501204"
"local-etcd-snapshot-rancher-node-test-rke2-mgmt3-1720519205"
"local-on-demand-rancher-node-test-rke2-mgmt1-1720524430"
"local-on-demand-rancher-node-test-rke2-mgmt1-1720524464"
"local-on-demand-rancher-node-test-rke2-mgmt1-1720524497"

But only the mgmt1 snapshots are syncronized on Rancher cluster:

ᐅ for snap in $(kb --context local get etcdsnapshots -n fleet-default -l rke.cattle.io/cluster-name=test-staging-rke2 -o name);do
kb --context local get "$snap" -n fleet-default -o yaml | awk '/^kind|node-name/'      
done
kind: ETCDSnapshot
    rke.cattle.io/node-name: rancher-node-test-rke2-mgmt1
kind: ETCDSnapshot
    rke.cattle.io/node-name: rancher-node-test-rke2-mgmt1
kind: ETCDSnapshot
    rke.cattle.io/node-name: rancher-node-test-rke2-mgmt1
kind: ETCDSnapshot
    rke.cattle.io/node-name: rancher-node-test-rke2-mgmt1
kind: ETCDSnapshot
    rke.cattle.io/node-name: rancher-node-test-rke2-mgmt1
kind: ETCDSnapshot
    rke.cattle.io/node-name: rancher-node-test-rke2-mgmt1
kind: ETCDSnapshot
    rke.cattle.io/node-name: rancher-node-test-rke2-mgmt1
kind: ETCDSnapshot
    rke.cattle.io/node-name: rancher-node-test-rke2-mgmt1

So only the mgmt1 snapshots are available in the Rnacher UI.
Yesterday I forgot to update the firewall aliases with the new nodes IPs, but after cleaning up all the on-demand snapshots, creating a new manual snapshot from the Rancher web UI give 3 snapshots of mgmt1.

nice side effect of the missing fw rule

On archive-staging-rke2 cluster, aligning the holder identity in rke2 lease to the one in rke2-etcd lease make the snapshots back to successful state. As on test-staging-rke2 cluster, only the first control-plane/etcd snapshots are synchronized on the Rancher cluster.

On all Rancher downstream clusters, the holder identity of leases rke2 and configmaps rke2, rke2-etcd have been updated to match the holder identity of lease rke2-etcd.
All the on-demand snapshots have been deleted.
Some on-demand snapshots still appears in the Rancher web UI, let the Rancher automatic snapshots do its job (it should do the cleaning).

Here are the Ranxcher docs:

I'm not sure the fleet-agent registration errors on test-staging-rke2 are related to the snapshots issue.
(kubectl --context test-staging-rke2 logs -l app=fleet-agent -n cattle-fleet-system)

Automatic snapshots have been processed: test-staging-rke2 and archive-staging-rke2 clusters are fine, cluster-admin-rke2 and archive-production-rke2 once again have an empty snapshot (and the on-demand still appears in the web UI).

On cluster-admin-rke2:

save the rke2-etcd-snapshots configmap;
save the last snapshot on node rancher-node-staging-rke2-mgmt1 (/root/etcd-snapshots-backups-20240710/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1720623605);
remove all datas in the rke2-etcd-snapshots configmap;
remove the last snapshot on node rancher-node-staging-rke2-mgmt1.

The Rancher web UI no longer contains any snapshot (there is 0 etcdsnapshot on rancher cluster). The next automatic snapshots should create a clean snapshot.

Idem on archive-production-rke2.

On cluster-admin-rke2 the new snapshots appears with 0B size in the Rancher UI.
The configmaps and the lease match the holder identity of rke2-etcd lease:

ᐅ kba get lease rke2-etcd -n kube-system -o jsonpath='{.spec.holderIdentity}'
rancher-node-admin-rke2-mgmt1
ᐅ kba get lease rke2 -n kube-system -o jsonpath='{.spec.holderIdentity}'
rancher-node-admin-rke2-mgmt1
ᐅ kba get cm rke2-etcd -n kube-system \
-o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \
jq -r '.holderIdentity'
rancher-node-admin-rke2-mgmt1
ᐅ kba get cm rke2 -n kube-system \
-o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \
jq -r '.holderIdentity'
rancher-node-admin-rke2-mgmt1

On the master nodes filesystem the snapshots seems to be fine:

root@pergamon:~# clush -b -w @k8s-admin-mgmt 'ls -trl /var/lib/rancher/rke2/server/db/snapshots/'
---------------
rancher-node-admin-rke2-mgmt1.internal.admin.swh.network
---------------
total 97793
-rw------- 1 root root 124739616 Jul 10 20:00 etcd-snapshot-rancher-node-admin-rke2-mgmt1-1720641603
-rw------- 1 root root 124739616 Jul 11 00:00 etcd-snapshot-rancher-node-admin-rke2-mgmt1-1720656003
-rw------- 1 root root 127717408 Jul 11 05:00 etcd-snapshot-rancher-node-admin-rke2-mgmt1-1720674001
---------------
rancher-node-admin-rke2-mgmt2.internal.admin.swh.network
---------------
total 171064
-rw------- 1 root root 140316704 Jul 10 10:00 etcd-snapshot-rancher-node-admin-rke2-mgmt2-1720605601
-rw------- 1 root root 141484064 Jul 10 15:00 etcd-snapshot-rancher-node-admin-rke2-mgmt2-1720623604
-rw------- 1 root root 144736288 Jul 10 20:00 etcd-snapshot-rancher-node-admin-rke2-mgmt2-1720641603
-rw------- 1 root root 144736288 Jul 11 00:00 etcd-snapshot-rancher-node-admin-rke2-mgmt2-1720656003
-rw------- 1 root root 146563104 Jul 11 05:00 etcd-snapshot-rancher-node-admin-rke2-mgmt2-1720674003
---------------
rancher-node-admin-rke2-mgmt3.internal.admin.swh.network
---------------
total 160606
-rw------- 1 root root 133947424 Jul 10 10:00 etcd-snapshot-rancher-node-admin-rke2-mgmt3-1720605605
-rw------- 1 root root 104558624 Jul 10 15:00 etcd-snapshot-rancher-node-admin-rke2-mgmt3-1720623602
-rw------- 1 root root 127193120 Jul 10 20:00 etcd-snapshot-rancher-node-admin-rke2-mgmt3-1720641602
-rw------- 1 root root 127193120 Jul 11 00:00 etcd-snapshot-rancher-node-admin-rke2-mgmt3-1720656003
-rw------- 1 root root 130031648 Jul 11 05:00 etcd-snapshot-rancher-node-admin-rke2-mgmt3-1720674004
root@pergamon:~# clush -b -w @k8s-admin-mgmt 'rke2 etcd-snapshot ls 2> /dev/null'
---------------
rancher-node-admin-rke2-mgmt[1-3].internal.admin.swh.network (3)
---------------
Name                                                   Location                                                                                                Size      Created
etcd-snapshot-rancher-node-admin-rke2-mgmt1-1720641603 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1720641603 124739616 2024-07-10T20:00:03Z
etcd-snapshot-rancher-node-admin-rke2-mgmt1-1720656003 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1720656003 124739616 2024-07-11T00:00:03Z
etcd-snapshot-rancher-node-admin-rke2-mgmt1-1720674001 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1720674001 127717408 2024-07-11T05:00:01Z

But on Rancher cluster the etcdsnapshot objects are inconsistency:

ᐅ for snap in $(kbl get etcdsnapshots.rke.cattle.io -n fleet-default \
-l rke.cattle.io/cluster-name=archive-production-rke2 -o name);do
awk '{split($1,a,"/");print a[2]}' <<< "$snap"
kbl get "$snap" -n fleet-default -o jsonpath='{.snapshotFile}' | jq
done
archive-production-rke2-etcd-snapshot-rancher-node-produc-1eac7
{
  "location": "file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-production-rke2-mgmt1-1720674004",
  "name": "etcd-snapshot-rancher-node-production-rke2-mgmt1-1720674004",
  "nodeName": "rancher-node-production-rke2-mgmt2"
}
archive-production-rke2-etcd-snapshot-rancher-node-produc-85e85
{
  "location": "file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-production-rke2-mgmt1-1720656003",
  "name": "etcd-snapshot-rancher-node-production-rke2-mgmt1-1720656003",
  "nodeName": "rancher-node-production-rke2-mgmt2"
}
archive-production-rke2-etcd-snapshot-rancher-node-produc-a5a81
{
  "location": "file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-production-rke2-mgmt1-1720641604",
  "name": "etcd-snapshot-rancher-node-production-rke2-mgmt1-1720641604",
  "nodeName": "rancher-node-production-rke2-mgmt2"
}

The name and the nodeName doesn't match.

Same issue in cluster archive-production-rke2.

On-demand snapshots on test-staging-rke2 and archive-staging-rke2 no longer fail and block the snapshot process.
It creates 3 snapshots of the first master node (mgmt1).
rke2-etcd-snapshots is updated on the downstream cluster and etcdsnapshots is created on the Rancher cluster.

Here is a summary of the issues on cluster-admin-rke2 and archive-production-rke2 clusters:

The snapshot appears in the Rancher UI with a 0B size because the etcdsnapshots on Rancher cluster are inconsistency:

ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \
-l rke.cattle.io/cluster-name=archive-production-rke2 \
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.status}{"\n"}{end}' | \
awk 'BEGIN{format="%-65s %-35s %s\n"
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status"
printf format,"---","---","---"}
{printf format,$1,$2,$3}'
Snapshotfile Name                                                 Snapshotfile Node Name              Snapshotfile Status
---                                                               ---                                 ---
etcd-snapshot-rancher-node-production-rke2-mgmt1-1720728001       rancher-node-production-rke2-mgmt3  {"missing":true}
etcd-snapshot-rancher-node-production-rke2-mgmt1-1720760404       rancher-node-production-rke2-mgmt2  {"missing":true}
etcd-snapshot-rancher-node-production-rke2-mgmt1-1720742404       rancher-node-production-rke2-mgmt2  {"missing":true}

Snapshots files with mgmt1 name are on node ... rancher-node-production-rke2-mgmt1 so the status is "missing".

root@rancher-node-production-rke2-mgmt1:~# ls -trl /var/lib/rancher/rke2/server/db/snapshots/
total 175297
-rw------- 1 root root 188715040 Jul 11 20:00 etcd-snapshot-rancher-node-production-rke2-mgmt1-1720728001
-rw------- 1 root root 188715040 Jul 12 00:00 etcd-snapshot-rancher-node-production-rke2-mgmt1-1720742404
-rw------- 1 root root 188715040 Jul 12 05:00 etcd-snapshot-rancher-node-production-rke2-mgmt1-1720760404

The etcdsnapshotfiles on these downstream clusters are not cleaned even if they have a deletion timestamp in their metadata:

ᐅ kbp get etcdsnapshotfiles.k3s.cattle.io \
-o jsonpath='{range .items[*]}{.metadata.name} {.metadata.creationTimestamp} {.metadata.deletionTimestamp}{"\n"}{end}' | \
awk 'BEGIN{format="%-75s %-23s %s\n"
printf format,"Etcd Snapshotfile Name","Creation Timestamp","Deletion Timestamp"
printf format,"---","---","---"
}{printf format,$1,$2,$3}'
Etcd Snapshotfile Name                                                      Creation Timestamp      Deletion Timestamp
---                                                                         ---                     ---
local-etcd-snapshot-rancher-node-production-rke2-mgmt1-1718722802-ebbf26    2024-06-18T15:00:02Z    2024-06-19T15:00:03Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt1-1718740802-0810df    2024-06-18T20:00:03Z    2024-06-19T20:00:05Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt1-1718755201-a37515    2024-06-19T00:00:01Z    2024-06-20T00:00:04Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt1-1718773201-701fd5    2024-06-19T05:00:02Z    2024-06-20T05:00:04Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt1-1718791204-dfe2a5    2024-06-19T10:00:05Z    2024-06-20T10:00:01Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt1-1720728001-0327f2    2024-07-11T20:00:03Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt1-1720742404-9060a9    2024-07-12T00:00:06Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt1-1720760404-f30817    2024-07-12T05:00:06Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt2-1718722803-02c092    2024-06-18T15:00:04Z    2024-06-19T15:00:01Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt2-1718740804-7252a6    2024-06-18T20:00:05Z    2024-06-19T20:00:01Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt2-1718755204-4b63a2    2024-06-19T00:00:04Z    2024-06-20T00:00:02Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt2-1718773202-2eaebe    2024-06-19T05:00:02Z    2024-06-20T05:00:05Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt2-1718791201-ec24ef    2024-06-19T10:00:01Z    2024-06-20T10:00:01Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt2-1720728001-dd20a7    2024-07-11T20:00:02Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt2-1720742404-b3dd26    2024-07-12T00:00:05Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt2-1720760402-571174    2024-07-12T05:00:03Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt3-1718722801-a8116d    2024-06-18T15:00:02Z    2024-06-19T15:00:04Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt3-1718740804-7d26b0    2024-06-18T20:00:06Z    2024-06-19T20:00:02Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt3-1718755205-4f1550    2024-06-19T00:00:07Z    2024-06-20T00:00:03Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt3-1718773204-3717d5    2024-06-19T05:00:06Z    2024-06-20T05:00:04Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt3-1718791201-40d144    2024-06-19T10:00:03Z    2024-06-20T10:00:03Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt3-1720728003-e04526    2024-07-11T20:00:06Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt3-1720742401-49f2ab    2024-07-12T00:00:05Z
local-etcd-snapshot-rancher-node-production-rke2-mgmt3-1720760404-a3a567    2024-07-12T05:00:08Z
local-on-demand-rancher-node-production-rke2-mgmt1-1671643823-ab1c8b        2024-02-13T12:00:06Z    2024-07-10T14:02:32Z
local-on-demand-rancher-node-production-rke2-mgmt1-1671643884-8ceda6        2024-02-13T12:00:04Z    2024-07-10T14:02:32Z
local-on-demand-rancher-node-production-rke2-mgmt1-1672130496-ce7405        2024-02-13T12:00:04Z    2024-07-10T14:02:32Z
local-on-demand-rancher-node-production-rke2-mgmt1-1672131537-86577b        2024-02-13T12:00:04Z    2024-07-10T14:02:32Z
local-on-demand-rancher-node-production-rke2-mgmt1-1691596918-db83e1        2024-02-13T12:00:06Z    2024-07-10T14:02:32Z
local-on-demand-rancher-node-production-rke2-mgmt1-1692715187-ab3dbf        2024-02-13T12:00:07Z    2024-07-10T14:02:32Z
local-on-demand-rancher-node-production-rke2-mgmt1-1718720719-2a09c1        2024-06-18T14:25:21Z    2024-07-10T14:02:32Z
local-on-demand-rancher-node-production-rke2-mgmt2-1718720718-bbb3e5        2024-06-18T14:25:19Z    2024-07-10T15:00:03Z
local-on-demand-rancher-node-production-rke2-mgmt3-1718720720-4c5518        2024-06-18T14:25:23Z    2024-07-10T15:00:06Z

These "undeleted" etcdsnapshotfiles can't be deleted with k9s or kubectl even with a force or grace-period option.

The configmap rke2-etcd-snapshots on the downstream clusters are not updated and stay empty:

ᐅ kbp get cm rke2-etcd-snapshots -n kube-system -o yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  creationTimestamp: "2022-12-21T17:30:24Z"
  name: rke2-etcd-snapshots
  namespace: kube-system
  resourceVersion: "553587525"
  uid: 0d82d98a-5b6f-425d-a286-41d35b72c40f

Seems to be related to this GitHub issue: What produces the Finalizer in the ETCDsnapshotfiles and which resource deletes it?

Snapshots seems to be fine on cluster-admin-rke2 and archive-production-rke2 clusters.

Restore consistency in kube-system leases and configmaps: etcd snapshots showing 0B size in the Rancher UI;
Remove corrupted/malformed snapshots, eventually remove all data in rke2-etcd-snapshots configmap on downstream cluster: failed etcd snapshot with StorageError invalid object message;
Change the retention time to a number lower than the current one (otherwise, corrupt snapshots will not be cleaned up).

On-demand snapshots works fine on cluster-admin-rke2 and archive-production-rke2 clusters.

root@rancher-node-admin-rke2-mgmt1:~# rke2 etcd-snapshot ls 2> /dev/null
Name                                                   Location                                                                                                Size      Created
etcd-snapshot-rancher-node-admin-rke2-mgmt1-1721055602 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1721055602 85532704  2024-07-15T15:00:02Z
etcd-snapshot-rancher-node-admin-rke2-mgmt1-1721056204 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1721056204 85532704  2024-07-15T15:10:04Z
etcd-snapshot-rancher-node-admin-rke2-mgmt1-1721073604 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1721073604 88576032  2024-07-15T20:00:04Z
etcd-snapshot-rancher-node-admin-rke2-mgmt1-1721088004 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1721088004 89899040  2024-07-16T00:00:04Z
etcd-snapshot-rancher-node-admin-rke2-mgmt1-1721106000 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1721106000 101019680 2024-07-16T05:00:00Z
on-demand-rancher-node-admin-rke2-mgmt1-1721118947     file:///var/lib/rancher/rke2/server/db/snapshots/on-demand-rancher-node-admin-rke2-mgmt1-1721118947     102850592 2024-07-16T08:35:47Z
on-demand-rancher-node-admin-rke2-mgmt1-1721118954     file:///var/lib/rancher/rke2/server/db/snapshots/on-demand-rancher-node-admin-rke2-mgmt1-1721118954     102850592 2024-07-16T08:35:54Z
on-demand-rancher-node-admin-rke2-mgmt1-1721118979     file:///var/lib/rancher/rke2/server/db/snapshots/on-demand-rancher-node-admin-rke2-mgmt1-1721118979     102850592 2024-07-16T08:36:19Z
root@rancher-node-admin-rke2-mgmt1:~# ls -trl /var/lib/rancher/rke2/server/db/snapshots/
total 240066
-rw------- 1 root root  85532704 Jul 15 15:00 etcd-snapshot-rancher-node-admin-rke2-mgmt1-1721055602
-rw------- 1 root root  85532704 Jul 15 15:10 etcd-snapshot-rancher-node-admin-rke2-mgmt1-1721056204
-rw------- 1 root root  88576032 Jul 15 20:00 etcd-snapshot-rancher-node-admin-rke2-mgmt1-1721073604
-rw------- 1 root root  89899040 Jul 16 00:00 etcd-snapshot-rancher-node-admin-rke2-mgmt1-1721088004
-rw------- 1 root root 101019680 Jul 16 05:00 etcd-snapshot-rancher-node-admin-rke2-mgmt1-1721106000
-rw------- 1 root root 102850592 Jul 16 08:35 on-demand-rancher-node-admin-rke2-mgmt1-1721118947
-rw------- 1 root root 102850592 Jul 16 08:35 on-demand-rancher-node-admin-rke2-mgmt1-1721118954
-rw------- 1 root root 102850592 Jul 16 08:36 on-demand-rancher-node-admin-rke2-mgmt1-1721118979

The on-demand snapshots are well synchronized on the Rancher server:

ᐅ kubectl --context local get etcdsnapshots.rke.cattle.io -n fleet-default \           
-l rke.cattle.io/cluster-name=archive-production-rke2 \
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status}{"\n"}{end}' | \
awk 'BEGIN{format="%-65s %-35s %s\n"
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status"
printf format,"---","---","---"}
{printf format,$1,$2,$3}'
Snapshotfile Name                                                 Snapshotfile Node Name              Snapshotfile Status
---                                                               ---                                 ---
etcd-snapshot-rancher-node-production-rke2-mgmt1-1721056804       rancher-node-production-rke2-mgmt1  successful
etcd-snapshot-rancher-node-production-rke2-mgmt1-1721088004       rancher-node-production-rke2-mgmt1  successful
etcd-snapshot-rancher-node-production-rke2-mgmt1-1721106000       rancher-node-production-rke2-mgmt1  successful
etcd-snapshot-rancher-node-production-rke2-mgmt1-1721056201       rancher-node-production-rke2-mgmt1  successful
etcd-snapshot-rancher-node-production-rke2-mgmt1-1721073603       rancher-node-production-rke2-mgmt1  successful
on-demand-rancher-node-production-rke2-mgmt1-1721119063           rancher-node-production-rke2-mgmt1  successful
on-demand-rancher-node-production-rke2-mgmt1-1721119065           rancher-node-production-rke2-mgmt1  successful
on-demand-rancher-node-production-rke2-mgmt1-1721119094           rancher-node-production-rke2-mgmt1  successful

I removed all the on-demand snapshots on all downstream clusters and everything is still fine.
I deleted all the snapshots backups I've done on all nodes.

I can finally close this issue.

added 2w of time spent

closed

rancher etcd snapshots not working for staging and admin clusters

Designs

Child items ...

Activity