Rancher's etcd backups failed

changed the description

In test-staging-rke2 here are the failed snapshosts:

ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \
-l rke.cattle.io/cluster-name=test-staging-rke2 \
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \
awk 'BEGIN{format="%-60s %-35s %-20s %s\n"
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at"
printf format,"---","---","---","---"}
$3!="successful"{printf format,$1,$2,$3,$4}'
Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736383505        rancher-node-test-rke2-mgmt1                             
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736369100        rancher-node-test-rke2-mgmt2                             
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736351102        rancher-node-test-rke2-mgmt2                             
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736401502        rancher-node-test-rke2-mgmt2

In test-staging-rke2:

delete corrupted snapshots;
check rke2-etcd-snapshots configmap;
realign leases and configmaps.

ᐅ for context in $(kubectx | awk '/test-staging-rke2/');do
echo -e "---\nEtcd leader in cluster $context"
kubectl --context "$context" exec $(kubectl --context "$context" get po -n kube-system -l component=etcd --no-headers -o jsonpath='{range .items[0]}{.metadata.name}{end}') -n kube-system \
-- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \
--cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \
--key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \
endpoint status --cluster | awk '/true/{split($1,a,":");print substr(a[2],3)}' | \
xargs -I{} dig -x {} +short | awk -F '.' '{printf "\t%s\n",$1}'
echo "Leases and configmaps in cluster $context"
for name in rke2 rke2-etcd;do 
kubectl --context "$context" get cm -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \
awk '{split($3,a,",");printf "\t%-10s %-10s %s\n",$1,$2,substr(a[1],2)}'
kubectl --context "$context" get leases -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.spec.holderIdentity}' | \
awk '{printf "\t%-10s %-10s %s\n",$1,$2,$3}'
done
done
---
Etcd leader in cluster test-staging-rke2
        rancher-node-test-rke2-mgmt1
Leases and configmaps in cluster test-staging-rke2
        ConfigMap  rke2       "holderIdentity":"rancher-node-test-rke2-mgmt1"
        Lease      rke2       rancher-node-test-rke2-mgmt1
        ConfigMap  rke2-etcd  "holderIdentity":"rancher-node-test-rke2-mgmt1"
        Lease      rke2-etcd  rancher-node-test-rke2-mgmt1

In archive-staging-rke2 here are the failed snapshots:

ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \
-l rke.cattle.io/cluster-name=archive-staging-rke2 \
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \
awk 'BEGIN{format="%-60s %-35s %-20s %s\n"
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at"
printf format,"---","---","---","---"}
$3!="successful"{printf format,$1,$2,$3,$4}'
Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736382602     rancher-node-staging-rke2-mgmt2                          
etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736368202     rancher-node-staging-rke2-mgmt2                          
etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736350201     rancher-node-staging-rke2-mgmt2                          
etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736400603     rancher-node-staging-rke2-mgmt3

In archive-staging-rke2:

delete corrupted snapshots;
check rke2-etcd-snapshots configmap;
realign leases and configmaps.

ᐅ for context in $(kubectx | awk '/archive-staging-rke2/');do
echo -e "---\nEtcd leader in cluster $context"
kubectl --context "$context" exec $(kubectl --context "$context" get po -n kube-system -l component=etcd --no-headers -o jsonpath='{range .items[0]}{.metadata.name}{end}') -n kube-system \
-- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \
--cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \
--key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \
endpoint status --cluster | awk '/true/{split($1,a,":");print substr(a[2],3)}' | \
xargs -I{} dig -x {} +short | awk -F '.' '{printf "\t%s\n",$1}'
echo "Leases and configmaps in cluster $context"
for name in rke2 rke2-etcd;do 
kubectl --context "$context" get cm -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \
awk '{split($3,a,",");printf "\t%-10s %-10s %s\n",$1,$2,substr(a[1],2)}'
kubectl --context "$context" get leases -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.spec.holderIdentity}' | \
awk '{printf "\t%-10s %-10s %s\n",$1,$2,$3}'
done
done
---
Etcd leader in cluster archive-staging-rke2
        rancher-node-staging-rke2-mgmt1
Leases and configmaps in cluster archive-staging-rke2
        ConfigMap  rke2       "holderIdentity":"rancher-node-staging-rke2-mgmt1"
        Lease      rke2       rancher-node-staging-rke2-mgmt1
        ConfigMap  rke2-etcd  "holderIdentity":"rancher-node-staging-rke2-mgmt1"
        Lease      rke2-etcd  rancher-node-staging-rke2-mgmt1

In archive-production-rke2 the last snapshot on s3 bucket failed as MinIO was unreachable during the cluster-admin-rke2 upgrade:

ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \
-l rke.cattle.io/cluster-name=archive-production-rke2 \
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \
awk 'BEGIN{format="%-60s %-35s %-20s %s\n"
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at"
printf format,"---","---","---","---"}
$4~/2025-01-09T15/{printf format,$1,$2,$3,$4}'
Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-production-rke2-mgmt3-1736434803  s3                                  failed               2025-01-09T15:00:03Z
etcd-snapshot-rancher-node-production-rke2-mgmt2-1736434802  s3                                  failed               2025-01-09T15:00:02Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736434800  s3                                  failed               2025-01-09T15:00:00Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736434800  rancher-node-production-rke2-mgmt1  successful           2025-01-09T15:00:00Z

In archive-staging-rke2 and test-staging-rke2 clusters last snapshots still failed:

~ ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \
-l rke.cattle.io/cluster-name=archive-staging-rke2 \
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \
awk 'BEGIN{format="%-60s %-35s %-20s %s\n"
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at"
printf format,"---","---","---","---"}
$3!="successful"{printf format,$1,$2,$3,$4}'
Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736436604     rancher-node-staging-rke2-mgmt3                          
~ ᐅ date -d @1736436604          
Thu Jan  9 04:30:04 PM CET 2025
~ ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \
-l rke.cattle.io/cluster-name=test-staging-rke2 \   
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \
awk 'BEGIN{format="%-60s %-35s %-20s %s\n"
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at"
printf format,"---","---","---","---"}
$3!="successful"{printf format,$1,$2,$3,$4}'
Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736437504        rancher-node-test-rke2-mgmt2                             
~ ᐅ date -d @1736437504          
Thu Jan  9 04:45:04 PM CET 2025

Oddly everything works fine on cluster-admin-rke2:

~ ᐅ for context in $(kubectx | awk '/cluster-admin-rke2/');do
echo -e "---\nEtcd leader in cluster $context"    
kubectl --context "$context" exec $(kubectl --context "$context" get po -n kube-system -l component=etcd --no-headers -o jsonpath='{range .items[0]}{.metadata.name}{end}') -n kube-system \
-- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \
--cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \                           
--key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \
endpoint status --cluster | awk '/true/{split($1,a,":");print substr(a[2],3)}' | \
xargs -I{} dig -x {} +short | awk -F '.' '{printf "\t%s\n",$1}'
echo "Leases and configmaps in cluster $context"
for name in rke2 rke2-etcd;do
kubectl --context "$context" get cm -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \
awk '{split($3,a,",");printf "\t%-10s %-10s %s\n",$1,$2,substr(a[1],2)}'
kubectl --context "$context" get leases -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.spec.holderIdentity}' | \
awk '{printf "\t%-10s %-10s %s\n",$1,$2,$3}'
done
done
---
Etcd leader in cluster cluster-admin-rke2
        rancher-node-admin-rke2-mgmt3
Leases and configmaps in cluster cluster-admin-rke2
        ConfigMap  rke2       "holderIdentity":"rancher-node-admin-rke2-mgmt2"
        Lease      rke2       rancher-node-admin-rke2-mgmt2
        ConfigMap  rke2-etcd  "holderIdentity":"rancher-node-admin-rke2-mgmt2"
        Lease      rke2-etcd  rancher-node-admin-rke2-mgmt2
~ ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \   
-l rke.cattle.io/cluster-name=cluster-admin-rke2 \
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \                                                
awk 'BEGIN{format="%-60s %-35s %-20s %s\n"                                 
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at"
printf format,"---","---","---","---"}                           
$4~/2025-01-09T15/{printf format,$1,$2,$3,$4}'                                    
Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736435703       s3                                  successful           2025-01-09T15:15:03Z
etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736435702       s3                                  successful           2025-01-09T15:15:02Z
etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736435702       s3                                  successful           2025-01-09T15:15:02Z
etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736435703       rancher-node-admin-rke2-mgmt1       successful           2025-01-09T15:15:03Z

On archive-production-rke2 the failed snapshots I deleted yesterday are back:

ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \
-l rke.cattle.io/cluster-name=archive-production-rke2 \
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \
awk 'BEGIN{format="%-60s %-35s %-20s %s\n"
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at"
printf format,"---","---","---","---"}
$3~/failed/{printf format,$1,$2,$3,$4}'

Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-production-rke2-mgmt3-1736434803  s3                                  failed               2025-01-09T15:00:03Z
etcd-snapshot-rancher-node-production-rke2-mgmt2-1736434802  s3                                  failed               2025-01-09T15:00:02Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736434800  s3                                  failed               2025-01-09T15:00:00Z

I removed these snapshots references from the configmap rke2-etcd-snapshots on archive-production-rke2 cluster without deleting the etcdsnapshots on Rancher cluster. It seems to be fine:

ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \
-l rke.cattle.io/cluster-name=archive-production-rke2 \
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \
awk 'BEGIN{format="%-60s %-35s %-20s %s\n"
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at"
printf format,"---","---","---","---"}
{printf format,$1,$2,$3,$4}' 

Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736485200  rancher-node-production-rke2-mgmt1  successful           2025-01-10T05:00:00Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736467203  rancher-node-production-rke2-mgmt1  successful           2025-01-10T00:00:03Z
etcd-snapshot-rancher-node-production-rke2-mgmt3-1736485204  s3                                  successful           2025-01-10T05:00:04Z
etcd-snapshot-rancher-node-production-rke2-mgmt3-1736416804  s3                                  successful           2025-01-09T10:00:04Z
etcd-snapshot-rancher-node-production-rke2-mgmt2-1736452801  s3                                  successful           2025-01-09T20:00:01Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736416805  rancher-node-production-rke2-mgmt1  successful           2025-01-09T10:00:05Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736398803  s3                                  successful           2025-01-09T05:00:03Z
etcd-snapshot-rancher-node-production-rke2-mgmt3-1736467204  s3                                  successful           2025-01-10T00:00:04Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736452803  s3                                  successful           2025-01-09T20:00:03Z
etcd-snapshot-rancher-node-production-rke2-mgmt2-1736416801  s3                                  successful           2025-01-09T10:00:01Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736452803  rancher-node-production-rke2-mgmt1  successful           2025-01-09T20:00:03Z
etcd-snapshot-rancher-node-production-rke2-mgmt3-1736452804  s3                                  successful           2025-01-09T20:00:04Z
etcd-snapshot-rancher-node-production-rke2-mgmt3-1736398804  s3                                  successful           2025-01-09T05:00:04Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736467203  s3                                  successful           2025-01-10T00:00:03Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736416805  s3                                  successful           2025-01-09T10:00:05Z
etcd-snapshot-rancher-node-production-rke2-mgmt2-1736398803  s3                                  successful           2025-01-09T05:00:03Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736485200  s3                                  successful           2025-01-10T05:00:00Z
etcd-snapshot-rancher-node-production-rke2-mgmt2-1736467203  s3                                  successful           2025-01-10T00:00:03Z
etcd-snapshot-rancher-node-production-rke2-mgmt2-1736485204  s3                                  successful           2025-01-10T05:00:04Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736434800  rancher-node-production-rke2-mgmt1  successful           2025-01-09T15:00:00Z

On test-staging-rke2 the snapshots are still corrupted:

~ ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \
-l rke.cattle.io/cluster-name=test-staging-rke2 \
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \
awk 'BEGIN{format="%-60s %-35s %-20s %s\n"
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at"
printf format,"---","---","---","---"}
$3!="successful"{printf format,$1,$2,$3,$4}'

Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736487904        rancher-node-test-rke2-mgmt1                             
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736455502        rancher-node-test-rke2-mgmt3                             
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736469903        rancher-node-test-rke2-mgmt3                             
~ ᐅ date -d @1736469903                                   
Fri Jan 10 01:45:03 AM CET 2025
~ ᐅ date -d @1736455502                                   
Thu Jan  9 09:45:02 PM CET 2025
~ ᐅ date -d @1736487904                                   
Fri Jan 10 06:45:04 AM CET 2025

The corrupted snapshots are not in the configmap rke2-etcd-snapshots:

~ ᐅ kbt get cm -n kube-system rke2-etcd-snapshots \
-o go-template='{{range $snap,$snap_value:=.data}}{{printf "%s\n" $snap}}{{end}}' | \
grep "1736487904\|1736455502\|1736469903"
~ ᐅ

Here are the snapshots on node rancher-node-test-rke2-mgmt1:

root@rancher-node-test-rke2-mgmt1:~# rke2 etcd-snapshot ls 2> /dev/null | awk '/mgmt1-(1736487904|1736455502|1736469903)/'
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736455502 s3://backup-rke2-etcd/test-staging-rke2/etcd-snapshot-rancher-node-test-rke2-mgmt1-1736455502          48472096 2025-01-09T20:45:02Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736455502 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-test-rke2-mgmt1-1736455502 48472096 2025-01-09T20:45:02Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736469903 s3://backup-rke2-etcd/test-staging-rke2/etcd-snapshot-rancher-node-test-rke2-mgmt1-1736469903          48472096 2025-01-10T00:45:03Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736469903 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-test-rke2-mgmt1-1736469903 48472096 2025-01-10T00:45:03Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736487904 s3://backup-rke2-etcd/test-staging-rke2/etcd-snapshot-rancher-node-test-rke2-mgmt1-1736487904          48472096 2025-01-10T05:45:04Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736487904 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-test-rke2-mgmt1-1736487904 48472096 2025-01-10T05:45:04Z

I deleted the corrupted snapshots:

root@rancher-node-test-rke2-mgmt1:~# rke2 etcd-snapshot delete $(rke2 etcd-snapshot ls 2> /dev/null | \
awk '/mgmt1-(1736487904|1736455502|1736469903)/{print $1}' | \
uniq | while read -r snap;do printf "%s " "$snap";done)
WARN[0000] Unknown flag --agent-token found in config.yaml, skipping 
WARN[0000] Unknown flag --cni found in config.yaml, skipping 
WARN[0000] Unknown flag --disable found in config.yaml, skipping 
WARN[0000] Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-controller-manager-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-controller-manager-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-controller-manager-extra-mount found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-scheduler-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-scheduler-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-scheduler-extra-mount found in config.yaml, skipping 
WARN[0000] Unknown flag --kubelet-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kubelet-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kubelet-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kubelet-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --node-label found in config.yaml, skipping 
WARN[0000] Unknown flag --node-label found in config.yaml, skipping 
WARN[0000] Unknown flag --node-taint found in config.yaml, skipping 
WARN[0000] Unknown flag --node-taint found in config.yaml, skipping 
WARN[0000] Unknown flag --private-registry found in config.yaml, skipping 
WARN[0000] Unknown flag --snapshotter found in config.yaml, skipping 
WARN[0000] Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation. 
INFO[0004] Snapshot etcd-snapshot-rancher-node-test-rke2-mgmt1-1736455502 deleted. 
INFO[0004] Snapshot etcd-snapshot-rancher-node-test-rke2-mgmt1-1736455502 deleted. 
INFO[0004] Snapshot etcd-snapshot-rancher-node-test-rke2-mgmt1-1736469903 deleted. 
INFO[0004] Snapshot etcd-snapshot-rancher-node-test-rke2-mgmt1-1736469903 deleted. 
INFO[0004] Snapshot etcd-snapshot-rancher-node-test-rke2-mgmt1-1736487904 deleted. 
INFO[0004] Snapshot etcd-snapshot-rancher-node-test-rke2-mgmt1-1736487904 deleted.

There are no more corrupted snapshots in Rancher cluster:

~ ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \
-l rke.cattle.io/cluster-name=test-staging-rke2 \
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \
awk 'BEGIN{format="%-60s %-35s %-20s %s\n"
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at"
printf format,"---","---","---","---"}
{printf format,$1,$2,$3,$4}' 

Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736282702        rancher-node-test-rke2-mgmt1        successful           2025-01-07T20:45:02Z
etcd-snapshot-rancher-node-test-rke2-mgmt2-1736282704        s3                                  successful           2025-01-07T20:45:04Z
etcd-snapshot-rancher-node-test-rke2-mgmt3-1736297102        s3                                  successful           2025-01-08T00:45:02Z
etcd-snapshot-rancher-node-test-rke2-mgmt2-1736297103        s3                                  successful           2025-01-08T00:45:03Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736297105        rancher-node-test-rke2-mgmt1        successful           2025-01-08T00:45:05Z
etcd-snapshot-rancher-node-test-rke2-mgmt3-1736264704        s3                                  successful           2025-01-07T15:45:04Z
etcd-snapshot-rancher-node-test-rke2-mgmt3-1736333102        s3                                  successful           2025-01-08T10:45:02Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736315101        s3                                  successful           2025-01-08T05:45:01Z
etcd-snapshot-rancher-node-test-rke2-mgmt2-1736264705        s3                                  successful           2025-01-07T15:45:05Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736282702        s3                                  successful           2025-01-07T20:45:02Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736333104        s3                                  successful           2025-01-08T10:45:04Z
etcd-snapshot-rancher-node-test-rke2-mgmt3-1736315102        s3                                  successful           2025-01-08T05:45:02Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736264703        s3                                  successful           2025-01-07T15:45:03Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736264703        rancher-node-test-rke2-mgmt1        successful           2025-01-07T15:45:03Z
etcd-snapshot-rancher-node-test-rke2-mgmt3-1736282701        s3                                  successful           2025-01-07T20:45:01Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736315101        rancher-node-test-rke2-mgmt1        successful           2025-01-08T05:45:01Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736297105        s3                                  successful           2025-01-08T00:45:05Z
etcd-snapshot-rancher-node-test-rke2-mgmt2-1736315103        s3                                  successful           2025-01-08T05:45:03Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736333104        rancher-node-test-rke2-mgmt1        successful           2025-01-08T10:45:04Z
etcd-snapshot-rancher-node-test-rke2-mgmt2-1736333101        s3                                  successful           2025-01-08T10:45:01Z

The configmap and leases seems to be fine:

~ ᐅ for context in $(kubectx | awk '/test-staging-rke2/');do
echo -e "---\nEtcd leader in cluster $context"
kubectl --context "$context" exec $(kubectl --context "$context" get po -n kube-system -l component=etcd --no-headers -o jsonpath='{range .items[0]}{.metadata.name}{end}') -n kube-system \
-- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \
--cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \
--key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \
endpoint status --cluster | awk '/true/{split($1,a,":");print substr(a[2],3)}' | \
xargs -I{} dig -x {} +short | awk -F '.' '{printf "\t%s\n",$1}'
echo "Leases and configmaps in cluster $context"
for name in rke2 rke2-etcd;do 
kubectl --context "$context" get cm -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \
awk '{split($3,a,",");printf "\t%-10s %-10s %s\n",$1,$2,substr(a[1],2)}'
kubectl --context "$context" get leases -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.spec.holderIdentity}' | \
awk '{printf "\t%-10s %-10s %s\n",$1,$2,$3}'
done
done
---
Etcd leader in cluster test-staging-rke2
        rancher-node-test-rke2-mgmt1
Leases and configmaps in cluster test-staging-rke2
        ConfigMap  rke2       "holderIdentity":"rancher-node-test-rke2-mgmt1"
        Lease      rke2       rancher-node-test-rke2-mgmt1
        ConfigMap  rke2-etcd  "holderIdentity":"rancher-node-test-rke2-mgmt1"
        Lease      rke2-etcd  rancher-node-test-rke2-mgmt1

I hope the next snapshot will be ok.

I carried out the same checks and actions on archive-staging-rke2 than on test-staging-rke2. I'm afraid the next snapshot will fail again. If so, I will delete all snapshots.

In archive-staging-rke2 and test-staging-rke2 clusters I deleted all local snapshots.

In archive-production-rke2 I realigned the leases and configmaps after the control-planes upgrade:

from:

ᐅ for context in $(kubectx | awk '/archive-production-rke2/');do
echo -e "---\nEtcd leader in cluster $context"
kubectl --context "$context" exec $(kubectl --context "$context" get po -n kube-system -l component=etcd --no-headers -o jsonpath='{range .items[0]}{.metadata.name}{end}') -n kube-system \
-- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \
--cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \
--key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \
endpoint status --cluster | awk '/true/{split($1,a,":");print substr(a[2],3)}' | \
xargs -I{} dig -x {} +short | awk -F '.' '{printf "\t%s\n",$1}'
echo "Leases and configmaps in cluster $context"
for name in rke2 rke2-etcd;do 
kubectl --context "$context" get cm -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \
awk '{split($3,a,",");printf "\t%-10s %-10s %s\n",$1,$2,substr(a[1],2)}'
kubectl --context "$context" get leases -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.spec.holderIdentity}' | \
awk '{printf "\t%-10s %-10s %s\n",$1,$2,$3}'
done
done
---
Etcd leader in cluster archive-production-rke2
        rancher-node-production-rke2-mgmt1
Leases and configmaps in cluster archive-production-rke2
        ConfigMap  rke2       "holderIdentity":"rancher-node-production-rke2-mgmt3"
        Lease      rke2       rancher-node-production-rke2-mgmt1
        ConfigMap  rke2-etcd  "holderIdentity":"rancher-node-production-rke2-mgmt3"
        Lease      rke2-etcd  rancher-node-production-rke2-mgmt2

to:

ᐅ for context in $(kubectx | awk '/archive-production-rke2/');do
echo -e "---\nEtcd leader in cluster $context"
kubectl --context "$context" exec $(kubectl --context "$context" get po -n kube-system -l component=etcd --no-headers -o jsonpath='{range .items[0]}{.metadata.name}{end}') -n kube-system \
-- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \
--cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \
--key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \
endpoint status --cluster | awk '/true/{split($1,a,":");print substr(a[2],3)}' | \
xargs -I{} dig -x {} +short | awk -F '.' '{printf "\t%s\n",$1}'
echo "Leases and configmaps in cluster $context"
for name in rke2 rke2-etcd;do 
kubectl --context "$context" get cm -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \
awk '{split($3,a,",");printf "\t%-10s %-10s %s\n",$1,$2,substr(a[1],2)}'
kubectl --context "$context" get leases -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.spec.holderIdentity}' | \
awk '{printf "\t%-10s %-10s %s\n",$1,$2,$3}'
done
done
---
Etcd leader in cluster archive-production-rke2
        rancher-node-production-rke2-mgmt1
Leases and configmaps in cluster archive-production-rke2
        ConfigMap  rke2       "holderIdentity":"rancher-node-production-rke2-mgmt1"
        Lease      rke2       rancher-node-production-rke2-mgmt1
        ConfigMap  rke2-etcd  "holderIdentity":"rancher-node-production-rke2-mgmt1"
        Lease      rke2-etcd  rancher-node-production-rke2-mgmt1

In archive-production-rke2, archive-staging-rke2, test-staging-rke2 clusters after disabling s3 snapshots:

realign and check etcd leaders, configmaps and leases;

---
Etcd leader in cluster test-staging-rke2
        rancher-node-test-rke2-mgmt1
Leases and configmaps in cluster test-staging-rke2
        ConfigMap  rke2       "holderIdentity":"rancher-node-test-rke2-mgmt1"
        Lease      rke2       rancher-node-test-rke2-mgmt1
        ConfigMap  rke2-etcd  "holderIdentity":"rancher-node-test-rke2-mgmt1"
        Lease      rke2-etcd  rancher-node-test-rke2-mgmt1

---
Etcd leader in cluster archive-staging-rke2
        rancher-node-staging-rke2-mgmt1
Leases and configmaps in cluster archive-staging-rke2
        ConfigMap  rke2       "holderIdentity":"rancher-node-staging-rke2-mgmt1"
        Lease      rke2       rancher-node-staging-rke2-mgmt1
        ConfigMap  rke2-etcd  "holderIdentity":"rancher-node-staging-rke2-mgmt1"
        Lease      rke2-etcd  rancher-node-staging-rke2-mgmt1

---
Etcd leader in cluster archive-production-rke2
        rancher-node-production-rke2-mgmt1
Leases and configmaps in cluster archive-production-rke2
        ConfigMap  rke2       "holderIdentity":"rancher-node-production-rke2-mgmt1"
        Lease      rke2       rancher-node-production-rke2-mgmt1
        ConfigMap  rke2-etcd  "holderIdentity":"rancher-node-production-rke2-mgmt1"
        Lease      rke2-etcd  rancher-node-production-rke2-mgmt1

remove all s3 snapshots from configmap rke2-etcd-snapshots as there are no more snapshots in MinIO;

ᐅ mc ls -r rke2-etcd/backup-rke2-etcd/ | grep -v ".metadata"
[2025-01-13 11:15:11 CET] 100MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736763304
[2025-01-13 16:15:08 CET] 100MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736781301
[2025-01-13 21:15:10 CET] 105MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736799301
[2025-01-14 01:15:11 CET] 105MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736813704
[2025-01-14 06:15:12 CET] 105MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736831705
[2025-01-13 11:15:08 CET] 103MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736763303
[2025-01-13 16:15:09 CET] 104MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736781302
[2025-01-13 21:15:09 CET] 109MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736799301
[2025-01-14 01:15:08 CET] 109MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736813702
[2025-01-14 06:15:09 CET] 109MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736831704
[2025-01-13 11:15:08 CET] 102MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736763303
[2025-01-13 16:15:11 CET] 102MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736781304
[2025-01-13 21:15:11 CET] 103MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736799302
[2025-01-14 01:15:09 CET] 103MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736813702
[2025-01-14 06:15:11 CET] 103MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736831705

check etcdsnapshots objects in Rancher cluster.

## test-staging-rke2
Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736815502        rancher-node-test-rke2-mgmt1        successful           2025-01-14T00:45:02Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736801101        rancher-node-test-rke2-mgmt1        successful           2025-01-13T20:45:01Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736833504        rancher-node-test-rke2-mgmt1        successful           2025-01-14T05:45:04Z

## archive-staging-rke2
Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736800205     rancher-node-staging-rke2-mgmt1     successful           2025-01-13T20:30:05Z
etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736832602     rancher-node-staging-rke2-mgmt1     successful           2025-01-14T05:30:02Z
etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736782201     rancher-node-staging-rke2-mgmt1     successful           2025-01-13T15:30:01Z
etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736814602     rancher-node-staging-rke2-mgmt1     successful           2025-01-14T00:30:02Z
etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736764204     rancher-node-staging-rke2-mgmt1     successful           2025-01-13T10:30:04Z

## archive-production-rke2

Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736830804  rancher-node-production-rke2-mgmt1  successful           2025-01-14T05:00:04Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736798404  rancher-node-production-rke2-mgmt1  successful           2025-01-13T20:00:04Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736812800  rancher-node-production-rke2-mgmt1  successful           2025-01-14T00:00:00Z

If the noon snapshots are successful, I will enable s3 snapshots.

The last local snapshots were successful:

ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \
awk 'BEGIN{format="%-60s %-35s %-20s %s\n"
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at"
printf format,"---","---","---","---"}
$4~/^2025-01-14T10/&&$1!~/admin/{printf format,$1,$2,$3,$4}'

Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736848805  rancher-node-production-rke2-mgmt1  successful           2025-01-14T10:00:05Z
etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736850600     rancher-node-staging-rke2-mgmt1     successful           2025-01-14T10:30:00Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736851501        rancher-node-test-rke2-mgmt1        successful           2025-01-14T10:45:01Z

I will enable s3 snapshots.

The last snapshots were successful except in archive-staging-rke2:

ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \                     
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \
awk 'BEGIN{format="%-60s %-35s %-20s %s\n"                                                                                     
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at"
printf format,"---","---","---","---"}                                                                                         
$3!="successful"||$4~/^2025-01-14T15/&&$1!~/admin/{printf format,$1,$2,$3,$4}'
                                                                                                                                                                                                                                                              
Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---                                                                                                                                     
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736866801  s3                                  successful           2025-01-14T15:00:01Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736866801  rancher-node-production-rke2-mgmt1  successful           2025-01-14T15:00:01Z
etcd-snapshot-rancher-node-production-rke2-mgmt3-1736866804  s3                                  successful           2025-01-14T15:00:04Z
etcd-snapshot-rancher-node-production-rke2-mgmt2-1736866804  s3                                  successful           2025-01-14T15:00:04Z
etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736868604     rancher-node-staging-rke2-mgmt2                          
etcd-snapshot-rancher-node-test-rke2-mgmt2-1736869502        s3                                  successful           2025-01-14T15:45:02Z
etcd-snapshot-rancher-node-test-rke2-mgmt3-1736869504        s3                                  successful           2025-01-14T15:45:04Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736869503        s3                                  successful           2025-01-14T15:45:03Z                               
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736869503        rancher-node-test-rke2-mgmt1        successful           2025-01-14T15:45:03Z

where the lease rke2-etcd has changed since this morning:

---
Etcd leader in cluster archive-staging-rke2
        rancher-node-staging-rke2-mgmt1
Leases and configmaps in cluster archive-staging-rke2
        ConfigMap  rke2       "holderIdentity":"rancher-node-staging-rke2-mgmt1"
        Lease      rke2       rancher-node-staging-rke2-mgmt1
        ConfigMap  rke2-etcd  "holderIdentity":"rancher-node-staging-rke2-mgmt1"
        Lease      rke2-etcd  rancher-node-staging-rke2-mgmt2

therefore the last etcdsnapshots status is 'Missing: true' as the node name is not the good one:

ᐅ kbl describe etcdsnapshots.rke.cattle.io -n fleet-default archive-staging-rke2-etcd-snapshot-rancher-node-staging-r-6735c
Name:         archive-staging-rke2-etcd-snapshot-rancher-node-staging-r-6735c
Namespace:    fleet-default
Labels:       rke.cattle.io/cluster-name=archive-staging-rke2
              rke.cattle.io/machine-id=7064d06c186885368ecd982d0609d833143d8c169cbd122d8b4c59ca6eb7301
Annotations:  etcdsnapshot.rke.io/snapshot-file-name: etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736868604
              etcdsnapshot.rke.io/storage: local
API Version:  rke.cattle.io/v1
Kind:         ETCDSnapshot
Metadata:
  Creation Timestamp:  2025-01-14T16:10:19Z
  Generation:          1
  Owner References:
    API Version:           cluster.x-k8s.io/v1beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Machine
    Name:                  custom-adf8ed96fd1c
    UID:                   515e124d-49db-47e2-a063-53331c671590
  Resource Version:        908341232
  UID:                     bbce182f-9c59-4b90-b9d4-f1be193aca9c
Snapshot File:
  Location:   file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736868604
  Name:       etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736868604
  Node Name:  rancher-node-staging-rke2-mgmt2
Spec:
  Cluster Name:  archive-staging-rke2
Status:
  Missing:  true
Events:     <none>

In archive-staging-rke2:

delete all corrupted snapshots;
save local snapshots on nodes mgmt2 and mgmt3;
delete all local snapshots;
delete s3 snapshots;
disable s3 snapshots;
delete all data: in configmap rke2-etcd-snapshots;
check and realign configmaps and leases.

---
Etcd leader in cluster archive-staging-rke2
        rancher-node-staging-rke2-mgmt1
Leases and configmaps in cluster archive-staging-rke2
        ConfigMap  rke2       "holderIdentity":"rancher-node-staging-rke2-mgmt1"
        Lease      rke2       rancher-node-staging-rke2-mgmt1
        ConfigMap  rke2-etcd  "holderIdentity":"rancher-node-staging-rke2-mgmt1"
        Lease      rke2-etcd  rancher-node-staging-rke2-mgmt1

The last archive-staging-rke2 snapshot were successful:

ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \
awk 'BEGIN{format="%-60s %-35s %-20s %s\n"
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at"
printf format,"---","---","---","---"}
$3!="successful"||$4~/^2025-01-15T10/&&$1~/staging/{printf format,$1,$2,$3,$4}'

Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736937004     rancher-node-staging-rke2-mgmt1     successful           2025-01-15T10:30:04Z

All the last snapshots were successful:

ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \
awk 'BEGIN{format="%-60s %-35s %-20s %s\n"
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at"
printf format,"---","---","---","---"}
$3!="successful"||$4~/^2025-01-15T10/{printf format,$1,$2,$3,$4}'

Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-production-rke2-mgmt3-1736935203  s3                                  successful           2025-01-15T10:00:03Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736935201  rancher-node-production-rke2-mgmt1  successful           2025-01-15T10:00:01Z
etcd-snapshot-rancher-node-production-rke2-mgmt2-1736935203  s3                                  successful           2025-01-15T10:00:03Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736935201  s3                                  successful           2025-01-15T10:00:01Z
etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736937004     rancher-node-staging-rke2-mgmt1     successful           2025-01-15T10:30:04Z
etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736936105       s3                                  successful           2025-01-15T10:15:05Z
etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736936102       rancher-node-admin-rke2-mgmt1       successful           2025-01-15T10:15:02Z
etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736936105       s3                                  successful           2025-01-15T10:15:05Z
etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736936102       s3                                  successful           2025-01-15T10:15:02Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736937904        rancher-node-test-rke2-mgmt1        successful           2025-01-15T10:45:04Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736937904        s3                                  successful           2025-01-15T10:45:04Z
etcd-snapshot-rancher-node-test-rke2-mgmt2-1736937902        s3                                  successful           2025-01-15T10:45:02Z

I will enable the s3 backups for archive-staging-rke2.

On all management nodes, cleaning up backup folders, rke2_snapshots_20250113 and rke2_snapshots_20250115.

All the local and s3 last snapshots were successful:

ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \                                  
-o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \
awk 'BEGIN{format="%-60s %-35s %-20s %s\n"
printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at"
printf format,"---","---","---","---"}
$3!="successful"||$4~/^2025-01-15T15/{printf format,$1,$2,$3,$4}'

Snapshotfile Name                                            Snapshotfile Node Name              Snapshotfile Status  Created at
---                                                          ---                                 ---                  ---
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736953200  rancher-node-production-rke2-mgmt1  successful           2025-01-15T15:00:00Z
etcd-snapshot-rancher-node-production-rke2-mgmt2-1736953201  s3                                  successful           2025-01-15T15:00:01Z
etcd-snapshot-rancher-node-production-rke2-mgmt3-1736953204  s3                                  successful           2025-01-15T15:00:04Z
etcd-snapshot-rancher-node-production-rke2-mgmt1-1736953200  s3                                  successful           2025-01-15T15:00:00Z
etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736955004     s3                                  successful           2025-01-15T15:30:04Z
etcd-snapshot-rancher-node-staging-rke2-mgmt2-1736955001     s3                                  successful           2025-01-15T15:30:01Z
etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736955004     rancher-node-staging-rke2-mgmt1     successful           2025-01-15T15:30:04Z
etcd-snapshot-rancher-node-staging-rke2-mgmt3-1736955004     s3                                  successful           2025-01-15T15:30:04Z
etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736954102       s3                                  successful           2025-01-15T15:15:02Z
etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736954104       rancher-node-admin-rke2-mgmt1       successful           2025-01-15T15:15:04Z
etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736954104       s3                                  successful           2025-01-15T15:15:04Z
etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736954102       s3                                  successful           2025-01-15T15:15:02Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736955903        s3                                  successful           2025-01-15T15:45:03Z
etcd-snapshot-rancher-node-test-rke2-mgmt2-1736955904        s3                                  successful           2025-01-15T15:45:04Z
etcd-snapshot-rancher-node-test-rke2-mgmt3-1736955901        s3                                  successful           2025-01-15T15:45:01Z
etcd-snapshot-rancher-node-test-rke2-mgmt1-1736955903        rancher-node-test-rke2-mgmt1        successful           2025-01-15T15:45:03Z

ᐅ mc ls -r rke2-etcd/backup-rke2-etcd/ | awk '$6!~/\.metadata/&&$1~/2025-01-15/&&$2~/^16/'
[2025-01-15 16:00:06 CET] 134MiB STANDARD archive-production-rke2/etcd-snapshot-rancher-node-production-rke2-mgmt1-1736953200
[2025-01-15 16:00:08 CET] 134MiB STANDARD archive-production-rke2/etcd-snapshot-rancher-node-production-rke2-mgmt2-1736953201
[2025-01-15 16:00:11 CET] 134MiB STANDARD archive-production-rke2/etcd-snapshot-rancher-node-production-rke2-mgmt3-1736953204
[2025-01-15 16:30:08 CET]  74MiB STANDARD archive-staging-rke2/etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736955004
[2025-01-15 16:30:03 CET]  73MiB STANDARD archive-staging-rke2/etcd-snapshot-rancher-node-staging-rke2-mgmt2-1736955001
[2025-01-15 16:30:07 CET]  74MiB STANDARD archive-staging-rke2/etcd-snapshot-rancher-node-staging-rke2-mgmt3-1736955004
[2025-01-15 16:15:09 CET] 105MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736954104
[2025-01-15 16:15:06 CET] 109MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736954102
[2025-01-15 16:15:06 CET] 104MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736954102
[2025-01-15 16:45:04 CET]  38MiB STANDARD test-staging-rke2/etcd-snapshot-rancher-node-test-rke2-mgmt1-1736955903
[2025-01-15 16:45:05 CET]  38MiB STANDARD test-staging-rke2/etcd-snapshot-rancher-node-test-rke2-mgmt2-1736955904
[2025-01-15 16:45:03 CET]  37MiB STANDARD test-staging-rke2/etcd-snapshot-rancher-node-test-rke2-mgmt3-1736955901

Here are the etcd leaders with the corresponding holder identity leases and configmaps:

ᐅ for context in $(kubectx | awk '/-rke2/');do 
echo -e "---\nEtcd leader in cluster $context"
kubectl --context "$context" exec $(kubectl --context "$context" get po -n kube-system -l component=etcd --no-headers -o jsonpath='{range .items[0]}{.metadata.name}{end}') -n kube-system \
-- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \
--cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \
--key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \
endpoint status --cluster | awk '/true/{split($1,a,":");print substr(a[2],3)}' | \
xargs -I{} dig -x {} +short | awk -F '.' '{printf "\t%s\n",$1}'
echo "Leases and configmaps in cluster $context"
for name in rke2 rke2-etcd;do 
kubectl --context "$context" get cm -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \
awk '{split($3,a,",");printf "\t%-10s %-10s %s\n",$1,$2,substr(a[1],2)}'
kubectl --context "$context" get leases -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.spec.holderIdentity}' | \
awk '{printf "\t%-10s %-10s %s\n",$1,$2,$3}'
done
done
---
Etcd leader in cluster archive-production-rke2
        rancher-node-production-rke2-mgmt1
Leases and configmaps in cluster archive-production-rke2
        ConfigMap  rke2       "holderIdentity":"rancher-node-production-rke2-mgmt1"
        Lease      rke2       rancher-node-production-rke2-mgmt1
        ConfigMap  rke2-etcd  "holderIdentity":"rancher-node-production-rke2-mgmt1"
        Lease      rke2-etcd  rancher-node-production-rke2-mgmt1
---
Etcd leader in cluster archive-staging-rke2
        rancher-node-staging-rke2-mgmt1
Leases and configmaps in cluster archive-staging-rke2
        ConfigMap  rke2       "holderIdentity":"rancher-node-staging-rke2-mgmt1"
        Lease      rke2       rancher-node-staging-rke2-mgmt1
        ConfigMap  rke2-etcd  "holderIdentity":"rancher-node-staging-rke2-mgmt1"
        Lease      rke2-etcd  rancher-node-staging-rke2-mgmt1
---
Etcd leader in cluster cluster-admin-rke2
        rancher-node-admin-rke2-mgmt3
Leases and configmaps in cluster cluster-admin-rke2
        ConfigMap  rke2       "holderIdentity":"rancher-node-admin-rke2-mgmt2"
        Lease      rke2       rancher-node-admin-rke2-mgmt2
        ConfigMap  rke2-etcd  "holderIdentity":"rancher-node-admin-rke2-mgmt2"
        Lease      rke2-etcd  rancher-node-admin-rke2-mgmt2
---
Etcd leader in cluster test-staging-rke2
        rancher-node-test-rke2-mgmt1
Leases and configmaps in cluster test-staging-rke2
        ConfigMap  rke2       "holderIdentity":"rancher-node-test-rke2-mgmt1"
        Lease      rke2       rancher-node-test-rke2-mgmt1
        ConfigMap  rke2-etcd  "holderIdentity":"rancher-node-test-rke2-mgmt1"
        Lease      rke2-etcd  rancher-node-test-rke2-mgmt1

Note the mismatch between the etcd leader and the holder identity objects in cluster-admin-rke2. I didn't touch it because the backup works fine.
I am delighted to close this issue.

closed

Rancher's etcd backups failed

Designs

Child items ...

Activity