Rancher's etcd backups failed
During the Debian upgrade on Kubernetes clusters [1], the control-plane nodes are rebooted. As usual in this case the etcd snapshots failed.
[1] #5415 (closed)
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Guillaume Samson added activity::MRO label
added activity::MRO label
- Guillaume Samson assigned to @guillaume
assigned to @guillaume
- Antoine R. Dumont changed the description
changed the description
- Author Owner
In
test-staging-rke2
here are the failed snapshosts:ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \ -l rke.cattle.io/cluster-name=test-staging-rke2 \ -o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \ awk 'BEGIN{format="%-60s %-35s %-20s %s\n" printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at" printf format,"---","---","---","---"} $3!="successful"{printf format,$1,$2,$3,$4}' Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-test-rke2-mgmt1-1736383505 rancher-node-test-rke2-mgmt1 etcd-snapshot-rancher-node-test-rke2-mgmt1-1736369100 rancher-node-test-rke2-mgmt2 etcd-snapshot-rancher-node-test-rke2-mgmt1-1736351102 rancher-node-test-rke2-mgmt2 etcd-snapshot-rancher-node-test-rke2-mgmt1-1736401502 rancher-node-test-rke2-mgmt2
Edited by Guillaume Samson - Author Owner
In
test-staging-rke2
:- delete corrupted snapshots;
- check
rke2-etcd-snapshots
configmap; - realign leases and configmaps.
ᐅ for context in $(kubectx | awk '/test-staging-rke2/');do echo -e "---\nEtcd leader in cluster $context" kubectl --context "$context" exec $(kubectl --context "$context" get po -n kube-system -l component=etcd --no-headers -o jsonpath='{range .items[0]}{.metadata.name}{end}') -n kube-system \ -- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \ --cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \ --key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \ endpoint status --cluster | awk '/true/{split($1,a,":");print substr(a[2],3)}' | \ xargs -I{} dig -x {} +short | awk -F '.' '{printf "\t%s\n",$1}' echo "Leases and configmaps in cluster $context" for name in rke2 rke2-etcd;do kubectl --context "$context" get cm -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \ awk '{split($3,a,",");printf "\t%-10s %-10s %s\n",$1,$2,substr(a[1],2)}' kubectl --context "$context" get leases -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.spec.holderIdentity}' | \ awk '{printf "\t%-10s %-10s %s\n",$1,$2,$3}' done done --- Etcd leader in cluster test-staging-rke2 rancher-node-test-rke2-mgmt1 Leases and configmaps in cluster test-staging-rke2 ConfigMap rke2 "holderIdentity":"rancher-node-test-rke2-mgmt1" Lease rke2 rancher-node-test-rke2-mgmt1 ConfigMap rke2-etcd "holderIdentity":"rancher-node-test-rke2-mgmt1" Lease rke2-etcd rancher-node-test-rke2-mgmt1
Edited by Guillaume Samson - Author Owner
In
archive-staging-rke2
here are the failed snapshots:ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \ -l rke.cattle.io/cluster-name=archive-staging-rke2 \ -o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \ awk 'BEGIN{format="%-60s %-35s %-20s %s\n" printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at" printf format,"---","---","---","---"} $3!="successful"{printf format,$1,$2,$3,$4}' Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736382602 rancher-node-staging-rke2-mgmt2 etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736368202 rancher-node-staging-rke2-mgmt2 etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736350201 rancher-node-staging-rke2-mgmt2 etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736400603 rancher-node-staging-rke2-mgmt3
- Author Owner
In
archive-staging-rke2
:- delete corrupted snapshots;
- check
rke2-etcd-snapshots
configmap; - realign leases and configmaps.
ᐅ for context in $(kubectx | awk '/archive-staging-rke2/');do echo -e "---\nEtcd leader in cluster $context" kubectl --context "$context" exec $(kubectl --context "$context" get po -n kube-system -l component=etcd --no-headers -o jsonpath='{range .items[0]}{.metadata.name}{end}') -n kube-system \ -- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \ --cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \ --key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \ endpoint status --cluster | awk '/true/{split($1,a,":");print substr(a[2],3)}' | \ xargs -I{} dig -x {} +short | awk -F '.' '{printf "\t%s\n",$1}' echo "Leases and configmaps in cluster $context" for name in rke2 rke2-etcd;do kubectl --context "$context" get cm -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \ awk '{split($3,a,",");printf "\t%-10s %-10s %s\n",$1,$2,substr(a[1],2)}' kubectl --context "$context" get leases -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.spec.holderIdentity}' | \ awk '{printf "\t%-10s %-10s %s\n",$1,$2,$3}' done done --- Etcd leader in cluster archive-staging-rke2 rancher-node-staging-rke2-mgmt1 Leases and configmaps in cluster archive-staging-rke2 ConfigMap rke2 "holderIdentity":"rancher-node-staging-rke2-mgmt1" Lease rke2 rancher-node-staging-rke2-mgmt1 ConfigMap rke2-etcd "holderIdentity":"rancher-node-staging-rke2-mgmt1" Lease rke2-etcd rancher-node-staging-rke2-mgmt1
- Author OwnerResolved by Guillaume Samson
In
archive-production-rke2
the last snapshot on s3 bucket failed as MinIO was unreachable during thecluster-admin-rke2
upgrade:ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \ -l rke.cattle.io/cluster-name=archive-production-rke2 \ -o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \ awk 'BEGIN{format="%-60s %-35s %-20s %s\n" printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at" printf format,"---","---","---","---"} $4~/2025-01-09T15/{printf format,$1,$2,$3,$4}' Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-production-rke2-mgmt3-1736434803 s3 failed 2025-01-09T15:00:03Z etcd-snapshot-rancher-node-production-rke2-mgmt2-1736434802 s3 failed 2025-01-09T15:00:02Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736434800 s3 failed 2025-01-09T15:00:00Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736434800 rancher-node-production-rke2-mgmt1 successful 2025-01-09T15:00:00Z
1 reply Last reply by Guillaume Samson
- Author OwnerResolved by Guillaume Samson
In
archive-staging-rke2
andtest-staging-rke2
clusters last snapshots still failed:~ ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \ -l rke.cattle.io/cluster-name=archive-staging-rke2 \ -o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \ awk 'BEGIN{format="%-60s %-35s %-20s %s\n" printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at" printf format,"---","---","---","---"} $3!="successful"{printf format,$1,$2,$3,$4}' Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736436604 rancher-node-staging-rke2-mgmt3 ~ ᐅ date -d @1736436604 Thu Jan 9 04:30:04 PM CET 2025 ~ ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \ -l rke.cattle.io/cluster-name=test-staging-rke2 \ -o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \ awk 'BEGIN{format="%-60s %-35s %-20s %s\n" printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at" printf format,"---","---","---","---"} $3!="successful"{printf format,$1,$2,$3,$4}' Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-test-rke2-mgmt1-1736437504 rancher-node-test-rke2-mgmt2 ~ ᐅ date -d @1736437504 Thu Jan 9 04:45:04 PM CET 2025
1 reply Last reply by Guillaume Samson
- Author Owner
Oddly everything works fine on
cluster-admin-rke2
:~ ᐅ for context in $(kubectx | awk '/cluster-admin-rke2/');do echo -e "---\nEtcd leader in cluster $context" kubectl --context "$context" exec $(kubectl --context "$context" get po -n kube-system -l component=etcd --no-headers -o jsonpath='{range .items[0]}{.metadata.name}{end}') -n kube-system \ -- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \ --cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \ --key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \ endpoint status --cluster | awk '/true/{split($1,a,":");print substr(a[2],3)}' | \ xargs -I{} dig -x {} +short | awk -F '.' '{printf "\t%s\n",$1}' echo "Leases and configmaps in cluster $context" for name in rke2 rke2-etcd;do kubectl --context "$context" get cm -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \ awk '{split($3,a,",");printf "\t%-10s %-10s %s\n",$1,$2,substr(a[1],2)}' kubectl --context "$context" get leases -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.spec.holderIdentity}' | \ awk '{printf "\t%-10s %-10s %s\n",$1,$2,$3}' done done --- Etcd leader in cluster cluster-admin-rke2 rancher-node-admin-rke2-mgmt3 Leases and configmaps in cluster cluster-admin-rke2 ConfigMap rke2 "holderIdentity":"rancher-node-admin-rke2-mgmt2" Lease rke2 rancher-node-admin-rke2-mgmt2 ConfigMap rke2-etcd "holderIdentity":"rancher-node-admin-rke2-mgmt2" Lease rke2-etcd rancher-node-admin-rke2-mgmt2 ~ ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \ -l rke.cattle.io/cluster-name=cluster-admin-rke2 \ -o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \ awk 'BEGIN{format="%-60s %-35s %-20s %s\n" printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at" printf format,"---","---","---","---"} $4~/2025-01-09T15/{printf format,$1,$2,$3,$4}' Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736435703 s3 successful 2025-01-09T15:15:03Z etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736435702 s3 successful 2025-01-09T15:15:02Z etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736435702 s3 successful 2025-01-09T15:15:02Z etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736435703 rancher-node-admin-rke2-mgmt1 successful 2025-01-09T15:15:03Z
- Author Owner
On
archive-production-rke2
the failed snapshots I deleted yesterday are back:ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \ -l rke.cattle.io/cluster-name=archive-production-rke2 \ -o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \ awk 'BEGIN{format="%-60s %-35s %-20s %s\n" printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at" printf format,"---","---","---","---"} $3~/failed/{printf format,$1,$2,$3,$4}' Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-production-rke2-mgmt3-1736434803 s3 failed 2025-01-09T15:00:03Z etcd-snapshot-rancher-node-production-rke2-mgmt2-1736434802 s3 failed 2025-01-09T15:00:02Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736434800 s3 failed 2025-01-09T15:00:00Z
I removed these snapshots references from the configmap
rke2-etcd-snapshots
onarchive-production-rke2
cluster without deleting theetcdsnapshots
on Rancher cluster. It seems to be fine:ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \ -l rke.cattle.io/cluster-name=archive-production-rke2 \ -o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \ awk 'BEGIN{format="%-60s %-35s %-20s %s\n" printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at" printf format,"---","---","---","---"} {printf format,$1,$2,$3,$4}' Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-production-rke2-mgmt1-1736485200 rancher-node-production-rke2-mgmt1 successful 2025-01-10T05:00:00Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736467203 rancher-node-production-rke2-mgmt1 successful 2025-01-10T00:00:03Z etcd-snapshot-rancher-node-production-rke2-mgmt3-1736485204 s3 successful 2025-01-10T05:00:04Z etcd-snapshot-rancher-node-production-rke2-mgmt3-1736416804 s3 successful 2025-01-09T10:00:04Z etcd-snapshot-rancher-node-production-rke2-mgmt2-1736452801 s3 successful 2025-01-09T20:00:01Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736416805 rancher-node-production-rke2-mgmt1 successful 2025-01-09T10:00:05Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736398803 s3 successful 2025-01-09T05:00:03Z etcd-snapshot-rancher-node-production-rke2-mgmt3-1736467204 s3 successful 2025-01-10T00:00:04Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736452803 s3 successful 2025-01-09T20:00:03Z etcd-snapshot-rancher-node-production-rke2-mgmt2-1736416801 s3 successful 2025-01-09T10:00:01Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736452803 rancher-node-production-rke2-mgmt1 successful 2025-01-09T20:00:03Z etcd-snapshot-rancher-node-production-rke2-mgmt3-1736452804 s3 successful 2025-01-09T20:00:04Z etcd-snapshot-rancher-node-production-rke2-mgmt3-1736398804 s3 successful 2025-01-09T05:00:04Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736467203 s3 successful 2025-01-10T00:00:03Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736416805 s3 successful 2025-01-09T10:00:05Z etcd-snapshot-rancher-node-production-rke2-mgmt2-1736398803 s3 successful 2025-01-09T05:00:03Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736485200 s3 successful 2025-01-10T05:00:00Z etcd-snapshot-rancher-node-production-rke2-mgmt2-1736467203 s3 successful 2025-01-10T00:00:03Z etcd-snapshot-rancher-node-production-rke2-mgmt2-1736485204 s3 successful 2025-01-10T05:00:04Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736434800 rancher-node-production-rke2-mgmt1 successful 2025-01-09T15:00:00Z
- Author Owner
On
test-staging-rke2
the snapshots are still corrupted:~ ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \ -l rke.cattle.io/cluster-name=test-staging-rke2 \ -o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \ awk 'BEGIN{format="%-60s %-35s %-20s %s\n" printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at" printf format,"---","---","---","---"} $3!="successful"{printf format,$1,$2,$3,$4}' Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-test-rke2-mgmt1-1736487904 rancher-node-test-rke2-mgmt1 etcd-snapshot-rancher-node-test-rke2-mgmt1-1736455502 rancher-node-test-rke2-mgmt3 etcd-snapshot-rancher-node-test-rke2-mgmt1-1736469903 rancher-node-test-rke2-mgmt3 ~ ᐅ date -d @1736469903 Fri Jan 10 01:45:03 AM CET 2025 ~ ᐅ date -d @1736455502 Thu Jan 9 09:45:02 PM CET 2025 ~ ᐅ date -d @1736487904 Fri Jan 10 06:45:04 AM CET 2025
The corrupted snapshots are not in the configmap
rke2-etcd-snapshots
:~ ᐅ kbt get cm -n kube-system rke2-etcd-snapshots \ -o go-template='{{range $snap,$snap_value:=.data}}{{printf "%s\n" $snap}}{{end}}' | \ grep "1736487904\|1736455502\|1736469903" ~ ᐅ
Here are the snapshots on node
rancher-node-test-rke2-mgmt1
:root@rancher-node-test-rke2-mgmt1:~# rke2 etcd-snapshot ls 2> /dev/null | awk '/mgmt1-(1736487904|1736455502|1736469903)/' etcd-snapshot-rancher-node-test-rke2-mgmt1-1736455502 s3://backup-rke2-etcd/test-staging-rke2/etcd-snapshot-rancher-node-test-rke2-mgmt1-1736455502 48472096 2025-01-09T20:45:02Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736455502 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-test-rke2-mgmt1-1736455502 48472096 2025-01-09T20:45:02Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736469903 s3://backup-rke2-etcd/test-staging-rke2/etcd-snapshot-rancher-node-test-rke2-mgmt1-1736469903 48472096 2025-01-10T00:45:03Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736469903 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-test-rke2-mgmt1-1736469903 48472096 2025-01-10T00:45:03Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736487904 s3://backup-rke2-etcd/test-staging-rke2/etcd-snapshot-rancher-node-test-rke2-mgmt1-1736487904 48472096 2025-01-10T05:45:04Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736487904 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-test-rke2-mgmt1-1736487904 48472096 2025-01-10T05:45:04Z
I deleted the corrupted snapshots:
root@rancher-node-test-rke2-mgmt1:~# rke2 etcd-snapshot delete $(rke2 etcd-snapshot ls 2> /dev/null | \ awk '/mgmt1-(1736487904|1736455502|1736469903)/{print $1}' | \ uniq | while read -r snap;do printf "%s " "$snap";done) WARN[0000] Unknown flag --agent-token found in config.yaml, skipping WARN[0000] Unknown flag --cni found in config.yaml, skipping WARN[0000] Unknown flag --disable found in config.yaml, skipping WARN[0000] Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping WARN[0000] Unknown flag --kube-controller-manager-arg found in config.yaml, skipping WARN[0000] Unknown flag --kube-controller-manager-arg found in config.yaml, skipping WARN[0000] Unknown flag --kube-controller-manager-extra-mount found in config.yaml, skipping WARN[0000] Unknown flag --kube-scheduler-arg found in config.yaml, skipping WARN[0000] Unknown flag --kube-scheduler-arg found in config.yaml, skipping WARN[0000] Unknown flag --kube-scheduler-extra-mount found in config.yaml, skipping WARN[0000] Unknown flag --kubelet-arg found in config.yaml, skipping WARN[0000] Unknown flag --kubelet-arg found in config.yaml, skipping WARN[0000] Unknown flag --kubelet-arg found in config.yaml, skipping WARN[0000] Unknown flag --kubelet-arg found in config.yaml, skipping WARN[0000] Unknown flag --node-label found in config.yaml, skipping WARN[0000] Unknown flag --node-label found in config.yaml, skipping WARN[0000] Unknown flag --node-taint found in config.yaml, skipping WARN[0000] Unknown flag --node-taint found in config.yaml, skipping WARN[0000] Unknown flag --private-registry found in config.yaml, skipping WARN[0000] Unknown flag --snapshotter found in config.yaml, skipping WARN[0000] Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation. INFO[0004] Snapshot etcd-snapshot-rancher-node-test-rke2-mgmt1-1736455502 deleted. INFO[0004] Snapshot etcd-snapshot-rancher-node-test-rke2-mgmt1-1736455502 deleted. INFO[0004] Snapshot etcd-snapshot-rancher-node-test-rke2-mgmt1-1736469903 deleted. INFO[0004] Snapshot etcd-snapshot-rancher-node-test-rke2-mgmt1-1736469903 deleted. INFO[0004] Snapshot etcd-snapshot-rancher-node-test-rke2-mgmt1-1736487904 deleted. INFO[0004] Snapshot etcd-snapshot-rancher-node-test-rke2-mgmt1-1736487904 deleted.
There are no more corrupted snapshots in Rancher cluster:
~ ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \ -l rke.cattle.io/cluster-name=test-staging-rke2 \ -o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \ awk 'BEGIN{format="%-60s %-35s %-20s %s\n" printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at" printf format,"---","---","---","---"} {printf format,$1,$2,$3,$4}' Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-test-rke2-mgmt1-1736282702 rancher-node-test-rke2-mgmt1 successful 2025-01-07T20:45:02Z etcd-snapshot-rancher-node-test-rke2-mgmt2-1736282704 s3 successful 2025-01-07T20:45:04Z etcd-snapshot-rancher-node-test-rke2-mgmt3-1736297102 s3 successful 2025-01-08T00:45:02Z etcd-snapshot-rancher-node-test-rke2-mgmt2-1736297103 s3 successful 2025-01-08T00:45:03Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736297105 rancher-node-test-rke2-mgmt1 successful 2025-01-08T00:45:05Z etcd-snapshot-rancher-node-test-rke2-mgmt3-1736264704 s3 successful 2025-01-07T15:45:04Z etcd-snapshot-rancher-node-test-rke2-mgmt3-1736333102 s3 successful 2025-01-08T10:45:02Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736315101 s3 successful 2025-01-08T05:45:01Z etcd-snapshot-rancher-node-test-rke2-mgmt2-1736264705 s3 successful 2025-01-07T15:45:05Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736282702 s3 successful 2025-01-07T20:45:02Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736333104 s3 successful 2025-01-08T10:45:04Z etcd-snapshot-rancher-node-test-rke2-mgmt3-1736315102 s3 successful 2025-01-08T05:45:02Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736264703 s3 successful 2025-01-07T15:45:03Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736264703 rancher-node-test-rke2-mgmt1 successful 2025-01-07T15:45:03Z etcd-snapshot-rancher-node-test-rke2-mgmt3-1736282701 s3 successful 2025-01-07T20:45:01Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736315101 rancher-node-test-rke2-mgmt1 successful 2025-01-08T05:45:01Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736297105 s3 successful 2025-01-08T00:45:05Z etcd-snapshot-rancher-node-test-rke2-mgmt2-1736315103 s3 successful 2025-01-08T05:45:03Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736333104 rancher-node-test-rke2-mgmt1 successful 2025-01-08T10:45:04Z etcd-snapshot-rancher-node-test-rke2-mgmt2-1736333101 s3 successful 2025-01-08T10:45:01Z
The configmap and leases seems to be fine:
~ ᐅ for context in $(kubectx | awk '/test-staging-rke2/');do echo -e "---\nEtcd leader in cluster $context" kubectl --context "$context" exec $(kubectl --context "$context" get po -n kube-system -l component=etcd --no-headers -o jsonpath='{range .items[0]}{.metadata.name}{end}') -n kube-system \ -- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \ --cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \ --key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \ endpoint status --cluster | awk '/true/{split($1,a,":");print substr(a[2],3)}' | \ xargs -I{} dig -x {} +short | awk -F '.' '{printf "\t%s\n",$1}' echo "Leases and configmaps in cluster $context" for name in rke2 rke2-etcd;do kubectl --context "$context" get cm -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \ awk '{split($3,a,",");printf "\t%-10s %-10s %s\n",$1,$2,substr(a[1],2)}' kubectl --context "$context" get leases -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.spec.holderIdentity}' | \ awk '{printf "\t%-10s %-10s %s\n",$1,$2,$3}' done done --- Etcd leader in cluster test-staging-rke2 rancher-node-test-rke2-mgmt1 Leases and configmaps in cluster test-staging-rke2 ConfigMap rke2 "holderIdentity":"rancher-node-test-rke2-mgmt1" Lease rke2 rancher-node-test-rke2-mgmt1 ConfigMap rke2-etcd "holderIdentity":"rancher-node-test-rke2-mgmt1" Lease rke2-etcd rancher-node-test-rke2-mgmt1
I hope the next snapshot will be ok.
- Author Owner
I carried out the same checks and actions on
archive-staging-rke2
than ontest-staging-rke2
. I'm afraid the next snapshot will fail again. If so, I will delete all snapshots. - Author Owner
In
archive-staging-rke2
andtest-staging-rke2
clusters I deleted all local snapshots. - Author Owner
In
archive-production-rke2
I realigned the leases and configmaps after the control-planes upgrade:- from:
ᐅ for context in $(kubectx | awk '/archive-production-rke2/');do echo -e "---\nEtcd leader in cluster $context" kubectl --context "$context" exec $(kubectl --context "$context" get po -n kube-system -l component=etcd --no-headers -o jsonpath='{range .items[0]}{.metadata.name}{end}') -n kube-system \ -- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \ --cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \ --key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \ endpoint status --cluster | awk '/true/{split($1,a,":");print substr(a[2],3)}' | \ xargs -I{} dig -x {} +short | awk -F '.' '{printf "\t%s\n",$1}' echo "Leases and configmaps in cluster $context" for name in rke2 rke2-etcd;do kubectl --context "$context" get cm -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \ awk '{split($3,a,",");printf "\t%-10s %-10s %s\n",$1,$2,substr(a[1],2)}' kubectl --context "$context" get leases -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.spec.holderIdentity}' | \ awk '{printf "\t%-10s %-10s %s\n",$1,$2,$3}' done done --- Etcd leader in cluster archive-production-rke2 rancher-node-production-rke2-mgmt1 Leases and configmaps in cluster archive-production-rke2 ConfigMap rke2 "holderIdentity":"rancher-node-production-rke2-mgmt3" Lease rke2 rancher-node-production-rke2-mgmt1 ConfigMap rke2-etcd "holderIdentity":"rancher-node-production-rke2-mgmt3" Lease rke2-etcd rancher-node-production-rke2-mgmt2
- to:
ᐅ for context in $(kubectx | awk '/archive-production-rke2/');do echo -e "---\nEtcd leader in cluster $context" kubectl --context "$context" exec $(kubectl --context "$context" get po -n kube-system -l component=etcd --no-headers -o jsonpath='{range .items[0]}{.metadata.name}{end}') -n kube-system \ -- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \ --cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \ --key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \ endpoint status --cluster | awk '/true/{split($1,a,":");print substr(a[2],3)}' | \ xargs -I{} dig -x {} +short | awk -F '.' '{printf "\t%s\n",$1}' echo "Leases and configmaps in cluster $context" for name in rke2 rke2-etcd;do kubectl --context "$context" get cm -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \ awk '{split($3,a,",");printf "\t%-10s %-10s %s\n",$1,$2,substr(a[1],2)}' kubectl --context "$context" get leases -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.spec.holderIdentity}' | \ awk '{printf "\t%-10s %-10s %s\n",$1,$2,$3}' done done --- Etcd leader in cluster archive-production-rke2 rancher-node-production-rke2-mgmt1 Leases and configmaps in cluster archive-production-rke2 ConfigMap rke2 "holderIdentity":"rancher-node-production-rke2-mgmt1" Lease rke2 rancher-node-production-rke2-mgmt1 ConfigMap rke2-etcd "holderIdentity":"rancher-node-production-rke2-mgmt1" Lease rke2-etcd rancher-node-production-rke2-mgmt1
Edited by Guillaume Samson - Author Owner
In
archive-production-rke2
,archive-staging-rke2
,test-staging-rke2
clusters after disabling s3 snapshots:-
realign and check etcd leaders, configmaps and leases;
--- Etcd leader in cluster test-staging-rke2 rancher-node-test-rke2-mgmt1 Leases and configmaps in cluster test-staging-rke2 ConfigMap rke2 "holderIdentity":"rancher-node-test-rke2-mgmt1" Lease rke2 rancher-node-test-rke2-mgmt1 ConfigMap rke2-etcd "holderIdentity":"rancher-node-test-rke2-mgmt1" Lease rke2-etcd rancher-node-test-rke2-mgmt1 --- Etcd leader in cluster archive-staging-rke2 rancher-node-staging-rke2-mgmt1 Leases and configmaps in cluster archive-staging-rke2 ConfigMap rke2 "holderIdentity":"rancher-node-staging-rke2-mgmt1" Lease rke2 rancher-node-staging-rke2-mgmt1 ConfigMap rke2-etcd "holderIdentity":"rancher-node-staging-rke2-mgmt1" Lease rke2-etcd rancher-node-staging-rke2-mgmt1 --- Etcd leader in cluster archive-production-rke2 rancher-node-production-rke2-mgmt1 Leases and configmaps in cluster archive-production-rke2 ConfigMap rke2 "holderIdentity":"rancher-node-production-rke2-mgmt1" Lease rke2 rancher-node-production-rke2-mgmt1 ConfigMap rke2-etcd "holderIdentity":"rancher-node-production-rke2-mgmt1" Lease rke2-etcd rancher-node-production-rke2-mgmt1
-
remove all s3 snapshots from configmap
rke2-etcd-snapshots
as there are no more snapshots in MinIO;ᐅ mc ls -r rke2-etcd/backup-rke2-etcd/ | grep -v ".metadata" [2025-01-13 11:15:11 CET] 100MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736763304 [2025-01-13 16:15:08 CET] 100MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736781301 [2025-01-13 21:15:10 CET] 105MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736799301 [2025-01-14 01:15:11 CET] 105MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736813704 [2025-01-14 06:15:12 CET] 105MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736831705 [2025-01-13 11:15:08 CET] 103MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736763303 [2025-01-13 16:15:09 CET] 104MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736781302 [2025-01-13 21:15:09 CET] 109MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736799301 [2025-01-14 01:15:08 CET] 109MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736813702 [2025-01-14 06:15:09 CET] 109MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736831704 [2025-01-13 11:15:08 CET] 102MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736763303 [2025-01-13 16:15:11 CET] 102MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736781304 [2025-01-13 21:15:11 CET] 103MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736799302 [2025-01-14 01:15:09 CET] 103MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736813702 [2025-01-14 06:15:11 CET] 103MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736831705
-
check
etcdsnapshots
objects in Rancher cluster.## test-staging-rke2 Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-test-rke2-mgmt1-1736815502 rancher-node-test-rke2-mgmt1 successful 2025-01-14T00:45:02Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736801101 rancher-node-test-rke2-mgmt1 successful 2025-01-13T20:45:01Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736833504 rancher-node-test-rke2-mgmt1 successful 2025-01-14T05:45:04Z ## archive-staging-rke2 Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736800205 rancher-node-staging-rke2-mgmt1 successful 2025-01-13T20:30:05Z etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736832602 rancher-node-staging-rke2-mgmt1 successful 2025-01-14T05:30:02Z etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736782201 rancher-node-staging-rke2-mgmt1 successful 2025-01-13T15:30:01Z etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736814602 rancher-node-staging-rke2-mgmt1 successful 2025-01-14T00:30:02Z etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736764204 rancher-node-staging-rke2-mgmt1 successful 2025-01-13T10:30:04Z ## archive-production-rke2 Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-production-rke2-mgmt1-1736830804 rancher-node-production-rke2-mgmt1 successful 2025-01-14T05:00:04Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736798404 rancher-node-production-rke2-mgmt1 successful 2025-01-13T20:00:04Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736812800 rancher-node-production-rke2-mgmt1 successful 2025-01-14T00:00:00Z
If the noon snapshots are successful, I will enable s3 snapshots.
-
- Author Owner
The last local snapshots were successful:
ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \ -o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \ awk 'BEGIN{format="%-60s %-35s %-20s %s\n" printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at" printf format,"---","---","---","---"} $4~/^2025-01-14T10/&&$1!~/admin/{printf format,$1,$2,$3,$4}' Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-production-rke2-mgmt1-1736848805 rancher-node-production-rke2-mgmt1 successful 2025-01-14T10:00:05Z etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736850600 rancher-node-staging-rke2-mgmt1 successful 2025-01-14T10:30:00Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736851501 rancher-node-test-rke2-mgmt1 successful 2025-01-14T10:45:01Z
I will enable s3 snapshots.
- Author Owner
The last snapshots were successful except in
archive-staging-rke2
:ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \ -o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \ awk 'BEGIN{format="%-60s %-35s %-20s %s\n" printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at" printf format,"---","---","---","---"} $3!="successful"||$4~/^2025-01-14T15/&&$1!~/admin/{printf format,$1,$2,$3,$4}' Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-production-rke2-mgmt1-1736866801 s3 successful 2025-01-14T15:00:01Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736866801 rancher-node-production-rke2-mgmt1 successful 2025-01-14T15:00:01Z etcd-snapshot-rancher-node-production-rke2-mgmt3-1736866804 s3 successful 2025-01-14T15:00:04Z etcd-snapshot-rancher-node-production-rke2-mgmt2-1736866804 s3 successful 2025-01-14T15:00:04Z etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736868604 rancher-node-staging-rke2-mgmt2 etcd-snapshot-rancher-node-test-rke2-mgmt2-1736869502 s3 successful 2025-01-14T15:45:02Z etcd-snapshot-rancher-node-test-rke2-mgmt3-1736869504 s3 successful 2025-01-14T15:45:04Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736869503 s3 successful 2025-01-14T15:45:03Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736869503 rancher-node-test-rke2-mgmt1 successful 2025-01-14T15:45:03Z
where the lease
rke2-etcd
has changed since this morning:--- Etcd leader in cluster archive-staging-rke2 rancher-node-staging-rke2-mgmt1 Leases and configmaps in cluster archive-staging-rke2 ConfigMap rke2 "holderIdentity":"rancher-node-staging-rke2-mgmt1" Lease rke2 rancher-node-staging-rke2-mgmt1 ConfigMap rke2-etcd "holderIdentity":"rancher-node-staging-rke2-mgmt1" Lease rke2-etcd rancher-node-staging-rke2-mgmt2
therefore the last
etcdsnapshots
status is 'Missing: true' as the node name is not the good one:ᐅ kbl describe etcdsnapshots.rke.cattle.io -n fleet-default archive-staging-rke2-etcd-snapshot-rancher-node-staging-r-6735c Name: archive-staging-rke2-etcd-snapshot-rancher-node-staging-r-6735c Namespace: fleet-default Labels: rke.cattle.io/cluster-name=archive-staging-rke2 rke.cattle.io/machine-id=7064d06c186885368ecd982d0609d833143d8c169cbd122d8b4c59ca6eb7301 Annotations: etcdsnapshot.rke.io/snapshot-file-name: etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736868604 etcdsnapshot.rke.io/storage: local API Version: rke.cattle.io/v1 Kind: ETCDSnapshot Metadata: Creation Timestamp: 2025-01-14T16:10:19Z Generation: 1 Owner References: API Version: cluster.x-k8s.io/v1beta1 Block Owner Deletion: true Controller: true Kind: Machine Name: custom-adf8ed96fd1c UID: 515e124d-49db-47e2-a063-53331c671590 Resource Version: 908341232 UID: bbce182f-9c59-4b90-b9d4-f1be193aca9c Snapshot File: Location: file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736868604 Name: etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736868604 Node Name: rancher-node-staging-rke2-mgmt2 Spec: Cluster Name: archive-staging-rke2 Status: Missing: true Events: <none>
- Author Owner
In
archive-staging-rke2
:- delete all corrupted snapshots;
- save local snapshots on nodes mgmt2 and mgmt3;
- delete all local snapshots;
- delete s3 snapshots;
- disable s3 snapshots;
- delete all
data:
in configmaprke2-etcd-snapshots
; - check and realign configmaps and leases.
--- Etcd leader in cluster archive-staging-rke2 rancher-node-staging-rke2-mgmt1 Leases and configmaps in cluster archive-staging-rke2 ConfigMap rke2 "holderIdentity":"rancher-node-staging-rke2-mgmt1" Lease rke2 rancher-node-staging-rke2-mgmt1 ConfigMap rke2-etcd "holderIdentity":"rancher-node-staging-rke2-mgmt1" Lease rke2-etcd rancher-node-staging-rke2-mgmt1
- Author Owner
The last
archive-staging-rke2
snapshot were successful:ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \ -o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \ awk 'BEGIN{format="%-60s %-35s %-20s %s\n" printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at" printf format,"---","---","---","---"} $3!="successful"||$4~/^2025-01-15T10/&&$1~/staging/{printf format,$1,$2,$3,$4}' Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736937004 rancher-node-staging-rke2-mgmt1 successful 2025-01-15T10:30:04Z
All the last snapshots were successful:
ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \ -o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \ awk 'BEGIN{format="%-60s %-35s %-20s %s\n" printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at" printf format,"---","---","---","---"} $3!="successful"||$4~/^2025-01-15T10/{printf format,$1,$2,$3,$4}' Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-production-rke2-mgmt3-1736935203 s3 successful 2025-01-15T10:00:03Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736935201 rancher-node-production-rke2-mgmt1 successful 2025-01-15T10:00:01Z etcd-snapshot-rancher-node-production-rke2-mgmt2-1736935203 s3 successful 2025-01-15T10:00:03Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736935201 s3 successful 2025-01-15T10:00:01Z etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736937004 rancher-node-staging-rke2-mgmt1 successful 2025-01-15T10:30:04Z etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736936105 s3 successful 2025-01-15T10:15:05Z etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736936102 rancher-node-admin-rke2-mgmt1 successful 2025-01-15T10:15:02Z etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736936105 s3 successful 2025-01-15T10:15:05Z etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736936102 s3 successful 2025-01-15T10:15:02Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736937904 rancher-node-test-rke2-mgmt1 successful 2025-01-15T10:45:04Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736937904 s3 successful 2025-01-15T10:45:04Z etcd-snapshot-rancher-node-test-rke2-mgmt2-1736937902 s3 successful 2025-01-15T10:45:02Z
I will enable the s3 backups for
archive-staging-rke2
. - Author Owner
On all management nodes, cleaning up backup folders,
rke2_snapshots_20250113
andrke2_snapshots_20250115
. - Author Owner
All the local and s3 last snapshots were successful:
ᐅ kbl get etcdsnapshots.rke.cattle.io -n fleet-default \ -o jsonpath='{range .items[*]}{.snapshotFile.name} {.snapshotFile.nodeName} {.snapshotFile.status} {.snapshotFile.createdAt}{"\n"}{end}' | \ awk 'BEGIN{format="%-60s %-35s %-20s %s\n" printf format,"Snapshotfile Name","Snapshotfile Node Name","Snapshotfile Status","Created at" printf format,"---","---","---","---"} $3!="successful"||$4~/^2025-01-15T15/{printf format,$1,$2,$3,$4}' Snapshotfile Name Snapshotfile Node Name Snapshotfile Status Created at --- --- --- --- etcd-snapshot-rancher-node-production-rke2-mgmt1-1736953200 rancher-node-production-rke2-mgmt1 successful 2025-01-15T15:00:00Z etcd-snapshot-rancher-node-production-rke2-mgmt2-1736953201 s3 successful 2025-01-15T15:00:01Z etcd-snapshot-rancher-node-production-rke2-mgmt3-1736953204 s3 successful 2025-01-15T15:00:04Z etcd-snapshot-rancher-node-production-rke2-mgmt1-1736953200 s3 successful 2025-01-15T15:00:00Z etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736955004 s3 successful 2025-01-15T15:30:04Z etcd-snapshot-rancher-node-staging-rke2-mgmt2-1736955001 s3 successful 2025-01-15T15:30:01Z etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736955004 rancher-node-staging-rke2-mgmt1 successful 2025-01-15T15:30:04Z etcd-snapshot-rancher-node-staging-rke2-mgmt3-1736955004 s3 successful 2025-01-15T15:30:04Z etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736954102 s3 successful 2025-01-15T15:15:02Z etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736954104 rancher-node-admin-rke2-mgmt1 successful 2025-01-15T15:15:04Z etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736954104 s3 successful 2025-01-15T15:15:04Z etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736954102 s3 successful 2025-01-15T15:15:02Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736955903 s3 successful 2025-01-15T15:45:03Z etcd-snapshot-rancher-node-test-rke2-mgmt2-1736955904 s3 successful 2025-01-15T15:45:04Z etcd-snapshot-rancher-node-test-rke2-mgmt3-1736955901 s3 successful 2025-01-15T15:45:01Z etcd-snapshot-rancher-node-test-rke2-mgmt1-1736955903 rancher-node-test-rke2-mgmt1 successful 2025-01-15T15:45:03Z
ᐅ mc ls -r rke2-etcd/backup-rke2-etcd/ | awk '$6!~/\.metadata/&&$1~/2025-01-15/&&$2~/^16/' [2025-01-15 16:00:06 CET] 134MiB STANDARD archive-production-rke2/etcd-snapshot-rancher-node-production-rke2-mgmt1-1736953200 [2025-01-15 16:00:08 CET] 134MiB STANDARD archive-production-rke2/etcd-snapshot-rancher-node-production-rke2-mgmt2-1736953201 [2025-01-15 16:00:11 CET] 134MiB STANDARD archive-production-rke2/etcd-snapshot-rancher-node-production-rke2-mgmt3-1736953204 [2025-01-15 16:30:08 CET] 74MiB STANDARD archive-staging-rke2/etcd-snapshot-rancher-node-staging-rke2-mgmt1-1736955004 [2025-01-15 16:30:03 CET] 73MiB STANDARD archive-staging-rke2/etcd-snapshot-rancher-node-staging-rke2-mgmt2-1736955001 [2025-01-15 16:30:07 CET] 74MiB STANDARD archive-staging-rke2/etcd-snapshot-rancher-node-staging-rke2-mgmt3-1736955004 [2025-01-15 16:15:09 CET] 105MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt1-1736954104 [2025-01-15 16:15:06 CET] 109MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt2-1736954102 [2025-01-15 16:15:06 CET] 104MiB STANDARD cluster-admin-rke2/etcd-snapshot-rancher-node-admin-rke2-mgmt3-1736954102 [2025-01-15 16:45:04 CET] 38MiB STANDARD test-staging-rke2/etcd-snapshot-rancher-node-test-rke2-mgmt1-1736955903 [2025-01-15 16:45:05 CET] 38MiB STANDARD test-staging-rke2/etcd-snapshot-rancher-node-test-rke2-mgmt2-1736955904 [2025-01-15 16:45:03 CET] 37MiB STANDARD test-staging-rke2/etcd-snapshot-rancher-node-test-rke2-mgmt3-1736955901
Here are the etcd leaders with the corresponding
holder identity
leases and configmaps:ᐅ for context in $(kubectx | awk '/-rke2/');do echo -e "---\nEtcd leader in cluster $context" kubectl --context "$context" exec $(kubectl --context "$context" get po -n kube-system -l component=etcd --no-headers -o jsonpath='{range .items[0]}{.metadata.name}{end}') -n kube-system \ -- etcdctl --cacert='/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt' \ --cert='/var/lib/rancher/rke2/server/tls/etcd/server-client.crt' \ --key='/var/lib/rancher/rke2/server/tls/etcd/server-client.key' \ endpoint status --cluster | awk '/true/{split($1,a,":");print substr(a[2],3)}' | \ xargs -I{} dig -x {} +short | awk -F '.' '{printf "\t%s\n",$1}' echo "Leases and configmaps in cluster $context" for name in rke2 rke2-etcd;do kubectl --context "$context" get cm -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' | \ awk '{split($3,a,",");printf "\t%-10s %-10s %s\n",$1,$2,substr(a[1],2)}' kubectl --context "$context" get leases -n kube-system "$name" -o jsonpath='{.kind} {.metadata.name} {.spec.holderIdentity}' | \ awk '{printf "\t%-10s %-10s %s\n",$1,$2,$3}' done done --- Etcd leader in cluster archive-production-rke2 rancher-node-production-rke2-mgmt1 Leases and configmaps in cluster archive-production-rke2 ConfigMap rke2 "holderIdentity":"rancher-node-production-rke2-mgmt1" Lease rke2 rancher-node-production-rke2-mgmt1 ConfigMap rke2-etcd "holderIdentity":"rancher-node-production-rke2-mgmt1" Lease rke2-etcd rancher-node-production-rke2-mgmt1 --- Etcd leader in cluster archive-staging-rke2 rancher-node-staging-rke2-mgmt1 Leases and configmaps in cluster archive-staging-rke2 ConfigMap rke2 "holderIdentity":"rancher-node-staging-rke2-mgmt1" Lease rke2 rancher-node-staging-rke2-mgmt1 ConfigMap rke2-etcd "holderIdentity":"rancher-node-staging-rke2-mgmt1" Lease rke2-etcd rancher-node-staging-rke2-mgmt1 --- Etcd leader in cluster cluster-admin-rke2 rancher-node-admin-rke2-mgmt3 Leases and configmaps in cluster cluster-admin-rke2 ConfigMap rke2 "holderIdentity":"rancher-node-admin-rke2-mgmt2" Lease rke2 rancher-node-admin-rke2-mgmt2 ConfigMap rke2-etcd "holderIdentity":"rancher-node-admin-rke2-mgmt2" Lease rke2-etcd rancher-node-admin-rke2-mgmt2 --- Etcd leader in cluster test-staging-rke2 rancher-node-test-rke2-mgmt1 Leases and configmaps in cluster test-staging-rke2 ConfigMap rke2 "holderIdentity":"rancher-node-test-rke2-mgmt1" Lease rke2 rancher-node-test-rke2-mgmt1 ConfigMap rke2-etcd "holderIdentity":"rancher-node-test-rke2-mgmt1" Lease rke2-etcd rancher-node-test-rke2-mgmt1
Note the mismatch between the etcd leader and the
holder identity
objects incluster-admin-rke2
. I didn't touch it because the backup works fine.
I am delighted to close this issue. - Guillaume Samson closed
closed