[rancher] staging and production clusters admin regularly crash
It's the root cause of #4874 (closed).
it's not clear yet if it's due to some timeouts on etcd or an issue with the communication with the rancher manager.
There are several errors in the management nodes:
May 12 08:04:49 rancher-node-staging-rke2-mgmt1 rke2[1655994]: {"level":"warn","ts":"2023-05-12T08:04:25.756Z","logger":"etcd-client","caller":"v3@v3.5.4-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00077dc00/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
May 12 08:04:49 rancher-node-staging-rke2-mgmt1 rke2[1655994]: time="2023-05-12T08:04:25Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
May 12 08:04:49 rancher-node-staging-rke2-mgmt1 rke2[1655994]: {"level":"warn","ts":"2023-05-12T08:04:40.758Z","logger":"etcd-client","caller":"v3@v3.5.4-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00077dc00/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
May 12 08:04:49 rancher-node-staging-rke2-mgmt1 rke2[1655994]: time="2023-05-12T08:04:40Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
May 12 08:04:50 rancher-node-staging-rke2-mgmt1 rancher-system-agent[841]: time="2023-05-12T08:04:20Z" level=error msg="[K8s] received secret to process that was older than the last secret operated on. (301509225 vs 301509321)"
May 12 08:04:50 rancher-node-staging-rke2-mgmt1 rancher-system-agent[841]: time="2023-05-12T08:04:20Z" level=error msg="error syncing 'fleet-default/custom-8e8eb25d9b24-machine-plan': handler secret-watch: secret received was too old, requeuing"
May 12 08:04:50 rancher-node-staging-rke2-mgmt1 rancher-system-agent[841]: time="2023-05-12T08:04:25Z" level=error msg="[K8s] received secret to process that was older than the last secret operated on. (301509321 vs 301509449)"
May 12 08:04:50 rancher-node-staging-rke2-mgmt1 rancher-system-agent[841]: time="2023-05-12T08:04:25Z" level=error msg="error syncing 'fleet-default/custom-8e8eb25d9b24-machine-plan': handler secret-watch: secret received was too old, requeuing"
May 12 08:04:50 rancher-node-staging-rke2-mgmt1 rancher-system-agent[841]: time="2023-05-12T08:04:31Z" level=error msg="[K8s] received secret to process that was older than the last secret operated on. (301509449 vs 301509526)"
May 12 08:04:50 rancher-node-staging-rke2-mgmt1 rancher-system-agent[841]: time="2023-05-12T08:04:31Z" level=error msg="error syncing 'fleet-default/custom-8e8eb25d9b24-machine-plan': handler secret-watch: secret received was too old, requeuing"
...
May 12 08:04:50 rancher-node-staging-rke2-mgmt1 rke2[1655994]: E0512 08:04:49.994435 1655994 leaderelection.go:330] error retrieving resource lock kube-system/rke2: Get "https://127.0.0.1:6443/api/v1/namespaces/kube-system/configmaps/rke2": context deadline exceeded
May 12 08:04:50 rancher-node-staging-rke2-mgmt1 rke2[1655994]: I0512 08:04:49.996688 1655994 leaderelection.go:283] failed to renew lease kube-system/rke2: timed out waiting for the condition
May 12 08:04:50 rancher-node-staging-rke2-mgmt1 rke2[1655994]: E0512 08:04:49.996876 1655994 leaderelection.go:306] Failed to release lock: resource name may not be empty
May 12 08:04:50 rancher-node-staging-rke2-mgmt1 rke2[1655994]: time="2023-05-12T08:04:49Z" level=fatal msg="leaderelection lost for rke2"
...
The unit run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-18f779c5dea2e6c0e3963489872a79c7ce195b76cccc431ab724fa6a9f969f91-rootfs.mount has successfully entered the 'dead' state.
May 12 08:04:50 rancher-node-staging-rke2-mgmt1 systemd[1]: rke2-server.service: Main process exited, code=exited, status=1/FAILURE
...
May 12 08:05:21 rancher-node-staging-rke2-mgmt1 rke2[1722765]: time="2023-05-12T08:05:21Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"