Rancher is unstable since at least the 2023-01-09 which make the interaction with the internal(rocquencourt) cluster complicated and break argocd updates
❯ http --header https://rancher.euwest.azure.internal.softwareheritage.org/ 11:30:30HTTP/1.1 503 Service Temporarily UnavailableConnection: keep-aliveContent-Length: 190Content-Type: text/htmlDate: Tue, 10 Jan 2023 10:30:36 GMTStrict-Transport-Security: max-age=15724800; includeSubDomains
No timeline items have been added yet.
Child items ...
Show closed items
Linked incidents or issues 0
Link incidents together to show that they're related.
Learn more.
it seems the ingress controller is evicted from time to time:
default 61m Normal NodeHasInsufficientMemory node/aks-default-36212332-vmss00000e Node aks-default-36212332-vmss00000e status is now: NodeHasInsufficientMemoryingress-nginx 60m Warning Evicted pod/ingress-nginx-controller-f8b6887b6-vt8xx The node was low on resource: memory. Container controller was using 3396472Ki, which exceeds its request of 90Mi.ingress-nginx 60m Normal Killing pod/ingress-nginx-controller-f8b6887b6-vt8xx Stopping container controller
It seems completely legit to evict it because it start to consume 3.3G instead of the 90Mi a lot of memory which is weird.
monitoring 0s Warning Unhealthy pod/rancher-prometheus-prometheus-node-exporter-7v7nr Readiness probe failed: Get "http://10.240.0.7:9100/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)default 0s Warning ContainerRuntimeIsDown node/aks-default-36212332-vmss00000e Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_runtime.sdefault 0s Warning KubeletIsDown node/aks-default-36212332-vmss00000e Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_kubelet.scert-manager 0s Warning Unhealthy pod/rancher-certmanager-cert-manager-webhook-8fbbf9d4-h4q9m Readiness probe failed: Get "http://10.244.0.15:6080/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)tigera-operator 0s Warning BackOff pod/tigera-operator-66b9bfd96c-h5sxr Back-off restarting failed containertigera-operator 5s Normal LeaderElection configmap/operator-lock aks-default-36212332-vmss00000E_e4ece5a6-5d68-4924-9560-da7f62b7b6d0 became leadertigera-operator 5s Normal LeaderElection lease/operator-lock aks-default-36212332-vmss00000E_e4ece5a6-5d68-4924-9560-da7f62b7b6d0 became leadercalico-system 0s Warning Unhealthy pod/calico-node-ngkss Readiness probe failed: command "/bin/calico-node -felix-ready" timed outmonitoring 0s Warning Unhealthy pod/rancher-prometheus-prometheus-node-exporter-7v7nr Liveness probe failed: Get "http://10.240.0.7:9100/": context deadline exceeded (Cl
I forced a node upgrade through the azure interface:
Version actuelle: AKSUbuntu-1804gen2containerd-2022.10.24
Dernière version: AKSUbuntu-1804gen2containerd-2022.12.15
It should ensure the nodes are completely rotated and installed with the last os