Rancher is unstable since a couple of days

added azure kubernetes rancher state:wip labels

it seems the ingress controller is evicted from time to time:

default           61m         Normal    NodeHasInsufficientMemory   node/aks-default-36212332-vmss00000e                                        Node aks-default-36212332-vmss00000e status is now: NodeHasInsufficientMemory
ingress-nginx     60m         Warning   Evicted                     pod/ingress-nginx-controller-f8b6887b6-vt8xx                                The node was low on resource: memory. Container controller was using 3396472Ki, which exceeds its request of 90Mi.
ingress-nginx     60m         Normal    Killing                     pod/ingress-nginx-controller-f8b6887b6-vt8xx                                Stopping container controller

It seems completely legit to evict it because it start to consume 3.3G instead of the 90Mi a lot of memory which is weird.

mentioned in commit swh/infra/ci-cd/k8s-clusters-conf@72d3564d

The problem is still present

monitoring        0s          Warning   Unhealthy                pod/rancher-prometheus-prometheus-node-exporter-7v7nr         Readiness probe failed: Get "http://10.240.0.7:9100/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
default           0s          Warning   ContainerRuntimeIsDown   node/aks-default-36212332-vmss00000e                          Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_runtime.s
default           0s          Warning   KubeletIsDown            node/aks-default-36212332-vmss00000e                          Timeout when running plugin "/etc/node-problem-detector.d/plugin/check_kubelet.s
cert-manager      0s          Warning   Unhealthy                pod/rancher-certmanager-cert-manager-webhook-8fbbf9d4-h4q9m   Readiness probe failed: Get "http://10.244.0.15:6080/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
tigera-operator   0s          Warning   BackOff                  pod/tigera-operator-66b9bfd96c-h5sxr                          Back-off restarting failed container
tigera-operator   5s          Normal    LeaderElection           configmap/operator-lock                                       aks-default-36212332-vmss00000E_e4ece5a6-5d68-4924-9560-da7f62b7b6d0 became leader
tigera-operator   5s          Normal    LeaderElection           lease/operator-lock                                           aks-default-36212332-vmss00000E_e4ece5a6-5d68-4924-9560-da7f62b7b6d0 became leader
calico-system     0s          Warning   Unhealthy                pod/calico-node-ngkss                                         Readiness probe failed: command "/bin/calico-node -felix-ready" timed out
monitoring        0s          Warning   Unhealthy                pod/rancher-prometheus-prometheus-node-exporter-7v7nr         Liveness probe failed: Get "http://10.240.0.7:9100/": context deadline exceeded (Cl

I forced a node upgrade through the azure interface:
Version actuelle: AKSUbuntu-1804gen2containerd-2022.10.24
Dernière version: AKSUbuntu-1804gen2containerd-2022.12.15

It should ensure the nodes are completely rotated and installed with the last os

changed milestone to %Dynamic infrastructure [Roadmap - Tooling and infrastructure]

the tigera-operator pod seems stable since the nodes upgrade.

Let's keep the cluster under surveillance for one more days.

added 2h of time spent at 2023-01-10

every looks ok since yesterday so I close the issue. I will reopen it if something goes wrong again

added 1h of time spent

closed

changed the incident status to Resolved by closing the incident

Rancher is unstable since a couple of days

Child items ...

Activity