Test the impact on a kubernetes cluster of a down management node
During the crash of mucem, some pods with readiness probe activated go down (graphql, webapp, ...)
│ Events: │
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Warning Unhealthy 140m (x2 over 140m) kubelet Liveness probe failed: Get "http://10.42.239.210:5013/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) │
│ Warning FailedMount 42m (x11 over 141m) kubelet MountVolume.SetUp failed for volume "config" : failed to sync configmap cache: timed out waiting for the condition │
│ Warning Failed 41m (x3 over 42m) kubelet Error: failed to sync secret cache: timed out waiting for the condition │
│ Normal Pulled 41m (x3 over 42m) kubelet Container image "container-registry.softwareheritage.org/swh/infra/swh-apps/graphql:20231115.1" already present on machine │
│ Normal Created 41m kubelet Created container graphql │
│ Normal Started 41m kubelet Started container graphql │
│ Warning Unhealthy 41m (x2 over 41m) kubelet Startup probe failed: Get "http://10.42.239.210:5013/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
It seems having a down management node could impact the readiness probes.
If it's confirmed, we problably need to think to start a cluster of etcd at least for production to ensure the cluster realibility.