Since one week, the etcd slowness has crept back up to the point that the Kube API server on all clusters is being regularly restarted, making lots of stuff unstable
On that date, a few changes happened:
ceph components on the proxmox ceph cluster were upgraded to 17.2.7 (from 17.2.6)
the configuration of the docker cache was pushed at the rancher level and taken into account in all clusters
the admin-rke2 management node was moved to the zfs snapshotter
the staging management node (and others) was moved to the zfs snapshotter
My gut feeling is that the issue is somehow caused by the ceph upgrade, which is going to be a pain to diagnose and fix.
etcd test command: etcdctl check perf --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key --auto-compact --auto-defrag (in the etcd pod)
output on test-staging-rke2:
Compacting with revision 119353594Compacted with revision 119353594Defragmenting "127.0.0.1:2379"Defragmented "127.0.0.1:2379"PASS: Throughput is 145 writes/sSlowest request took too long: 1.479196sStddev too high: 0.169764sFAIL
After moving the rancher storage of rancher-node-test-rke2-mgmt1 node to uffizi's local storage:
Compacting with revision 119365112Compacted with revision 119365112Defragmenting "127.0.0.1:2379"Defragmented "127.0.0.1:2379"PASS: Throughput is 150 writes/sPASS: Slowest request took 0.182145sPASS: Stddev is 0.020533sPASS
Checking ceph metrics, there's pretty much no i/o on the ceph pools for the kubernetes clusters, and certainly no patterns similar enough to the "overall" load on the cluster.
(prom query: rate(ceph_pool_rd_bytes[5m]) * on (job,instance,environment,pool_id) group_left(name) ceph_pool_metadata{name!="proxmox"})
After tagging the grafana dashboards with the ceph upgrade, it's clear that this is the cause of the increased IO jitter. Moving to the zfs snapshotter actually decreases the (read) iops, plausibly because of better cache sharing with the layered snapshots.
So, I feel that the way forward is moving to three rke2 server nodes per cluster, using local storage, which would remove the need for redundant storage and ensure high availability for cluster management (solving #5159 (closed) in the process). Any opinions for/against @teams/sysadmin ?
So, I feel that the way forward is moving to three rke2 server nodes per cluster,
I've read that as having 3 etcd instances per cluster, 1 per cluster node, right? (with the details of using local storage per node running those instead of using ceph).
I'm totally for having 3 management nodes at least for production and admin clusters to increase the availability. Having it also for staging seems sensible to replicate the environment.
Perhaps we should also have a look at #4680 (closed) but not necessarily in the critical path.
# Small OSDs : 16GiBroot@hypervisor3:~# for osd in {4..10}; do ceph tell osd.$osd injectargs '--osd_memory_target 17179869184'; done# Large OSDs : 32GiBroot@hypervisor3:~# for osd in {11..16}; do ceph tell osd.$osd injectargs '--osd_memory_target 34359738368'; done
rancher-node-metal02 ran out of memory and crashed around 19:50 UTC; the subsequent spike in apiserver traffic made the whole cluster management (and health checks...) unresponsive.
I temporarily moved the storage of rancher-node-production-rke2-mgmt1 to the scratch storage until the cluster came back to a stable state, then moved it back to ceph.