The #4506 (closed) issue was not enough as the local disk is used as soon a git repository is bigger than the tmpbuffer limit declared for the git loader.
The nodes should be configured to replicate the current worker behavior aka working in a tmpfs and fallback to the server swap stored on the local hypervisor storage when the tmpfs doesn't fit in RAM.
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
The pods ephemeral volume can be well mapped on node memory
The sizeLimit parameter is working as soon a size lower than the node memory is specified
We didn't succeed in creating a pod using more than the node memory (excluding the swap, so on worker01). the biggest possible size is 7.7G
The default configuration was used memorySwap.swapBehavior: LimitedSwap because it is a kubelet configuration that can be only specified in the kubelet config file and not as a kubelet parameter. We didn't found yet how the kubelet configuration file can be modified
The behaviour of the LimitedSwap setting depends if the node is running with v1 or v2 of control groups (also known as "cgroups"):
cgroups v1: Kubernetes workloads can use any combination of memory and swap, up to the pod's memory limit, if set.
cgroups v2: Kubernetes workloads cannot use swap memory.
I believe that the behavior you've noticed is consistent with the cgroupsv2 behavior.
Have we tried rebooting the nodes using cgroups v1 (setting systemd.unified_cgroup_hierarchy=false systemd.legacy_systemd_cgroup_controller=false in the linux cmdline)?
That's great! This seems strictly equivalent to what we currently use on static workers, so I think we can go ahead with this.
It may be worth testing the behavior when there's pressure on disk usage, for instance from a runaway worker.
From what I understand of volume/storage resource requests, what I expect is that the pressure will only be visible when /scheduling/ a new pod (i.e. the pod will be scheduled on a node with available disk space), rather than while a pod is running (i.e. no pods will be evicted if they go above their storage request). But I haven't actually seen this in action yet.
I've performed a couple of tests to the limits.
Unfortunately, there will be not eviction at all as kubernetes only monitor the kubelet and image fs: \
The kubelet supports the following filesystem partitions: \
nodefs: The node's main filesystem, used for local disk volumes, emptyDir, log storage, and more. For example, nodefs contains /var/lib/kubelet/. \
imagefs: An optional filesystem that container runtimes use to store container images and container writable layers.
Kubelet auto-discovers these filesystems and ignores other filesystems. Kubelet does not support other configurations.
When the tmpfs will be full, the process will failed as a any classic process:
swh@loader-cvs-swap-7546fc8454-w45bp:/tmp$ dd if=/dev/zero of=/tmp/bigfile bs=1024 count=4092000dd: error writing '/tmp/bigfile': No space left on device <------ Failed2392641+0 records in2392640+0 records out2450063360 bytes (2.5 GB, 2.3 GiB) copied, 18.9851 s, 129 MB/s
~ ❯ kubectl describe node rancher-node-staging-worker4 | grep -i disk DiskPressure False Wed, 23 Nov 2022 15:28:25 +0100 Wed, 23 Nov 2022 12:07:21 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure~ ❯ ssh rancher-node-staging-worker4.internal.staging.swh.network df -h /tmp/Filesystem Size Used Avail Use% Mounted ontmpfs 15G 15G 0 100% /tmp
If we found it becomes a problem, I guess we will be able to configure an init container in the loader pods checking there is at least X% free on /tmp before let the container start