k8s - Configure the nodes to use the memory first and fallback to the local storage only when needed

Already done:

Configured RKE to activate the alpha feature allowing the pods to access the node swap directly infra/swh-sysadmin-provisioning/-/commit/eda884735635e839bf616323b4524c7b6a0787fb Reference: https://kubernetes.io/blog/2021/08/09/run-nodes-with-swap-alpha/
Checked the rancher configuration to ensure the nodes can run with some swap activated

TODO:

Activate the swap on nodes (using the local storage zfs pool)
Configure the pods to use mount /tmp as a tmpfs
Configure the pods to activate node swap access
Check everything is working as expected

mentioned in commit swh-sysadmin-provisioning@eda88473

Create a swap volume in zfs:

# zfs create -V 30G data/swap
# mkswap -L swap /dev/zvol/data/swap
# swapon /dev/zvol/data/swap

~~Update /etc/fstab accordingly~~ This is not necessary as puppet will mount swap partitions based on labels

root@rancher-node-staging-worker1:~# diff -U3 /tmp/fstab /etc/fstab
--- /tmp/fstab  2022-11-15 11:19:46.602220013 +0000
+++ /etc/fstab  2022-11-15 11:17:39.682065691 +0000
@@ -12,4 +12,5 @@
 # /boot was on /dev/vda1 during installation
 UUID=e56e245a-3413-4040-9481-56016371ec43 /boot           ext2    defaults        0       2
 #/dev/mapper/base--template--vg-swap_1 none            swap    sw              0       0
+/dev/zvol/data/swap none            swap    sw              0       0
 /dev/sr0        /media/cdrom0   udf,iso9660 user,noauto     0       0

check

root@rancher-node-staging-worker1:~# free -h
               total        used        free      shared  buff/cache   available
Mem:           7.8Gi       6.5Gi       381Mi       6.0Mi       968Mi       1.1Gi
Swap:           29Gi          0B        29Gi

Note: The size allocated to the swap volume is directly reserved to the volume and not accessible to the other dynamic volumes:

root@rancher-node-staging-worker1:~# zfs list | grep -v docker/
NAME                                                                                USED  AVAIL     REFER  MOUNTPOINT
data                                                                               38.4G  9.53G       24K  /data
data/docker                                                                        4.38G  9.53G     84.0M  /var/lib/docker
data/kubelet                                                                       3.01G  9.53G     3.01G  /var/lib/kubelet
data/swap                                                                          30.9G  40.5G       20K  -

Current status:

The pods ephemeral volume can be well mapped on node memory
The sizeLimit parameter is working as soon a size lower than the node memory is specified
We didn't succeed in creating a pod using more than the node memory (excluding the swap, so on worker01). the biggest possible size is 7.7G
The default configuration was used memorySwap.swapBehavior: LimitedSwap because it is a kubelet configuration that can be only specified in the kubelet config file and not as a kubelet parameter. We didn't found yet how the kubelet configuration file can be modified

Ah, now I remember what I had read on the k8s node swap announcement https://kubernetes.io/blog/2021/08/09/run-nodes-with-swap-alpha/#how-do-i-use-it

The behaviour of the LimitedSwap setting depends if the node is running with v1 or v2 of control groups (also known as "cgroups"):

cgroups v1: Kubernetes workloads can use any combination of memory and swap, up to the pod's memory limit, if set.

cgroups v2: Kubernetes workloads cannot use swap memory.

I believe that the behavior you've noticed is consistent with the cgroupsv2 behavior.

Have we tried rebooting the nodes using cgroups v1 (setting systemd.unified_cgroup_hierarchy=false systemd.legacy_systemd_cgroup_controller=false in the linux cmdline)?

After several attempts, it seems properly manage the memory consumption of the nodes is quite complicated.

Another track can be to use a local volume provisioner backed by a tmpfs volume for example: https://github.com/rancher/local-path-provisioner

The main concern is the memory consumed in the tmpfs will not be counted in the memory used by the pods

looks like local-path-provisioner is doing what we expect from it

Using an ephemral volume on the pods

...  
      volumeMounts:
        - mountPath: /etc/swh
          name: configuration
        # - mountPath: /tmp
        #   name: localstorage
        - mountPath: /tmp
          name: ephemeral-volume
...
    volumes
      - name: ephemeral-volume
        ephemeral:
          volumeClaimTemplate:
            metadata:
              labels:
                type: ephemeral-volume
            spec:
              accessModes: [ "ReadWriteOnce" ]
              storageClassName: "local-path"
              resources:
                requests:
                  storage: 2Gi

When starting 5 replicas of a deployment:

# kubectl get pvc -A
NAMESPACE   NAME                                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
swh         loader-cvs-swap-7546fc8454-4gr6k-ephemeral-volume   Bound    pvc-97500b98-f588-4004-a9bc-a58506806e41   2Gi        RWO            local-path     10m
swh         loader-cvs-swap-7546fc8454-876lr-ephemeral-volume   Bound    pvc-1d1fdeee-d4b3-4d33-9227-e9227503a96f   2Gi        RWO            local-path     10m
swh         loader-cvs-swap-7546fc8454-8x26p-ephemeral-volume   Bound    pvc-9317a616-0e2c-4c15-82ac-51869d35b9fb   2Gi        RWO            local-path     10m
swh         loader-cvs-swap-7546fc8454-h2j57-ephemeral-volume   Bound    pvc-ac9449cb-9393-41ff-86e2-85f30f9e5e14   2Gi        RWO            local-path     10m
swh         loader-cvs-swap-7546fc8454-wsc2k-ephemeral-volume   Bound    pvc-9472640f-4cad-45be-984b-97382aba9914   2Gi        RWO            local-path     10m

# kubectl get pv -A                                                                                                                                    16:07:41
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                   STORAGECLASS   REASON   AGE
pvc-1d1fdeee-d4b3-4d33-9227-e9227503a96f   2Gi        RWO            Delete           Bound    swh/loader-cvs-swap-7546fc8454-876lr-ephemeral-volume   local-path              11m
pvc-9317a616-0e2c-4c15-82ac-51869d35b9fb   2Gi        RWO            Delete           Bound    swh/loader-cvs-swap-7546fc8454-8x26p-ephemeral-volume   local-path              11m
pvc-9472640f-4cad-45be-984b-97382aba9914   2Gi        RWO            Delete           Bound    swh/loader-cvs-swap-7546fc8454-wsc2k-ephemeral-volume   local-path              11m
pvc-97500b98-f588-4004-a9bc-a58506806e41   2Gi        RWO            Delete           Bound    swh/loader-cvs-swap-7546fc8454-4gr6k-ephemeral-volume   local-path              11m
pvc-ac9449cb-9393-41ff-86e2-85f30f9e5e14   2Gi        RWO            Delete           Bound    swh/loader-cvs-swap-7546fc8454-h2j57-ephemeral-volume   local-path              11m

Host directory:

rancher-node-staging-worker4 /opt/local-path-provisioner
 % df -h .
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           7.9G     0  7.9G   0% /opt/local-path-provisioner

rancher-node-staging-worker4 /opt/local-path-provisioner
 % ls
pvc-1d1fdeee-d4b3-4d33-9227-e9227503a96f_swh_loader-cvs-swap-7546fc8454-876lr-ephemeral-volume
pvc-9317a616-0e2c-4c15-82ac-51869d35b9fb_swh_loader-cvs-swap-7546fc8454-8x26p-ephemeral-volume
pvc-9472640f-4cad-45be-984b-97382aba9914_swh_loader-cvs-swap-7546fc8454-wsc2k-ephemeral-volume
pvc-97500b98-f588-4004-a9bc-a58506806e41_swh_loader-cvs-swap-7546fc8454-4gr6k-ephemeral-volume
pvc-ac9449cb-9393-41ff-86e2-85f30f9e5e14_swh_loader-cvs-swap-7546fc8454-h2j57-ephemeral-volume

When the deployment is scaled to 2 replicas:

rancher-node-staging-worker4 /opt/local-path-provisioner
 % ls -al
total 4
drwxrwxrwt 4 root root   80 Nov 22 15:11 .
drwxr-xr-x 6 root root 4096 Nov 22 13:55 ..
drwxrwxrwx 2 root root   40 Nov 22 14:56 pvc-97500b98-f588-4004-a9bc-a58506806e41_swh_loader-cvs-swap-7546fc8454-4gr6k-ephemeral-volume
drwxrwxrwx 2 root root   40 Nov 22 14:56 pvc-ac9449cb-9393-41ff-86e2-85f30f9e5e14_swh_loader-cvs-swap-7546fc8454-h2j57-ephemeral-volume

And completely removed:

rancher-node-staging-worker4 /opt/local-path-provisioner
 % ls -al
total 4
drwxrwxrwt 2 root root   40 Nov 22 15:12 .
drwxr-xr-x 6 root root 4096 Nov 22 13:55 ..

That's great! This seems strictly equivalent to what we currently use on static workers, so I think we can go ahead with this.

It may be worth testing the behavior when there's pressure on disk usage, for instance from a runaway worker.

From what I understand of volume/storage resource requests, what I expect is that the pressure will only be visible when /scheduling/ a new pod (i.e. the pod will be scheduled on a node with available disk space), rather than while a pod is running (i.e. no pods will be evicted if they go above their storage request). But I haven't actually seen this in action yet.

I've performed a couple of tests to the limits. Unfortunately, there will be not eviction at all as kubernetes only monitor the kubelet and image fs: \

The kubelet supports the following filesystem partitions: \

nodefs: The node's main filesystem, used for local disk volumes, emptyDir, log storage, and more. For example, nodefs contains /var/lib/kubelet/. \

imagefs: An optional filesystem that container runtimes use to store container images and container writable layers.

Kubelet auto-discovers these filesystems and ignores other filesystems. Kubelet does not support other configurations.

When the tmpfs will be full, the process will failed as a any classic process:

swh@loader-cvs-swap-7546fc8454-w45bp:/tmp$ dd if=/dev/zero of=/tmp/bigfile bs=1024 count=4092000
dd: error writing '/tmp/bigfile': No space left on device <------  Failed
2392641+0 records in
2392640+0 records out
2450063360 bytes (2.5 GB, 2.3 GiB) copied, 18.9851 s, 129 MB/s

~ ❯ kubectl describe node rancher-node-staging-worker4  | grep -i disk
  DiskPressure         False   Wed, 23 Nov 2022 15:28:25 +0100   Wed, 23 Nov 2022 12:07:21 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
~ ❯ ssh  rancher-node-staging-worker4.internal.staging.swh.network df -h /tmp/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            15G   15G     0 100% /tmp

If we found it becomes a problem, I guess we will be able to configure an init container in the loader pods checking there is at least X% free on /tmp before let the container start

Yeah, that seems fair enough (and it's on par with the current situation, so that's good).

mentioned in commit swh-sysadmin-provisioning@710100d3

mentioned in merge request swh-sysadmin-provisioning!93 (merged)

k8s - Configure the nodes to use the memory first and fallback to the local storage only when needed

Designs

Child items ...

Activity