rke2 zfs snapshotter: garbage collection doesn't happen when zfs dataset doesn't exist
While migrating storage1 to the zfs snapshotter, I noticed that rancher-node-staging-rke2-mgmt1 ran out of disk space.
Turns out the zfs snapshotter was unable to run its garbage collection procedure as one of the zfs datasets was missing.
How to notice the issue: in /var/lib/rancher/rke2/agent/containerd/containerd.log
:
level=warning msg="snapshot garbage collection failed" error="exit status 1: \"/usr/sbin/zfs list -Hp -o name,origin,used,available,mountpoint,compression,type,volsize,quota,referenced,written,logicalused,usedbydataset data/rancher/1800\" => cannot open 'data/rancher/1800': dataset does not exist\n" snapshotter=zfs
zfs list -t all
takes a long time and dumps a lot of unexpected datasets
workaround: zfs create data/rancher/{missing_dataset_id} -o mountpoint=legacy
creates an empty dataset that containerd is able to garbage collect...