Skip to content

SelectBlobs: Replace pyspark with pyarrow+datafusion

vlorentz requested to merge no-pyspark into master

This makes the run time go from 245 to 230 minutes, but with a third of the CPU load, and lower memory use (that doesn't need to be manually configured)

But most importantly, this considerably reduces the dependency size, as pyspark is 300MB of Java; while pyarrow and datafusion are respectively 40MB and 20MB of manylinux wheels

Edited by vlorentz

Merge request reports