Skip to content

provenance: Order rows in final files by the column they will be queried on

vlorentz requested to merge provenance-transposed into master

ie. dir-in-rev is ordered by dir instead of rel, cnt-in-dir by cnt instead of dir, and cnt-in-rev by cnt instead of rel.

This allows using Parquet's page index (resp. row group statistics) to find pages (resp. row groups) where all entries of a given queries are; instead of relying on either Bloom filters or partitions to filter out whole row groups / files, as this is less efficient

This significantly changes the implementation, because we need to build the files from an iterator on destination instead of sources, so all traversals must go backward, and check every node is reachable from a head rev/rel (so we don't go traversing non-reachable nodes)

Contains commits from !462 (merged)

Edited by vlorentz

Merge request reports