Skip to content
  • vlorentz's avatar
    provenance: Order rows in final files by the column they will be queried on · 604f3043
    vlorentz authored
    ie. dir-in-rev is ordered by dir instead of rel, cnt-in-dir by cnt instead of dir,
    and cnt-in-rev by cnt instead of rel.
    
    This allows using Parquet's page index (resp. row group statistics) to find pages
    (resp. row groups) where all entries of a given queries are; instead of relying
    on either Bloom filters or partitions to filter out whole row groups / files,
    as this is less efficient
    
    This significantly changes the implementation, because we need to build the files
    from an iterator on destination instead of sources, so all traversals must go
    backward, and check every node is reachable from a head rev/rel (so we don't go
    traversing non-reachable nodes)
    604f3043