Skip to content

Replace orcxx with datafusion-orc + Arrow + ar_row

vlorentz requested to merge arrow_v2 into master

orcxx builds and links with the Apache ORC C++ library, which is a recurring source of issues:

  • linking failures we do not understand
  • dependencies (libsnappy, libzstd, ...) need to either be installed on the system or built as part of Apache ORC, the latter being another source of linking failures
  • Starting with v2.0, Apache ORC C++ will start downloading a downloading orc-format as part of its build process, from a location (archive.apache.org) that bans our CI; and v2.1 will only partially mitigate it by first downloading from another location (dlcdn.apache.org) where the file will disappear as soon as the next version of orc-format is released

This change replaces orcxx with three components:

  • datafusion-orc, a library to parse ORC files into Arrow structures
  • Arrow, an in-memory columnar format
  • ar_row, a new crate I forked from orcxx by removing all the ORC-parsing code to keep only the "columnar arrays -> vector of row structures" deserialization and adapted it to Arrow. ar_row is pure Rust, and with much less unsafe code than orcxx.

In terms of functionality, this is mostly equivalent, with a ~2% performance penalty on the first step of extract-swhids and a 0.7 -> 17GB increase in memory usage of the Rust part of that first step. Given this first step includes 96 instances of sort each with a 100MB in-memory buffer, followed by a second step (merging sorted lists) that takes even longer, these extra resources should not matter in practice.

Edited by vlorentz

Merge request reports