Skip to content

Investigate more compact on-disk structures for node properties

These files are needlessly large:

-r--r--r--+ 2 ext-8972 super 103G Feb 10  2023 graph.property.author_id.bin
-r--r--r--+ 2 ext-8972 super 205G Feb 10  2023 graph.property.author_timestamp.bin
-r--r--r--+ 2 ext-8972 super  52G Feb 10  2023 graph.property.author_timestamp_offset.bin
-r--r--r--+ 2 ext-8972 super 103G Feb 11  2023 graph.property.committer_id.bin
-r--r--r--+ 2 ext-8972 super 205G Feb 10  2023 graph.property.committer_timestamp.bin
-r--r--r--+ 2 ext-8972 super  52G Feb 10  2023 graph.property.committer_timestamp_offset.bin
-r--r--r--+ 2 ext-8972 super 3.3G Feb 10  2023 graph.property.content.is_skipped.bin
-r--r--r--+ 2 ext-8972 super 205G Feb 10  2023 graph.property.content.length.bin
-r--r--r--+ 2 ext-8972 super 343G Feb 11  2023 graph.property.message.bin
-r--r--r--+ 2 ext-8972 super 205G Feb 11  2023 graph.property.message.offset.bin
-r--r--r--+ 2 ext-8972 super 805M Feb 11  2023 graph.property.tag_name.bin
-r--r--r--+ 2 ext-8972 super 205G Feb 11  2023 graph.property.tag_name.offset.bin

mostly because:

  1. .offset files store one u64 for each node
  2. timestamps are overwhelmingly in the 1980-2030 range
  3. timestamp-offsets are overwhelmingly multiples of 30 in the -840-840 range
  4. messages are base64-encoded

We could probably save 1-2TB with little runtime cost (and therefore save on cache too) with the following changes:

  1. use Elias-Fano to represent offsets (they are monotonically increasing AFAICT, except for -1 values which we need to find a way to deal with. bit array? make them equal to the previous offset?)
  2. use varints centered on 2010 for timestamps, so timestamps in the "normal" range are smaller (this will require a new set of offset files)
  3. do a similar smart thing for timestamp offsets
  4. don't base64-encode messages and/or store them in a rear-coded list (though RCLs imply sorting, so offsets wouldn't be monotonic anymore)

see also #4785 for edge label optimizations

Edited by vlorentz