Investigate more compact on-disk structures for node properties
These files are needlessly large:
-r--r--r--+ 2 ext-8972 super 103G Feb 10 2023 graph.property.author_id.bin
-r--r--r--+ 2 ext-8972 super 205G Feb 10 2023 graph.property.author_timestamp.bin
-r--r--r--+ 2 ext-8972 super 52G Feb 10 2023 graph.property.author_timestamp_offset.bin
-r--r--r--+ 2 ext-8972 super 103G Feb 11 2023 graph.property.committer_id.bin
-r--r--r--+ 2 ext-8972 super 205G Feb 10 2023 graph.property.committer_timestamp.bin
-r--r--r--+ 2 ext-8972 super 52G Feb 10 2023 graph.property.committer_timestamp_offset.bin
-r--r--r--+ 2 ext-8972 super 3.3G Feb 10 2023 graph.property.content.is_skipped.bin
-r--r--r--+ 2 ext-8972 super 205G Feb 10 2023 graph.property.content.length.bin
-r--r--r--+ 2 ext-8972 super 343G Feb 11 2023 graph.property.message.bin
-r--r--r--+ 2 ext-8972 super 205G Feb 11 2023 graph.property.message.offset.bin
-r--r--r--+ 2 ext-8972 super 805M Feb 11 2023 graph.property.tag_name.bin
-r--r--r--+ 2 ext-8972 super 205G Feb 11 2023 graph.property.tag_name.offset.bin
mostly because:
-
.offset
files store one u64 for each node - timestamps are overwhelmingly in the 1980-2030 range
- timestamp-offsets are overwhelmingly multiples of 30 in the -840-840 range
- messages are base64-encoded
We could probably save 1-2TB with little runtime cost (and therefore save on cache too) with the following changes:
- use Elias-Fano to represent offsets (they are monotonically increasing AFAICT, except for -1 values which we need to find a way to deal with. bit array? make them equal to the previous offset?)
- use varints centered on 2010 for timestamps, so timestamps in the "normal" range are smaller (this will require a new set of offset files)
- do a similar smart thing for timestamp offsets
- don't base64-encode messages and/or store them in a rear-coded list (though RCLs imply sorting, so offsets wouldn't be monotonic anymore)