Skip to content

shard format v2

This issue is dedicated to discuss the idea of updating the shard file format.

Ideas:

  • move the index and mph function after the header rather than having them at the end of the file;
    • rationale: makes a partial shard (as partially copied) somewhat recoverable ;
    • should not require a new shard format per se (all the offsets are stored in the header so it should not require a version bump)
  • add file size within the index in addition to the main payload area
    • rationale:
    • requires a version bump
  • add some form of CRC of the header at least
    • rationale: identify a bitrot in the header; which, if corrupted, can make the whole shard useless
    • requires a version bump
  • use a xMB alignment for all the sections of the shard file format
    • rationale: XXX
    • does not require a version bump
  • think about (/assess) making shard file format usable on cloud storage (s3)
    • rationale: XXX
  • make the primary hash an "parameter" of the shard file (eg. advertized in the header)
  • make the shard file able to store several hash indexes/mph functions
    • rationale: would that be useful?