Skip to content

Migrate the content store to a new (internal) primary key scheme

Our content store, as well as all the associated data, uses the sha1 of the data as a primary key.

This content store primary key is used in several places, notably :

As an accessor in our common object storage API

To construct on-disk paths and storage API objects

As a primary key on the (internal) tables that store metadata associated to each object.

As a primary key in the archiver database

There are some advantages to the current scheme :

  • It's intrinsic: you can check the integrity of your data store without storing any metadata about the contents
  • It shards naturally thanks to statistical properties of the hashes

The scheme has a few drawbacks :

  • If there's a collision on your hash of choice, you lose
  • A hash is a big value (20 bytes for SHA1, probably 32+ bytes if we migrate to a successor such as SHA2, SHA3 or Blake2b), which makes any new metadata table instantly and inherently huge.
  • We want the next scheme to be as future-proof as possible, as migrating is not going to get easier in the future.

Of course, this is only an "internal" matter: our data model and deduplication functionality heavily depends on purely intrinsic hashes, which allow us to retain the properties of a Merkle DAG, as well as letting anyone recompute the intrinsic identifier of any object.

This task is a meta-task to decide the new primary key scheme, with further subtasks to track the individual migration items.


Migrated from T698 (view on Phabricator)

Edited by Phabricator Migration user
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information