Skip to content

Dealing with repositories with contents that produces hash conflicts (example included from GitLab)

The (in)famous two different files with same length and same SHA1 (SHAttered) is being included as a test in cryptography related projects. An example showed up as a result of a failure to load the https://gitlab.com/sequoia-pgp/sequoia repository, that contains such files.

 $ git clone https://gitlab.com/sequoia-pgp/sequoia
 [...]
 $ cd sequoia/openpgp/tests/data/messages
 $ sha1sum shattered-[12].pdf 
 38762cf7f55934b34d179ae6a4c80cadccbb7f0a  shattered-1.pdf
 38762cf7f55934b34d179ae6a4c80cadccbb7f0a  shattered-2.pdf

It turns out that this does not pose a problem for git, nor for our SWHIDv1, as the SHA1 conflicting files do not produce a SHA1-git conflict: indeed, these files are properly stored in the sequoia project.

$ git hash-object shattered-[12].pdf
ba9aaa145ccd24ef760cf31c74d8f7ca1a2e47b0
b621eeccd5c7edac9b7dcba35a8d5afd075e24f2

But our current pipeline detects the SHA1 conflict and prevents their ingestion.

We need to design a way to archive such repositories, instead of skipping like we do today.


Migrated from T3775 (view on Phabricator)

Edited by Phabricator Migration user
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information