Dealing with repositories with contents that produces hash conflicts (example included from GitLab)
The (in)famous two different files with same length and same SHA1 (SHAttered) is being included as a test in cryptography related projects. An example showed up as a result of a failure to load the https://gitlab.com/sequoia-pgp/sequoia repository, that contains such files.
$ git clone https://gitlab.com/sequoia-pgp/sequoia
[...]
$ cd sequoia/openpgp/tests/data/messages
$ sha1sum shattered-[12].pdf
38762cf7f55934b34d179ae6a4c80cadccbb7f0a shattered-1.pdf
38762cf7f55934b34d179ae6a4c80cadccbb7f0a shattered-2.pdf
It turns out that this does not pose a problem for git, nor for our SWHIDv1, as the SHA1 conflicting files do not produce a SHA1-git conflict: indeed, these files are properly stored in the sequoia
project.
$ git hash-object shattered-[12].pdf
ba9aaa145ccd24ef760cf31c74d8f7ca1a2e47b0
b621eeccd5c7edac9b7dcba35a8d5afd075e24f2
But our current pipeline detects the SHA1 conflict and prevents their ingestion.
We need to design a way to archive such repositories, instead of skipping like we do today.
Migrated from T3775 (view on Phabricator)
Edited by Phabricator Migration user