I'll write my remarks down here for tracking purposes
it looks like there's a few actual collisions; seems that they're the known-colliding Google PDFs
there's a few of what looks like false positives; most hashes are duplicates, save from one.
For these false positives, the hash in the database contains two 0x25 bytes next to one another; the "colliding" object only has a single 0x25 byte (and therefore a hash of the wrong length).
Interpreted as ascii, [0x25, 0x25] is "%%"
"%%" is an escape that can turn into "%" in a few situations:
python %-encoding
sql LIKE operations
"%"s are also special in some HTTP operations; maybe something wrongly handles "%%" encoded as "%25%25" and ends up storing only "%".
Right now, I'm not convinced that the storage ever sees the objects with the mismatched hashes. To be more sure of that, I think we should make sure that all hash data in all exception arguments is hex-encoded unicode strings, rather than bytes objects left for python to repr(); this would circumvent a lot of places where encoding or decoding the data in transfer can go wrong.
I'd also like to compare the time of HashCollision with the ctime of the object in the database; if both timestamps are close to one another, it might be a sign that these collisions are the common insert/insert postgresql race condition.
Finally, we should make sure that the storage implementations reject objects with hashes of the wrong length. I'm /almost/ sure that's the case, but we should be sure of it.
to be more sure of that, I think we should make sure that all hash data in all exception arguments is hex-encoded unicode strings, rather than bytes objects left for python to repr(); this would circumvent a lot of places where encoding or decoding the data in transfer can go wrong.
I'd also like to compare the time of HashCollision with the ctime of the object in the database; if both timestamps are close to one another, it might be a sign that these collisions are the common insert/insert postgresql race condition.
cooking...
cooked...
They are quite close indeed (order or minutes apart).
$619 updated, the stored-content has its ctime displayed and the sentry one as a field 'date-reported-by-sentry' which corresponds to the dateCreated field filed in the event sentry api call output
Finally, we should make sure that the storage implementations reject objects with hashes of the wrong length. I'm /almost/ sure that's the case, but we should be sure of it.
That's the case.
It's declared in the pg schema [1]
and it's installed in the db.
$ psql service=swhsoftwareheritage=> \dT+ blake2s256 List of data types Schema | Name | Internal name | Size | Elements | Owner | Access privileges | Description--------+------------+---------------+------+----------+------------+-------------------+------------- public | blake2s256 | blake2s256 | var | | swhstorage | |(1 row)softwareheritage=> \dT+ sha1 List of data types Schema | Name | Internal name | Size | Elements | Owner | Access privileges | Description--------+------+---------------+------+----------+------------+-------------------+------------- public | sha1 | sha1 | var | | swhstorage | |(1 row)softwareheritage=> \dT+ sha1_git List of data types Schema | Name | Internal name | Size | Elements | Owner | Access privileges | Description--------+----------+---------------+------+----------+------------+-------------------+------------- public | sha1_git | sha1_git | var | | swhstorage | |(1 row)softwareheritage=> \dT+ sha256 List of data types Schema | Name | Internal name | Size | Elements | Owner | Access privileges | Description--------+--------+---------------+------+----------+------------+-------------------+------------- public | sha256 | sha256 | var | | swhstorage | |(1 row)softwareheritage=> \d content Table "public.content" Column | Type | Collation | Nullable | Default------------+--------------------------+-----------+----------+-------------------------------------------- sha1 | sha1 | | not null | sha1_git | sha1_git | | not null | sha256 | sha256 | | not null | length | bigint | | not null | ctime | timestamp with time zone | | not null | now() status | content_status | | not null | 'visible'::content_status object_id | bigint | | not null | nextval('content_object_id_seq'::regclass) blake2s256 | blake2s256 | | |Indexes: "content_pkey" PRIMARY KEY, btree (sha1) "content_object_id_idx" UNIQUE, btree (object_id) "content_sha1_git_idx" UNIQUE, btree (sha1_git) "content_blake2s256_idx" btree (blake2s256) "content_ctime_idx" btree (ctime) "content_sha256_idx" btree (sha256)Publications: "softwareheritage"
! In #2332, @ardumont wrote:
Finally, we should make sure that the storage implementations reject objects with hashes of the wrong length. I'm /almost/ sure that's the case, but we should be sure of it.
An interesting experiment, disabling the proxy buffer storage in the loader nixguix configuration.
And the number of hashcollision dropped to 0 (no new event for that loader since yesterday around 6pm our time).