Recompute class to trigger an add/update hash checksums in storage
Production ready.
Related #692 (closed) Depends swh-model!3 (closed)
Migrated from D186 (view on Phabricator)
Merge request reports
Activity
Add a storage primary_key option to generically check for corruption
This no longer fetches metadata from contents. The contents passed as parameter should be self-contained and in adequation with the 'primary_key' configuration option. As of now, that primary key is 'sha1'.
Unknown or corrupted contents are skipped.
Add some other improvments:
- docstring clarification
- Add a batch_size to retrieve contents' blobs
- Add another batch_size for the update contents operations
- Align option names with _ as per usual conventions
Rebase
- Recompute class to trigger an add/update hash checksums in storage
- Open task to recompute checksums
- Add a storage primary_key option to generically check for corruption
- Use boolean as default value
- Fix wrong configuration key
- refactoring: Adapt according to review
- Respect semantics regarding hashes (re)computation
- swh.indexer.tasks: Rename queue name to swh_indexer_rehash
This is now production ready (i readapted the initial diff description).
Manual testing modus operandi:
-
injected a repository using the git-loader.
-
altered the db schema to add a new column
alter table content add column blake2b512 bytea null;
-
computing the list of sha1s from the schema
psql service=swh-dev -c "copy (select sha1 from content) to stdin" | sed -e 's/^\\\\x//g' > sha1s
-
Checking the number of contents before batch update:
select count(*) from content; 3991 select count(*) from content where blake2b512 is null; 3991
- feeding those sha1 for updates
Configuration file ~/.config/swh/storage/rehash.yml:
storage: cls: remote args: url: http://localhost:5002/ compute_checksums: - blake2b512 recompute_checksums: false batch_size_retrieve_content: 10 batch_size_update: 100
And trigger the run:
cat sha1s | python3 -m swh.indexer.producer --task-name rehash --dict-with-key sha1
- Check everything has been updated when done:
softwareheritage-dev=# select count(*) from content where blake2b512 is null; count ------- 0 (1 row) softwareheritage-dev=# select count(*) from content where blake2b512 is not null; count ------- 3991 (1 row)
Conclusion: working \o/
-
mentioned in commit 67be6108
Some references in the commit message have been migrated:
- T692 is now #692 (closed)
Some references in the commit message have been migrated:
- T692 is now #692 (closed)
Some references in the commit message have been migrated:
- T692 is now #692 (closed)
Some references in the commit message have been migrated:
- T692 is now #692 (closed)
Some references in the commit message have been migrated:
- T692 is now #692 (closed)
Some references in the commit message have been migrated:
- T692 is now #692 (closed)