the scrubber DB stores a list of SWHID ranges that were checked
the postgresql checker queries the postgresql DB directly, to get all SWHIDs in a range
To efficiently read from Cassandra, we should use ranges of token(sha1, sha1_git, sha256, blake2s256) (for content and skipped_content) token(sha1_git) (for other tables) where token is a Murmur3 non-cryptographic hash function. This means we have to change this design. The minimal change I see is to make it so that:
the scrubber DB stores a list of SWHID ranges that were checked (for postgresql), Murmur3 ranges (for cassandra), and sha1/sha256 ranges for the future objstorage checker
the postgresql checker is unchanged: queries the postgresql DB directly, to get all SWHIDs in a range and reports it in its table
the cassandra checker queries the cassandra DB directly, to get all objects within a Murmur3 range and reports it in its table
ditto with future objstorage checker
However, this is going to multiply tables, and I am not a huge fan of the scrubber breaking the swh-storage abstraction by querying the DB directly; but with this design we cannot really do otherwise, as batch queries to Cassandra requires Cassandra-specific knowledge of the swh-storage caller.
The scrubber DB would store these (partition_id, nb_partitions) pairs, instead of SWHIDs (which is more opaque, but datastore-independent)
The postgresql and cassandra checker would query these endpoints
the objstorage would emulate it or something
the main downside is that the scrubber DB now has to store somewhat-opaque integers instead of SWHIDs, but would need to do that anyway for Cassandra (with completely opaque integers). On the other hand, it might make it easier to compute scrubbing progress in pure SQL.
A minor issue is that content_get_partition is meant to have low limits, and materialize immediately the results; as it's meant to be used over bloquing queries; whereas the scrubber would benefit from long lazy iterators. I don't see how to address that unless we add streaming support to our RPC framework. Maybe by adding *_iter_partition endpoints instead (or in addition to) *_get_partition that are not available through the RPC layer?