Implement storage scrubber for cassandra

changed milestone to %Regularly scrub journal, storage, and objstorage [Roadmap - Preserve]

added activity::Implementation label

assigned to @vlorentz

added Storage manager label

added #4686 (closed) as child task

added #4687 (closed) as child task

added priority:High label

mentioned in merge request swh-storage!1019 (closed)

Current design:

the scrubber DB stores a list of SWHID ranges that were checked
the postgresql checker queries the postgresql DB directly, to get all SWHIDs in a range

To efficiently read from Cassandra, we should use ranges of token(sha1, sha1_git, sha256, blake2s256) (for content and skipped_content) token(sha1_git) (for other tables) where token is a Murmur3 non-cryptographic hash function. This means we have to change this design. The minimal change I see is to make it so that:

the scrubber DB stores a list of SWHID ranges that were checked (for postgresql), Murmur3 ranges (for cassandra), and sha1/sha256 ranges for the future objstorage checker
the postgresql checker is unchanged: queries the postgresql DB directly, to get all SWHIDs in a range and reports it in its table
the cassandra checker queries the cassandra DB directly, to get all objects within a Murmur3 range and reports it in its table
ditto with future objstorage checker

However, this is going to multiply tables, and I am not a huge fan of the scrubber breaking the swh-storage abstraction by querying the DB directly; but with this design we cannot really do otherwise, as batch queries to Cassandra requires Cassandra-specific knowledge of the swh-storage caller.

Instead, we could redesign it like this:

For every table, add *_get_partition endpoint to swh-storage, similar to the existing one for contents: content_get_partition(partition_id: int, nb_partitions: int, page_token: Optional[str] = None, limit: int = 1000) → PagedResult[Content, str]
The scrubber DB would store these (partition_id, nb_partitions) pairs, instead of SWHIDs (which is more opaque, but datastore-independent)
The postgresql and cassandra checker would query these endpoints
the objstorage would emulate it or something

the main downside is that the scrubber DB now has to store somewhat-opaque integers instead of SWHIDs, but would need to do that anyway for Cassandra (with completely opaque integers). On the other hand, it might make it easier to compute scrubbing progress in pure SQL.

A minor issue is that content_get_partition is meant to have low limits, and materialize immediately the results; as it's meant to be used over bloquing queries; whereas the scrubber would benefit from long lazy iterators. I don't see how to address that unless we add streaming support to our RPC framework. Maybe by adding *_iter_partition endpoints instead (or in addition to) *_get_partition that are not available through the RPC layer?

We (@olasd, @douardda, and I) decided to go with the redesign. We won't need to worry about migrating current data, because we lost it already because of faulty hardware (swh/infra/sysadm-environment#4727 (closed))

It's actually not clear that we've lost that data after all (only an index), but either way, yeah, we can start from scratch anyway.

mentioned in issue swh-storage#4672 (closed)

mentioned in merge request swh-storage!1034 (merged)

mentioned in merge request swh-storage!1026 (merged)

mentioned in merge request !35 (merged)

mentioned in merge request !36 (merged)

closed with merge request !36 (merged)

mentioned in merge request !37 (closed)

Implement storage scrubber for cassandra

Designs

Child items ...

Activity