[cassandra] Validate the replayed data

added Datastore Scrubber cassandra labels

marked this issue as related to #4373 (closed)

changed milestone to %Cassandra in production as primary storage [Roadmap - Tooling and infrastructure]

mentioned in commit swh/infra/ci-cd/swh-charts@3b30c28d

@vlorentz I've no idea how to deploy the scrubber and the impacts on the current running scrubbers for the production postgresql

Is there any documentation somewhere explaining the datamodel and the configuration options?

added 2h of time spent at 2023-03-20

Starting from the postgresql config:

https://gitlab.softwareheritage.org/swh/infra/puppet/puppet-swh-site/-/blob/bab2d2e97a90e28c24f4283e2183997150ba22a2/data/common/common.yaml#L2090-209

You can do this:

      scrubber_db:
        cls: postgresql
        db: "%{alias('swh::deploy::scrubber::db::config')}"
      storage:
        cls: cassandra
        hosts: ...
        keyspace: ...
        objstorage:
          cls: noop

The other half of the configuration is to give it ranges in the CLI, like is done here: https://gitlab.softwareheritage.org/swh/infra/puppet/puppet-swh-site/-/blob/bab2d2e97a90e28c24f4283e2183997150ba22a2/site-modules/profile/manifests/swh/deploy/scrubber/checker/postgres.pp

Except the new version uses --start-partition-id/--end-partition-id/--nb-partitions instead of --start_object/--end_object to split work across processes. So for example if you wanted 8 threads checking revisions in parallel and with 64k checkpoints, you would run the CLI 8 times with these parameters:

--object-type revision --nb-partitions 65536 --start-partition-id 0 --end-partition-id 8192
--object-type revision --nb-partitions 65536 --start-partition-id 8192 --end-partition-id 16384
--object-type revision --nb-partitions 65536 --start-partition-id 16384 --end-partition-id 24576
--object-type revision --nb-partitions 65536 --start-partition-id 24576 --end-partition-id 32768
--object-type revision --nb-partitions 65536 --start-partition-id 32768 --end-partition-id 40960
--object-type revision --nb-partitions 65536 --start-partition-id 40960 --end-partition-id 49152
--object-type revision --nb-partitions 65536 --start-partition-id 49152 --end-partition-id 57344
--object-type revision --nb-partitions 65536 --start-partition-id 57344 --end-partition-id 65536

For some reason the documentation isn't showing on https://docs.softwareheritage.org/devel/apidoc/swh.scrubber.cli.html but you can see it in the docstrings here:

https://gitlab.softwareheritage.org/swh/devel/swh-scrubber/-/blob/229c7f4f64c788f045492d5ddd2a9cd97fe9ecd2/swh/scrubber/cli.py

mentioned in merge request swh/devel/swh-scrubber!39 (merged)

Thanks, it seems the next deployment in staging/production on the legacy will need some adjustments to support these new parameters.

I guess we need to wait for the current production to be upgraded to have the last version of the database.

Btw, I don't see a new migration script, what will happen to the current data?

Btw, I don't see a new migration script, what will happen to the current data?

drop the checked_range table (which was corrupted by ZFS issues) and run the initialization scripts to create checked_partition and its indexes.

ack. We should removed all the upgrade scripts to avoid a difference between what is in the database and what the upgrades scripts are expecting (or better, having a 5.sql do perform the migration)