[cassandra] Validate the replayed data
We should ensure all the data was correctly imported in cassandra.
To do so the scrubbers should be executed at least once to detect any incoherence compared to kafka
The replayer error reporter should be deployed too to track the import errors
plan:
-
Develop cli to easily compare journal/cassandra/postgres representations -
Debug issues -
Refactor -
Let it run basically on machine -
Package and deploy -
Deploy in staging (3 replicas for now) -
Monitor - steady workload [1]
- kafka eta in the future ;) [2]
-
Deploy in production (6 replicas) -
Analysis and fix papercuts -
RawExtrinsicMetadata comparison -
Content: check all hashes -
OriginVisit [0] (not fixed, further analysis required to know what to do) -
OriginVisitStatus~> not reproduced -
SkippedContent (null?)~> no issue so far -
Fetch content by each hash (so 4 queries per content), consider it None if any is missing. -
Adapt comparison behavior for various object types -
Add debug logging instructions and the means to turn them on on deployment -
Dump all representations on disk so we can compare what went wrong during comparison if any problem actually exists -
Add script to analyze disk representations after the facts
-
-
swh-charts: Deploy per object type -
Deploy through swh-charts in staging -
Issue with ceph volume mounting... (too many small files) -
Use local path on rancher node instead -
Fix remaining issue on journal client -
Deploy
-
-
Monitor -
(ongoing) Synthesis of problematic objects: https://hedgedoc.softwareheritage.org/Kv0sxpfPQN-wDPhC8xhzBw
[0] Date of the origin is older in the journal than it is in the backend (same date between cassandra and postgresql) #4707 (comment 167740)
[1] https://grafana.softwareheritage.org/goto/0UosGx0Sk?orgId=1
[2] https://grafana.softwareheritage.org/goto/VhGLMbASz?orgId=1
Edited by Antoine R. Dumont