Skip to content
Snippets Groups Projects
Forked from Platform / Development / swh-scrubber
18 commits behind the upstream repository.
Antoine Lambert's avatar
Antoine Lambert authored
Instead of reinventing the wheel, prefer to use the check method from the
object storage interface for verifying content presence and integrity.

Related to #4694.
68d754ae
History

Software Heritage - Datastore Scrubber

Tools to periodically checks data integrity in swh-storage and swh-objstorage, reports errors, and (try to) fix them.

This is a work in progress; some of the components described below do not exist yet (cassandra storage checker, objstorage checker, recovery, and reinjection)

The Scrubber package is made of the following parts:

Checking

Highly parallel processes continuously read objects from a data store, compute checksums, and write any failure in a database, along with the data of the corrupt object.

There is one "checker" for each datastore package: storage (postgresql and cassandra), journal (kafka), and objstorage.

The journal is "crawled" using its native streaming; others are crawled by range, reusing swh-storage's backfiller utilities, and checkpointed from time to time to the scrubber's database (in the checked_range table).

Storage

For the storage checker, a checking configuration must be created before being able to spawn a number of checkers.

A new configuration is created using the swh scrubber check init tool:

$ swh scrubber check init --object-type snapshot --nb-partitions 65536 --name chk-snp
Created configuration chk-snp [2] for checking snapshot in datastore storage postgresql

One (or more) checking worker can then be spawned by using the swh scrubber check storage command:

$ swh scrubber check storage chk-snp
[...]

Note

A configuration file is expected, as for most swh tools.
This file must have a scrubber section with the configuration of the scrubber database. For storage checking operations, this configuration file must also have a storage configuration section. See the swh-storage documentation for more details on this. A typical configuration file could look like:
scrubber:
  cls: postgresql
  db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824

storage:
  cls: postgresql
  db: service=swh
  objstorage:
    cls: noop

Note

The configuration section scrubber_db has been renamed as scrubber in swh-scrubber version 2.0.0

Recovery

Then, from time to time, jobs go through the list of known corrupt objects, and try to recover the original objects, through various means:

  • Brute-forcing variations until they match their checksum
  • Recovering from another data store
  • As a last resort, recovering from known origins, if any

Reinjection

Finally, when an original object is recovered, it is reinjected in the original data store, replacing the corrupt one.