Skip to content
Snippets Groups Projects
Forked from Platform / Development / swh-scrubber
46 commits behind the upstream repository.
David Douard's avatar
David Douard authored
These tables used to reference the datastore the invalid/missing object
was found in, but not keeping the config entry, i.e. the checking session
during wich the invalid/missing object was found, which can be an issue
when more than one checking session is executed on a given datastore.

This replaces the `datastore` field of tables `corrupt_object`,
`missing_object` and `missing_object_reference` tables by `config_id`.

Adapt all the code accordingly.

Note that it changes a bit the cli usage: the kafka checker now needs
a config entry, thus a kafka checking session can ony target a given
object type (i.e. one kafka topic),

The migration script will fill the config_id column for corrupt_object
using the check_config entry that matches the oject_type (of
corrupt_object) and datastore. For missing_object and
missing_object_reference, it will use this later table to idenify the
check_config entry corresponding object type for the reference_id and
datastore, since it is a checking session on this object type that will
generate a missing object entry (which is generaaly not of the same
type). For the missing_object table, the config_id will use the one
extracted from the missing_object_reference (joining on the missing_id
column).

Note that the migration script will fail if there are rows in one
of these tables for which there exists more than one possible
config_entry (i.e. with the same object_type and datastore).
bd8e324c
History

Software Heritage - Datastore Scrubber

Tools to periodically checks data integrity in swh-storage and swh-objstorage, reports errors, and (try to) fix them.

This is a work in progress; some of the components described below do not exist yet (cassandra storage checker, objstorage checker, recovery, and reinjection)

The Scrubber package is made of the following parts:

Checking

Highly parallel processes continuously read objects from a data store, compute checksums, and write any failure in a database, along with the data of the corrupt object.

There is one "checker" for each datastore package: storage (postgresql and cassandra), journal (kafka), and objstorage.

The journal is "crawled" using its native streaming; others are crawled by range, reusing swh-storage's backfiller utilities, and checkpointed from time to time to the scrubber's database (in the checked_range table).

Storage

For the storage checker, a checking configuration must be created before being able to spawn a number of checkers.

A new configuration is created using the swh scrubber check init tool:

$ swh scrubber check init --object-type snapshot --nb-partitions 65536 --name chk-snp
Created configuration chk-snp [2] for checking snapshot in datastore storage postgresql

One (or more) checking worker can then be spawned by using the swh scrubber check storage command:

$ swh scrubber check storage chk-snp
[...]

Note

A configuration file is expected, as for most swh tools.
This file must have a scrubber section with the configuration of the scrubber database. For storage checking operations, this configuration file must also have a storage configuration section. See the swh-storage documentation for more details on this. A typical configuration file could look like:
scrubber:
  cls: postgresql
  db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824

storage:
  cls: postgresql
  db: service=swh
  objstorage:
    cls: noop

Note

The configuration section scrubber_db has been renamed as scrubber in swh-scrubber version 2.0.0

Recovery

Then, from time to time, jobs go through the list of known corrupt objects, and try to recover the original objects, through various means:

  • Brute-forcing variations until they match their checksum
  • Recovering from another data store
  • As a last resort, recovering from known origins, if any

Reinjection

Finally, when an original object is recovered, it is reinjected in the original data store, replacing the corrupt one.