Software Heritage - Datastore Scrubber
Tools to periodically checks data integrity in swh-storage and swh-objstorage, reports errors, and (try to) fix them.
This is a work in progress; some of the components described below do not exist yet (cassandra storage checker, objstorage checker, recovery, and reinjection)
The Scrubber package is made of the following parts:
Checking
Highly parallel processes continuously read objects from a data store, compute checksums, and write any failure in a database, along with the data of the corrupt object.
There is one "checker" for each datastore package: storage (postgresql and cassandra), journal (kafka), and objstorage.
The journal is "crawled" using its native streaming; others are crawled by range,
reusing swh-storage's backfiller utilities, and checkpointed from time to time
to the scrubber's database (in the checked_range
table).
Storage
For the storage checker, a checking configuration must be created before being able to spawn a number of checkers.
A new configuration is created using the swh scrubber check init
tool:
$ swh scrubber check init --object-type snapshot --nb-partitions 65536 --name chk-snp
Created configuration chk-snp [2] for checking snapshot in datastore storage postgresql
One (or more) checking worker can then be spawned by using the swh scrubber
check storage
command:
$ swh scrubber check storage chk-snp
[...]
Note
- A configuration file is expected, as for most
swh
tools. - This file must have a
scrubber
section with the configuration of the scrubber database. For storage checking operations, this configuration file must also have astorage
configuration section. See the swh-storage documentation for more details on this. A typical configuration file could look like:
scrubber:
cls: postgresql
db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824
storage:
cls: postgresql
db: service=swh
objstorage:
cls: noop
Note
The configuration section scrubber_db
has been renamed as
scrubber
in swh-scrubber
version 2.0.0
Recovery
Then, from time to time, jobs go through the list of known corrupt objects, and try to recover the original objects, through various means:
- Brute-forcing variations until they match their checksum
- Recovering from another data store
- As a last resort, recovering from known origins, if any
Reinjection
Finally, when an original object is recovered, it is reinjected in the original data store, replacing the corrupt one.