Skip to content
Snippets Groups Projects
Forked from Platform / Development / swh-scrubber
73 commits behind the upstream repository.
Jérémy Bobbio (Lunar)'s avatar
Jérémy Bobbio (Lunar) authored
GitLab will display the content of the README file when browsing the
repository. But in case the file is a symlink, it will display the
path pointed by the symlink. There is a 6 year old issue about this:
https://gitlab.com/gitlab-org/gitlab/-/issues/15093

We can workaround the issue by having the content at the root of the
repository and a symlink to this file in the `docs/` directory.

Tested in swh/devel/swh-py-template!27
cdc1cb97
History

Software Heritage - Datastore Scrubber

Tools to periodically checks data integrity in swh-storage and swh-objstorage, reports errors, and (try to) fix them.

This is a work in progress; some of the components described below do not exist yet (cassandra storage checker, objstorage checker, recovery, and reinjection)

The Scrubber package is made of the following parts:

Checking

Highly parallel processes continuously read objects from a data store, compute checksums, and write any failure in a database, along with the data of the corrupt object.

There is one "checker" for each datastore package: storage (postgresql and cassandra), journal (kafka), and objstorage.

The journal is "crawled" using its native streaming; others are crawled by range, reusing swh-storage's backfiller utilities, and checkpointed from time to time to the scrubber's database (in the checked_range table).

Recovery

Then, from time to time, jobs go through the list of known corrupt objects, and try to recover the original objects, through various means:

  • Brute-forcing variations until they match their checksum
  • Recovering from another data store
  • As a last resort, recovering from known origins, if any

Reinjection

Finally, when an original object is recovered, it is reinjected in the original data store, replacing the corrupt one.