Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • vlorentz/swh-scrubber
  • lunar/swh-scrubber
  • anlambert/swh-scrubber
  • swh/devel/swh-scrubber
  • olasd/swh-scrubber
  • douardda/swh-scrubber
  • ardumont/swh-scrubber
  • marmoute/swh-scrubber
8 results
Show changes
Commits on Source (97)
# Changes here will be overwritten by Copier
_commit: v0.3.3
_src_path: https://gitlab.softwareheritage.org/swh/devel/swh-py-template.git
description: Software Heritage datastore scrubber
distribution_name: swh-scrubber
have_cli: true
have_workers: false
package_root: swh/scrubber
project_name: swh.scrubber
python_minimal_version: '3.7'
readme_format: rst
# python: Reformat code with black 22.3.0
# python: Reformat code with black
73eee4e307c59c4930a97cab9f787ae8e284b163
972edfbca7888c14d54fef681b4a06328dbe7677
*.egg-info/
*.pyc
*.sw?
*~
.coverage
.eggs/
.hypothesis
.mypy_cache
.tox
__pycache__
build/
dist/
version.txt
.mypy_cache/
.vscode/
# these are symlinks created by a hook in swh-docs' main sphinx conf.py
docs/README.rst
docs/README.md
# this should be a symlink for people who want to build the sphinx doc
# without using tox, generally created by the swh-env/bin/update script
docs/Makefile.sphinx
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.4.0
hooks:
- id: trailing-whitespace
- id: check-json
- id: check-yaml
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: trailing-whitespace
- id: check-json
- id: check-yaml
- repo: https://gitlab.com/pycqa/flake8
rev: 3.8.3
hooks:
- id: flake8
- repo: https://github.com/python/black
rev: 25.1.0
hooks:
- id: black
- repo: https://github.com/codespell-project/codespell
rev: v1.16.0
hooks:
- id: codespell
- repo: https://github.com/PyCQA/isort
rev: 6.0.0
hooks:
- id: isort
- repo: local
hooks:
- id: mypy
name: mypy
entry: mypy
args: [swh]
pass_filenames: false
language: system
types: [python]
- repo: https://github.com/pycqa/flake8
rev: 7.1.1
hooks:
- id: flake8
additional_dependencies: [flake8-bugbear==24.12.12, flake8-pyproject]
# unfortunately, we are far from being able to enable this...
# - repo: https://github.com/PyCQA/pydocstyle.git
# rev: 4.0.0
# hooks:
# - id: pydocstyle
# name: pydocstyle
# description: pydocstyle is a static analysis tool for checking compliance with Python docstring conventions.
# entry: pydocstyle --convention=google
# language: python
# types: [python]
- repo: https://github.com/codespell-project/codespell
rev: v2.4.1
hooks:
- id: codespell
name: Check source code spelling
additional_dependencies:
- tomli
stages: [pre-commit]
- id: codespell
name: Check commit message spelling
additional_dependencies:
- tomli
stages: [commit-msg]
- repo: https://github.com/PyCQA/isort
rev: 5.5.2
hooks:
- id: isort
- repo: https://github.com/python/black
rev: 22.3.0
hooks:
- id: black
- repo: local
hooks:
- id: mypy
name: mypy
entry: mypy
args: [swh]
pass_filenames: false
language: system
types: [python]
- id: twine-check
name: twine check
description: call twine check when pushing an annotated release tag
entry: bash -c "ref=$(git describe) &&
[[ $ref =~ ^v[0-9]+\.[0-9]+\.[0-9]+$ ]] &&
(python3 -m build --sdist && twine check $(ls -t dist/* | head -1)) || true"
pass_filenames: false
stages: [pre-push]
language: python
additional_dependencies: [twine, build]
......@@ -6,7 +6,7 @@ In the interest of fostering an open and welcoming environment, we as Software
Heritage contributors and maintainers pledge to making participation in our
project and our community a harassment-free experience for everyone, regardless
of age, body size, disability, ethnicity, sex characteristics, gender identity
and expression, level of experience, education, socio-economic status,
and expression, level of experience, education, socioeconomic status,
nationality, personal appearance, race, religion, or sexual identity and
orientation.
......
include Makefile
include requirements*.txt
include version.txt
include README.md
recursive-include swh py.typed
docs/README.rst
\ No newline at end of file
Software Heritage - Datastore Scrubber
======================================
Tools to periodically checks data integrity in ``swh-storage``, ``swh-objstorage``
and ``swh-journal``, reports errors, and (try to) fix them.
The Scrubber package is made of the following parts:
Checking
--------
Highly parallel processes continuously read objects from a data store,
compute checksums, and write any failure in a database, along with the data of
the corrupt object.
There is one "checker" for each datastore package: storage (postgresql and cassandra),
journal (kafka), and object storage (any backends).
The journal is "crawled" using its native streaming; others are crawled by range,
reusing swh-storage's backfiller utilities, and checkpointed from time to time
to the scrubber's database (in the ``checked_range`` table).
Storage
+++++++
For the storage checker, a checking configuration must be created before being
able to spawn a number of checkers.
A new configuration is created using the ``swh scrubber check init`` tool:
.. code-block:: console
$ swh scrubber check init storage --object-type snapshot --nb-partitions 65536 --name chk-snp
Created configuration chk-snp [2] for checking snapshot in datastore storage postgresql
.. note::
A configuration file is expected, as for most ``swh`` tools.
This file must have a ``scrubber`` section with the configuration of
the scrubber database. For storage checking operations, this
configuration file must also have a ``storage`` configuration section.
See the `swh-storage documentation`_ for more details on this. A
typical configuration file could look like:
.. code-block:: yaml
scrubber:
cls: postgresql
db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824
storage:
cls: postgresql
db: service=swh
objstorage:
cls: noop
.. note::
The configuration section ``scrubber_db`` has been renamed as
``scrubber`` in ``swh-scrubber`` version 2.0.0
One (or more) checking worker can then be spawned by using the ``swh scrubber
check run`` command:
.. code-block:: console
$ swh scrubber check run chk-snp
[...]
Object storage
++++++++++++++
As with the storage checker, a checking configuration must be created before
being able to spawn a number of checkers.
A new configuration is created using the ``swh scrubber check init`` tool:
.. code-block:: console
$ swh scrubber check init objstorage --object-type content --nb-partitions 65536 --name check-contents
Created configuration check-contents [3] for checking content in datastore objstorage remote
.. note::
A configuration file is expected, as for most ``swh`` tools.
This file must have a ``scrubber`` section with the configuration of
the scrubber database. For object storage checking operations, this
configuration file must have:
- a ``storage`` configuration section if content ids are read from it (default)
- a ``journal`` configuration section if content ids are read from a kafka content
topic (require to use flag ``--use-journal`` of the ``swh scrubber check run``
command)
- an ``objstorage`` configuration section targeting the object storage to check
See the `swh-storage documentation`_, `swh-objstorage documentation`_ and
`swh-journal documentation`_ for more details on this. A typical configuration
file could look like:
.. code-block:: yaml
scrubber:
cls: postgresql
db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824
storage:
cls: postgresql
db: service=swh
objstorage:
cls: noop
journal:
cls: kafka
brokers:
- broker1.journal.softwareheritage.org:9093
- broker2.journal.softwareheritage.org:9093
- broker3.journal.softwareheritage.org:9093
- broker4.journal.softwareheritage.org:9093
group_id: swh.scrubber
prefix: swh.journal.objects
on_eof: stop
objstorage:
cls: remote
url: https://objstorage.softwareheritage.org/
By default, an object storage checker detects missing and corrupted contents.
To disable detection of missing contents, use the ``--no-check-references``
option of the ``swh check init`` command.
To disable detection of corrupted contents, use the ``--no-check-hashes``
option of the ``swh check init`` command.
One (or more) checking worker can then be spawned by using the ``swh scrubber
check run`` command:
- if the content ids must be read from a storage instance
.. code-block:: console
$ swh scrubber check run check-contents
[...]
- if the content ids must be read from a kafka content topic of ``swh-journal``
.. code-block:: console
$ swh scrubber check run check-contents --use-journal
[...]
Journal
+++++++
As with the other checkers, a checking configuration must be created before being
able to spawn a number of checkers.
A new configuration is created using the ``swh scrubber check init`` tool:
.. code-block:: console
$ swh scrubber check init journal --object-type directory --name check-dirs-journal
Created configuration check-dirs-journal [4] for checking directory in datastore journal kafka
.. note::
A configuration file is expected, as for most ``swh`` tools.
This file must have a ``scrubber`` section with the configuration of
the scrubber database. For journal checking operations, this
configuration file must also have a ``journal`` configuration section.
See the `swh-journal documentation`_ for more details on this.
A typical configuration file could look like:
.. code-block:: yaml
scrubber:
cls: postgresql
db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824
journal:
cls: kafka
brokers:
- broker1.journal.softwareheritage.org:9093
- broker2.journal.softwareheritage.org:9093
- broker3.journal.softwareheritage.org:9093
- broker4.journal.softwareheritage.org:9093
group_id: swh.scrubber
prefix: swh.journal.objects
on_eof: stop
One (or more) checking worker can then be spawned by using the ``swh scrubber
check run`` command:
.. code-block:: console
$ swh scrubber check run check-dirs-journal
[...]
Recovery
--------
Then, from time to time, jobs go through the list of known corrupt objects,
and try to recover the original objects, through various means:
* Brute-forcing variations until they match their checksum
* Recovering from another data store
* As a last resort, recovering from known origins, if any
Reinjection
-----------
Finally, when an original object is recovered, it is reinjected in the original
data store, replacing the corrupt one.
.. _`swh-storage documentation`: https://docs.softwareheritage.org/devel/swh-storage/index.html
.. _`swh-objstorage documentation`: https://docs.softwareheritage.org/devel/swh-objstorage/index.html
.. _`swh-journal documentation`: https://docs.softwareheritage.org/devel/swh-journal/index.html
pytest_plugins = ["swh.storage.pytest_plugin", "swh.core.db.pytest_plugin"]
pytest_plugins = [
"swh.storage.pytest_plugin",
"swh.graph.pytest_plugin",
"swh.objstorage.pytest_plugin",
]
include ../../swh-docs/Makefile.sphinx
include Makefile.sphinx
Software Heritage - Datastore Scrubber
======================================
Tools to periodically checks data integrity in swh-storage and swh-objstorage,
reports errors, and (try to) fix them.
This is a work in progress; some of the components described below do not
exist yet (cassandra storage checker, objstorage checker, recovery, and reinjection)
The Scrubber package is made of the following parts:
Checking
--------
Highly parallel processes continuously read objects from a data store,
compute checksums, and write any failure in a database, along with the data of
the corrupt object.
There is one "checker" for each datastore package: storage (postgresql and cassandra),
journal (kafka), and objstorage.
Recovery
--------
Then, from time to time, jobs go through the list of known corrupt objects,
and try to recover the original objects, through various means:
* Brute-forcing variations until they match their checksum
* Recovering from another data store
* As a last resort, recovering from known origins, if any
Reinjection
-----------
Finally, when an original object is recovered, it is reinjected in the original
data store, replacing the corrupt one.
../README.rst
\ No newline at end of file
.. _swh-scrubber-cli:
Command-line interface
======================
.. click:: swh.scrubber.cli:scrubber_cli_group
:prog: swh scrubber
:nested: full
......@@ -6,10 +6,14 @@
:maxdepth: 2
:caption: Contents:
cli
Indices and tables
------------------
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
.. only:: standalone_package_doc
Indices and tables
------------------
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
[mypy]
namespace_packages = True
warn_unused_ignores = True
# 3rd party libraries without stubs (yet)
[mypy-dulwich.*]
ignore_missing_imports = True
[mypy-pkg_resources.*]
ignore_missing_imports = True
[mypy-psycopg2.*]
ignore_missing_imports = True
[mypy-pytest.*]
ignore_missing_imports = True
# [mypy-add_your_lib_here.*]
# ignore_missing_imports = True
[project]
name = "swh.scrubber"
authors = [
{name="Software Heritage developers", email="swh-devel@inria.fr"},
]
description = "Software Heritage datastore scrubber"
readme = {file = "README.rst", content-type = "text/x-rst"}
requires-python = ">=3.7"
classifiers = [
"Programming Language :: Python :: 3",
"Intended Audience :: Developers",
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
"Operating System :: OS Independent",
"Development Status :: 3 - Alpha",
]
dynamic = ["version", "dependencies", "optional-dependencies"]
[tool.setuptools.packages.find]
include = ["swh.*"]
[tool.setuptools.dynamic]
dependencies = {file = ["requirements.txt", "requirements-swh.txt"]}
[tool.setuptools.dynamic.optional-dependencies]
testing = {file = ["requirements-test.txt"]}
[project.entry-points."swh.cli.subcommands"]
"swh.scrubber" = "swh.scrubber.cli"
[project.entry-points."swh.scrubber.classes"]
"postgresql" = "swh.scrubber.db:ScrubberDb"
[project.urls]
"Homepage" = "https://gitlab.softwareheritage.org/swh/devel/swh-scrubber"
"Bug Reports" = "https://gitlab.softwareheritage.org/swh/devel/swh-scrubber/-/issues"
"Funding" = "https://www.softwareheritage.org/donate"
"Documentation" = "https://docs.softwareheritage.org/devel/swh-scrubber/"
"Source" = "https://gitlab.softwareheritage.org/swh/devel/swh-scrubber.git"
[build-system]
requires = ["setuptools", "setuptools-scm"]
build-backend = "setuptools.build_meta"
[tool.setuptools_scm]
fallback_version = "0.0.1"
[tool.black]
target-version = ['py37']
target-version = ['py39', 'py310', 'py311', 'py312']
[tool.isort]
multi_line_output = 3
......@@ -9,3 +56,40 @@ use_parentheses = true
ensure_newline_before_comments = true
line_length = 88
force_sort_within_sections = true
known_first_party = ['swh']
[tool.codespell]
ignore-words-list = "mor"
[tool.mypy]
namespace_packages = true
warn_unused_ignores = true
explicit_package_bases = true
# ^ Needed for mypy to detect py.typed from swh packages installed
# in editable mode
plugins = []
# 3rd party libraries without stubs (yet)
# [[tool.mypy.overrides]]
# module = [
# "package1.*",
# "package2.*",
# ]
# ignore_missing_imports = true
[tool.flake8]
select = ["C", "E", "F", "W", "B950"]
ignore = [
"E203", # whitespaces before ':' <https://github.com/psf/black/issues/315>
"E231", # missing whitespace after ','
"E501", # line too long, use B950 warning from flake8-bugbear instead
"W503", # line break before binary operator <https://github.com/psf/black/issues/52>
"E704", # multiple statements on one line (def)
]
max-line-length = 88
[tool.pytest.ini_options]
norecursedirs = "build docs .*"
asyncio_mode = "strict"
consider_namespace_packages = true
[pytest]
norecursedirs = build docs .*
asyncio_mode = strict
# Add here internal Software Heritage dependencies, one per line.
swh.core[http] >= 0.3 # [http] is required by swh.core.pytest_plugin
swh.core[http] >= 3.6.1
swh.loader.git >= 1.4.0
swh.model >= 5.0.0
swh.storage >= 1.1.0
swh.journal >= 0.9.0
swh.model >= 6.13.0
swh.storage >= 2.0.0
swh.journal >= 1.3.0
swh.objstorage >= 2.9.2
pytest
msgpack
pytest >= 8.1
pytest-mock
pyyaml
swh.core[testing] >= 3.0.0
swh.graph
types-pyyaml
msgpack-types
......@@ -2,3 +2,7 @@
# should match https://pypi.python.org/pypi names. For the full spec or
# dependency lines, see https://pip.readthedocs.org/en/1.1/requirements.html
dulwich
humanize
psycopg2
tenacity