Commits · 7e475942d2525619c3a5113e69a49b96720ba2b8 · Platform / Development / swh-scrubber

Mar 13, 2024

pytest: Fix tests execution with pytest 8.1 · 7e475942

Remove use of --import-mode=importlib pytest option and use
new option consider_namespace_packages to fix tests execution
with latest pytest release.

7e475942

Feb 06, 2024

Improve the db test to check maximum partition number is not exceeded · 3acaef34

David Douard authored 1 year ago

The next partition to check, as returned by the
checked_partition_iter_next() iterator should never return a partition
number exceeding the max number of partitions in the config, nor should
it addd this in the database.

3acaef34

Feb 05, 2024
- tox: Bump mypy to 1.8.0 · 3f228448
  Antoine Lambert authored 1 year ago
  
  Related to swh/meta#5075.
  3f228448
Feb 02, 2024
- Update for pytest-postgresql >= 5 · dc79e4bb
  Nicolas Dandrimont authored 1 year ago
  
  v2.3.0
  
  dc79e4bb
Dec 05, 2023
- Add latest blackify to git-blame-ignore-revs · 6b201db3
  David Douard authored 1 year ago
  
  6b201db3
- python: Fix black formatting after bump to 23.1.0 in pre-commit · 972edfbc
  David Douard authored 1 year ago
  
  v2.2.0
  
  972edfbc
- codespell: ignore the word 'mor' (stands for 'missing object reference') · 986cef51
  David Douard authored 1 year ago
  
  986cef51
Dec 03, 2023
- Apply swh-py-template 0.1.6 · 49215497
  David Douard authored 1 year ago
  
  49215497
Nov 30, 2023
- journal_checker: Fix crash on duplicate directory entries · 5b9b6a5b
  vlorentz authored 1 year ago and vlorentz committed 1 year ago
  
  5b9b6a5b
- Migrate to copier-based swh-py-template · a84e8048
  David Douard authored 1 year ago
  
  a84e8048
Nov 24, 2023

tests/test_cli: Fix test_locate_origins with dev version of swh-graph · c9a960f0

Antoine Lambert authored 1 year ago

It now requires a swh-graph server running or connection errors appear.

Use swh-graph NaiveClient to avoid spawning a real graph server during
the tests.

c9a960f0

Oct 17, 2023

Add a cli command to get statistics for a given config entry · c0a4d44b

David Douard authored 1 year ago

As well as a command to list partitions being checked.

For example:

```
$ swh scrubber check stats snapshot_16 -j
{
  "config": {
    "name": "snapshot_16",
    "datastore": {
      "package": "storage",
      "cls": "postgresql",
      "instance": "postgresql:///?service=swh-storage"
    },
    "object_type": "snapshot",
    "nb_partitions": 65536,
    "check_hashes": true,
    "check_references": true
  },
  "min_duration": 0.002196,
  "max_duration": 0.107398,
  "avg_duration": 0.005969,
  "checked_partition": 65536,
  "running_partition": 0,
  "missing_object": 0,
  "missing_object_reference": 0,
  "corrupt_object": 0
}

$ swh scrubber check running cfg1

Running partitions for cfg1 [id=1, type=snapshot]:
0:	running since today (20 minutes)

```

c0a4d44b

Oct 16, 2023
- cli: add support for check-hashes and check-references flags for `check · 781f84a1
  David Douard authored 1 year ago
  
  init` command
  v2.1.0
  
  781f84a1
Oct 12, 2023

journal_checker: Check the 'check_references' flag is not set · a20e6735
David Douard authored 1 year ago

a20e6735

Add support for 2 check config flags in the check_config table · 566db2ac

David Douard authored 1 year ago

These flags allow to configure a checking session including only one of
the 2 possible checks (hash computation and reference validation).

566db2ac

run mypy with --install-types by default · c08c10b1
David Douard authored 1 year ago
```
Which allows to remove the dependency on types-pyyaml in [testing]
extra.
```
c08c10b1

Refactor data model to use config_id instead of datastore in xxx_object tables · bd8e324c

David Douard authored 1 year ago

These tables used to reference the datastore the invalid/missing object
was found in, but not keeping the config entry, i.e. the checking session
during wich the invalid/missing object was found, which can be an issue
when more than one checking session is executed on a given datastore.

This replaces the `datastore` field of tables `corrupt_object`,
`missing_object` and `missing_object_reference` tables by `config_id`.

Adapt all the code accordingly.

Note that it changes a bit the cli usage: the kafka checker now needs
a config entry, thus a kafka checking session can ony target a given
object type (i.e. one kafka topic),

The migration script will fill the config_id column for corrupt_object
using the check_config entry that matches the oject_type (of
corrupt_object) and datastore. For missing_object and
missing_object_reference, it will use this later table to idenify the
check_config entry corresponding object type for the reference_id and
datastore, since it is a checking session on this object type that will
generate a missing object entry (which is generaaly not of the same
type). For the missing_object table, the config_id will use the one
extracted from the missing_object_reference (joining on the missing_id
column).

Note that the migration script will fail if there are rows in one
of these tables for which there exists more than one possible
config_entry (i.e. with the same object_type and datastore).

bd8e324c

Sep 21, 2023

Fix flake8 config in pre-commit · 84cbe40b

David Douard authored 1 year ago

was missing the flake8-bugbear dependency, making effectively the
line-too-long check disabled.

84cbe40b

Aug 24, 2023

scrubber.cli: Ensure to retrieve the correct datastore configuration · 24e4b467

Antoine R. Dumont authored 1 year ago

Previously, in production, this would retrieve the configuration of the other backend as
those configurations are named the same.

Refs. #4696

Verified

24e4b467

db: Allow to retrieve a configuration per name and optional datastore · ceea59b3

Antoine R. Dumont authored 1 year ago

To avoid returning only the first one when multiple configuration with the same name
exists for different backend to scrub.

Refs. #4696

Verified

ceea59b3

Jul 26, 2023

Ignore .hypothesis folder · eaaeb597
Antoine R. Dumont authored 1 year ago
```
It's popping up after having run tests.
```
v2.0.2 Verified

eaaeb597
scrubber.cli: Fix raised exception typo · 0ef7e397
Antoine R. Dumont authored 1 year ago
```
This was found while deploying the new version.
```
Verified

0ef7e397

test: Fix test for older click version · 53970d52

Antoine R. Dumont authored 1 year ago

With older click version (e.g. 7.0-1), the text wrapping can be different, resulting in
some docstring text included in this command list, so check we find the expected
commands instead [1] [2]

Refs. swh/infra/sysadm-environment#4992

[1] 'defined ...' is part of the first line of the docstring for the "init" subcommand.

```
10:21:42  E       AssertionError: assert ['init', 'defined...', 'journal', 'list', 'stalled', 'storage'] == ['init', 'journal', 'list', 'stalled', 'storage']
10:21:42  E         At index 1 diff: 'defined...' != 'journal'
10:21:42  E         Left contains one more item: 'storage'
10:21:42  E         Full diff:
10:21:42  E         - ['init', 'journal', 'list', 'stalled', 'storage']
10:21:42  E         + ['init', 'defined...', 'journal', 'list', 'stalled', 'storage']
10:21:42  E         ?         ++++++++++++++
```

[2] https://jenkins.softwareheritage.org/view/swh-debian%20(draft)/job/debian/job/packages/job/DSCRUB/job/gbp-buildpackage/31/console

Verified

53970d52

Jul 12, 2023
- Update the README file · a2646a71
  David Douard authored 1 year ago
  
  v2.0.0
  
  a2646a71
Jul 10, 2023

Rename the 'scrubber_db' config section as 'scrubber' · e879bd14

David Douard authored 1 year ago

This is needed to make it compatible with swh.core's db upgrade tooling:
the name of the configuration section is exptected to be the swh module.

e879bd14

fix the 5->6 upgrade sql script · d9f89378

David Douard authored 1 year ago

Need to drop the index of the old checked_partition before recreating
the new one (with the same name); simplest way of doing this is cascade
droping the old checked_partition table before recreating the new index.

d9f89378

Add a couple of tests in test_cli · 140a935e

David Douard authored 1 year ago

This is especially testing the fact the `--help` argument works when
running the `swh scrubber check --help` without any configuration file
set.

140a935e

Add a `--reset` flag to the `swh scrubber check stalled` command · 87412380

David Douard authored 1 year ago

This flag reset the partitions identified as stalled by setting
start_date and end_date to NULL.

This should put these reset partition to be selected for checking by a
scrubber worker.

87412380

Add a 'swh scrubber check stalled` command listing stalled partitions · 67a743d0

David Douard authored 1 year ago

For a given configuration (hence sotrage, object_type and partition scheme)
list partitions that have a start_date but no end_date for a long enough
time.

By default, it will compute the delay for a partition to be considered as
stalled based on the 10 last partitions checked for the given
configuration.

67a743d0

Refactor the checker stack · 9cd7414a

David Douard authored 1 year ago

A checker configuration must now be created before being
able to start a checker session. This configuration is stored in the
database and consist in a triplet

  (datastore, object_type, nb_partitions)

Once done, any number of checker can be started for this specific
checker configuration; each checher process will check partitions
one by one, using the status stored in the database to get the next
partition number to check on the next iteration.

This allows to dynamically adapt the number of checker processes.

For example, checking the shapshots splitting the hash space in 4096
partitions using 4 parallel workers could be like:

  $ export SWH_CONFIG_FILENAME=config.yml
  $ swh scrubber check init --object-type snapshot --nb-partitions 4096 --name cfg-snp
  Created configuration cfg-snp [3] for checking shapshot in postgresql storage

  $ for i in {1..4}; do (swh scrubber check storage cfg-snp &); done

9cd7414a

Jul 07, 2023

Extract part of the checked_partition table in a new check_config one · 369341bc

David Douard authored 1 year ago

This new table stores the "configuration" for a scrubber. A
configuration consists in a set of:

  (datastore, object_type, nb_partitions)

This comes with a migration script; WARNING: this script needs to be
checked before deployment on a productiion-sized big database. Any
activity on the database should be stopped before execution.

This is the first step of a series to make the scrubber easier to deploy
on elastic infrastructure.

369341bc

Fix mypy/click: add swh.core[testing] in requirements-test.txt · 673a6e61
David Douard authored 1 year ago
```
It now needs types-click which is indeed a dependency of
swh.core[testing].
```
673a6e61

Jun 21, 2023

tox.ini: pass cassandra-related environment variables · edb1007b

Nicolas Dandrimont authored 1 year ago

This allows overriding the JAVA_HOME to run cassandra with a different
java version (which also happens to be needed in CI, as we force usage
of an old java for cassandra through that envvar).

edb1007b

tox.ini: use minversion instead of requires · 4e5334d7
Nicolas Dandrimont authored 1 year ago
```
This avoids reinstalling tox all the time
```
4e5334d7

Apr 05, 2023
- storage_checker: Fix crash on directory with duplicate entries · 192ae3d1
  vlorentz authored 1 year ago and vlorentz committed 1 year ago
  
  and report it as corrupt instead.
  v1.0.3
  
  192ae3d1
Mar 28, 2023
- Make the 5.sql upgrade script actually work · 4400f3e9
  Nicolas Dandrimont authored 1 year ago
  
  v1.0.2
  
  4400f3e9
Mar 22, 2023
- docs: Add CLI · 7ab35978
  vlorentz authored 1 year ago
  
  v1.0.1
  
  7ab35978
- Add migration script for checked_range -> checked_partition · 0a6d7983
  vlorentz authored 1 year ago
  
  v1.0.0
  
  0a6d7983
- cli: Add example of CLI arguments for parallelization · 5967ae36
  vlorentz authored 1 year ago and vlorentz committed 1 year ago
  
  5967ae36
Mar 16, 2023

Move 'nb_partitions' before 'partition_id' in the index · 229c7f4f

vlorentz authored 2 years ago

It makes more sense to query a range of partition ids with a fixed nb_partition
than a range of nb_partitions with a fix partition id

No migration because the next release will need to scrap the whole table
anyway.

229c7f4f