Commits · f3ef6e6a5a12a9018a5a3865676b41a8e7974d5d · Platform / Development / swh-storage

Feb 19, 2021

storage: Implement visit types filtering in origin_search method · f3ef6e6a

Antoine Lambert authored 4 years ago

Enable to filter searched origins by visit types.

Add a new optional visit_types parameter to origin_search method in
StorageInterface.

Implement visit types filtering in storage backends, an origin wil be
returned if it has any of the requested visit types.

This is clearly not designed to be used in production due to performance
issues but rather in testing environments with small archive dataset.

Related to T2869

f3ef6e6a

Feb 17, 2021
- 167: Make the migration script unblocking · 7b4c1247
  Antoine R. Dumont authored 4 years ago
  
  Verified
  
  7b4c1247
Feb 16, 2021

Switch anonymized replayer test to use pytest parametrization · cc3eb4b1

Nicolas Dandrimont authored 4 years ago

This allows us to only read the kafka topics once instead of twice in
the same tests, which is apparently a hard thing to do in a way
compatible with both confluent-kafka 1.5 and 1.6.

cc3eb4b1

Feb 09, 2021

storage: Refactor OriginVisitStatus instantiation · e0e88b2f
Antoine R. Dumont authored 4 years ago

View commits for tag v0.23.0 v0.23.0 Verified

e0e88b2f
db: Unify sql joins on origin_visit_status using "USING" · d30ca938
Antoine R. Dumont authored 4 years ago

Verified

d30ca938

storage.postgresql: Use origin_visit_status.type value as source · 046fe57c

Antoine R. Dumont authored 4 years ago

This stops using the origin_visit.type as fallback values as now, the database has been
migrated.

So this makes the origin_visit_status.type a not nullable column.

This also drops now redundant join instructions on origin_visit table when reading.

Related to T2968

Verified

046fe57c

test_replay: Fix hang since confluent-kafka 1.6 release · 51df58e8

Antoine Lambert authored 4 years ago

Side effect of the following commit in librdkafka 1.6:
https://github.com/edenhill/librdkafka/commit/f418e0f721518d71ff533759698b647cb2e89b80

Tests was relying on a buggy behavior of the mocked kafka cluster: two
subsequent consumers setup with the same group id should receive a
different set of messages, rather than the same set of messages.

Also explicitly commit messages once consumed.

51df58e8

Feb 08, 2021
- postgresql: Fix dbversion() to return the max version instead of a random one. · b0383833
  vlorentz authored 4 years ago
  
  b0383833
Feb 04, 2021

buffer: ensure objects are flushed in topological order · efd8815b

Nicolas Dandrimont authored 4 years ago

This new integration test checks that, when flushing the buffer storage,
the addition functions of the underlying storage backend are called in
topological order (content, directory, revision, release then snapshot).

This reduces the probability of "data consistency" regressions caused by
the use of the buffering storage proxy alone.

efd8815b

Return an accurate summary from buffer's flush() method · 1526107b

Nicolas Dandrimont authored 4 years ago

The earlier implementation would only return summary data from keys that
existed in the last `_add` backend method run, rather than collating all
the results.

1526107b

buffer: add support for snapshots · 5b3e6c9f

Nicolas Dandrimont authored 4 years ago

This is mostly a consistency addition, considering that most (if not
all) loaders will only add a single snapshot.

The common pattern of loading objects in topological order (content >
directory > revision > release > snapshot), then flushing the storage,
is now fully consistent; Without this addition, the snapshot addition
would reach the backend storage before all other objects are added,
leading to potential inconsistencies if the flush of other object types
fails.

5b3e6c9f

buffer: add type annotations for tests · 18967ed4
Nicolas Dandrimont authored 4 years ago

18967ed4

Feb 01, 2021

storage: Make origin_get_latest_visit_status return OriginVisitStatus · 9a9f234e

Antoine R. Dumont authored 4 years ago

This returned a Tuple[OriginVisit, OriginVisitStatus].

This was required to have the missing information "type" for visit-status. This is no
longer needed as now OriginVisitStatus holds the type information.

Verified

9a9f234e

Change origin_visit_status_get_random interface to return visit_status · 626b0bf8

Antoine R. Dumont authored 4 years ago

This returned a Tuple[OriginVisit, OriginVisitStatus] which is no longer needed as now
OriginVisitStatus held the type information now.

Verified

626b0bf8

Write introduction to swh-storage. · f6ae8a06

vlorentz authored 4 years ago

Explains:

* when to use swh-web instead
* that `get_storage` should always be used to instantiate the storage
* `StorageInterface`
* model objects
* pagination
* backends

f6ae8a06

Jan 28, 2021

Correctly return origin_visit_status.type value everywhere · 76de53cb

Vincent Sellier authored 4 years ago

If the type is not present on an origin_visit_status, it
should be computed from the origin_visit.
There were some methods which only return the origin_visit_status
value. It breaks the webapp mangling the type to empty value on
the search result page.

Related to T3001

Verified

76de53cb

Jan 20, 2021
- db: Allow new status values not_found, failed to OriginVisitStatus · e433255d
  Antoine R. Dumont authored 4 years ago
  
  Related to T2961
  View commits for tag v0.21.0 v0.21.0 Verified
  
  e433255d
Jan 18, 2021
- Add type to the origin_visit_status topic · d04165f5
  Vincent Sellier authored 4 years ago
  
  useful when the type is not yet populated in the database Related to T2966
  View commits for tag v0.20.0 v0.20.0 Verified
  
  d04165f5
Jan 15, 2021
- Add persistence of the field OriginVisitStatus.type · c24d35f8
  Vincent Sellier authored 4 years ago
  
  (!) A new database upgrade is needed (165.sql) for postgresql backend Related to T2964
  Verified
  
  c24d35f8
- Make test_content_add_race fail for the right reason. · da55308f
  vlorentz authored 4 years ago
  
  Since 209de5db, it was failing because of: TypeError("content_add() got an unexpected keyword argument 'db'")
  da55308f
Jan 13, 2021

Adapt cassandra storage to ignore the new OriginVisitStatus.type field · 0b44b372
Vincent Sellier authored 4 years ago
```
Depends on D4848
Related to T2443
```
View commits for tag v0.19.0 v0.19.0 Verified

0b44b372

Allow to use the JAVA_HOME environment for cassandra tests · 728c3eea

David Douard authored 4 years ago

This allows to enforce a specific version of java to be used. For
example, since cassandra seems not to support java 14 yet, this allows
to run tests on bullseye:

  JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/ pytest swh

728c3eea

Enforce hypothesis <6 to prevent test breakage · 30945a58

David Douard authored 4 years ago

hypothesis 6 upgraded a warning into an error: now raises a FailedHealthCheck
when using a pytest fixture with a @given generative test set.

See https://hypothesis.readthedocs.io/en/latest/healthchecks.html

30945a58

Jan 08, 2021
- Make the CREATE_TABLES_QUERIES in cassandra/schema.py an explicit list · 74e6f58e
  David Douard authored 4 years ago
  
  prevent being fooled by a missing '\n'.
  74e6f58e
Dec 18, 2020
- Add a cli section in the doc · 2b35198d
  David Douard authored 4 years ago
  
  2b35198d
Nov 24, 2020
- storage.backfill: Allow cli run for origin_visit_status as well · 04ae89f4
  Antoine R. Dumont authored 4 years ago
  
  Verified
  
  04ae89f4
- conftest: Reference swh.core.db.pytest_plugin · 64ee8451
  Antoine R. Dumont authored 4 years ago
  
  As it's exposed through the swh.storage.pytest_plugin itself used by other swh modules, this needs to be declared to avoid other swh module build failures. Related to T2746
  Verified
  
  64ee8451
Nov 23, 2020
- requirements-test.txt: Drop no longer needed pytest-postgresql requirement · e289593f
  Antoine R. Dumont authored 4 years ago
  
  requirements-swh.txt already declares the swh.core[db] dependency which transitively pulls it. Related to T2746
  View commits for tag v0.18.0 v0.18.0 Verified
  
  e289593f
Nov 13, 2020

backfill: Reverse flawed logic in SnapshotBranch generation · 0065d4df
Nicolas Dandrimont authored 4 years ago
```
The previous code would nullify all non-null branches, and try to create a
SnapshotBranch out of null branches.
```
0065d4df
migrate_extrinsic_metadata: don't crash when deb revisions aren't referenced by any snapshot · f5011362
vlorentz authored 4 years ago
```
As this happens for about 50 revisions in the archive.
```
f5011362

backfill: only flush the journal writer on every batch · 20d3f8e7

Nicolas Dandrimont authored 4 years ago

This module's use of write_addition predated the introduction of reliable
writing in swh.journal; Since this introduction, the backfiller has been
flushing the kafka writer after writing each single object, leading to a 3x
measured slowdown on backfilling contents.

20d3f8e7

Nov 12, 2020
- Don't use string expansions in debug logging · 248a04b5
  Nicolas Dandrimont authored 4 years ago
  
  248a04b5
Nov 09, 2020
- migrate_extrinsic_metadata: Remove log output when a CRAN origin is missing · 3eba73df
  vlorentz authored 4 years ago
  
  as this happens quite often and isn't an error.
  3eba73df
- migrate_extrinsic_metadata: add support for guessing the origin of more PyPI... · f3652a97
  vlorentz authored 4 years ago
  
  migrate_extrinsic_metadata: add support for guessing the origin of more PyPI packages from filenames.
  f3652a97
- migrate_extrinsic_metadata: use the retry proxy · c0a3d966
  vlorentz authored 4 years ago
  
  Because it makes a lot of get requests and doesn't handle failures, it crashed often.
  c0a3d966
- Make the retry proxy work on all functions. · aded45b9
  vlorentz authored 4 years ago
  
  The metadata migration script kept crashing otherwise.
  aded45b9
- Set the value_sanitizer argument of get_journal_writer. · 2e7d489e
  vlorentz authored 4 years ago
  
  The next version of swh-journal will remove the default value.
  2e7d489e
- cassandra: Fix content_missing_per_sha1_git implementation · 24cdc85c
  Antoine Lambert authored 4 years ago
  
  24cdc85c
Nov 05, 2020
- algos.snapshot.snapshot_resolve_alias: Don't return the branch list. · 84984a60
  vlorentz authored 4 years ago
  
  It complicates the signature and the code, and we don't have any use for it currently.
  View commits for tag v0.17.1 v0.17.1
  
  84984a60
- Add test for snapshot_resolve_alias with a missing branch. · fa868346
  vlorentz authored 4 years ago
  
  fa868346