Commits · 39507b24d0f4bfa15347edf422bb3496b3761629 · Platform / Development / swh-storage

Apr 06, 2021
- Make the replayer drop the Revision.metadata · 39507b24
  David Douard authored 3 years ago
  
  this attribute is deprecated and on the verge of being replaced by RawExtrinsicMetadata objects, and the kafka journal currently in production contains a few invalid metadata entries that makes the replayer unhappy. Closes T3201.
  v0.27.2
  
  39507b24
- Merge test_replay's _check_replayed and check_replayed in a single function · 84dcbe3d
  David Douard authored 3 years ago
  
  84dcbe3d
- Fix pg Storage.extid_add(): write ExtID objects to the journal · 36a7fd34
  David Douard authored 3 years ago
  
  and explicitely check for extid objects in the journal in TestStorage.
  36a7fd34
Mar 30, 2021
- migrate_extrinsic_metadata: Filter out git revisions · 0a270d1a
  vlorentz authored 3 years ago
  
  They can't have any extrinsic metadata, so fetching git revisions wastes a lot of time.
  0a270d1a
- buffer: Add support for 'extid' · 3309765d
  vlorentz authored 4 years ago
  
  Will be used by the extid migration script, and loaders can probably use it too.
  v0.27.1
  
  3309765d
Mar 26, 2021

extid: remove unicity on (extid_type, extid) and (target_type, target) · cfb2417f

vlorentz authored 4 years ago

It did not make sense for multiple reasons:

1. two extids can point to the same target (eg. extids with type git and git-sha256;
   or two package managers with different checksums)
2. inserting two objects with the same target or extid in a single call actually
   wrote both, but would crash when reading
3. inserting extid1 then extid2 would write both to Kafka, but only extid1
   would be inserted. When replaying on a new DB, extid2 may be inserted
   and extid1 ignored

Points 2 and 3 are simply fixable bugs, but 1 is an issue by design,
and this commit fixes all of them at once.

cfb2417f

origin_visit_status_add: Fix inconsistent/incorrect errors when type is None and visit is missing. · ac6f6423
vlorentz authored 4 years ago

ac6f6423

Mar 22, 2021

raw_extrinsic_metadata: Make (target, authority_id, discovery_date, fetcher_id) non-unique · eff23837

vlorentz authored 4 years ago

Uniqueness is only based on the id from now on.

Also adds the 'id' column to the Cassandra schema (it was already
present in postgresql's schema)

eff23837

Add raw_extrinsic_metadata.id column in postgresql. · 2d540b05

vlorentz authored 4 years ago

For now, this has absolutely no effect on the API users,
as rows are already deduplicated based on a subset of the
fields hashed by the id.

2d540b05

Mar 15, 2021

Document the existing metadata formats · 8dd9f7b6
vlorentz authored 4 years ago

v0.25.0

8dd9f7b6

content_add: Write to the objstorage before the DB or Kafka · ffc0841b

vlorentz authored 4 years ago

Must add to the objstorage before the DB and journal. Otherwise:
1. in case of a crash the DB may "believe" we have the content, but
   we didn't have time to write to the objstorage before the crash
2. the objstorage mirroring, which reads from the journal, may attempt to
   read from the objstorage before we finished writing it

This is already done in the postgresql backend unintentionally since
209de5db.

This commit documents it, makes the cassandra backend behave that way too,
and adds a test.

ffc0841b

Mar 12, 2021

storage: Allow to filter out branches by prefix when counting them · b565201d

Antoine Lambert authored 4 years ago

Add an optional branch_name_exclude_prefix parameter to the
snapshot_count_branches method of the Storage interface.

It enables to filter out branches whose name starts with a
given prefix when counting.

The purpose is to get accurate counters in swh-web as pull
request branches will be filtered out by default.

Related to T2782

b565201d

storage: Add branch names filtering support in snapshot_get_branches · 93301a1f

Antoine Lambert authored 4 years ago

Add optional branch_name_include_substring parameter to snapshot_get_branches,
if provided only branches whose name contains the given substring will be
returned.

Add optional branch_name_exclude_prefix parameter to snapshot_get_branches,
if provided branches whose name starts with the given prefix will not be
returned.

Purpose of these new features: add a search form in the branches view
of swh-web and filter out pull request branches (whose names start with
"refs/pull/") by default.

Related to T2782

93301a1f

Mar 11, 2021

Add ExtID query support to the Storage · b8e10f00

David Douard authored 4 years ago

These endpoints allow to add and query the storage for known ExtID from SWHID
(typically get original VCS' revision intrinsic identifier from SWHID).

The underlying data structure is to be filled typically by loaders using
the `extid_add()` endpoint.

This only provides the Postgresql implementation.

Related to T2849.

b8e10f00

Mar 10, 2021
- Add hg revisions to the test data set · 6a777322
  David Douard authored 4 years ago
  
  6a777322
- Import TEST_OBJECTS from swh.model instead of swh.journal · e83452b3
  David Douard authored 4 years ago
  
  this later has been deprecated for a while now.
  e83452b3
- Make sure test_backfill does not depend on 2 dict keys · 82ce7bf0
  David Douard authored 4 years ago
  
  being miraculously listed the same.
  82ce7bf0
- Add support for raw_extrinsic_metadata in the replayer · c4fdd6db
  Nicolas Dandrimont authored 4 years ago
  
  This also checks the basic raw_extrinsic_metadata codepaths in the backfiller tests.
  c4fdd6db
- Add basic support for raw_extrinsic_metadata in the backfiller · 53a58fa0
  Nicolas Dandrimont authored 4 years ago
  
  53a58fa0
- Add simple unit test for the backfill.byte_ranges function · 89ae0a14
  Nicolas Dandrimont authored 4 years ago
  
  89ae0a14
- Add support for reading RawExtrinsicMetadata with raw URL targets · 0d785d23
  Nicolas Dandrimont authored 4 years ago
  
  We convert the target attribute to a hashed ExtendedSWHID before returning the object.
  0d785d23
Mar 03, 2021

postgresql: Ensure a minimum limit for the snapshot branches query · 88ff2c2f

Antoine Lambert authored 4 years ago

With small limits (< 10), the snapshot branches query can degenerate into
using the deduplication index on snapshot_branch (name, target, target_type),
and the postgresql planner happily scans several hundred million rows.

So ensure a minimum limit value of 10 before executing the query for
optimal performances when a small branches_count value is provided
to the snapshot_get_branches method of the Storage interface.

Related to P966

88ff2c2f

Remove the remaining references to the deprecated SWHID class · ce8335db
vlorentz authored 4 years ago

ce8335db

tests: Drop hypothesis < 6 requirement · f46244b5

Antoine Lambert authored 4 years ago

Ensure tests can be executed using hypothesis >= 6 by suppressing
the function_scoped_fixture health check on tests that use a function
scope fixture in combination with @given that does not need to be reset
between individual hypothesis examples.

f46244b5

Mar 01, 2021
- RawExtrinsicMetadata: update to use the API in swh-model 1.0.0 · 14739c53
  vlorentz authored 4 years ago
  
  v0.24.0
  
  14739c53
Feb 25, 2021

storage_tests: recompute ids when evolving RawExtrinsicMetadata objects. · 23887483

vlorentz authored 4 years ago

For now this does nothing as RawExtrinsicMetadata has no 'id' field,
but the equality assertions will become errors when the next version
of swh.model is released.

23887483

Feb 19, 2021

storage: Implement visit types filtering in origin_search method · f3ef6e6a

Antoine Lambert authored 4 years ago

Enable to filter searched origins by visit types.

Add a new optional visit_types parameter to origin_search method in
StorageInterface.

Implement visit types filtering in storage backends, an origin wil be
returned if it has any of the requested visit types.

This is clearly not designed to be used in production due to performance
issues but rather in testing environments with small archive dataset.

Related to T2869

f3ef6e6a

Feb 17, 2021
- 167: Make the migration script unblocking · 7b4c1247
  Antoine R. Dumont authored 4 years ago
  
  Verified
  
  7b4c1247
Feb 16, 2021

Switch anonymized replayer test to use pytest parametrization · cc3eb4b1

Nicolas Dandrimont authored 4 years ago

This allows us to only read the kafka topics once instead of twice in
the same tests, which is apparently a hard thing to do in a way
compatible with both confluent-kafka 1.5 and 1.6.

cc3eb4b1

Feb 09, 2021

storage: Refactor OriginVisitStatus instantiation · e0e88b2f
Antoine R. Dumont authored 4 years ago

v0.23.0 Verified

e0e88b2f
db: Unify sql joins on origin_visit_status using "USING" · d30ca938
Antoine R. Dumont authored 4 years ago

Verified

d30ca938

storage.postgresql: Use origin_visit_status.type value as source · 046fe57c

Antoine R. Dumont authored 4 years ago

This stops using the origin_visit.type as fallback values as now, the database has been
migrated.

So this makes the origin_visit_status.type a not nullable column.

This also drops now redundant join instructions on origin_visit table when reading.

Related to T2968

Verified

046fe57c

test_replay: Fix hang since confluent-kafka 1.6 release · 51df58e8

Antoine Lambert authored 4 years ago

Side effect of the following commit in librdkafka 1.6:
https://github.com/edenhill/librdkafka/commit/f418e0f721518d71ff533759698b647cb2e89b80

Tests was relying on a buggy behavior of the mocked kafka cluster: two
subsequent consumers setup with the same group id should receive a
different set of messages, rather than the same set of messages.

Also explicitly commit messages once consumed.

51df58e8

Feb 08, 2021
- postgresql: Fix dbversion() to return the max version instead of a random one. · b0383833
  vlorentz authored 4 years ago
  
  b0383833
Feb 04, 2021

buffer: ensure objects are flushed in topological order · efd8815b

Nicolas Dandrimont authored 4 years ago

This new integration test checks that, when flushing the buffer storage,
the addition functions of the underlying storage backend are called in
topological order (content, directory, revision, release then snapshot).

This reduces the probability of "data consistency" regressions caused by
the use of the buffering storage proxy alone.

efd8815b

Return an accurate summary from buffer's flush() method · 1526107b

Nicolas Dandrimont authored 4 years ago

The earlier implementation would only return summary data from keys that
existed in the last `_add` backend method run, rather than collating all
the results.

1526107b

buffer: add support for snapshots · 5b3e6c9f

Nicolas Dandrimont authored 4 years ago

This is mostly a consistency addition, considering that most (if not
all) loaders will only add a single snapshot.

The common pattern of loading objects in topological order (content >
directory > revision > release > snapshot), then flushing the storage,
is now fully consistent; Without this addition, the snapshot addition
would reach the backend storage before all other objects are added,
leading to potential inconsistencies if the flush of other object types
fails.

5b3e6c9f

buffer: add type annotations for tests · 18967ed4
Nicolas Dandrimont authored 4 years ago

18967ed4

Feb 01, 2021

storage: Make origin_get_latest_visit_status return OriginVisitStatus · 9a9f234e

Antoine R. Dumont authored 4 years ago

This returned a Tuple[OriginVisit, OriginVisitStatus].

This was required to have the missing information "type" for visit-status. This is no
longer needed as now OriginVisitStatus holds the type information.

Verified

9a9f234e

Change origin_visit_status_get_random interface to return visit_status · 626b0bf8

Antoine R. Dumont authored 4 years ago

This returned a Tuple[OriginVisit, OriginVisitStatus] which is no longer needed as now
OriginVisitStatus held the type information now.

Verified

626b0bf8