Commits · v0.40.0 · Raphaël Gomès / swh-storage

Nov 09, 2021

Add support for a redis-based reporting for invalid mirrorred objects · 850a7553

David Douard authored 3 years ago

The idea is that we check the BaseModel validity at journal
deserialization time so that we still have access to the raw object from
kafka for complete reporting (object id plus raw message from kafka).

This uses a new ModelObjectDeserializer class that is responsible for
deserializing the kafka message (still using kafka_to_value) then
immediately create the BaseModel object from that dict. Its `convert`
method is then passed as `value_deserializer` argument of the
`JournalClient`.

Then, for each deserialized object from kafka, if it's a HashableObject,
check its validity by comparing the computed hash with its id.

If it's invalid, report the error in logs, and if configured, register the
invalid object in via the `reporter` callback.

In the cli code, a `Redis.set()` is used a such a callback (if configured).
So it simply stores invalid objects using the object id a key (typically its
swhid), and the raw kafka message value as value.

Related to T3693.

850a7553

Refactor fixer.fix_objects() to extract the inner object_fixers dict · 04bd15a0
David Douard authored 3 years ago
```
allowing to use this dict independently of the fix_objects() function.
```
04bd15a0
Remove now useless fixers · d655c858
David Douard authored 3 years ago
```
keep the the fix_objects() function for bw compat for now.
```
d655c858
Add a --type option to 'swh storage replay' · 55eed77b
David Douard authored 3 years ago
```
allows to choose replayed object types from the cli.
```
55eed77b

Update extrinsic-metadata-specification.rst to match the current implementation · 0262f1c1

vlorentz authored 3 years ago

* merged origin and artifact metadata
* added metametadata
* uses structures instead of dict
* removed raw_extrinsic_metadata_get_latest

0262f1c1

Oct 28, 2021

interface: Add origin_snapshot_get_all method · a5bfe5b5

Antoine Lambert authored 3 years ago

It enables to return in an efficient way the list of unique snapshot
identifiers resulting from the visits of an origin.

Previously it was required to query all visits of an origin then query
all visit statuses for each visit to extract such information.

Introduced method enables to extract origin snaphots information in
a single datase query.

Related to T3631

a5bfe5b5

Oct 22, 2021

algos/revisions_walker: Handle case of revision without committer date · 49a932c9

Antoine Lambert authored 3 years ago

Some revisions in the archive do not have committer date so workaround
it to avoid errors when walking on such revisions when using the class
CommitterDateRevisionsWalker.

49a932c9

Oct 21, 2021
- test_revisions_walker: Migrate from unittest to pytest · c02be8eb
  Antoine Lambert authored 3 years ago
  
  c02be8eb
Oct 18, 2021

cassandra: Fix incomplete check of content existence in object_find_by_sha1_git · a9867104

vlorentz authored 3 years ago

content_missing_by_sha1_git only checks the index and not the main table.

This is incorrect, because contents should not be considered written
before an entry is written to the main table, even if an entry
exists in one of the indexes.

a9867104

Oct 11, 2021
- serializers: Prepare rename of 'identifiers_enum' to 'swhids_enum'. · e9fd74d7
  vlorentz authored 3 years ago
```
This will be done in three steps to avoid any disruption:

1. (this step) add support for decoding the new name, but keep encoding as the old one
2. start encoding as the new name
3. remove support for decoding the old name
```
  e9fd74d7
- Rename imports of swh.model.identifiers to fix deprecation warnings. · ea86a862
  vlorentz authored 3 years ago
  
  ea86a862
Oct 08, 2021

buffer: add some debug logging for number of objects sent · 3441f689
Nicolas Dandrimont authored 3 years ago

v0.38.0

3441f689

buffer: add a threshold for the estimated size of revision and release batches · b6040142

Nicolas Dandrimont authored 3 years ago

The size of individual revisions and releases is essentially unbounded.
This means that, when the buffer storage is used as a way of limiting
memory use for an ingestion process, it is still possible to go beyond
the expected memory use when adding a batch of revisions or releases
with large messages or other metadata.

The duration of the database operations for revision_add or release_add is also
commensurate to the size of the objects added in a batch, so
using the buffer proxy to limit the time individual database operations
takes was not effective.

Adding a threshold on estimated sizes for batches of revision and
release objects makes this overuse of memory and of database transaction
time much less likely.

b6040142

buffer: add a threshold for the number of revision parents in one batch · 7c5b0ec1

Nicolas Dandrimont authored 3 years ago

The size of individual revisions is essentially unbounded. This means
that, when the buffer storage is used as a way of limiting memory use
for an ingestion process, it is still possible to go beyond the expected
memory use when adding a batch of revisions with extensive histories.

The duration of the database operation for revision_add is also
commensurate to the number of revision parents added in a batch, so
using the buffer proxy to limit the time individual database operations
takes was not effective.

Adding a threshold on cumulated number of revision parents per batch
makes this overuse of memory and of database transaction time much less
likely.

7c5b0ec1

buffer: add a threshold for the number of directory entries in one batch · 5edc0ba7

Nicolas Dandrimont authored 3 years ago

The size of individual directories is essentially unbounded. This means
that, when the buffer storage is used as a way of limiting memory use
for an ingestion process, it is still possible to go beyond the expected
memory use when adding a batch of (very) large directories.

The duration of the database operation for directory_add is also
commensurate to the number of directory entries added in a batch, so
using the buffer proxy to limit the time individual database operations
takes was not effective.

Adding a threshold on cumulated number of directory entries per batch
makes this overuse of memory and of database transaction time much less
likely.

5edc0ba7

filter: add filtering for release_add · abe95b34
Nicolas Dandrimont authored 3 years ago

abe95b34
filter: do not call the underlying functions if there's nothing to add · c52b7b66
Nicolas Dandrimont authored 3 years ago

c52b7b66

buffer: Ensure that we don't send data from empty buffers · 5d5d4c94

Nicolas Dandrimont authored 3 years ago

This was already the case (as grouper called on an empty iterator just
returns no batches), but add a test to enforce it.

5d5d4c94

Sep 29, 2021
- replay: add type annotation for process_replay_objects() · 113088ab
  David Douard authored 3 years ago
  
  v0.37.1
  
  113088ab
- replay: fix raw_extrinsic_metadata insertion and type annotation · 9a3589f2
  David Douard authored 3 years ago
```
due to missing type annotation of the storage argument of
_insert_objects(), we missed a bug in the processing of
raw_extrinsic_metadata objects, passing set() as arguments of storage
add methods.
```
  9a3589f2
- replay: fix annotation of collision_aware_content_add() · 21aff2d1
  David Douard authored 3 years ago
```
now the callable is expected to return a dict.
```
  21aff2d1
Sep 28, 2021
- Fix support of swh-model 3.0.0 · 8ba232c5
  vlorentz authored 3 years ago
  
  8ba232c5
Sep 27, 2021
- postgresql: Don't raise StorageArgumentException in case of write conflicts · 42bad908
  vlorentz authored 3 years ago
  
  42bad908
Sep 23, 2021
- Add bazaar as supported revision type · ec548ee8
  Raphaël Gomès authored 3 years ago
```
This has a corresponding change in swh.model
```
  ec548ee8
Sep 21, 2021
- cassandra: Make _content_get_from_hashes run concurrently · 61e9e4a3
  vlorentz authored 3 years ago
```
This is used by directory_ls and content_get.
```
  61e9e4a3
- postgresql: Fix regression introduced in previous commit · 59e63db6
  Antoine Lambert authored 3 years ago
```
Methods snapshot_get and snapshot_get_branches should return
None if the snapshot does not exist in the archive.

Add missing tests to cover that case.
```
  59e63db6
Sep 16, 2021

postgresql: Fix get_snapshot_branches return value for empty search · 9465054e

Antoine Lambert authored 3 years ago

When searching for branches in an existing snapshot, a PartialBranches
object must be returned regardless the number of found branches.

None should only be returned when a snapshot does not exist.

This fixes an inconsistency between the postgresql and cassandra
backends.

Related to T3413

9465054e

Sep 15, 2021

Allow filtering extids per extid_version/extid_type when reading · a9fde72c

Antoine R. Dumont authored 3 years ago

This impacts both the `extid_get_from_extid` and `extid_get_from_target` endpoints.

Whe extid_version/extid_type are not provided, this keeps the existing behavior of
returning all extids matching.

Related to T3567

a9fde72c

Sep 14, 2021
- migrate_extrinsic_metadata: Fix missing f-stringification · 589d20ed
  vlorentz authored 3 years ago
  
  589d20ed
Sep 10, 2021
- migrate_extrinsic_metadata: Fix crash on deposit hal-02355563 · 1c8337fd
  vlorentz authored 3 years ago
  
  1c8337fd
- migrate_extrinsic_metadata: Fix remaining pypi issues · 3315738b
  vlorentz authored 3 years ago
```
All packages now pass
```
  3315738b
- migrate_extrinsic_metadata: Fix off-by-one error, causing the first_id to be skipped · 8e94afaa
  vlorentz authored 3 years ago
  
  8e94afaa
Sep 09, 2021
- cassandra: Make directory_ls fetch contents in batch instead of one-by-one · 5facf661
  vlorentz authored 3 years ago
```
This should make it run up to 100 times faster, even on average directories.
```
  5facf661
- content_get: Fetch rows concurrently · 0570a426
  vlorentz authored 3 years ago
```
Instead of fetching them one-by-one, with the very high latency this entails.

This is preliminary work to make `directory_ls` less painfully slow.
```
  0570a426
- directory_entry_add_batch: Remove the temporary prepared statement entirely · 50fb54f2
  vlorentz authored 3 years ago
```
And fall back to concurrent insertion.
```
  50fb54f2
Sep 08, 2021

directory_entry_add_batch: Reduce churn of prepared statements · da7e63ea
vlorentz authored 3 years ago
```
By reusing the 'steady state' main statement (which is quite large)
across calls.
```
da7e63ea

cassandra: Add option to select (hopefully) more efficient batch insertion algos · fc950deb

vlorentz authored 3 years ago

This adds a new config option for the cassandra backend,
'directory_entries_insert_algo', with three possible values:

* 'one-per-one' is the default, and preserves the current naive behavior
* 'concurrent' and 'batch' are attempts at being more efficient

fc950deb

Sep 06, 2021
- migrate_extrinsic_metadata: Add an option to limit the number of revisions · 7dc2863e
  vlorentz authored 3 years ago
```
This will be used as a second pass on objects that failed with
older versions of the script.
```
  7dc2863e
Sep 03, 2021
- test_directory_get_entries_pagination: don't depend on result order · 834a49d0
  vlorentz authored 3 years ago
  
  834a49d0
Aug 31, 2021

cassandra: Remove stat_counters. · e8aad0ff

vlorentz authored 3 years ago

They were inaccurate and a performance bottleneck.

We can/should use swh-counters instead, now.

e8aad0ff