Commits · 8e94afaa0ecc0d5c25bca34c01f985337fe617ae · Platform / Development / swh-storage

Sep 10, 2021
- migrate_extrinsic_metadata: Fix off-by-one error, causing the first_id to be skipped · 8e94afaa
  vlorentz authored 3 years ago
  
  8e94afaa
Sep 09, 2021
- cassandra: Make directory_ls fetch contents in batch instead of one-by-one · 5facf661
  vlorentz authored 3 years ago
  
  This should make it run up to 100 times faster, even on average directories.
  5facf661
- content_get: Fetch rows concurrently · 0570a426
  vlorentz authored 3 years ago
  
  Instead of fetching them one-by-one, with the very high latency this entails. This is preliminary work to make `directory_ls` less painfully slow.
  0570a426
- directory_entry_add_batch: Remove the temporary prepared statement entirely · 50fb54f2
  vlorentz authored 3 years ago
  
  And fall back to concurrent insertion.
  50fb54f2
Sep 08, 2021

directory_entry_add_batch: Reduce churn of prepared statements · da7e63ea
vlorentz authored 3 years ago
```
By reusing the 'steady state' main statement (which is quite large)
across calls.
```
da7e63ea

cassandra: Add option to select (hopefully) more efficient batch insertion algos · fc950deb

vlorentz authored 3 years ago

This adds a new config option for the cassandra backend,
'directory_entries_insert_algo', with three possible values:

* 'one-per-one' is the default, and preserves the current naive behavior
* 'concurrent' and 'batch' are attempts at being more efficient

fc950deb

Sep 06, 2021
- migrate_extrinsic_metadata: Add an option to limit the number of revisions · 7dc2863e
  vlorentz authored 3 years ago
  
  This will be used as a second pass on objects that failed with older versions of the script.
  7dc2863e
Sep 03, 2021
- test_directory_get_entries_pagination: don't depend on result order · 834a49d0
  vlorentz authored 3 years ago
  
  834a49d0
Aug 31, 2021

cassandra: Remove stat_counters. · e8aad0ff

vlorentz authored 3 years ago

They were inaccurate and a performance bottleneck.

We can/should use swh-counters instead, now.

e8aad0ff

Aug 30, 2021
- postgresql: Fix a column order mismatch between the query and object builder · 3ad1bec1
  Vincent Sellier authored 3 years ago
  
  resulting in OriginVisitStatus trying to put a snapshot id in the metadata field Related to T3539
  3ad1bec1
- cassandra: generate statsd metrics on method calls · 999ea6bb
  Vincent Sellier authored 3 years ago
  
  Related to T3517
  999ea6bb
Aug 27, 2021

Add counting storage proxy · 47a6919f

vlorentz authored 3 years ago

It will be used in the Cassandra experiment.

Currently we use the built-in counters of the Cassandra backend; but in
addition to being inaccurate, they seem to be a bottleneck.

This proxy will be a lightweight solution for counting object insertion,
without needing to run Kafka on the test cluster.

47a6919f

Aug 24, 2021

Add cvs as supported revision_type · b110d1b6
Nicolas Dandrimont authored 3 years ago

View commits for tag v0.36.0 v0.36.0

b110d1b6

Add test for origin_visit_get_latest in presence of mismatched id and date orders · 8f1cdf65

vlorentz authored 3 years ago

It was unclear this actually worked; I had to write this test to realize
the code wasn't buggy.

Also replaced a conditional that is always False (because Cassandra
always returns results in the order of the clustering key) with an
assertion, so the code is less confusing.

8f1cdf65

cassandra: Bump next_visit_id when origin_visit_add is called by a replayer · cf880db3

vlorentz authored 3 years ago

When called by a replayer, the visit.visit field is set; but
origin.next_visit_id was never incremented, so on the next loader
run, the visit id would be 1 even if there is already a visit
with that id.

cf880db3

cassandra: Make content_missing query in batches · 54b5abfb

vlorentz authored 3 years ago

Instead of calling content_find() for each object, which needs to make
two queries for each.

Given the latency of Cassandra queries, this should be a significant
speed-up (possibly up to 100 times faster, as this is the value of
PARTITION_KEY_RESTRICTION_MAX_SIZE).

This also changes the schema, because CQL does not allow doing `IN`
queries on compound partition keys.

54b5abfb

backfill: add extra where clause to use the right index for extid requests · 7113198f
Vincent Sellier authored 3 years ago
```
Related to T3485
```
7113198f

Aug 06, 2021
- cassandra: Fix crash when using _missing() functions with more than 100 ids with ScyllaDB. · 9f00eb9d
  vlorentz authored 3 years ago
  
  View commits for tag v0.35.1 v0.35.1
  
  9f00eb9d
Jul 27, 2021
- sql: Adapt extid.extid_version comment · 912d04ee
  Antoine R. Dumont authored 3 years ago
  
  View commits for tag v0.35.0 v0.35.0
  
  912d04ee
Jul 23, 2021

Implement storage of the ExtID.extid_version field · 7a380458

Nicolas Dandrimont authored 3 years ago

This fields allows having multiple version of the ExtID -> SWHID
mapping, for instance when the implementation of a loader changes in a
backwards-incompatible way.

For now, we don't change the API used to query or store ExtIDs. When
querying for the SWHIDs corresponding to a given external objects, all
versions are returned, and the client is expected to do the filtering.

7a380458

Jul 07, 2021
- cassandra: Allow to configure the consistency level to use · 9747aed6
  Vincent Sellier authored 3 years ago
  
  The default ONE level is used to keep the previous behaviour Related to T3396
  View commits for tag v0.34.0 v0.34.0
  
  9747aed6
Jun 28, 2021
- postgresql: Add type annotation for 'db' argument · f1cac4fc
  vlorentz authored 3 years ago
  
  This allows mypy to actually type-check calls to db methods. This commit also fixes an issue found by mypy.
  View commits for tag v0.33.0 v0.33.0
  
  f1cac4fc
- --amend · dd8a590b
  vlorentz authored 3 years ago
  
  dd8a590b
- Add endpoint raw_extrinsic_metadata_get_authorities · c5beb49a
  vlorentz authored 3 years ago
  
  This will make it easier for users of swh-web to discover metadata on a given SWHID, as you otherwise need to specify an authority to fetch metadata.
  c5beb49a
Jun 25, 2021
- cassandra: Add support for non-ASCII origin 'URLs'. · ec2fac44
  vlorentz authored 3 years ago
  
  We agreed a while ago they are IRIs, and we have some of them in the postgresql database already.
  View commits for tag v0.32.0 v0.32.0
  
  ec2fac44
Jun 15, 2021
- Add endpoints to access REMD by id · 47575a69
  vlorentz authored 3 years ago
  
  This will be used by swh-web to allow downloading them from a non-JSON endpoint.
  View commits for tag v0.31.0 v0.31.0
  
  47575a69
Jun 09, 2021
- mypy: Fix errors with release >= v0.900 · 036d2273
  Antoine Lambert authored 3 years ago
  
  036d2273
May 21, 2021

cassandra: Add partial support for ScyllaDB · 1d880a52

vlorentz authored 3 years ago

All features work but snapshot_count_branches, because ScyllaDB does not
support user-defined aggregates yet.

Migration tests hang when run after the regular tests, but I can't
figure out why. This should not be an issue for now, as we won't run
Scylla tests on the CI.

1d880a52

Finalize the config "local" deprecation in favor of "postgresql" · 8e3731ac

Antoine R. Dumont authored 3 years ago

This will remove further deprecation warnings from the tests, especially the ones from
other modules depending on the storage's pytest-plugin.

This also fixes some edge case configuration for the backfill and the storage rpc
backend which would have been broken if we switched to that new name prior to this.

Related to b487a21f

8e3731ac

May 19, 2021
- tests: Make test parameters order deterministic, so they don't crash pytest-xdist · a92a9684
  vlorentz authored 3 years ago
  
  pytest-xdist expects the parameters to be in the same order in all processes.
  a92a9684
- test_cassandra: Improve error when the process is started but not listening · 5a8d6052
  vlorentz authored 3 years ago
  
  5a8d6052
May 18, 2021
- Make the TenaciousProxyStorage also handle content_add_metadata · 0ed4a975
  David Douard authored 3 years ago
  
  View commits for tag v0.30.0 v0.30.0
  
  0ed4a975
May 14, 2021
- Add missing schema migration for swh_directory_get_entries · 53c21d4c
  Nicolas Dandrimont authored 3 years ago
  
  View commits for tag v0.29.1 v0.29.1
  
  53c21d4c
May 11, 2021

content_get: Add support for queries by sha1_git · f3283679

vlorentz authored 3 years ago

Before this commit, the only way to get Content objects from their sha1_git
was to call content_find for each object.
This was obviously neither convenient nor efficient.

Using this endpoint to batch calls reduces the runtime of the git-bare
vault cooker by 30%.

f3283679

Add endpoint directory_get_entries, to quickly list a directory's entries · e3cbd5ee

vlorentz authored 3 years ago

It spares a join with the content table, which should hopefully make
the vault (and possibly other users) faster when they don't need this
join.

e3cbd5ee

cassandra: Add tests checking directory_add and snapshot_add are atomic. · f140f634
vlorentz authored 3 years ago

f140f634

May 10, 2021
- Deprecate the "local" storage cls in favor of "postgresql" · b487a21f
  David Douard authored 3 years ago
  
  b487a21f
- Move all proxy storages in swh/storage/proxies/ · 91052539
  David Douard authored 3 years ago
  
  to clean a bit the swh.storage namespace.
  91052539
May 07, 2021

Make the TenaciousProxyStorage retry when a single object add fails · 76170995

David Douard authored 3 years ago

give a chance to one-object batches to be ingested, and reduce the
number of objects wrongly reported as non-ingested, e.g. during a
replayer session, where this situation can occur.

76170995

May 06, 2021
- Use swh.core 0.14 · 35ae94a4
  vlorentz authored 3 years ago
  
  It renamed db_name to dbname, which is a breaking change.
  View commits for tag v0.28.0 v0.28.0
  
  35ae94a4