Commits · 0f9f54b72e51660e25ed9b66d2186a976373ec69 · Platform / Development / swh-storage

Feb 04, 2022
- revision_walker: Add support for ignore_displayname. · 0f9f54b7
  vlorentz authored 3 years ago
  
  This is needed by the vault.
  0f9f54b7
- Add typing to revision_walker.py and make the state a dataclass · 75a7f093
  vlorentz authored 3 years ago
  
  75a7f093
- Require pytest to be <7.0.0 · a3a63d8d
  vlorentz authored 3 years ago
  
  a3a63d8d
Feb 01, 2022
- Introduce a new displayname field for persons in the PostgreSQL storage · 4544d7ca
  Nicolas Dandrimont authored 3 years ago
  
  Extend the APIs for Revisions and Releases to honor the field by default, unless the new `ignore_displayname` argument is set.
  View commits for tag v0.43.0 v0.43.0
  
  4544d7ca
- Make test_release_add_get_arbitrary non-flaky · 97caa933
  Nicolas Dandrimont authored 3 years ago
  
  It was made flaky by d4ddd415.
  97caa933
Jan 31, 2022

Mostly use normalized Person objects in tests · f868f3c8

This opens up the possibility of eventually ignoring the `name` and
`email` fields stored in database in favor of parsing them again from
the fullname field (and therefore to update our parsing logic without
having to affect stored data).

f868f3c8

postgresql: Use Person.from_fullname if name and email are None · d4ddd415

Nicolas Dandrimont authored 3 years ago

This allows us to populate sensible name and email values out of the new
displayname field, without having to store them.

d4ddd415

Jan 25, 2022
- Fix directory_add to actually insert the manifest + add directory_get_raw_manifest · 6f025246
  vlorentz authored 3 years ago
  
  I don't expect directory_get_raw_manifest to be used, but it is needed for tests, so why not.
  View commits for tag v0.42.0 v0.42.0
  
  6f025246
Jan 21, 2022

Stop using the deprecated 'TimestampWithTimezone.offset' attribute · 58749057
vlorentz authored 3 years ago
```
It will be replaced by what is currently called 'offset_bytes'
```
58749057

Remove 'offset' and 'negative_utc' · 2e741388

vlorentz authored 3 years ago

This only keeps 'offset_bytes' to store the timezone, to support swh-model
v5.0.0.

However, this keeps writing 'offset' and 'negative_utc' to the postgresql
database, just in case we need to roll back this change.
But they are not read anymore.

2e741388

Jan 18, 2022

postgres: Add indices to keep track of objects with a raw_manifest · c68a4fd9
vlorentz authored 3 years ago
```
They should be a rare occurence, so adding these indices allows us to count
and enumerate them without expensive full table scans.
```
View commits for tag v0.41.2 v0.41.2

c68a4fd9

Fix sphinx error · 228de337

vlorentz authored 3 years ago

[2022-01-17T16:03:27.448Z] /var/lib/jenkins/workspace/DSTO/tests-on-diff@2/docs/index.rst:25:hardcoded link 'https://archive.softwareheritage.org/api/' could be replaced by an extlink (try using ':swh_web:`api/`' instead)

228de337

Jan 12, 2022

cassandra: Make content_missing run in linear time instead of quadratic · 40a57d43

vlorentz authored 3 years ago

Assuming all contents passed to content_missing() have (at least) a missing algo,
the function used to iterate over the size of the arg squared
in the worst case (when all contents are found).

With this commit, it starts with bucketing them by hash, so it does not
need to iterate over *all* found contents for each content passed as arg.

40a57d43

cassandra: Rewrite content_missing to run queries concurrently. · d5f1f0ec
vlorentz authored 3 years ago
```
This is twice as fast, according to
https://forge.softwareheritage.org/T3577#72791
```
d5f1f0ec

Jan 06, 2022

cassandra: Use concurrent queries in *_missing() instead of naive grouping · 4a245050

vlorentz authored 3 years ago

Instead of grouping ids in queries in arbitrary batches (which forces
the server node to coordinate with other nodes to complete the query),
this sends queries with one id each, directly to the right node.

This is the 'concurrent' algorithm from https://forge.softwareheritage.org/T3577#72791
which gives a >=2x speed-up on directories, and a >=8x speed-up on revisions.

4a245050

Jan 04, 2022
- Improve documentation of the replay command · 259bf6fe
  David Douard authored 3 years ago
  
  View commits for tag v0.41.1 v0.41.1
  
  259bf6fe
- Move the 'error_reporter' config entry in a dedicated 'replayer' section · 1071781d
  David Douard authored 3 years ago
  
  1071781d
Dec 22, 2021
- Add columns {,committer_}date_offset to rev/rel and raw_manifest to dir/rev/rel · f3232e66
  vlorentz authored 3 years ago
  
  View commits for tag v0.41.0 v0.41.0
  
  f3232e66
Dec 16, 2021

Pin mypy and drop type annotations which makes mypy unhappy · f09a54d4

Antoine R. Dumont authored 3 years ago

This also drops:
- spurious copyright headers to those files if present.
- fix a type issue revealed by the new mypy

Related to T3812

Verified

f09a54d4

Dec 15, 2021
- Add tests checking round-tripping of dir/rev/rel/snp objects generated by Hypothesis · c40ceb32
  vlorentz authored 3 years ago
  
  c40ceb32
- Add test_revision_add_fractional_timezone · 45687d8c
  vlorentz authored 3 years ago
  
  45687d8c
Dec 13, 2021

postgresql: Fix one-by-one error in db_to_date on negative dates · fb1b3a06

vlorentz authored 3 years ago

Using `int()` on `date.timestamp()` rounded it up (toward zero),
but the semantics of `model.Timestamp` is that the actual time is
`ts.seconds + ts.microseconds/1000000`, so all negative dates were
shifted one second up.

In particular, this causes dates from
`1969-12-31T23:59:59.000001` to `1969-12-31T23:59:59.999999`
(inclusive) to smash into dates from
`1970-01-01T00:00:00.000001` to `1970-01-01T00:00:00.999999`,
which is how I discovered the issue.

fb1b3a06

Dec 09, 2021
- postgresql: Add tests for db_to_date. · 34ca67e2
  vlorentz authored 3 years ago
  
  34ca67e2
Dec 08, 2021

proxies/retry: Remove no longer needed tenacity workarounds · 7cb4128e

Antoine Lambert authored 3 years ago

Now that we have packaged tenacity 6.2 for debian buster and use it
in production, we can remove the workarounds to support tenacity < 5.

7cb4128e

Dec 07, 2021

test_cassandra: Fix failing tests since swh-model update · 615fb99e

Antoine Lambert authored 3 years ago

Directory entries are now checked for name duplicates in swh-model
so we must ensure the CrashyEntry class is properly initialized.

Closes T3776

615fb99e

Nov 09, 2021

Add support for a redis-based reporting for invalid mirrorred objects · 850a7553

David Douard authored 3 years ago

The idea is that we check the BaseModel validity at journal
deserialization time so that we still have access to the raw object from
kafka for complete reporting (object id plus raw message from kafka).

This uses a new ModelObjectDeserializer class that is responsible for
deserializing the kafka message (still using kafka_to_value) then
immediately create the BaseModel object from that dict. Its `convert`
method is then passed as `value_deserializer` argument of the
`JournalClient`.

Then, for each deserialized object from kafka, if it's a HashableObject,
check its validity by comparing the computed hash with its id.

If it's invalid, report the error in logs, and if configured, register the
invalid object in via the `reporter` callback.

In the cli code, a `Redis.set()` is used a such a callback (if configured).
So it simply stores invalid objects using the object id a key (typically its
swhid), and the raw kafka message value as value.

Related to T3693.

850a7553

Refactor fixer.fix_objects() to extract the inner object_fixers dict · 04bd15a0
David Douard authored 3 years ago
```
allowing to use this dict independently of the fix_objects() function.
```
04bd15a0
Remove now useless fixers · d655c858
David Douard authored 3 years ago
```
keep the the fix_objects() function for bw compat for now.
```
d655c858
Add a --type option to 'swh storage replay' · 55eed77b
David Douard authored 3 years ago
```
allows to choose replayed object types from the cli.
```
55eed77b

Update extrinsic-metadata-specification.rst to match the current implementation · 0262f1c1

vlorentz authored 3 years ago

* merged origin and artifact metadata
* added metametadata
* uses structures instead of dict
* removed raw_extrinsic_metadata_get_latest

0262f1c1

Oct 28, 2021

interface: Add origin_snapshot_get_all method · a5bfe5b5

Antoine Lambert authored 3 years ago

It enables to return in an efficient way the list of unique snapshot
identifiers resulting from the visits of an origin.

Previously it was required to query all visits of an origin then query
all visit statuses for each visit to extract such information.

Introduced method enables to extract origin snaphots information in
a single datase query.

Related to T3631

a5bfe5b5

Oct 22, 2021

algos/revisions_walker: Handle case of revision without committer date · 49a932c9

Antoine Lambert authored 3 years ago

Some revisions in the archive do not have committer date so workaround
it to avoid errors when walking on such revisions when using the class
CommitterDateRevisionsWalker.

49a932c9

Oct 21, 2021
- test_revisions_walker: Migrate from unittest to pytest · c02be8eb
  Antoine Lambert authored 3 years ago
  
  c02be8eb
Oct 18, 2021

cassandra: Fix incomplete check of content existence in object_find_by_sha1_git · a9867104

vlorentz authored 3 years ago

content_missing_by_sha1_git only checks the index and not the main table.

This is incorrect, because contents should not be considered written
before an entry is written to the main table, even if an entry
exists in one of the indexes.

a9867104

Oct 11, 2021
- serializers: Prepare rename of 'identifiers_enum' to 'swhids_enum'. · e9fd74d7
  vlorentz authored 3 years ago
  
  This will be done in three steps to avoid any disruption: 1. (this step) add support for decoding the new name, but keep encoding as the old one 2. start encoding as the new name 3. remove support for decoding the old name
  e9fd74d7
- Rename imports of swh.model.identifiers to fix deprecation warnings. · ea86a862
  vlorentz authored 3 years ago
  
  ea86a862
Oct 08, 2021

buffer: add some debug logging for number of objects sent · 3441f689
Nicolas Dandrimont authored 3 years ago

View commits for tag v0.38.0 v0.38.0

3441f689

buffer: add a threshold for the estimated size of revision and release batches · b6040142

Nicolas Dandrimont authored 3 years ago

The size of individual revisions and releases is essentially unbounded.
This means that, when the buffer storage is used as a way of limiting
memory use for an ingestion process, it is still possible to go beyond
the expected memory use when adding a batch of revisions or releases
with large messages or other metadata.

The duration of the database operations for revision_add or release_add is also
commensurate to the size of the objects added in a batch, so
using the buffer proxy to limit the time individual database operations
takes was not effective.

Adding a threshold on estimated sizes for batches of revision and
release objects makes this overuse of memory and of database transaction
time much less likely.

b6040142

buffer: add a threshold for the number of revision parents in one batch · 7c5b0ec1

Nicolas Dandrimont authored 3 years ago

The size of individual revisions is essentially unbounded. This means
that, when the buffer storage is used as a way of limiting memory use
for an ingestion process, it is still possible to go beyond the expected
memory use when adding a batch of revisions with extensive histories.

The duration of the database operation for revision_add is also
commensurate to the number of revision parents added in a batch, so
using the buffer proxy to limit the time individual database operations
takes was not effective.

Adding a threshold on cumulated number of revision parents per batch
makes this overuse of memory and of database transaction time much less
likely.

7c5b0ec1

buffer: add a threshold for the number of directory entries in one batch · 5edc0ba7

Nicolas Dandrimont authored 3 years ago

The size of individual directories is essentially unbounded. This means
that, when the buffer storage is used as a way of limiting memory use
for an ingestion process, it is still possible to go beyond the expected
memory use when adding a batch of (very) large directories.

The duration of the database operation for directory_add is also
commensurate to the number of directory entries added in a batch, so
using the buffer proxy to limit the time individual database operations
takes was not effective.

Adding a threshold on cumulated number of directory entries per batch
makes this overuse of memory and of database transaction time much less
likely.

5edc0ba7