Commits · 3eff720a0c2cabb756a4635bb107659cae3e903a · Platform / Development / swh-storage

Mar 23, 2022

Add support for author=None and committer=None · 3eff720a

vlorentz authored 3 years ago

committer=None happens on some malformed commits generated by old dgit
version; and it is possible for author=None to happen for the same reason.

For now, this is not supported by swh-model, so tests temporarily
disable attrs checks that swh-model relies on.

3eff720a

Mar 22, 2022

pytest: Exclude build directory for tests discovery · 92c78ab5

Antoine Lambert authored 3 years ago

Due to test modules being copied in subdirectories of the
build directory by setuptools, it makes pytest fail by raising
ImportPathMismatchError exceptions when invoked from root
directory of the module.

So ignore the build folder to discover tests.

92c78ab5

Mar 16, 2022

backfill: Add missing raw_manifest to directories · 98b41c85

vlorentz authored 3 years ago

This was not covered by tests so far, because
swh.model.tests.swh_model_data.TEST_OBJECTS did not contain
any object with a raw_manifest.

But it will in swh-model > 5.0.0

98b41c85

Mar 15, 2022
- backfill: Make integer_ranges() work on str args + add typing to RANGE_GENERATORS · 8b65e42f
  vlorentz authored 3 years ago
  
  Without the type annotation, mypy errors with 'Cannot call function of unknown type' when called from a type-checked function.
  8b65e42f
- postgresql: Remove unused listener code from db.py · 77f7e6d1
  vlorentz authored 3 years ago
  
  77f7e6d1
Mar 11, 2022

origin_visit_get_latest: Order by visit id instead of date · ccde0975

vlorentz authored 3 years ago

This allows both the postgresql and cassandra backends to make efficient
queries by using an index (resp. clustering key) instead of scanning
all visits of the given origin then sorting by date.

This does not affect the results for the last majority of cases,
as ids are always in increasing chronological, unless an origin was
re-loaded from an old archive.

ccde0975

origin_visit_get_latest: Materialize subquery on 'origin' table. · 600e87fb

vlorentz authored 3 years ago

postgresql's query planner does not understand the origin is unique, so it performs
a partial index scan on origin_visit_pkey, which is inefficient on origins with
many visits.

This commit itself is not enough to make it use the proper index,
but provides this necessary change that will be used by a future commit.

600e87fb

postgresql: Increase timeouts that often fail · b0cdab58

vlorentz authored 3 years ago

According to Sentry, in the last 30 days:

* directory_entry_get_by_path: 958 events, https://sentry.softwareheritage.org/share/issue/c4c2124953a145b2bd325f6f6b7df5a6/
* revision_get: 841 events, https://sentry.softwareheritage.org/share/issue/55fbe01c6f4d4c9bbf684c7608a62ad9/
* release_get: 14 events, https://sentry.softwareheritage.org/share/issue/37c53354541b4c4eaa1faf4e20a68418/
* origin_visit_find_by_date: 114 events, https://sentry.softwareheritage.org/share/issue/a674c12049a941968a717661a0226559/
* origin_get: 79 events, https://sentry.softwareheritage.org/share/issue/bf21d6bc7b24442eb18643d80d936d27/ ; 67 events, https://sentry.softwareheritage.org/share/issue/010a4b1e085a4e2089ba4897c6de6038/

b0cdab58

Mar 08, 2022
- Remove aiohttp from requirements.txt · 4e78014b
  David Douard authored 3 years ago
  
  it's not used by swh.storage.
  4e78014b
Mar 02, 2022

Move metrics handling from backends to RPC server · 284a4ab3

vlorentz authored 3 years ago

Motivation: replaces code duplication in the backends with a single one,
to be consistent with the objstorage (which has many more backends)

This also fixes the issue of metrics from 'extid_add' to be missing
when using the postgresql storage.

284a4ab3

Feb 24, 2022

Update for swh.core 2.0.0 · 215162b2

David Douard authored 3 years ago

- Add expected entry points for swh.core 2 db handling new features:

  - add a ``swh.storge.get_datastore()`` function
  - add ``swh.storage.postgreql.storage.Storage.get_current_version()`` method
  - move sql migration scripts in ``swh/storage/sql/upgrades``
  - modify sql initialization scripts to match swh.core 2 (remove
    dbversion management code).

- Update tests to use the new template-based database handling; this
  should have only minimal impact on test execution performances.

215162b2

Add types-toml to requirements-test.txt · 386fb4d6
David Douard authored 3 years ago

386fb4d6

Feb 10, 2022
- pre-commit: Bump hooks and add new one to check commit message spelling · f578377d
  Antoine Lambert authored 3 years ago
  
  To install the new hook: $ pre-commit install -t commit-msg
  f578377d
Feb 04, 2022
- revision_walker: Actually pass ignore_displaynam to revision_log · 8bf07c39
  vlorentz authored 3 years ago
  
  I somehow forgot to stage this change in the previous commit.
  View commits for tag v0.43.1 v0.43.1
  
  8bf07c39
- revision_walker: Add support for ignore_displayname. · 0f9f54b7
  vlorentz authored 3 years ago
  
  This is needed by the vault.
  0f9f54b7
- Add typing to revision_walker.py and make the state a dataclass · 75a7f093
  vlorentz authored 3 years ago
  
  75a7f093
- Require pytest to be <7.0.0 · a3a63d8d
  vlorentz authored 3 years ago
  
  a3a63d8d
Feb 01, 2022
- Introduce a new displayname field for persons in the PostgreSQL storage · 4544d7ca
  Nicolas Dandrimont authored 3 years ago
  
  Extend the APIs for Revisions and Releases to honor the field by default, unless the new `ignore_displayname` argument is set.
  View commits for tag v0.43.0 v0.43.0
  
  4544d7ca
- Make test_release_add_get_arbitrary non-flaky · 97caa933
  Nicolas Dandrimont authored 3 years ago
  
  It was made flaky by d4ddd415.
  97caa933
Jan 31, 2022

Mostly use normalized Person objects in tests · f868f3c8

Nicolas Dandrimont authored 3 years ago

This opens up the possibility of eventually ignoring the `name` and
`email` fields stored in database in favor of parsing them again from
the fullname field (and therefore to update our parsing logic without
having to affect stored data).

f868f3c8

postgresql: Use Person.from_fullname if name and email are None · d4ddd415

Nicolas Dandrimont authored 3 years ago

This allows us to populate sensible name and email values out of the new
displayname field, without having to store them.

d4ddd415

Jan 25, 2022
- Fix directory_add to actually insert the manifest + add directory_get_raw_manifest · 6f025246
  vlorentz authored 3 years ago
  
  I don't expect directory_get_raw_manifest to be used, but it is needed for tests, so why not.
  View commits for tag v0.42.0 v0.42.0
  
  6f025246
Jan 21, 2022

Stop using the deprecated 'TimestampWithTimezone.offset' attribute · 58749057
vlorentz authored 3 years ago
```
It will be replaced by what is currently called 'offset_bytes'
```
58749057

Remove 'offset' and 'negative_utc' · 2e741388

vlorentz authored 3 years ago

This only keeps 'offset_bytes' to store the timezone, to support swh-model
v5.0.0.

However, this keeps writing 'offset' and 'negative_utc' to the postgresql
database, just in case we need to roll back this change.
But they are not read anymore.

2e741388

Jan 18, 2022

postgres: Add indices to keep track of objects with a raw_manifest · c68a4fd9
vlorentz authored 3 years ago
```
They should be a rare occurence, so adding these indices allows us to count
and enumerate them without expensive full table scans.
```
View commits for tag v0.41.2 v0.41.2

c68a4fd9

Fix sphinx error · 228de337

vlorentz authored 3 years ago

[2022-01-17T16:03:27.448Z] /var/lib/jenkins/workspace/DSTO/tests-on-diff@2/docs/index.rst:25:hardcoded link 'https://archive.softwareheritage.org/api/' could be replaced by an extlink (try using ':swh_web:`api/`' instead)

228de337

Jan 12, 2022

cassandra: Make content_missing run in linear time instead of quadratic · 40a57d43

vlorentz authored 3 years ago

Assuming all contents passed to content_missing() have (at least) a missing algo,
the function used to iterate over the size of the arg squared
in the worst case (when all contents are found).

With this commit, it starts with bucketing them by hash, so it does not
need to iterate over *all* found contents for each content passed as arg.

40a57d43

cassandra: Rewrite content_missing to run queries concurrently. · d5f1f0ec
vlorentz authored 3 years ago
```
This is twice as fast, according to
https://forge.softwareheritage.org/T3577#72791
```
d5f1f0ec

Jan 06, 2022

cassandra: Use concurrent queries in *_missing() instead of naive grouping · 4a245050

vlorentz authored 3 years ago

Instead of grouping ids in queries in arbitrary batches (which forces
the server node to coordinate with other nodes to complete the query),
this sends queries with one id each, directly to the right node.

This is the 'concurrent' algorithm from https://forge.softwareheritage.org/T3577#72791
which gives a >=2x speed-up on directories, and a >=8x speed-up on revisions.

4a245050

Jan 04, 2022
- Improve documentation of the replay command · 259bf6fe
  David Douard authored 3 years ago
  
  View commits for tag v0.41.1 v0.41.1
  
  259bf6fe
- Move the 'error_reporter' config entry in a dedicated 'replayer' section · 1071781d
  David Douard authored 3 years ago
  
  1071781d
Dec 22, 2021
- Add columns {,committer_}date_offset to rev/rel and raw_manifest to dir/rev/rel · f3232e66
  vlorentz authored 3 years ago
  
  View commits for tag v0.41.0 v0.41.0
  
  f3232e66
Dec 16, 2021

Pin mypy and drop type annotations which makes mypy unhappy · f09a54d4

Antoine R. Dumont authored 3 years ago

This also drops:
- spurious copyright headers to those files if present.
- fix a type issue revealed by the new mypy

Related to T3812

f09a54d4

Dec 15, 2021
- Add tests checking round-tripping of dir/rev/rel/snp objects generated by Hypothesis · c40ceb32
  vlorentz authored 3 years ago
  
  c40ceb32
- Add test_revision_add_fractional_timezone · 45687d8c
  vlorentz authored 3 years ago
  
  45687d8c
Dec 13, 2021

postgresql: Fix one-by-one error in db_to_date on negative dates · fb1b3a06

vlorentz authored 3 years ago

Using `int()` on `date.timestamp()` rounded it up (toward zero),
but the semantics of `model.Timestamp` is that the actual time is
`ts.seconds + ts.microseconds/1000000`, so all negative dates were
shifted one second up.

In particular, this causes dates from
`1969-12-31T23:59:59.000001` to `1969-12-31T23:59:59.999999`
(inclusive) to smash into dates from
`1970-01-01T00:00:00.000001` to `1970-01-01T00:00:00.999999`,
which is how I discovered the issue.

fb1b3a06

Dec 09, 2021
- postgresql: Add tests for db_to_date. · 34ca67e2
  vlorentz authored 3 years ago
  
  34ca67e2
Dec 08, 2021

proxies/retry: Remove no longer needed tenacity workarounds · 7cb4128e

Antoine Lambert authored 3 years ago

Now that we have packaged tenacity 6.2 for debian buster and use it
in production, we can remove the workarounds to support tenacity < 5.

7cb4128e

Dec 07, 2021

test_cassandra: Fix failing tests since swh-model update · 615fb99e

Antoine Lambert authored 3 years ago

Directory entries are now checked for name duplicates in swh-model
so we must ensure the CrashyEntry class is properly initialized.

Closes T3776

615fb99e

Nov 09, 2021

Add support for a redis-based reporting for invalid mirrorred objects · 850a7553

David Douard authored 3 years ago

The idea is that we check the BaseModel validity at journal
deserialization time so that we still have access to the raw object from
kafka for complete reporting (object id plus raw message from kafka).

This uses a new ModelObjectDeserializer class that is responsible for
deserializing the kafka message (still using kafka_to_value) then
immediately create the BaseModel object from that dict. Its `convert`
method is then passed as `value_deserializer` argument of the
`JournalClient`.

Then, for each deserialized object from kafka, if it's a HashableObject,
check its validity by comparing the computed hash with its id.

If it's invalid, report the error in logs, and if configured, register the
invalid object in via the `reporter` callback.

In the cli code, a `Redis.set()` is used a such a callback (if configured).
So it simply stores invalid objects using the object id a key (typically its
swhid), and the raw kafka message value as value.

Related to T3693.

850a7553