Commits · 60c5a5c463ece97f9a909530dab2177e39d434bb · Pierre-Yves David / swh-provenance

Oct 18, 2022

pre-commit, tox: Bump pre-commit, codespell, black and flake8 · 60c5a5c4

David Douard authored 2 years ago

- pre-commit from 4.1.0 to 4.3.0,
- codespell from 2.2.1 to 2.2.2,
- black from 22.3.0 to 22.10.0 and
- flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies.

Also change flake8's repo config to github (the gitlab mirror
being outdated).

60c5a5c4

Oct 13, 2022
- Add a 'swh provenance replay' cli command · 022b6f76
  David Douard authored 2 years ago
  
  022b6f76
- Add a journal replayer for the revision layer · 0850a394
  David Douard authored 2 years ago
  
  As usual when kafka is involved in tests, provided new tests are a bit slow...
  0850a394
- Make relation_add sql function prefill entity tables if needed · e1da37d4
  David Douard authored 2 years ago
  
  instead of depending on the proper behavior of the user of the ProvenanceStoragePostgresql user.
  e1da37d4
- Remove the without-path db flavor · a0b2f0e9
  David Douard authored 2 years ago
  
  it's not used, and keeping it makes code unnecessarly complex.
  a0b2f0e9
Oct 12, 2022
- Remove unused test data directory · b3fa1f59
  David Douard authored 2 years ago
  
  b3fa1f59
Oct 11, 2022

Add support for kafka journalization of the ProvenanceStorageInterface · 08f2e604

David Douard authored 2 years ago

the new ProvenanceStorageJournal is a proxy ProvenanceStorageInterface
that will push added objects in a swh-journal (typ. a kafka).

Journal messages are simple dicts with 2 keys: id (the sharding key) and
value (a serialiazable version of the argument of the xxx_add() method).

Use the 'kafka' pytest marker for all kafka-related tests (especially
used for tox, see tox.ini).

08f2e604

Rename ProvenanceInterface.directory_xxx_flattenned as directory_xxx_flattened · 7e6a62c9
David Douard authored 2 years ago
```
and fix all occurrences of the typo.
```
7e6a62c9

Normalize _add() methods of the ProvenanceStorage interface · 2bd74fc7

David Douard authored 2 years ago

make them all accept a Dict[Sha1Git, xxx] as argument, ie:

- remove support for Iterable[bytes] in revision_add, and
- replace Iterable[bytes] by Dict[Sha1Git, bytes] for location_add

Currently, the sha1 of location path in location_add() is not really
used by any backend, so the computation of said hashed is a waste of
resource, but it makes the API of this interface much more consistent
which will be helpful for coming features (like kafka journal).

2bd74fc7

Oct 03, 2022

More core reorganization · 6f4a193e
David Douard authored 2 years ago
```
move ingestion related code for the 3 "layers" in an algos/ submodule.
```
6f4a193e

Reorganize the code · 7c882f57

David Douard authored 2 years ago

- move everything (swh)archive related in a archive/ submodule
- move everything provenance storage related in a storage/ submodule
  (which remains a not ideal name, may be confusing with the general
  'storage == swh-storage' acceptance in swh)
- rename rabbitmq's backend from api/ to storage/rabbitmq
- spit interface.py in 3 parts (one for each part, ProvenanceInterface,
  ProvenanceStorageInterface and ArchiveInterface).

7c882f57

Mark origin layer tests as "origin_layer" · 9a63bd81
David Douard authored 2 years ago
```
so that one can easily run tests for the revision or origin
layer only.
```
9a63bd81
Adapt postgresql backend to swh.core.db >= 2.0 · 67cd713b
David Douard authored 2 years ago

67cd713b
Upgrade to swh.graph 2.0 · bb0574c2
David Douard authored 2 years ago

bb0574c2

Sep 08, 2022
- tests/data/README.md: Use 'python3' executable instead of 'python' · a2bcc8ea
  vlorentz authored 2 years ago
  
  'python' may be Python 2, according to https://peps.python.org/pep-0394/#for-python-runtime-distributors
  a2bcc8ea
- Add a size limit for directory expansions when computing the isochrone graph · 4ee2ed21
  David Douard authored 2 years ago
  
  This allows us to ignore git bombs and other suspiciously large repos.
  4ee2ed21
Sep 01, 2022

swhgraph: use grpc API · fa15961c

David Douard authored 2 years ago

replace the (deprecated) HTTP RCP API to access the swh-graph service,
in favor of the grpc server.

To be able to test the (now) grpc-based ArchiveGraph, compressed graph
datasets for all 3 common datasets (cmdbts2, out-of-order and with-merges)
have been generated and included in this revision.

fa15961c

tests: add tags (Release) in the datasets · 865449ec
David Douard authored 2 years ago
```
this will be needed for testing grpc swh-graph archive backend.
```
865449ec

Aug 30, 2022
- Fix a couple of mypy issues · 4a241adb
  David Douard authored 2 years ago
  
  4a241adb
Aug 12, 2022
- swhgraph: handle empty responses · 8f476d49
  Nicolas Dandrimont authored 2 years ago
  
  When the visit_edges response is empty, swh.graph.client generates an empty tuple, which can't be unpacked. Work around the issue.
  8f476d49
- Use proper signatures in journal_client · edf00f88
  Nicolas Dandrimont authored 2 years ago
  
  We're always passing the provenance-internal object types, not those of swh.storage.
  edf00f88
- origin layer: retrieve multiple levels of revision history at once · 08de80b6
  Nicolas Dandrimont authored 2 years ago
  
  Replace `revision_get_parents` with `revision_get_some_outbound_edges`, which can optionally retrieve more levels of history than just a single one. This allows us to do way fewer queries on the swh.graph or swh.storage backend if the revision exists there. The swh.storage backend does limited recursion, so we still process the origin in multiple steps to fetch the whole history.
  08de80b6
- Appease pyright by ensuring target_type is bound · 68e1907e
  Nicolas Dandrimont authored 2 years ago
  
  68e1907e
- Rename origin.proceed_origin to origin.process_origin · d935abf4
  Nicolas Dandrimont authored 2 years ago
  
  d935abf4
- multiplexer: add endpoint counts per backend · 2ac46f58
  Nicolas Dandrimont authored 2 years ago
  
  2ac46f58
- journal client: only use the provenance context manager once · 8d323c32
  Nicolas Dandrimont authored 2 years ago
  
  The context manager for the provenance storage rabbitmq client doesn't like being used multiple times over the lifetime of a process. Only use it once in the cli of the journal client.
  8d323c32
- provenance: lower the cache thresholds · f5f8555f
  Nicolas Dandrimont authored 2 years ago
  
  Instead of flushing if any entry is over the threshold, flush when the cumulative count goes over.
  f5f8555f
- revision: only trigger partial flushes when necessary · 4b3de617
  Nicolas Dandrimont authored 2 years ago
  
  4b3de617
- revision: sort batches by date, improve logging, add incremental flushing · 9c936c39
  Nicolas Dandrimont authored 2 years ago
  
  9c936c39
- revision: capture datetime exceptions with sentry · 5b66b98e
  Nicolas Dandrimont authored 2 years ago
  
  5b66b98e
- revision: don't process revisions before the epoch · af09058f
  Nicolas Dandrimont authored 2 years ago
  
  af09058f
- revision: don't process revisions with unknown dates · 3473d4af
  Nicolas Dandrimont authored 2 years ago
  
  3473d4af
- postgresql archive: add support for partially copied databases · d7d0c3d8
  Nicolas Dandrimont authored 2 years ago
  
  The incremental copy of the archive to mmca is not atomic: the directory table needs to be copied first, then the directory_entry_* tables need to be updated. This means that the client can view inconsistent entries, where the directory has been synced but not all the entry rows. We return an empty list when one of these bogus entries is detected. This allows smooth fallback to the main database through the multiplexer.
  d7d0c3d8
- postgresql archive: don't use custom types · 95eb9622
  Nicolas Dandrimont authored 2 years ago
  
  The partial copy of the archive on mmca doesn't have them anyway.
  95eb9622
- Remove sneaky caches in the postgresql archive implementation · 34a9a1ac
  Nicolas Dandrimont authored 2 years ago
  
  34a9a1ac
- rabbitmq: Extend timeouts for reception of acks · bae8f4af
  Nicolas Dandrimont authored 2 years ago
  
  The retry logic is not very refined, extending the timeouts makes more sense.
  bae8f4af
- rabbitmq: close the consumer only after all acks are received · 1efc40c7
  Nicolas Dandrimont authored 2 years ago
  
  This is not quite working but it seems to reduce issues on worker termination a bit.
  1efc40c7
- Improve logging in the API client and the revision layer · ef7cd991
  Nicolas Dandrimont authored 2 years ago
  
  ef7cd991
- Add systemd notification support · 3edf3690
  Nicolas Dandrimont authored 2 years ago
  
  3edf3690
- Try to avoid some circular imports · 5cadb13d
  Nicolas Dandrimont authored 2 years ago
  
  5cadb13d