Commits · debian/1.7.2-1_swh1_bpo10+1 · vlorentz / Storage manager

Oct 31, 2022
- Updated backport on buster-swh from debian/1.7.2-1_swh1 (unstable-swh) · 41b5e6a5
  Jenkins for Software Heritage authored 2 years ago
  
  debian/1.7.2-1_swh1_bpo10+1
  
  41b5e6a5
- Merge tag 'debian/1.7.2-1_swh1' into debian/buster-swh · e06db735
  Jenkins for Software Heritage authored 2 years ago
  
  e06db735
- Updated debian changelog for version 1.7.2 · ac3b2b8f
  Jenkins for Software Heritage authored 2 years ago
  
  debian/1.7.2-1_swh1
  
  ac3b2b8f
- Update upstream source from tag 'debian/upstream/1.7.2' · 09379d85
  Jenkins for Software Heritage authored 2 years ago
```
Update to upstream version '1.7.2'
with Debian dir dacc5f180032675dbb661873d7a2b30a3f07b33f
```
  09379d85
- New upstream version 1.7.2 · 3f35a0aa
  Jenkins for Software Heritage authored 2 years ago
  
  debian/upstream/1.7.2
  
  3f35a0aa
Oct 23, 2022

vlorentz authored 2 years ago

It was broken by 0e8da810
because the conditional import prevents sphinx from detecting
where the class is imported from.

Additionally, this attribute should not be on the interface,
because proxies do not have it, but they are expected to have
that interface.

82ad28bb

Oct 21, 2022

tests: only flush() the kafka journal writer once per test · 0e8da810

David Douard authored 2 years ago

instead of flushing it n times per test. Since the call to kafka
Producer.flush() takes about 1s, reducing the number of calls to this
method significantly reduce the execution time of the tests.

This required a small refactoring of the JournalBackfiller class to make
the journal writer live out of the scope of the run() so the test can
access the journal writer instance and call the flush() method.

Requires swh-journal >= 1.2.0

0e8da810

Make the replayer not crash on kafka messages that fail to be converted as model objects · fe0eaee8

David Douard authored 2 years ago

for example, there are a few kafka directory messages in the current
production kafka cluster which entries contain the same name twice,
preventing the Directory model object from being created at all,
which makes the replayer crash.

This change makes the replayer able to handle such cases. When the model
object creation fails with a ValueError, the error is reported in the
(redis) error reporter, but the replaying process continue.

Since there is no model object, the error is reported with a crafted
error key of the form "{object_type}:{object_id}" if an object id is
present in the data structure, or "{object_type}:uuid:{uuid4}" if such
an id is not even present. For the record, the standard error key in
redis for a model object is it's swhid (if any).

fe0eaee8

Add a comment that should have been "kept" from 850a7553 · 242e37a7
David Douard authored 2 years ago

242e37a7

Oct 19, 2022
- Fix typos detected by codespell · 784f730e
  Antoine Lambert authored 2 years ago
  
  784f730e
- Updated backport on buster-swh from debian/1.7.1-1_swh1 (unstable-swh) · 68d78cb8
  Jenkins for Software Heritage authored 2 years ago
  
  debian/1.7.1-1_swh1_bpo10+1
  
  68d78cb8
- Merge tag 'debian/1.7.1-1_swh1' into debian/buster-swh · 6b1c9c48
  Jenkins for Software Heritage authored 2 years ago
  
  6b1c9c48
- Updated debian changelog for version 1.7.1 · 2b68f268
  Jenkins for Software Heritage authored 2 years ago
  
  debian/1.7.1-1_swh1
  
  2b68f268
- Update upstream source from tag 'debian/upstream/1.7.1' · 735340cb
  Jenkins for Software Heritage authored 2 years ago
```
Update to upstream version '1.7.1'
with Debian dir f05f0e9091b40ea57a67d50d05421042fd01878e
```
  735340cb
- New upstream version 1.7.1 · 086d974f
  Jenkins for Software Heritage authored 2 years ago
  
  debian/upstream/1.7.1
  
  086d974f
- test_retry: Use proper way to mock sleep of retryable storage methods · 3c08d9f0
  Antoine Lambert authored 2 years ago
```
Previous implementation was not mocking sleep of retryable storage methods
as the RetryingProxyStorage setup the retry features when it is instantiated.

So modify fixture to ensure sleep functions are mocked and return the mocks
in a dict indexed by storage method names.

This fixes debian buster package build for swh-storage.
```
  v1.7.1
  
  3c08d9f0
Oct 18, 2022

pre-commit, tox: Bump pre-commit, codespell, black and flake8 · c1c2dbf0

David Douard authored 2 years ago

- pre-commit from 4.1.0 to 4.3.0,
- codespell from 2.2.1 to 2.2.2,
- black from 22.3.0 to 22.10.0 and
- flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies.

Also change flake8's repo config to github (the gitlab mirror
being outdated).

c1c2dbf0

Oct 17, 2022
- docs: Add info about CPAN extrinsic metadata format · 17d9ad23
  Antoine Lambert authored 2 years ago
```
Related to T2833
```
  17d9ad23
Oct 10, 2022
- Updated debian changelog for version 1.7.0 · 7dc44381
  Jenkins for Software Heritage authored 2 years ago
  
  debian/1.7.0-1_swh1
  
  7dc44381
- Update upstream source from tag 'debian/upstream/1.7.0' · bbd7b5bd
  Jenkins for Software Heritage authored 2 years ago
```
Update to upstream version '1.7.0'
with Debian dir 81b2401295e3498ed142bd16cdfa97d2cb619643
```
  bbd7b5bd
- New upstream version 1.7.0 · e8896bd8
  Jenkins for Software Heritage authored 2 years ago
  
  debian/upstream/1.7.0
  
  e8896bd8
Sep 29, 2022

postgresql: Remove merge join with origin_visit in origin_visit_get_latest · 657d31f6

vlorentz authored 2 years ago

I noticed that `origin_visit_get_latest` spends a lot of time doing index
scans on `origin_visit_pkey`:

```
swh=> explain analyze SELECT * FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                                                      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10.14..29.33 rows=1 width=171) (actual time=1432.475..1432.479 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o_1  (cost=0.56..8.57 rows=1 width=8) (actual time=0.077..0.079 rows=1 loops=1)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.56..2208.37 rows=115 width=171) (actual time=1432.473..1432.476 rows=1 loops=1)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Nested Loop  (cost=1.00..1615.69 rows=93 width=143) (actual time=298.705..298.707 rows=1 loops=1)
               ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85) (actual time=298.658..298.658 rows=1 loops=1)
                     Index Cond: (origin = $0)
                     Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
                     Rows Removed by Filter: 198
               ->  Materialize  (cost=0.43..8.46 rows=1 width=58) (actual time=0.042..0.043 rows=1 loops=1)
                     ->  Index Scan using origin_pkey on origin o  (cost=0.43..8.45 rows=1 width=58) (actual time=0.038..0.038 rows=1 loops=1)
                           Index Cond: (id = $0)
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28) (actual time=30.120..1133.650 rows=100 loops=1)
               Index Cond: (origin = $0)
 Planning Time: 0.577 ms
 Execution Time: 1432.532 ms
(18 lignes)
```

As far as I understand, this is because we do not have a FK to tell the
planner that every row in `origin_visit_status` does have a
corresponding row in `origin_visit`, so it checks every row from
`origin_visit_status` in this loop.

Therefore, I rewrote the query to use a `LEFT JOIN`, so it will spare
this check.

First, here is the original query:

```
swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Change columns to filter directly on the "materialized" fields in ovs
instead of those on those in ov (no actual change yet):

```
swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Then, reorder tables (obviously no change either):

```
swh=> explain SELECT * FROM origin_visit_status ovs INNER JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Finally, replace `INNER JOIN` with `LEFT JOIN`:

```
swh=> explain SELECT * FROM origin_visit_status ovs LEFT JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..35.47 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Nested Loop Left Join  (cost=1.13..2396.79 rows=93 width=113)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan using origin_visit_pkey on origin_visit ov  (cost=0.56..8.59 rows=1 width=28)
               Index Cond: ((origin = ovs.origin) AND (origin = $0) AND (visit = ovs.visit))
(10 lignes)
```

This would also work with a subquery just to get the value of `ov.date`
and removing the actual join to `ov` entirely, but it was more annoying
to implement because the function reuses `self.origin_visit_select_cols`
as column list.

All these EXPLAIN queries were run on staging.

657d31f6

conftest: Replace multiprocessing hack when pytest-cov >= 4 is installed · 44616afe

vlorentz authored 2 years ago

The hack crashes on >= 4 because 'pytest_cov.embed.multiprocessing_start'
is not in the hook list anymore.

https://pytest-cov.readthedocs.io/en/latest/changelog.html

44616afe

Sep 28, 2022
- retry: Do not retry on SystemExit exceptions · 87d3f0d7
  vlorentz authored 2 years ago
```
It prevents process shutdown (unless the user presses Ctrl-C
several times in a row)
```
  87d3f0d7
Sep 27, 2022
- docs: Update archive stats · 7da7067a
  vlorentz authored 2 years ago
  
  7da7067a
- Handle errors raised by fromisoformat. · 26995d42
  vlorentz authored 2 years ago
  
  26995d42
Sep 13, 2022
- Add date-based index to origin_visit · 1281ee7b
  Nicolas Dandrimont authored 2 years ago
```
This will help queries retrieving origin_visits by date by avoiding
having to scan all visits for the origin.
```
  1281ee7b
- SQL upgrade scripts don't need to bump dbversion anymore · 0786ab17
  Nicolas Dandrimont authored 2 years ago
  
  0786ab17
Aug 31, 2022
- docs: Document metadata formats for Gogs and Gitea · aa735cba
  vlorentz authored 2 years ago
  
  aa735cba
Aug 30, 2022

postgresql: Fix SQL query for origin_find_visit_by_date method · d038240c

Antoine Lambert authored 2 years ago

In that query, the interval alias was set for the visit column instead
of the date difference computation which could lead to wrong visit
being returned due to invalid results ordering.

d038240c

Aug 29, 2022
- Raise StorageArgumentException on ProgramLimitExceeded · 57e5431c
  vlorentz authored 2 years ago
```
So it is no logged on the server side, and so clients do not retry
```
  57e5431c
Aug 22, 2022

origin_visit_add: Fix crash when adding multiple visits to the same origin simultaneously · b5836bab

vlorentz authored 2 years ago

This works by adding a RW lock on the row of the latest visit,
which should block other transactions until the insertion
is committed; so other transactions will generate a different
(larger) visit id

This commit also slightly rewrites how the max visit id is
computed, as we need to actually select a row to lock it,
instead of using the `max()` aggregate function.

b5836bab

Aug 16, 2022
- Updated backport on buster-swh from debian/1.6.0-1_swh1 (unstable-swh) · 77c70887
  Jenkins for Software Heritage authored 2 years ago
  
  debian/1.6.0-1_swh1_bpo10+1
  
  77c70887
- Merge tag 'debian/1.6.0-1_swh1' into debian/buster-swh · 036ce3e0
  Jenkins for Software Heritage authored 2 years ago
  
  036ce3e0
- Updated debian changelog for version 1.6.0 · 5a40829d
  Jenkins for Software Heritage authored 2 years ago
  
  debian/1.6.0-1_swh1
  
  5a40829d
- Update upstream source from tag 'debian/upstream/1.6.0' · b9166577
  Jenkins for Software Heritage authored 2 years ago
```
Update to upstream version '1.6.0'
with Debian dir 86c67609298670467429d7d09f9eb0e562cfeb39
```
  b9166577
- New upstream version 1.6.0 · 34ba15b7
  Jenkins for Software Heritage authored 2 years ago
  
  debian/upstream/1.6.0
  
  34ba15b7
Aug 09, 2022

retry: Add constant 10s wait when retrying transient exceptions · 5335244f

vlorentz authored 2 years ago

They are typically caused by server shutdown and other temporary
failures that may take more time than the typical 0-3s delay
used by the retry proxy.

This should keep noisy exceptions like AdminShutdown out of the
Sentry dashboards.

5335244f

Convert psycopg2 errors to TransientRemoteException instead of RemoteException · 7c7a721d

vlorentz authored 2 years ago

On the wire, this is done by making the server return a 503 error
instead of 500, which the RPC client generated by swh-core
interprets to change the exception class.

7c7a721d

Aug 08, 2022
- Add anchor extrinsic-metadata-original-artifacts-json · b4f289c8
  vlorentz authored 2 years ago
```
It needs to be linked from swh/lister/crates/__init__.py
```
  b4f289c8