Commits · debian/1.7.1-1_swh1 · vlorentz / Storage manager

Oct 19, 2022
- Updated debian changelog for version 1.7.1 · 2b68f268
  Jenkins for Software Heritage authored 2 years ago
  
  debian/1.7.1-1_swh1
  
  2b68f268
- Update upstream source from tag 'debian/upstream/1.7.1' · 735340cb
  Jenkins for Software Heritage authored 2 years ago
```
Update to upstream version '1.7.1'
with Debian dir f05f0e9091b40ea57a67d50d05421042fd01878e
```
  735340cb
- New upstream version 1.7.1 · 086d974f
  Jenkins for Software Heritage authored 2 years ago
  
  debian/upstream/1.7.1
  
  086d974f
- test_retry: Use proper way to mock sleep of retryable storage methods · 3c08d9f0
  Antoine Lambert authored 2 years ago
```
Previous implementation was not mocking sleep of retryable storage methods
as the RetryingProxyStorage setup the retry features when it is instantiated.

So modify fixture to ensure sleep functions are mocked and return the mocks
in a dict indexed by storage method names.

This fixes debian buster package build for swh-storage.
```
  v1.7.1
  
  3c08d9f0
Oct 18, 2022

pre-commit, tox: Bump pre-commit, codespell, black and flake8 · c1c2dbf0

David Douard authored 2 years ago

- pre-commit from 4.1.0 to 4.3.0,
- codespell from 2.2.1 to 2.2.2,
- black from 22.3.0 to 22.10.0 and
- flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies.

Also change flake8's repo config to github (the gitlab mirror
being outdated).

c1c2dbf0

Oct 17, 2022
- docs: Add info about CPAN extrinsic metadata format · 17d9ad23
  Antoine Lambert authored 2 years ago
```
Related to T2833
```
  17d9ad23
Oct 10, 2022
- Updated debian changelog for version 1.7.0 · 7dc44381
  Jenkins for Software Heritage authored 2 years ago
  
  debian/1.7.0-1_swh1
  
  7dc44381
- Update upstream source from tag 'debian/upstream/1.7.0' · bbd7b5bd
  Jenkins for Software Heritage authored 2 years ago
```
Update to upstream version '1.7.0'
with Debian dir 81b2401295e3498ed142bd16cdfa97d2cb619643
```
  bbd7b5bd
- New upstream version 1.7.0 · e8896bd8
  Jenkins for Software Heritage authored 2 years ago
  
  debian/upstream/1.7.0
  
  e8896bd8
Sep 29, 2022

postgresql: Remove merge join with origin_visit in origin_visit_get_latest · 657d31f6

vlorentz authored 2 years ago

I noticed that `origin_visit_get_latest` spends a lot of time doing index
scans on `origin_visit_pkey`:

```
swh=> explain analyze SELECT * FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                                                      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10.14..29.33 rows=1 width=171) (actual time=1432.475..1432.479 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o_1  (cost=0.56..8.57 rows=1 width=8) (actual time=0.077..0.079 rows=1 loops=1)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.56..2208.37 rows=115 width=171) (actual time=1432.473..1432.476 rows=1 loops=1)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Nested Loop  (cost=1.00..1615.69 rows=93 width=143) (actual time=298.705..298.707 rows=1 loops=1)
               ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85) (actual time=298.658..298.658 rows=1 loops=1)
                     Index Cond: (origin = $0)
                     Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
                     Rows Removed by Filter: 198
               ->  Materialize  (cost=0.43..8.46 rows=1 width=58) (actual time=0.042..0.043 rows=1 loops=1)
                     ->  Index Scan using origin_pkey on origin o  (cost=0.43..8.45 rows=1 width=58) (actual time=0.038..0.038 rows=1 loops=1)
                           Index Cond: (id = $0)
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28) (actual time=30.120..1133.650 rows=100 loops=1)
               Index Cond: (origin = $0)
 Planning Time: 0.577 ms
 Execution Time: 1432.532 ms
(18 lignes)
```

As far as I understand, this is because we do not have a FK to tell the
planner that every row in `origin_visit_status` does have a
corresponding row in `origin_visit`, so it checks every row from
`origin_visit_status` in this loop.

Therefore, I rewrote the query to use a `LEFT JOIN`, so it will spare
this check.

First, here is the original query:

```
swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Change columns to filter directly on the "materialized" fields in ovs
instead of those on those in ov (no actual change yet):

```
swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Then, reorder tables (obviously no change either):

```
swh=> explain SELECT * FROM origin_visit_status ovs INNER JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Finally, replace `INNER JOIN` with `LEFT JOIN`:

```
swh=> explain SELECT * FROM origin_visit_status ovs LEFT JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..35.47 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Nested Loop Left Join  (cost=1.13..2396.79 rows=93 width=113)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan using origin_visit_pkey on origin_visit ov  (cost=0.56..8.59 rows=1 width=28)
               Index Cond: ((origin = ovs.origin) AND (origin = $0) AND (visit = ovs.visit))
(10 lignes)
```

This would also work with a subquery just to get the value of `ov.date`
and removing the actual join to `ov` entirely, but it was more annoying
to implement because the function reuses `self.origin_visit_select_cols`
as column list.

All these EXPLAIN queries were run on staging.

657d31f6

conftest: Replace multiprocessing hack when pytest-cov >= 4 is installed · 44616afe

vlorentz authored 2 years ago

The hack crashes on >= 4 because 'pytest_cov.embed.multiprocessing_start'
is not in the hook list anymore.

https://pytest-cov.readthedocs.io/en/latest/changelog.html

44616afe

Sep 28, 2022
- retry: Do not retry on SystemExit exceptions · 87d3f0d7
  vlorentz authored 2 years ago
```
It prevents process shutdown (unless the user presses Ctrl-C
several times in a row)
```
  87d3f0d7
Sep 27, 2022
- docs: Update archive stats · 7da7067a
  vlorentz authored 2 years ago
  
  7da7067a
- Handle errors raised by fromisoformat. · 26995d42
  vlorentz authored 2 years ago
  
  26995d42
Sep 13, 2022
- Add date-based index to origin_visit · 1281ee7b
  Nicolas Dandrimont authored 2 years ago
```
This will help queries retrieving origin_visits by date by avoiding
having to scan all visits for the origin.
```
  1281ee7b
- SQL upgrade scripts don't need to bump dbversion anymore · 0786ab17
  Nicolas Dandrimont authored 2 years ago
  
  0786ab17
Aug 31, 2022
- docs: Document metadata formats for Gogs and Gitea · aa735cba
  vlorentz authored 2 years ago
  
  aa735cba
Aug 30, 2022

postgresql: Fix SQL query for origin_find_visit_by_date method · d038240c

Antoine Lambert authored 2 years ago

In that query, the interval alias was set for the visit column instead
of the date difference computation which could lead to wrong visit
being returned due to invalid results ordering.

d038240c

Aug 29, 2022
- Raise StorageArgumentException on ProgramLimitExceeded · 57e5431c
  vlorentz authored 2 years ago
```
So it is no logged on the server side, and so clients do not retry
```
  57e5431c
Aug 22, 2022

origin_visit_add: Fix crash when adding multiple visits to the same origin simultaneously · b5836bab

vlorentz authored 2 years ago

This works by adding a RW lock on the row of the latest visit,
which should block other transactions until the insertion
is committed; so other transactions will generate a different
(larger) visit id

This commit also slightly rewrites how the max visit id is
computed, as we need to actually select a row to lock it,
instead of using the `max()` aggregate function.

b5836bab

Aug 16, 2022
- Updated debian changelog for version 1.6.0 · 5a40829d
  Jenkins for Software Heritage authored 2 years ago
  
  debian/1.6.0-1_swh1
  
  5a40829d
- Update upstream source from tag 'debian/upstream/1.6.0' · b9166577
  Jenkins for Software Heritage authored 2 years ago
```
Update to upstream version '1.6.0'
with Debian dir 86c67609298670467429d7d09f9eb0e562cfeb39
```
  b9166577
- New upstream version 1.6.0 · 34ba15b7
  Jenkins for Software Heritage authored 2 years ago
  
  debian/upstream/1.6.0
  
  34ba15b7
Aug 09, 2022

retry: Add constant 10s wait when retrying transient exceptions · 5335244f

vlorentz authored 2 years ago

They are typically caused by server shutdown and other temporary
failures that may take more time than the typical 0-3s delay
used by the retry proxy.

This should keep noisy exceptions like AdminShutdown out of the
Sentry dashboards.

5335244f

Convert psycopg2 errors to TransientRemoteException instead of RemoteException · 7c7a721d

vlorentz authored 2 years ago

On the wire, this is done by making the server return a 503 error
instead of 500, which the RPC client generated by swh-core
interprets to change the exception class.

7c7a721d

Aug 08, 2022
- Add anchor extrinsic-metadata-original-artifacts-json · b4f289c8
  vlorentz authored 2 years ago
```
It needs to be linked from swh/lister/crates/__init__.py
```
  b4f289c8
Aug 05, 2022
- Updated debian changelog for version 1.5.1 · 4e53af05
  Jenkins for Software Heritage authored 2 years ago
  
  debian/1.5.1-1_swh1
  
  4e53af05
- Update upstream source from tag 'debian/upstream/1.5.1' · 5b0437ec
  Jenkins for Software Heritage authored 2 years ago
```
Update to upstream version '1.5.1'
with Debian dir aecb33b9607d268dc35fe135a46a97eab940b3e5
```
  5b0437ec
- New upstream version 1.5.1 · 6dba4b5a
  Jenkins for Software Heritage authored 2 years ago
  
  debian/upstream/1.5.1
  
  6dba4b5a
- cassandra: Fix flakiness of test_directory_add_atomic[concurrent] · 280ecc38
  vlorentz authored 2 years ago
  
  v1.5.1
  
  280ecc38
- Fix crash of test_*_arbitrary when given objects with the same id · 68b93a60
  vlorentz authored 2 years ago
  
  68b93a60
- cassandra: Make origin_visit_status_get_random's interval consistent with postgresql · 9b9eb282
  vlorentz authored 2 years ago
```
The postgresql implementation uses '3 months', which is closer to 13 weeks
than to 12 weeks.
```
  9b9eb282
- Fix flakiness of test_origin_visit_status_get_random_nothing_found · 56f69e56
  vlorentz authored 2 years ago
```
start is increased from 13 to 14, because 13 weeks is 91 days,
ie. 30+31+30; so it is sometimes smaller than 3 months.

This was only hit rarely because the number of visits was small,
so this commit also increases the number of visits to make the
test more likely to fail if it should actually fail.
```
  56f69e56
- Updated debian changelog for version 1.5.0 · 7d185377
  Jenkins for Software Heritage authored 2 years ago
  
  debian/1.5.0-1_swh1
  
  7d185377
- Update upstream source from tag 'debian/upstream/1.5.0' · eae94e6a
  Jenkins for Software Heritage authored 2 years ago
```
Update to upstream version '1.5.0'
with Debian dir 34ae119254de09bf2789c8e454ecae8cb67a614d
```
  eae94e6a
- New upstream version 1.5.0 · ea2f9f46
  Jenkins for Software Heritage authored 2 years ago
  
  debian/upstream/1.5.0
  
  ea2f9f46
- Fix flakiness in test_directory_add_get_arbitrary · 4825f40a
  vlorentz authored 2 years ago
```
By ignoring other attributes when raw_manifest is not None;
just like we already do in test_revision_add_get_arbitrary
and test_release_add_get_arbitrary.
```
  v1.5.0
  
  4825f40a
Aug 04, 2022

Stop logging and sending postgresql timeouts to Sentry · fc890595

vlorentz authored 2 years ago

They are very noisy, and clients are expected to retry a few times
before re-raising the exception on their side.

fc890595

Stop using `USE <keyspace>` with prepared statements · 1e7ede18

vlorentz authored 2 years ago

This caused the following warning:

```
WARNING  cassandra.protocol:libevreactor.py:361 Server warning: `USE <keyspace>` with prepared statements is considered to be an anti-pattern due to ambiguity in non-qualified table names. Please consider removing instances of `Session#setKeyspace(<keyspace>)`, `Session#execute("USE <keyspace>")` and `cluster.newSession(<keyspace>)` from your code, and always use fully qualified table names (e.g. <keyspace>.<table>).
```

This also prepends 'test' to the name of keyspaces used in tests, so
they are guaranteed to start with an letter (starting with digits cause
syntax errors in most statements).

1e7ede18

cassandra: Simplify SELECT statement formatting · 0aff4618
vlorentz authored 2 years ago

0aff4618