Commits · v1.7.1 · vlorentz / Storage manager

Oct 19, 2022

test_retry: Use proper way to mock sleep of retryable storage methods · 3c08d9f0

Antoine Lambert authored 2 years ago

Previous implementation was not mocking sleep of retryable storage methods
as the RetryingProxyStorage setup the retry features when it is instantiated.

So modify fixture to ensure sleep functions are mocked and return the mocks
in a dict indexed by storage method names.

This fixes debian buster package build for swh-storage.

3c08d9f0

Oct 18, 2022

pre-commit, tox: Bump pre-commit, codespell, black and flake8 · c1c2dbf0

David Douard authored 2 years ago

- pre-commit from 4.1.0 to 4.3.0,
- codespell from 2.2.1 to 2.2.2,
- black from 22.3.0 to 22.10.0 and
- flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies.

Also change flake8's repo config to github (the gitlab mirror
being outdated).

c1c2dbf0

Oct 17, 2022
- docs: Add info about CPAN extrinsic metadata format · 17d9ad23
  Antoine Lambert authored 2 years ago
  
  Related to T2833
  17d9ad23
Sep 29, 2022

postgresql: Remove merge join with origin_visit in origin_visit_get_latest · 657d31f6

vlorentz authored 2 years ago

I noticed that `origin_visit_get_latest` spends a lot of time doing index
scans on `origin_visit_pkey`:

```
swh=> explain analyze SELECT * FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                                                      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10.14..29.33 rows=1 width=171) (actual time=1432.475..1432.479 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o_1  (cost=0.56..8.57 rows=1 width=8) (actual time=0.077..0.079 rows=1 loops=1)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.56..2208.37 rows=115 width=171) (actual time=1432.473..1432.476 rows=1 loops=1)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Nested Loop  (cost=1.00..1615.69 rows=93 width=143) (actual time=298.705..298.707 rows=1 loops=1)
               ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85) (actual time=298.658..298.658 rows=1 loops=1)
                     Index Cond: (origin = $0)
                     Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
                     Rows Removed by Filter: 198
               ->  Materialize  (cost=0.43..8.46 rows=1 width=58) (actual time=0.042..0.043 rows=1 loops=1)
                     ->  Index Scan using origin_pkey on origin o  (cost=0.43..8.45 rows=1 width=58) (actual time=0.038..0.038 rows=1 loops=1)
                           Index Cond: (id = $0)
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28) (actual time=30.120..1133.650 rows=100 loops=1)
               Index Cond: (origin = $0)
 Planning Time: 0.577 ms
 Execution Time: 1432.532 ms
(18 lignes)
```

As far as I understand, this is because we do not have a FK to tell the
planner that every row in `origin_visit_status` does have a
corresponding row in `origin_visit`, so it checks every row from
`origin_visit_status` in this loop.

Therefore, I rewrote the query to use a `LEFT JOIN`, so it will spare
this check.

First, here is the original query:

```
swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Change columns to filter directly on the "materialized" fields in ovs
instead of those on those in ov (no actual change yet):

```
swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Then, reorder tables (obviously no change either):

```
swh=> explain SELECT * FROM origin_visit_status ovs INNER JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Finally, replace `INNER JOIN` with `LEFT JOIN`:

```
swh=> explain SELECT * FROM origin_visit_status ovs LEFT JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..35.47 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Nested Loop Left Join  (cost=1.13..2396.79 rows=93 width=113)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan using origin_visit_pkey on origin_visit ov  (cost=0.56..8.59 rows=1 width=28)
               Index Cond: ((origin = ovs.origin) AND (origin = $0) AND (visit = ovs.visit))
(10 lignes)
```

This would also work with a subquery just to get the value of `ov.date`
and removing the actual join to `ov` entirely, but it was more annoying
to implement because the function reuses `self.origin_visit_select_cols`
as column list.

All these EXPLAIN queries were run on staging.

657d31f6

conftest: Replace multiprocessing hack when pytest-cov >= 4 is installed · 44616afe

vlorentz authored 2 years ago

The hack crashes on >= 4 because 'pytest_cov.embed.multiprocessing_start'
is not in the hook list anymore.

https://pytest-cov.readthedocs.io/en/latest/changelog.html

44616afe

Sep 28, 2022
- retry: Do not retry on SystemExit exceptions · 87d3f0d7
  vlorentz authored 2 years ago
  
  It prevents process shutdown (unless the user presses Ctrl-C several times in a row)
  87d3f0d7
Sep 27, 2022
- docs: Update archive stats · 7da7067a
  vlorentz authored 2 years ago
  
  7da7067a
- Handle errors raised by fromisoformat. · 26995d42
  vlorentz authored 2 years ago
  
  26995d42
Sep 13, 2022
- Add date-based index to origin_visit · 1281ee7b
  Nicolas Dandrimont authored 2 years ago
  
  This will help queries retrieving origin_visits by date by avoiding having to scan all visits for the origin.
  1281ee7b
- SQL upgrade scripts don't need to bump dbversion anymore · 0786ab17
  Nicolas Dandrimont authored 2 years ago
  
  0786ab17
Aug 31, 2022
- docs: Document metadata formats for Gogs and Gitea · aa735cba
  vlorentz authored 2 years ago
  
  aa735cba
Aug 30, 2022

postgresql: Fix SQL query for origin_find_visit_by_date method · d038240c

Antoine Lambert authored 2 years ago

In that query, the interval alias was set for the visit column instead
of the date difference computation which could lead to wrong visit
being returned due to invalid results ordering.

d038240c

Aug 29, 2022
- Raise StorageArgumentException on ProgramLimitExceeded · 57e5431c
  vlorentz authored 2 years ago
  
  So it is no logged on the server side, and so clients do not retry
  57e5431c
Aug 22, 2022

origin_visit_add: Fix crash when adding multiple visits to the same origin simultaneously · b5836bab

vlorentz authored 2 years ago

This works by adding a RW lock on the row of the latest visit,
which should block other transactions until the insertion
is committed; so other transactions will generate a different
(larger) visit id

This commit also slightly rewrites how the max visit id is
computed, as we need to actually select a row to lock it,
instead of using the `max()` aggregate function.

b5836bab

Aug 09, 2022

retry: Add constant 10s wait when retrying transient exceptions · 5335244f

vlorentz authored 2 years ago

They are typically caused by server shutdown and other temporary
failures that may take more time than the typical 0-3s delay
used by the retry proxy.

This should keep noisy exceptions like AdminShutdown out of the
Sentry dashboards.

5335244f

Convert psycopg2 errors to TransientRemoteException instead of RemoteException · 7c7a721d

vlorentz authored 2 years ago

On the wire, this is done by making the server return a 503 error
instead of 500, which the RPC client generated by swh-core
interprets to change the exception class.

7c7a721d

Aug 08, 2022
- Add anchor extrinsic-metadata-original-artifacts-json · b4f289c8
  vlorentz authored 2 years ago
  
  It needs to be linked from swh/lister/crates/__init__.py
  b4f289c8
Aug 05, 2022
- cassandra: Fix flakiness of test_directory_add_atomic[concurrent] · 280ecc38
  vlorentz authored 2 years ago
  
  View commits for tag v1.5.1 v1.5.1
  
  280ecc38
- Fix crash of test_*_arbitrary when given objects with the same id · 68b93a60
  vlorentz authored 2 years ago
  
  68b93a60
- cassandra: Make origin_visit_status_get_random's interval consistent with postgresql · 9b9eb282
  vlorentz authored 2 years ago
  
  The postgresql implementation uses '3 months', which is closer to 13 weeks than to 12 weeks.
  9b9eb282
- Fix flakiness of test_origin_visit_status_get_random_nothing_found · 56f69e56
  vlorentz authored 2 years ago
  
  start is increased from 13 to 14, because 13 weeks is 91 days, ie. 30+31+30; so it is sometimes smaller than 3 months. This was only hit rarely because the number of visits was small, so this commit also increases the number of visits to make the test more likely to fail if it should actually fail.
  56f69e56
- Fix flakiness in test_directory_add_get_arbitrary · 4825f40a
  vlorentz authored 2 years ago
  
  By ignoring other attributes when raw_manifest is not None; just like we already do in test_revision_add_get_arbitrary and test_release_add_get_arbitrary.
  View commits for tag v1.5.0 v1.5.0
  
  4825f40a
Aug 04, 2022

Stop logging and sending postgresql timeouts to Sentry · fc890595

vlorentz authored 2 years ago

They are very noisy, and clients are expected to retry a few times
before re-raising the exception on their side.

fc890595

Stop using `USE <keyspace>` with prepared statements · 1e7ede18

vlorentz authored 2 years ago

This caused the following warning:

```
WARNING  cassandra.protocol:libevreactor.py:361 Server warning: `USE <keyspace>` with prepared statements is considered to be an anti-pattern due to ambiguity in non-qualified table names. Please consider removing instances of `Session#setKeyspace(<keyspace>)`, `Session#execute("USE <keyspace>")` and `cluster.newSession(<keyspace>)` from your code, and always use fully qualified table names (e.g. <keyspace>.<table>).
```

This also prepends 'test' to the name of keyspaces used in tests, so
they are guaranteed to start with an letter (starting with digits cause
syntax errors in most statements).

1e7ede18

cassandra: Simplify SELECT statement formatting · 0aff4618
vlorentz authored 2 years ago

0aff4618

Add test_directory_add_raw_manifest__different_entries · 2205fa6e

vlorentz authored 2 years ago

This reproduces what I think is the issue found in
https://jenkins.softwareheritage.org/job/debian/job/packages/job/DSTO/job/gbp-buildpackage/423/consoleFull

This does not fix the issue as it is a consequence of the design,
but documents this problematic behavior.

2205fa6e

Jul 13, 2022

postgresql: Increase some timeouts to get origin visits · fbe38038

Antoine Lambert authored 2 years ago

Even if missing index to speedup origin visit queries has
been added to replica database, the configured timeouts for
origin_visit_get_with_statuses and origin_visit_find_by_date
were still too low to avoid query timeouts in production.

After performing some tests locally, bumping them to 2000ms
makes the timeouts go away.

Related to T4386

fbe38038

Jul 12, 2022
- backfill: Add support for directories with duplicated entries · cfc86799
  vlorentz authored 2 years ago
  
  This uses Directory.from_possibly_duplicated_entries() to mangle entry names instead of crashing.
  cfc86799
Jul 08, 2022
- cli: move an import statement in the cli command · d6db4e44
  David Douard authored 2 years ago
  
  d6db4e44
Jul 06, 2022

do not always auto-create an OriginVisitStatus object in origin_visit_add() · e0825acb

David Douard authored 2 years ago

when the OriginVisit object given as argument to be inserted already
have its visit id set (which is usually the case in a replayer-like
session), it makes no sense to auto-add the first OriginVisitStatus
objects related to this visit; this behavior is expected only when the
origin_visit_add() is called from a loading session.

Adapt tests accordingly -- several tests did depend on the auto-add
behavior of the origin_visit_add method for OriginVisit objects which
visit_id is given in the test dataset.

e0825acb

Jul 01, 2022
- Add a Storage.flavor property to the postgresql backend · 47caf04e
  David Douard authored 2 years ago
  
  and add tests for 'mirror' and 'read_replica' flavors.
  47caf04e
- Update pytest_plugin for swh.core 2.10 · a00650ea
  David Douard authored 2 years ago
  
  a00650ea
Jun 03, 2022

Set current_version attribute to postgresql datastore · c19f53f1

Antoine R. Dumont authored 2 years ago

This also simplifies the db collaborator code reusing core.db functions to check the
code version and the actual db version matches.

Related to T4305

Verified

c19f53f1

May 31, 2022
- pytest_plugin: use the stock pytest_postgresql postgresql factory · e64d64e4
  David Douard authored 2 years ago
  
  instead of swh-core's postgresql_fact one, since we actually do not use its custom features any more in swh-storage.
  e64d64e4
- Add missing __init__.py in proxies/ · a936cfda
  David Douard authored 2 years ago
  
  a936cfda
May 10, 2022
- docs: Describe metadata formats more precisely, and mention github and gitlab's · cb12394c
  vlorentz authored 2 years ago
  
  cb12394c
May 09, 2022
- add strict asyncio_mode in pytest.ini · 27d3c8a6
  Pratyush authored 2 years ago
  
  27d3c8a6
May 02, 2022
- Add function storage.algos.directory.directory_get · 95629534
  vlorentz authored 2 years ago
  
  It will be used i swh-storage to fetch a complete directory object, ie. with the raw_manifest and all branches.
  View commits for tag v1.4.0 v1.4.0
  
  95629534
Apr 28, 2022
- client: Migrate to _post method call to stop deprecation warnings · fb551411
  Antoine R. Dumont authored 2 years ago
  
  Verified
  
  fb551411
Apr 26, 2022
- Bump mypy to v0.942 · a942679a
  vlorentz authored 2 years ago
  
  a942679a