- Oct 19, 2022
-
-
Antoine Lambert authored
Previous implementation was not mocking sleep of retryable storage methods as the RetryingProxyStorage setup the retry features when it is instantiated. So modify fixture to ensure sleep functions are mocked and return the mocks in a dict indexed by storage method names. This fixes debian buster package build for swh-storage.
-
- Oct 18, 2022
-
-
David Douard authored
- pre-commit from 4.1.0 to 4.3.0, - codespell from 2.2.1 to 2.2.2, - black from 22.3.0 to 22.10.0 and - flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies. Also change flake8's repo config to github (the gitlab mirror being outdated).
-
- Oct 17, 2022
-
-
Antoine Lambert authored
Related to T2833
-
- Sep 29, 2022
-
-
vlorentz authored
I noticed that `origin_visit_get_latest` spends a lot of time doing index scans on `origin_visit_pkey`: ``` swh=> explain analyze SELECT * FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=10.14..29.33 rows=1 width=171) (actual time=1432.475..1432.479 rows=1 loops=1) InitPlan 1 (returns $0) -> Index Scan using origin_url_idx on origin o_1 (cost=0.56..8.57 rows=1 width=8) (actual time=0.077..0.079 rows=1 loops=1) Index Cond: (url = 'https://pypi.org/project/simpleado/'::text) -> Merge Join (cost=1.56..2208.37 rows=115 width=171) (actual time=1432.473..1432.476 rows=1 loops=1) Merge Cond: (ovs.visit = ov.visit) -> Nested Loop (cost=1.00..1615.69 rows=93 width=143) (actual time=298.705..298.707 rows=1 loops=1) -> Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs (cost=0.57..1606.07 rows=93 width=85) (actual time=298.658..298.658 rows=1 loops=1) Index Cond: (origin = $0) Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state)) Rows Removed by Filter: 198 -> Materialize (cost=0.43..8.46 rows=1 width=58) (actual time=0.042..0.043 rows=1 loops=1) -> Index Scan using origin_pkey on origin o (cost=0.43..8.45 rows=1 width=58) (actual time=0.038..0.038 rows=1 loops=1) Index Cond: (id = $0) -> Index Scan Backward using origin_visit_pkey on origin_visit ov (cost=0.56..590.92 rows=150 width=28) (actual time=30.120..1133.650 rows=100 loops=1) Index Cond: (origin = $0) Planning Time: 0.577 ms Execution Time: 1432.532 ms (18 lignes) ``` As far as I understand, this is because we do not have a FK to tell the planner that every row in `origin_visit_status` does have a corresponding row in `origin_visit`, so it checks every row from `origin_visit_status` in this loop. Therefore, I rewrote the query to use a `LEFT JOIN`, so it will spare this check. First, here is the original query: ``` swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Limit (cost=9.71..28.82 rows=1 width=113) InitPlan 1 (returns $0) -> Index Scan using origin_url_idx on origin o (cost=0.56..8.57 rows=1 width=8) Index Cond: (url = 'https://pypi.org/project/simpleado/'::text) -> Merge Join (cost=1.13..2198.75 rows=115 width=113) Merge Cond: (ovs.visit = ov.visit) -> Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs (cost=0.57..1606.07 rows=93 width=85) Index Cond: (origin = $0) Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state)) -> Index Scan Backward using origin_visit_pkey on origin_visit ov (cost=0.56..590.92 rows=150 width=28) Index Cond: (origin = $0) (11 lignes) ``` Change columns to filter directly on the "materialized" fields in ovs instead of those on those in ov (no actual change yet): ``` swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Limit (cost=9.71..28.82 rows=1 width=113) InitPlan 1 (returns $0) -> Index Scan using origin_url_idx on origin o (cost=0.56..8.57 rows=1 width=8) Index Cond: (url = 'https://pypi.org/project/simpleado/'::text) -> Merge Join (cost=1.13..2198.75 rows=115 width=113) Merge Cond: (ovs.visit = ov.visit) -> Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs (cost=0.57..1606.07 rows=93 width=85) Index Cond: (origin = $0) Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state)) -> Index Scan Backward using origin_visit_pkey on origin_visit ov (cost=0.56..590.92 rows=150 width=28) Index Cond: (origin = $0) (11 lignes) ``` Then, reorder tables (obviously no change either): ``` swh=> explain SELECT * FROM origin_visit_status ovs INNER JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Limit (cost=9.71..28.82 rows=1 width=113) InitPlan 1 (returns $0) -> Index Scan using origin_url_idx on origin o (cost=0.56..8.57 rows=1 width=8) Index Cond: (url = 'https://pypi.org/project/simpleado/'::text) -> Merge Join (cost=1.13..2198.75 rows=115 width=113) Merge Cond: (ovs.visit = ov.visit) -> Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs (cost=0.57..1606.07 rows=93 width=85) Index Cond: (origin = $0) Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state)) -> Index Scan Backward using origin_visit_pkey on origin_visit ov (cost=0.56..590.92 rows=150 width=28) Index Cond: (origin = $0) (11 lignes) ``` Finally, replace `INNER JOIN` with `LEFT JOIN`: ``` swh=> explain SELECT * FROM origin_visit_status ovs LEFT JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Limit (cost=9.71..35.47 rows=1 width=113) InitPlan 1 (returns $0) -> Index Scan using origin_url_idx on origin o (cost=0.56..8.57 rows=1 width=8) Index Cond: (url = 'https://pypi.org/project/simpleado/'::text) -> Nested Loop Left Join (cost=1.13..2396.79 rows=93 width=113) -> Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs (cost=0.57..1606.07 rows=93 width=85) Index Cond: (origin = $0) Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state)) -> Index Scan using origin_visit_pkey on origin_visit ov (cost=0.56..8.59 rows=1 width=28) Index Cond: ((origin = ovs.origin) AND (origin = $0) AND (visit = ovs.visit)) (10 lignes) ``` This would also work with a subquery just to get the value of `ov.date` and removing the actual join to `ov` entirely, but it was more annoying to implement because the function reuses `self.origin_visit_select_cols` as column list. All these EXPLAIN queries were run on staging.
-
vlorentz authored
The hack crashes on >= 4 because 'pytest_cov.embed.multiprocessing_start' is not in the hook list anymore. https://pytest-cov.readthedocs.io/en/latest/changelog.html
-
- Sep 28, 2022
-
-
vlorentz authored
It prevents process shutdown (unless the user presses Ctrl-C several times in a row)
-
- Sep 27, 2022
- Sep 13, 2022
-
-
Nicolas Dandrimont authored
This will help queries retrieving origin_visits by date by avoiding having to scan all visits for the origin.
-
Nicolas Dandrimont authored
-
- Aug 31, 2022
-
-
vlorentz authored
-
- Aug 30, 2022
-
-
Antoine Lambert authored
In that query, the interval alias was set for the visit column instead of the date difference computation which could lead to wrong visit being returned due to invalid results ordering.
-
- Aug 29, 2022
-
-
vlorentz authored
So it is no logged on the server side, and so clients do not retry
-
- Aug 22, 2022
-
-
vlorentz authored
This works by adding a RW lock on the row of the latest visit, which should block other transactions until the insertion is committed; so other transactions will generate a different (larger) visit id This commit also slightly rewrites how the max visit id is computed, as we need to actually select a row to lock it, instead of using the `max()` aggregate function.
-
- Aug 09, 2022
-
-
vlorentz authored
On the wire, this is done by making the server return a 503 error instead of 500, which the RPC client generated by swh-core interprets to change the exception class.
- Aug 08, 2022
-
-
vlorentz authored
It needs to be linked from swh/lister/crates/__init__.py
-
- Aug 05, 2022
-
-
vlorentz authored
-
vlorentz authored
The postgresql implementation uses '3 months', which is closer to 13 weeks than to 12 weeks.
-
vlorentz authored
start is increased from 13 to 14, because 13 weeks is 91 days, ie. 30+31+30; so it is sometimes smaller than 3 months. This was only hit rarely because the number of visits was small, so this commit also increases the number of visits to make the test more likely to fail if it should actually fail.
- Aug 04, 2022
-
-
vlorentz authored
They are very noisy, and clients are expected to retry a few times before re-raising the exception on their side.
-
vlorentz authored
This caused the following warning: ``` WARNING cassandra.protocol:libevreactor.py:361 Server warning: `USE <keyspace>` with prepared statements is considered to be an anti-pattern due to ambiguity in non-qualified table names. Please consider removing instances of `Session#setKeyspace(<keyspace>)`, `Session#execute("USE <keyspace>")` and `cluster.newSession(<keyspace>)` from your code, and always use fully qualified table names (e.g. <keyspace>.<table>). ``` This also prepends 'test' to the name of keyspaces used in tests, so they are guaranteed to start with an letter (starting with digits cause syntax errors in most statements).
-
vlorentz authored
-
vlorentz authored
This reproduces what I think is the issue found in https://jenkins.softwareheritage.org/job/debian/job/packages/job/DSTO/job/gbp-buildpackage/423/consoleFull This does not fix the issue as it is a consequence of the design, but documents this problematic behavior.
-
- Jul 13, 2022
-
-
Antoine Lambert authored
Even if missing index to speedup origin visit queries has been added to replica database, the configured timeouts for origin_visit_get_with_statuses and origin_visit_find_by_date were still too low to avoid query timeouts in production. After performing some tests locally, bumping them to 2000ms makes the timeouts go away. Related to T4386
-
- Jul 12, 2022
-
-
vlorentz authored
This uses Directory.from_possibly_duplicated_entries() to mangle entry names instead of crashing.
-
- Jul 08, 2022
-
-
David Douard authored
-
- Jul 06, 2022
-
-
David Douard authored
when the OriginVisit object given as argument to be inserted already have its visit id set (which is usually the case in a replayer-like session), it makes no sense to auto-add the first OriginVisitStatus objects related to this visit; this behavior is expected only when the origin_visit_add() is called from a loading session. Adapt tests accordingly -- several tests did depend on the auto-add behavior of the origin_visit_add method for OriginVisit objects which visit_id is given in the test dataset.
-
- Jul 01, 2022
-
-
David Douard authored
and add tests for 'mirror' and 'read_replica' flavors.
-
David Douard authored
-
- Jun 03, 2022
-
-
Antoine R. Dumont authored
This also simplifies the db collaborator code reusing core.db functions to check the code version and the actual db version matches. Related to T4305
-
- May 31, 2022
-
-
David Douard authored
instead of swh-core's postgresql_fact one, since we actually do not use its custom features any more in swh-storage.
-
David Douard authored
-
- May 10, 2022
-
-
vlorentz authored
-
- May 09, 2022
-
-
Pratyush authored
-
- May 02, 2022
-
- Apr 28, 2022
-
-
Antoine R. Dumont authored
-
- Apr 26, 2022
-
-
vlorentz authored
-