Skip to content
Snippets Groups Projects
  1. Oct 19, 2022
    • Antoine Lambert's avatar
      test_retry: Use proper way to mock sleep of retryable storage methods · 3c08d9f0
      Antoine Lambert authored
      Previous implementation was not mocking sleep of retryable storage methods
      as the RetryingProxyStorage setup the retry features when it is instantiated.
      
      So modify fixture to ensure sleep functions are mocked and return the mocks
      in a dict indexed by storage method names.
      
      This fixes debian buster package build for swh-storage.
      v1.7.1
      3c08d9f0
  2. Oct 18, 2022
  3. Oct 17, 2022
  4. Sep 29, 2022
    • vlorentz's avatar
      postgresql: Remove merge join with origin_visit in origin_visit_get_latest · 657d31f6
      vlorentz authored
      I noticed that `origin_visit_get_latest` spends a lot of time doing index
      scans on `origin_visit_pkey`:
      
      ```
      swh=> explain analyze SELECT * FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                                                            QUERY PLAN
      --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
       Limit  (cost=10.14..29.33 rows=1 width=171) (actual time=1432.475..1432.479 rows=1 loops=1)
         InitPlan 1 (returns $0)
           ->  Index Scan using origin_url_idx on origin o_1  (cost=0.56..8.57 rows=1 width=8) (actual time=0.077..0.079 rows=1 loops=1)
                 Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
         ->  Merge Join  (cost=1.56..2208.37 rows=115 width=171) (actual time=1432.473..1432.476 rows=1 loops=1)
               Merge Cond: (ovs.visit = ov.visit)
               ->  Nested Loop  (cost=1.00..1615.69 rows=93 width=143) (actual time=298.705..298.707 rows=1 loops=1)
                     ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85) (actual time=298.658..298.658 rows=1 loops=1)
                           Index Cond: (origin = $0)
                           Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
                           Rows Removed by Filter: 198
                     ->  Materialize  (cost=0.43..8.46 rows=1 width=58) (actual time=0.042..0.043 rows=1 loops=1)
                           ->  Index Scan using origin_pkey on origin o  (cost=0.43..8.45 rows=1 width=58) (actual time=0.038..0.038 rows=1 loops=1)
                                 Index Cond: (id = $0)
               ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28) (actual time=30.120..1133.650 rows=100 loops=1)
                     Index Cond: (origin = $0)
       Planning Time: 0.577 ms
       Execution Time: 1432.532 ms
      (18 lignes)
      ```
      
      As far as I understand, this is because we do not have a FK to tell the
      planner that every row in `origin_visit_status` does have a
      corresponding row in `origin_visit`, so it checks every row from
      `origin_visit_status` in this loop.
      
      Therefore, I rewrote the query to use a `LEFT JOIN`, so it will spare
      this check.
      
      First, here is the original query:
      
      ```
      swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                                  QUERY PLAN
      ----------------------------------------------------------------------------------------------------------------------------------
       Limit  (cost=9.71..28.82 rows=1 width=113)
         InitPlan 1 (returns $0)
           ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
                 Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
         ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
               Merge Cond: (ovs.visit = ov.visit)
               ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
                     Index Cond: (origin = $0)
                     Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
               ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
                     Index Cond: (origin = $0)
      (11 lignes)
      ```
      
      Change columns to filter directly on the "materialized" fields in ovs
      instead of those on those in ov (no actual change yet):
      
      ```
      swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                                  QUERY PLAN
      ----------------------------------------------------------------------------------------------------------------------------------
       Limit  (cost=9.71..28.82 rows=1 width=113)
         InitPlan 1 (returns $0)
           ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
                 Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
         ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
               Merge Cond: (ovs.visit = ov.visit)
               ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
                     Index Cond: (origin = $0)
                     Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
               ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
                     Index Cond: (origin = $0)
      (11 lignes)
      ```
      
      Then, reorder tables (obviously no change either):
      
      ```
      swh=> explain SELECT * FROM origin_visit_status ovs INNER JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                                  QUERY PLAN
      ----------------------------------------------------------------------------------------------------------------------------------
       Limit  (cost=9.71..28.82 rows=1 width=113)
         InitPlan 1 (returns $0)
           ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
                 Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
         ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
               Merge Cond: (ovs.visit = ov.visit)
               ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
                     Index Cond: (origin = $0)
                     Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
               ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
                     Index Cond: (origin = $0)
      (11 lignes)
      ```
      
      Finally, replace `INNER JOIN` with `LEFT JOIN`:
      
      ```
      swh=> explain SELECT * FROM origin_visit_status ovs LEFT JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                                  QUERY PLAN
      ----------------------------------------------------------------------------------------------------------------------------------
       Limit  (cost=9.71..35.47 rows=1 width=113)
         InitPlan 1 (returns $0)
           ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
                 Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
         ->  Nested Loop Left Join  (cost=1.13..2396.79 rows=93 width=113)
               ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
                     Index Cond: (origin = $0)
                     Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
               ->  Index Scan using origin_visit_pkey on origin_visit ov  (cost=0.56..8.59 rows=1 width=28)
                     Index Cond: ((origin = ovs.origin) AND (origin = $0) AND (visit = ovs.visit))
      (10 lignes)
      ```
      
      This would also work with a subquery just to get the value of `ov.date`
      and removing the actual join to `ov` entirely, but it was more annoying
      to implement because the function reuses `self.origin_visit_select_cols`
      as column list.
      
      All these EXPLAIN queries were run on staging.
      v1.7.0
      657d31f6
    • vlorentz's avatar
      conftest: Replace multiprocessing hack when pytest-cov >= 4 is installed · 44616afe
      vlorentz authored
      The hack crashes on >= 4 because 'pytest_cov.embed.multiprocessing_start'
      is not in the hook list anymore.
      
      https://pytest-cov.readthedocs.io/en/latest/changelog.html
      44616afe
  5. Sep 28, 2022
  6. Sep 27, 2022
  7. Sep 13, 2022
  8. Aug 31, 2022
  9. Aug 30, 2022
  10. Aug 29, 2022
  11. Aug 22, 2022
    • vlorentz's avatar
      origin_visit_add: Fix crash when adding multiple visits to the same origin simultaneously · b5836bab
      vlorentz authored
      This works by adding a RW lock on the row of the latest visit,
      which should block other transactions until the insertion
      is committed; so other transactions will generate a different
      (larger) visit id
      
      This commit also slightly rewrites how the max visit id is
      computed, as we need to actually select a row to lock it,
      instead of using the `max()` aggregate function.
      b5836bab
  12. Aug 09, 2022
  13. Aug 08, 2022
  14. Aug 05, 2022
  15. Aug 04, 2022
  16. Jul 13, 2022
    • Antoine Lambert's avatar
      postgresql: Increase some timeouts to get origin visits · fbe38038
      Antoine Lambert authored
      Even if missing index to speedup origin visit queries has
      been added to replica database, the configured timeouts for
      origin_visit_get_with_statuses and origin_visit_find_by_date
      were still too low to avoid query timeouts in production.
      
      After performing some tests locally, bumping them to 2000ms
      makes the timeouts go away.
      
      Related to T4386
      v1.4.2
      fbe38038
  17. Jul 12, 2022
  18. Jul 08, 2022
  19. Jul 06, 2022
    • David Douard's avatar
      do not always auto-create an OriginVisitStatus object in origin_visit_add() · e0825acb
      David Douard authored
      when the OriginVisit object given as argument to be inserted already
      have its visit id set (which is usually the case in a replayer-like
      session), it makes no sense to auto-add the first OriginVisitStatus
      objects related to this visit; this behavior is expected only when the
      origin_visit_add() is called from a loading session.
      
      Adapt tests accordingly -- several tests did depend on the auto-add
      behavior of the origin_visit_add method for OriginVisit objects which
      visit_id is given in the test dataset.
      e0825acb
  20. Jul 01, 2022
  21. Jun 03, 2022
  22. May 31, 2022
  23. May 10, 2022
  24. May 09, 2022
  25. May 02, 2022
  26. Apr 28, 2022
  27. Apr 26, 2022
Loading