postgresql: Remove merge join with origin_visit in origin_visit_get_latest
I noticed that origin_visit_get_latest
spends a lot of time doing index
scans on origin_visit_pkey
:
swh=> explain analyze SELECT * FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=10.14..29.33 rows=1 width=171) (actual time=1432.475..1432.479 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Index Scan using origin_url_idx on origin o_1 (cost=0.56..8.57 rows=1 width=8) (actual time=0.077..0.079 rows=1 loops=1)
Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
-> Merge Join (cost=1.56..2208.37 rows=115 width=171) (actual time=1432.473..1432.476 rows=1 loops=1)
Merge Cond: (ovs.visit = ov.visit)
-> Nested Loop (cost=1.00..1615.69 rows=93 width=143) (actual time=298.705..298.707 rows=1 loops=1)
-> Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs (cost=0.57..1606.07 rows=93 width=85) (actual time=298.658..298.658 rows=1 loops=1)
Index Cond: (origin = $0)
Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
Rows Removed by Filter: 198
-> Materialize (cost=0.43..8.46 rows=1 width=58) (actual time=0.042..0.043 rows=1 loops=1)
-> Index Scan using origin_pkey on origin o (cost=0.43..8.45 rows=1 width=58) (actual time=0.038..0.038 rows=1 loops=1)
Index Cond: (id = $0)
-> Index Scan Backward using origin_visit_pkey on origin_visit ov (cost=0.56..590.92 rows=150 width=28) (actual time=30.120..1133.650 rows=100 loops=1)
Index Cond: (origin = $0)
Planning Time: 0.577 ms
Execution Time: 1432.532 ms
(18 lignes)
As far as I understand, this is because we do not have a FK to tell the
planner that every row in origin_visit_status
does have a
corresponding row in origin_visit
, so it checks every row from
origin_visit_status
in this loop.
Therefore, I rewrote the query to use a LEFT JOIN
, so it will spare
this check.
First, here is the original query:
swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=9.71..28.82 rows=1 width=113)
InitPlan 1 (returns $0)
-> Index Scan using origin_url_idx on origin o (cost=0.56..8.57 rows=1 width=8)
Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
-> Merge Join (cost=1.13..2198.75 rows=115 width=113)
Merge Cond: (ovs.visit = ov.visit)
-> Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs (cost=0.57..1606.07 rows=93 width=85)
Index Cond: (origin = $0)
Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
-> Index Scan Backward using origin_visit_pkey on origin_visit ov (cost=0.56..590.92 rows=150 width=28)
Index Cond: (origin = $0)
(11 lignes)
Change columns to filter directly on the "materialized" fields in ovs instead of those on those in ov (no actual change yet):
swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=9.71..28.82 rows=1 width=113)
InitPlan 1 (returns $0)
-> Index Scan using origin_url_idx on origin o (cost=0.56..8.57 rows=1 width=8)
Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
-> Merge Join (cost=1.13..2198.75 rows=115 width=113)
Merge Cond: (ovs.visit = ov.visit)
-> Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs (cost=0.57..1606.07 rows=93 width=85)
Index Cond: (origin = $0)
Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
-> Index Scan Backward using origin_visit_pkey on origin_visit ov (cost=0.56..590.92 rows=150 width=28)
Index Cond: (origin = $0)
(11 lignes)
Then, reorder tables (obviously no change either):
swh=> explain SELECT * FROM origin_visit_status ovs INNER JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=9.71..28.82 rows=1 width=113)
InitPlan 1 (returns $0)
-> Index Scan using origin_url_idx on origin o (cost=0.56..8.57 rows=1 width=8)
Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
-> Merge Join (cost=1.13..2198.75 rows=115 width=113)
Merge Cond: (ovs.visit = ov.visit)
-> Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs (cost=0.57..1606.07 rows=93 width=85)
Index Cond: (origin = $0)
Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
-> Index Scan Backward using origin_visit_pkey on origin_visit ov (cost=0.56..590.92 rows=150 width=28)
Index Cond: (origin = $0)
(11 lignes)
Finally, replace INNER JOIN
with LEFT JOIN
:
swh=> explain SELECT * FROM origin_visit_status ovs LEFT JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=9.71..35.47 rows=1 width=113)
InitPlan 1 (returns $0)
-> Index Scan using origin_url_idx on origin o (cost=0.56..8.57 rows=1 width=8)
Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
-> Nested Loop Left Join (cost=1.13..2396.79 rows=93 width=113)
-> Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs (cost=0.57..1606.07 rows=93 width=85)
Index Cond: (origin = $0)
Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
-> Index Scan using origin_visit_pkey on origin_visit ov (cost=0.56..8.59 rows=1 width=28)
Index Cond: ((origin = ovs.origin) AND (origin = $0) AND (visit = ovs.visit))
(10 lignes)
This would also work with a subquery just to get the value of ov.date
and removing the actual join to ov
entirely, but it was more annoying
to implement because the function reuses self.origin_visit_select_cols
as column list.
All these EXPLAIN queries were run on staging.
Migrated from D8574 (view on Phabricator)