postgresql: Remove merge join with origin_visit in origin_visit_get_latest
I noticed that `origin_visit_get_latest` spends a lot of time doing index scans on `origin_visit_pkey`: ``` swh=> explain analyze SELECT * FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=10.14..29.33 rows=1 width=171) (actual time=1432.475..1432.479 rows=1 loops=1) InitPlan 1 (returns $0) -> Index Scan using origin_url_idx on origin o_1 (cost=0.56..8.57 rows=1 width=8) (actual time=0.077..0.079 rows=1 loops=1) Index Cond: (url = 'https://pypi.org/project/simpleado/'::text) -> Merge Join (cost=1.56..2208.37 rows=115 width=171) (actual time=1432.473..1432.476 rows=1 loops=1) Merge Cond: (ovs.visit = ov.visit) -> Nested Loop (cost=1.00..1615.69 rows=93 width=143) (actual time=298.705..298.707 rows=1 loops=1) -> Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs (cost=0.57..1606.07 rows=93 width=85) (actual time=298.658..298.658 rows=1 loops=1) Index Cond: (origin = $0) Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state)) Rows Removed by Filter: 198 -> Materialize (cost=0.43..8.46 rows=1 width=58) (actual time=0.042..0.043 rows=1 loops=1) -> Index Scan using origin_pkey on origin o (cost=0.43..8.45 rows=1 width=58) (actual time=0.038..0.038 rows=1 loops=1) Index Cond: (id = $0) -> Index Scan Backward using origin_visit_pkey on origin_visit ov (cost=0.56..590.92 rows=150 width=28) (actual time=30.120..1133.650 rows=100 loops=1) Index Cond: (origin = $0) Planning Time: 0.577 ms Execution Time: 1432.532 ms (18 lignes) ``` As far as I understand, this is because we do not have a FK to tell the planner that every row in `origin_visit_status` does have a corresponding row in `origin_visit`, so it checks every row from `origin_visit_status` in this loop. Therefore, I rewrote the query to use a `LEFT JOIN`, so it will spare this check. First, here is the original query: ``` swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Limit (cost=9.71..28.82 rows=1 width=113) InitPlan 1 (returns $0) -> Index Scan using origin_url_idx on origin o (cost=0.56..8.57 rows=1 width=8) Index Cond: (url = 'https://pypi.org/project/simpleado/'::text) -> Merge Join (cost=1.13..2198.75 rows=115 width=113) Merge Cond: (ovs.visit = ov.visit) -> Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs (cost=0.57..1606.07 rows=93 width=85) Index Cond: (origin = $0) Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state)) -> Index Scan Backward using origin_visit_pkey on origin_visit ov (cost=0.56..590.92 rows=150 width=28) Index Cond: (origin = $0) (11 lignes) ``` Change columns to filter directly on the "materialized" fields in ovs instead of those on those in ov (no actual change yet): ``` swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Limit (cost=9.71..28.82 rows=1 width=113) InitPlan 1 (returns $0) -> Index Scan using origin_url_idx on origin o (cost=0.56..8.57 rows=1 width=8) Index Cond: (url = 'https://pypi.org/project/simpleado/'::text) -> Merge Join (cost=1.13..2198.75 rows=115 width=113) Merge Cond: (ovs.visit = ov.visit) -> Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs (cost=0.57..1606.07 rows=93 width=85) Index Cond: (origin = $0) Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state)) -> Index Scan Backward using origin_visit_pkey on origin_visit ov (cost=0.56..590.92 rows=150 width=28) Index Cond: (origin = $0) (11 lignes) ``` Then, reorder tables (obviously no change either): ``` swh=> explain SELECT * FROM origin_visit_status ovs INNER JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Limit (cost=9.71..28.82 rows=1 width=113) InitPlan 1 (returns $0) -> Index Scan using origin_url_idx on origin o (cost=0.56..8.57 rows=1 width=8) Index Cond: (url = 'https://pypi.org/project/simpleado/'::text) -> Merge Join (cost=1.13..2198.75 rows=115 width=113) Merge Cond: (ovs.visit = ov.visit) -> Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs (cost=0.57..1606.07 rows=93 width=85) Index Cond: (origin = $0) Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state)) -> Index Scan Backward using origin_visit_pkey on origin_visit ov (cost=0.56..590.92 rows=150 width=28) Index Cond: (origin = $0) (11 lignes) ``` Finally, replace `INNER JOIN` with `LEFT JOIN`: ``` swh=> explain SELECT * FROM origin_visit_status ovs LEFT JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Limit (cost=9.71..35.47 rows=1 width=113) InitPlan 1 (returns $0) -> Index Scan using origin_url_idx on origin o (cost=0.56..8.57 rows=1 width=8) Index Cond: (url = 'https://pypi.org/project/simpleado/'::text) -> Nested Loop Left Join (cost=1.13..2396.79 rows=93 width=113) -> Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs (cost=0.57..1606.07 rows=93 width=85) Index Cond: (origin = $0) Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state)) -> Index Scan using origin_visit_pkey on origin_visit ov (cost=0.56..8.59 rows=1 width=28) Index Cond: ((origin = ovs.origin) AND (origin = $0) AND (visit = ovs.visit)) (10 lignes) ``` This would also work with a subquery just to get the value of `ov.date` and removing the actual join to `ov` entirely, but it was more annoying to implement because the function reuses `self.origin_visit_select_cols` as column list. All these EXPLAIN queries were run on staging.
Please register or sign in to comment