使用多个表上的条件对Postgres多连接查询进行优化
SUMMARY小结
在具有多行的多表连接中添加某些标准时,查询结果会慢一级。我已经尝试了很多事情来提高速度,包括每种类型的表连接,重新排序连接,重新排序WHERE子句,进行子查询,在WHERE子句中使用CASE语句等。使用多个表上的条件对Postgres多连接查询进行优化
SQL细节下面。
质询
- 为什么加入这个简单的条件导致策划者彻底改变其执行计划?
- 是否可以告诉规划如何首先要分析具体的情况不急剧变化的查询或者做子查询(使用
WITH
例如)
注:我试图写一个通用的SQL生成器API,允许调用者在图中的任何点指定任意条件。问题是这些电话中的一些正在快速发展,另一些电话不是由于Postgres计划执行的方式。为此查询专门设计的优化不会帮助我满足通用SQL构建器的更大目标。
DETAILS我具有存储顶点和边缘Postgres里(一个简单的图形数据库)的模式:
CREATE TABLE IF NOT EXISTS vertex (type text, id serial, name text, data jsonb, UNIQUE (id))
CREATE INDEX vertex_data_idx ON vertex USING gin (data jsonb_path_ops)
CREATE INDEX vertex_type_idx ON vertex (type)
CREATE INDEX vertex_name_idx ON vertex (name)
CREATE TABLE IF NOT EXISTS edge (src integer REFERENCES vertex (id), dst integer REFERENCES vertex (id))
CREATE INDEX edge_src_idx ON edge (src)
CREATE INDEX edge_dst_idx ON edge (dst)
架构存储曲线图中,其中的一个是这样的:PLANET - >大陆 - >国家 - >区域
有447554个总顶点,在我的示例数据库3155047层总的边缘,但相关数据是在这里:
-
个
- 5行星(每个涉及5个大洲)
- 25大洲(每个涉及2500国)
- 62500国(25%,其中涉及到100个区域中的每个,其余都没有REGION关系)
- 250000个地区
此查询查找具有在任何给定的区域讲西班牙语的行星是快:
SELECT DISTINCT
v1.name as name, v1.id as id
FROM vertex v1
LEFT JOIN edge e1 ON (v1.id = e1.src)
LEFT JOIN vertex v2 ON (v2.id = e1.dst)
LEFT JOIN edge e2 ON (v2.id = e2.src)
LEFT JOIN vertex v3 ON (v3.id = e2.dst)
LEFT JOIN edge e3 ON (v3.id = e3.src)
LEFT JOIN vertex v4 ON (v4.id = e3.dst)
WHERE
v4.type = 'REGION' AND
v4.data @> '{"languages":["spanish"]}'::jsonb
规划时间:6.289毫秒 执行时间:0.744毫秒
当我在图中的(V1)在所述第一表中的索引的列添加一个条件,对结果没有任何影响,该查询是较慢12657倍:
SELECT DISTINCT
v1.name as name, v1.id as id
FROM vertex v1
LEFT JOIN edge e1 ON (v1.id = e1.src)
LEFT JOIN vertex v2 ON (v2.id = e1.dst)
LEFT JOIN edge e2 ON (v2.id = e2.src)
LEFT JOIN vertex v3 ON (v3.id = e2.dst)
LEFT JOIN edge e3 ON (v3.id = e3.src)
LEFT JOIN vertex v4 ON (v4.id = e3.dst)
WHERE
v1.type = 'PLANET' AND
v4.type = 'REGION' AND
v4.data @> '{"languages":["spanish"]}'::jsonb
规划时间:7.664毫秒 执行时间:89010。096毫秒
这是EXPLAIN(分析一下,缓冲区)第一,快速呼叫:
Unique (cost=154592.03..155453.96 rows=114925 width=28) (actual time=0.585..0.616 rows=4 loops=1)
Buffers: shared hit=92
-> Sort (cost=154592.03..154879.34 rows=114925 width=28) (actual time=0.579..0.588 rows=4 loops=1)
Sort Key: v1.name, v1.id
Sort Method: quicksort Memory: 17kB
Buffers: shared hit=92
-> Nested Loop (cost=37.96..142377.39 rows=114925 width=28) (actual time=0.155..0.549 rows=4 loops=1)
Buffers: shared hit=92
-> Nested Loop (cost=37.53..80131.76 rows=114925 width=4) (actual time=0.141..0.468 rows=4 loops=1)
Join Filter: (v2.id = e1.dst)
Buffers: shared hit=76
-> Nested Loop (cost=37.10..49179.08 rows=14270 width=8) (actual time=0.126..0.386 rows=4 loops=1)
Buffers: shared hit=60
-> Nested Loop (cost=36.68..41450.17 rows=14270 width=4) (actual time=0.112..0.304 rows=4 loops=1)
Join Filter: (v3.id = e2.dst)
Buffers: shared hit=44
-> Nested Loop (cost=36.25..37606.57 rows=1772 width=8) (actual time=0.092..0.209 rows=4 loops=1)
Buffers: shared hit=28
-> Nested Loop (cost=35.83..36646.82 rows=1772 width=4) (actual time=0.074..0.116 rows=4 loops=1)
Buffers: shared hit=12
-> Bitmap Heap Scan on vertex v4 (cost=30.99..1514.00 rows=220 width=4) (actual time=0.039..0.042 rows=1 loops=1)
Recheck Cond: (data @> '{"languages":["spanish"]}'::jsonb)
Filter: (type = 'REGION'::text)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on vertex_data_idx (cost=0.00..30.94 rows=392 width=0) (actual time=0.020..0.020 rows=1 loops=1)
Index Cond: (data @> '{"languages":["spanish"]}'::jsonb)
Buffers: shared hit=4
-> Bitmap Heap Scan on edge e3 (cost=4.84..159.12 rows=57 width=8) (actual time=0.023..0.037 rows=4 loops=1)
Recheck Cond: (dst = v4.id)
Heap Blocks: exact=4
Buffers: shared hit=7
-> Bitmap Index Scan on edge_dst_idx (cost=0.00..4.82 rows=57 width=0) (actual time=0.013..0.013 rows=4 loops=1)
Index Cond: (dst = v4.id)
Buffers: shared hit=3
-> Index Only Scan using vertex_id_key on vertex v3 (cost=0.42..0.53 rows=1 width=4) (actual time=0.008..0.011 rows=1 loops=4)
Index Cond: (id = e3.src)
Heap Fetches: 4
Buffers: shared hit=16
-> Index Scan using edge_dst_idx on edge e2 (cost=0.43..1.46 rows=57 width=8) (actual time=0.008..0.011 rows=1 loops=4)
Index Cond: (dst = e3.src)
Buffers: shared hit=16
-> Index Only Scan using vertex_id_key on vertex v2 (cost=0.42..0.53 rows=1 width=4) (actual time=0.006..0.009 rows=1 loops=4)
Index Cond: (id = e2.src)
Heap Fetches: 4
Buffers: shared hit=16
-> Index Scan using edge_dst_idx on edge e1 (cost=0.43..1.46 rows=57 width=8) (actual time=0.005..0.008 rows=1 loops=4)
Index Cond: (dst = e2.src)
Buffers: shared hit=16
-> Index Scan using vertex_id_key on vertex v1 (cost=0.42..0.53 rows=1 width=28) (actual time=0.006..0.009 rows=1 loops=4)
Index Cond: (id = e1.src)
Buffers: shared hit=16
Planning time: 6.940 ms
Execution time: 0.714 ms
而且在第二,慢呼:
HashAggregate (cost=592.23..592.24 rows=1 width=28) (actual time=89009.873..89009.885 rows=4 loops=1)
Group Key: v1.name, v1.id
Buffers: shared hit=11644657 read=1240045
-> Nested Loop (cost=2.98..592.22 rows=1 width=28) (actual time=9098.961..89009.833 rows=4 loops=1)
Buffers: shared hit=11644657 read=1240045
-> Nested Loop (cost=2.56..306.89 rows=522 width=32) (actual time=0.424..30066.007 rows=3092522 loops=1)
Buffers: shared hit=454795 read=46267
-> Nested Loop (cost=2.13..86.31 rows=65 width=36) (actual time=0.306..2120.293 rows=62500 loops=1)
Buffers: shared hit=239162 read=12162
-> Nested Loop (cost=1.70..51.10 rows=65 width=32) (actual time=0.261..574.490 rows=62500 loops=1)
Buffers: shared hit=488 read=562
actual time=0.205..1.206 rows=25 loops=1)p (cost=1.27..23.95 rows=8 width=36) (--More--
Buffers: shared hit=109 read=17
-> Nested Loop (cost=0.85..19.62 rows=8 width=32) (actual time=0.173..0.547 rows=25 loops=1)
Buffers: shared hit=12 read=14
-> Index Scan using vertex_type_idx on vertex v1 (cost=0.42..8.44 rows=1 width=28) (actual time=0.123..0.153 rows=5 loops=1)
Index Cond: (type = 'PLANET'::text)
Buffers: shared hit=2 read=4
-> Index Scan using edge_src_idx on edge e1 (cost=0.43..10.18 rows=100 width=8) (actual time=0.021..0.039 rows=5 loops=5)
Index Cond: (src = v1.id)
Buffers: shared hit=10 read=10
-> Index Only Scan using vertex_id_key on vertex v2 (cost=0.42..0.53 rows=1 width=4) (actual time=0.009..0.013 rows=1 loops=25)
Index Cond: (id = e1.dst)
Heap Fetches: 25
Buffers: shared hit=97 read=3
43..2.39 rows=100 width=8) (actual time=0.031..8.504 rows=2500 loops=25)(cost=0.--More--
Index Cond: (src = v2.id)
Buffers: shared hit=379 read=545
-> Index Only Scan using vertex_id_key on vertex v3 (cost=0.42..0.53 rows=1 width=4) (actual time=0.010..0.013 rows=1 loops=62500)
Index Cond: (id = e2.dst)
Heap Fetches: 62500
Buffers: shared hit=238674 read=11600
-> Index Scan using edge_src_idx on edge e3 (cost=0.43..2.39 rows=100 width=8) (actual time=0.013..0.163 rows=49 loops=62500)
Index Cond: (src = v3.id)
Buffers: shared hit=215633 read=34105
-> Index Scan using vertex_id_key on vertex v4 (cost=0.42..0.54 rows=1 width=4) (actual time=0.013..0.013 rows=0 loops=3092522)
Index Cond: (id = e3.dst)
Filter: ((data @> '{"languages":["spanish"]}'::jsonb) AND (type = 'REGION'::text))
Rows Removed by Filter: 1
Buffers: shared hit=11189862 read=1193778
Planning time: 7.664 ms
Execution time: 89010.096 ms
[转贴作为一个答案,因为我需要的格式]
边缘表desparately需要主键(这意味着NOT NULL为{SRC,DST}这是很好的):
CREATE TABLE IF NOT EXISTS edge
(src integer NOT NULL REFERENCES vertex (id)
, dst integer NOT NULL REFERENCES vertex (id)
, PRIMARY KEY (src,dst)
);
CREATE UNIQUE INDEX edge_dst_src_idx ON edge (dst, src);
-- the estimates in the question seem to be off, statistics may be absent.
VACUUM ANALYZE edge; -- refresh the statistics
VACUUM ANALYZE vertex;
我也将{type,name}索引结合起来(类型似乎有一个非常低的基数)。甚至可能使它成为UNIQUE和NOT NULL,但我不知道你的数据。
CREATE INDEX vertex_type_name_idx ON vertex (type, name);
我认为使用子查询会使postgresql无法使用索引。因此请尝试以下查询以通过不使用索引来测试性能改进:
select * from (
SELECT DISTINCT
v1.name as name, v1.id as id, v1.type as v1_type
FROM vertex v1
LEFT JOIN edge e1 ON (v1.id = e1.src)
LEFT JOIN vertex v2 ON (v2.id = e1.dst)
LEFT JOIN edge e2 ON (v2.id = e2.src)
LEFT JOIN vertex v3 ON (v3.id = e2.dst)
LEFT JOIN edge e3 ON (v3.id = e3.src)
LEFT JOIN vertex v4 ON (v4.id = e3.dst)
WHERE
v4.type = 'REGION' AND
v4.data @> '{"languages":["spanish"]}'::jsonb
) t1
where v1_type = 'PLANET'
感谢您的评论。我已经尝试了一个子查询,它的确按照我期望的做了,但不幸的是我试图创建一个通用查询构建器。这些类型的特定优化在测试时很有用,但我开始觉得没有通用的方法来强制规划者在另一个之前使用特定的索引,而无需将查询重新组织到子查询中(这违背了“通用查询构建器”指令) 。 – Voluntari
@Voluntari我对postgresql不够熟悉,但在mysql和oracle中可以说不使用索引。 –
@Msfvtp“我认为使用子查询将使postgresql无法使用索引”这将是任何查询优化器的重大失败。这当然不是Oracle的真实情况,我怀疑任何主流RDBMS都是如此。 –
删除'LEFT JOIN's'。它们不是必需的,只能混淆优化器。 –
'v4'上的外部连接是无用的,因为它由于'where'条件而被有效地转变为内部连接 –
您如何获得下面的答案,Voluntari? – halfer