查询检索邻居太慢

问题描述：

我有一个表示有向图一套Django的ORM模型，我试图找回所有的相邻顶点给定的顶点忽略的边缘方向：查询检索邻居太慢

class Vertex(models.Model): 
    pass 

class Edge(models.Model): 
    orig = models.ForeignKey(Vertex, related_name='%(class)s_orig', null=True, blank=True) 
    dest = models.ForeignKey(Vertex, related_name='%(class)s_dest', null=True, blank=True) 
    # ... other data about this edge ...

的查询Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct()返回正确的结果，但在我的情况下，执行需要很长的时间。

通常，对于我的应用程序，在任何给定时间都会有大约50-100个顶点，并且大约有一百万条边。即使它减少到只有20个顶点和100000层的边缘，该查询需要大约一分半钟的执行：

for i in range(20): 
    Vertex().save() 

vxs = list(Vertex.objects.all()) 
for i in tqdm.tqdm(range(100000)): 
    Edge(orig = random.sample(vxs,1)[0], dest = random.sample(vxs,1)[0]).save() 

v = vxs[0] 
def f1(): 
    return list(Vertex.objects.filter(
     Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct()) 

t1 = timeit.Timer(f1) 

print(t1.timeit(number=1)) # 84.21138522100227

在另一方面，如果我起来将查询分为两个部分，我可以得到完全相同的导致只有毫秒屈指可数：

def f2(): 
    q1 = Vertex.objects.filter(edge_orig__dest=v).distinct() 
    q2 = Vertex.objects.filter(edge_dest__orig=v).distinct() 
    return list({x for x in itertools.chain(q1, q2)}) 

t2 = timeit.Timer(f2) 
print(t2.timeit(number=100)/100) # 0.0109818680600074

这第二个版本虽然有问题：

这不是原子。边界列表几乎可以保证在我的应用程序中的两个查询之间发生变化，这意味着结果不会代表单个时间点。
我无法对结果执行额外的处理和聚合，无需手动循环。（例如，如果我想要Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct().aggregate(avg=Avg('some_field'))）

为什么第二个版本比第一个版本跑得快得多？我该如何做到这一点，有没有办法让第一个跑得足够快，以供实际使用？

我目前正在使用Python 3.5.2，PostgreSQL 9.5.6和Django 1.11进行测试，但是如果这是我遇到的问题之一。

这里是由每个查询生成的SQL，以及Postgres的的explan：

第一个：

Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)) 

SELECT DISTINCT "app_vertex"."id" 
FROM "app_vertex" 
LEFT OUTER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id") 
LEFT OUTER JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id") 
WHERE ("app_edge"."dest_id" = 1061 
     OR T4."orig_id" = 1061) 

HashAggregate (cost=8275151.47..8275151.67 rows=20 width=4) 
    Group Key: app_vertex.id 
    -> Hash Left Join (cost=3183.45..8154147.45 rows=48401608 width=4) 
     Hash Cond: (app_vertex.id = app_edge.orig_id) 
     Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061)) 
     -> Hash Right Join (cost=1.45..2917.45 rows=100000 width=8) 
       Hash Cond: (t4.dest_id = app_vertex.id) 
       -> Seq Scan on app_edge t4 (cost=0.00..1541.00 rows=100000 width=8) 
       -> Hash (cost=1.20..1.20 rows=20 width=4) 
        -> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4) 
     -> Hash (cost=1541.00..1541.00 rows=100000 width=8) 
       -> Seq Scan on app_edge (cost=0.00..1541.00 rows=100000 width=8)

第二的：

Vertex.objects.filter(edge_orig__dest=v).distinct() 

SELECT DISTINCT "app_vertex"."id" 
FROM "app_vertex" 
INNER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id") 
WHERE "app_edge"."dest_id" = 1061 

HashAggregate (cost=1531.42..1531.62 rows=20 width=4) 
    Group Key: app_vertex.id 
    -> Hash Join (cost=848.11..1519.04 rows=4950 width=4) 
     Hash Cond: (app_edge.orig_id = app_vertex.id) 
     -> Bitmap Heap Scan on app_edge (cost=846.65..1449.53 rows=4950 width=4) 
       Recheck Cond: (dest_id = 1061) 
       -> Bitmap Index Scan on app_edge_dest_id_4254b90f (cost=0.00..845.42 rows=4950 width=0) 
        Index Cond: (dest_id = 1061) 
     -> Hash (cost=1.20..1.20 rows=20 width=4) 
       -> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)

@ khampson的版本也需要一分半钟才能运行，所以它也是不可行的。

Vertex.objects.raw(...) 

SELECT DISTINCT "app_vertex"."id" 
FROM "app_vertex" 
JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id") 
JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id") 
WHERE ("app_edge"."dest_id" = 1061 
     OR T4."orig_id" = 1061); 

HashAggregate (cost=8275347.47..8275347.67 rows=20 width=4) 
    Group Key: app_vertex.id 
    -> Hash Join (cost=3183.45..8154343.45 rows=48401608 width=4) 
     Hash Cond: (app_vertex.id = app_edge.orig_id) 
     Join Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061)) 
     -> Hash Join (cost=1.45..2917.45 rows=100000 width=12) 
       Hash Cond: (t4.dest_id = app_vertex.id) 
       -> Seq Scan on app_edge t4 (cost=0.00..1541.00 rows=100000 width=8) 
       -> Hash (cost=1.20..1.20 rows=20 width=4) 
        -> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4) 
     -> Hash (cost=1541.00..1541.00 rows=100000 width=8) 
       -> Seq Scan on app_edge (cost=0.00..1541.00 rows=100000 width=8)

你能分享两个版本正在生成的SQL吗？ – Hamms

@Hamms增加了SQL和Postgres对每个查询的解释。 – AJMansfield

我可能会误解你的模型（实际上，我可能是），但你为什么要过滤'edge_orig__dest = v | edge_orig__dest = v'，而不仅仅是'edge_dest = v | edge_origt = v'？ – Hamms

答

这两个查询的查询计划完全不同。第一个（较慢）没有触及任何索引，并且正在执行两个left join s，这两个都会导致更多行被处理并返回。根据我对Django ORM语法意图的理解，听起来并不像你真的想在这里做left join那样。

我会建议考虑从Django的ORM内掉落下来到原SQL在这种情况下，和杂交两个。例如如果你把第一个，并将其转换为这样的事情：

SELECT DISTINCT "app_vertex"."id" 
FROM "app_vertex" 
JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id") 
JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id") 
WHERE ("app_edge"."dest_id" = 1061 
     OR T4."orig_id" = 1061);

两个问题有：如何该版本执行，以及它给你，你要寻找的结果？

欲了解更多关于原始查询的信息，请查看的Django doc。this section。

响应从OP评论：

因为我还建议查询的查询计划表明，它没有击中任何索引。

对于涉及的列，您是否在两个表上都有索引？我怀疑没有，特别是对于这个特定的查询，我们正在寻找一个单一的值，这意味着如果有索引，如果查询计划者确定顺序扫描是一个更好的选择，我会感到非常惊讶（OTOH，如果你是查找各种各样的行，例如表中超过10％的行，查询规划师可能会正确做出这样的决定）。

刚刚尝试过，它仍然需要一分半钟才能运行。我已经更新了我的问题以包含此版本的查询的解释。 – AJMansfield

@AJMansfield：添加到以上评论的后续回答中。 – khampson

我不确定是否有索引或没有索引。这些表只是使用普通的django migrations直接从模型类创建的，所以我不确定它是否创建了索引。 – AJMansfield

答

我提出的另一个查询可能是：

# Get edges which contain Vertex v, "only" optimizes fields returned 
edges = Edge.objects.filter(Q(orig=v) | Q(dest=v)).only('orig_id', 'dest_id') 
# Get set of vertex id's to discard duplicates 
vertex_ids = {*edges.values_list('orig_id', flat=True), *edges_values_list('dest_id', flat=True)} 
# Get list of vertices, excluding the original vertex 
vertices = Vertex.objects.filter(pk__in=vertex_ids).exclude(pk=v.pk)

这应该不需要任何连接和不应该从你所提到的竞争条件受到影响。

查询检索邻居太慢

相关推荐