如何在PostgreSQL查询中排列不同的元组

如何在PostgreSQL查询中排列不同的元组

问题描述:

我试图在Postgres中提交一个只返回不同元组的查询。在我的示例查询中,我不希望对于cluster_id/feed_id组合多次存在条目的重复条目。如果我做一个简单:如何在PostgreSQL查询中排列不同的元组

select distinct on (cluster_info.cluster_id, feed_id) 
    cluster_info.cluster_id, num_docs, feed_id, url_time 
    from url_info 
    join cluster_info on (cluster_info.cluster_id = url_info.cluster_id) 
    where feed_id in (select pot_seeder from potentials) 
    and num_docs > 5 and url_time > '2012-04-16'; 

我得到了这一点,但我也想组根据num_docs。所以,当我做到以下几点:

select distinct on (cluster_info.cluster_id, feed_id) 
    cluster_info.cluster_id, num_docs, feed_id, url_time 
    from url_info join cluster_info 
    on (cluster_info.cluster_id = url_info.cluster_id) 
    where feed_id in (select pot_seeder from potentials) 
    and num_docs > 5 and url_time > '2012-04-16' 
    order by num_docs desc; 

我收到以下错误:

ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions 
LINE 1: select distinct on (cluster_info.cluster_id, feed_id) cluste... 

我想我明白为什么我收到错误(不能按元组,除非我明确地描述该组不知何故),但我该怎么做?或者,如果我对错误的解释不正确,是否有办法实现我最初的目标?

最左边的ORDER BY项目不能不同意DISTINCT条款的项目。我引用the manual about DISTINCT

The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.

尝试:

SELECT * 
FROM (
    SELECT DISTINCT ON (c.cluster_id, feed_id) 
      c.cluster_id, num_docs, feed_id, url_time 
    FROM url_info u 
    JOIN cluster_info c ON (c.cluster_id = u.cluster_id) 
    WHERE feed_id IN (SELECT pot_seeder FROM potentials) 
    AND num_docs > 5 
    AND url_time > '2012-04-16' 
    ORDER BY c.cluster_id, feed_id, num_docs, url_time 
      -- first columns match DISTINCT 
      -- the rest to pick certain values for dupes 
      -- or did you want to pick random values for dupes? 
    ) x 
ORDER BY num_docs DESC; 

或者使用GROUP BY

SELECT c.cluster_id 
    , num_docs 
    , feed_id 
    , url_time 
FROM url_info u 
JOIN cluster_info c ON (c.cluster_id = u.cluster_id) 
WHERE feed_id IN (SELECT pot_seeder FROM potentials) 
AND num_docs > 5 
AND url_time > '2012-04-16' 
GROUP BY c.cluster_id, feed_id 
ORDER BY num_docs DESC; 

如果c.cluster_id, feed_id都(无论是在这种情况下)表的主键列,你包括从SELECT列表中列出,然后这只适用于PostgreSQL 9.1或更高版本。

否则您需要GROUP BY其余列或聚合或提供更多信息。

+0

我想我需要GROUP BY,因为我提到了第二个答案:ERROR:列“c.num_docs”必须出现在GROUP BY子句中或用于聚合函数 – WildBill 2012-04-21 21:25:38

+0

Your第一个答案给出以下错误:错误:SELECT DISTINCT ON表达式必须匹配初始ORDER BY表达式 LINE 3:SELECT DISTINCT ON(c.cluster_id,feed_id) – WildBill 2012-04-21 21:26:08

+0

@WildBill:您可能错过了第一个查询的更新。我在我的第一个版本中修正了一个错误。至于第二个查询:如果您提供缺少的信息哪些列属于哪个表以及哪些主键和您的PostgreSQL版本,我的答案可能更具体。 – 2012-04-21 22:21:57