PostgreSQL的窗口功能
问题描述:
面对的查询设计问题,不知道我的解决问题的方法是否是不必要的复杂内窗口功能反对目前的分析查询的其中一个(例如)将是:PostgreSQL的窗口功能
with intervals as (
select
(select '09/27/2014'::date) + (n || ' minutes')::interval start_time,
(select '09/27/2014'::date) + ((n+60) || ' minutes')::interval end_time
from generate_series(0, (24*60*7), 60 * 4) n
)
select
extract(epoch from i.start_time)::numeric * 1000 as ts,
extract(epoch from i.end_time)::numeric * 1000 as end_ts,
sum(avg(messages.score)) over (order by i.start_time) as score
from messages
right join intervals i
on messages.timestamp >= i.start_time and messages.timestamp < i.end_time
where messages.timestamp between '09/27/2014' and '10/04/2014'
group by i.start_time, i.end_time
order by i.start_time
正如你们可能会说 - 这个查询计算“得分” attribut的平均e用于给定时间桶分布的消息,然后与其一起计算桶(使用窗口)的累积。
接下来我要做的是找到最接近每个存储桶平均值的前5(例如)messages.text
。
现在,我唯一的计划是:
1) Join messages with the time-buckets
2) Compute a score - avg(score) over (partition by start_time) as deviation and save it against each record of the joined relation
3) Compute a rank() over (order by deviation) as rank
4) Select where rank between 1 and 5
我之所以把这个下来势在必行的步骤,因为我第一次尝试在未来与参与设计使用中的窗口函数窗口函数(rank() over (partition by start_time, order by score - avg(score) over (partition by start_time))
,我甚至没有试图去查看它是否可行。
请问我能否就正确的方向迈向一些建议?
答
幼龙 - 这里是我已经和似乎工作:
现已开始接受批评的,性能优化的结构和我的查询冗余!^_ ^(减去直接生成时间序列,而不是所有最终修复的扭曲间隔数学)
with intervals as (
select
(select '09/29/2014'::date) + (n || ' minutes')::interval start_time,
(select '09/29/2014'::date) + ((n+60) || ' minutes')::interval end_time
from generate_series(0, (24*60*7), 60 * 4) n
), intervaled_messages as (
select
extract(epoch from i.start_time)::numeric * 1000 as ts,
extract(epoch from i.end_time)::numeric * 1000 as end_ts,
abs(score - avg(score) over (partition by i.start_time)) as deviation
from messages
right join intervals i
on messages.timestamp >= i.start_time and messages.timestamp < i.end_time
where messages.timestamp between '09/29/2014' and '10/06/2014'
), ranked_messages as (
select ts, end_ts, deviation,
rank() over (partition by ts order by deviation) as rank,
row_number() over (partition by ts order by deviation) as row_number
from intervaled_messages
)
select ts, end_ts, deviation, rank
from ranked_messages
where rank between 1 and 5
and row_number between 1 and 5
order by ts;
答
你应该标题(这只是我的建议)方向:
- 获得的平均分(所有记录)
- 操作
MINUS
上(row score, avg(score))
-- This will leave you with values also positive and negative
- 对来自步骤2的每个操作使用
abs()
,在相同的计算 - 使用
rank()
和他们为了approprietly WHERE rank BETWEEN 1 AND 5
注意:'generate_series()'也适用于时间戳。 'generate_series('2014-09-27','2014-10-04','1 hour':: interval)'可能会做你想要的。 – wildplasser 2014-10-05 10:40:57
纠错:那应该是'generate_series('2014-09-27 00:00:00','2014-10-04 00:00:00','1小时':: interval)' – wildplasser 2014-10-05 11:29:38
@wildplasser啊,是的,你是对的 - 这是一个很好的重构建议,我会解决这个问题!^_ ^ – Slania 2014-10-05 14:25:30