如何在SQL中有效地查找正在运行的多个记录的最新更新?
考虑以下方案,如何在SQL中有效地查找正在运行的多个记录的最新更新?
-- items which have periodic updates
CREATE TABLE items (
[id] int identity(1, 1) primary key,
[name] varchar(100) not null
);
-- item updates. updating an item generally means it has a new status, at a certain time.
CREATE TABLE updates (
[id] int identity(1, 1) primary key,
[item_id] int foreign key references items([id]),
[new_status] varchar(100) not null,
[update_date] datetime not null
);
这是用来跟踪项目的状态,经过许多国家,随着时间的推移。
我一直试图找到一个高效的查询,将回答以下问题:
对于许多物品,这可以在几个州,在那里我们登录状态更新中的一个,有多少项目是目前在每个国家每天结束时?
我有一个SQLFiddle here,它有一些示例数据,以及我目前在这个查询中的尝试。 它在一些项目上运行良好,但我的数据库有成千上万,所以我的查询目前大约需要5分钟才能运行。
有没有更高效的查询来回答这个问题?
测试数据:
-- items which have periodic updates
CREATE TABLE items (
[id] int identity(1, 1) primary key,
[name] varchar(100) not null
);
-- item updates. updating an item generally means it has a new status, at a certain time.
CREATE TABLE updates (
[id] int identity(1, 1) primary key,
[item_id] int foreign key references items([id]),
[new_status] varchar(100) not null,
[update_date] datetime not null
);
-- lets just say that we just created 3 new items
INSERT INTO items (name)
VALUES ('item1'), ('item2'), ('item3');
-- and they all start in the new state
INSERT INTO updates (item_id, new_status, update_date)
SELECT
[id],
[new_status] = 'new',
[update_date] = '2017-10-9 00:00:00.000'
FROM items
-- then we have them update over the course of a couple days
-- item 1
INSERT INTO updates (item_id, new_status, update_date)
SELECT [id], [new_status] = 'in progress', [update_date] = '2017-10-10 00:00:00.000'
FROM items WHERE [name] = 'item1'
UNION
SELECT [id], [new_status] = 'ready', [update_date] = '2017-10-12 00:00:00.000'
FROM items WHERE [name] = 'item1'
UNION
SELECT [id], [new_status] = 'complete', [update_date] = '2017-10-14 00:00:00.000'
FROM items WHERE [name] = 'item1';
-- item 2
INSERT INTO updates (item_id, new_status, update_date)
SELECT [id], [new_status] = 'in progress', [update_date] = '2017-10-10 00:00:00.000'
FROM items WHERE [name] = 'item2'
UNION
SELECT [id], [new_status] = 'ready', [update_date] = '2017-10-11 00:00:00.000'
FROM items WHERE [name] = 'item2'
UNION
SELECT [id], [new_status] = 'complete', [update_date] = '2017-10-12 00:00:00.000'
FROM items WHERE [name] = 'item2';
-- item 3
INSERT INTO updates (item_id, new_status, update_date)
SELECT [id], [new_status] = 'in progress', [update_date] = '2017-10-11 00:00:00.000'
FROM items WHERE [name] = 'item3'
UNION
SELECT [id], [new_status] = 'ready', [update_date] = '2017-10-13 00:00:00.000'
FROM items WHERE [name] = 'item3'
UNION
SELECT [id], [new_status] = 'complete', [update_date] = '2017-10-15 00:00:00.000'
FROM items WHERE [name] = 'item3';
当前查询:
-- =======================
-- Running latest record
-- =======================
-- Goal: For a period of time, with multiple items, which have multiple updates,
-- find the number of items which are in each state at the end of a day.
--
-- Issue: how can i improve this query for a large database?
--
SELECT
dates.[update_date],
state = latest_update.[new_status],
volume = COUNT(*)
FROM items i -- start with the items that we want to count per day
CROSS JOIN (
SELECT DISTINCT [update_date] FROM updates
) dates -- the days to count for
CROSS APPLY (
-- this cross apply gets all updates for an item, that occurred on or before each date
SELECT
updates.*,
RN = ROW_NUMBER() OVER (PARTITION BY [item_id] ORDER BY [update_date] DESC)
FROM updates
WHERE [update_date] <= dates.[update_date] AND [item_id] = i.[id]
) latest_update
WHERE latest_update.RN = 1 -- only count the latest update
GROUP BY dates.[update_date], latest_update.[new_status]
ORDER BY dates.[update_date], latest_update.[new_status]
[结果]:
| update_date | state | volume |
|----------------------|-------------|--------|
| 2017-10-09T00:00:00Z | new | 3 |
| 2017-10-10T00:00:00Z | in progress | 2 |
| 2017-10-10T00:00:00Z | new | 1 |
| 2017-10-11T00:00:00Z | in progress | 2 |
| 2017-10-11T00:00:00Z | ready | 1 |
| 2017-10-12T00:00:00Z | complete | 1 |
| 2017-10-12T00:00:00Z | in progress | 1 |
| 2017-10-12T00:00:00Z | ready | 1 |
| 2017-10-13T00:00:00Z | complete | 1 |
| 2017-10-13T00:00:00Z | ready | 2 |
| 2017-10-14T00:00:00Z | complete | 2 |
| 2017-10-14T00:00:00Z | ready | 1 |
| 2017-10-15T00:00:00Z | complete | 3 |
的GROUP BY CLA在下面的语句末尾使用根据它们的值将new_status列中的数据分组。数据库然后向用户呈现来自new_status列的“不同”值的列表。
select new_status,count(new_status) from updates group by new_status
换句话说,如果我们运行不计(NEW_STATUS)部分的查询那么这将是完全一样的话说:
select distinct new_status from updates
因为我们所要求的计数,数据库能够计算它分组在一起的每个不同值的迭代次数,并将其显示在count(new_status)列中。由于它是数据库不会给一个名称,其对分组更新值的列,但你可以是这样做的:
select new_status,count(new_status) as nmbr_items from updates group by new_status
一种方法是使用条件汇总:
select cast(update_date as date), status, count(*)
from (select u.*,
row_number() over (partition by cast(update_date as date) order by update_date desc) as seqnum
from updates u
) u
where seqnum = 1
group by cast(update_date as date)
order by cast(update_date as date), count(*) desc;
您的查询会计算每天的更新次数,而不是一天之前的最新更新次数。即如果连续4天有新消息,那么您的解决方案会在第一天报告新消息,然后在接下来的3天内报告新消息。 – mcnnowak
编辑您的提问并显示你想要的结果。 –
加1样本数据,前进请包括预期的结果作为文本 – TheGameiswar
小提琴在查询端有预期的输出。问题更多的是如何有效地为大量项目获得正确的答案。 – mcnnowak