如何在SQL中有效地查找正在运行的多个记录的最新更新？

问题描述：

-- items which have periodic updates 
CREATE TABLE items (
    [id] int identity(1, 1) primary key, 
    [name] varchar(100) not null 
); 

-- item updates. updating an item generally means it has a new status, at a certain time. 
CREATE TABLE updates (
    [id] int identity(1, 1) primary key, 
    [item_id] int foreign key references items([id]), 
    [new_status] varchar(100) not null, 
    [update_date] datetime not null 
);

这是用来跟踪项目的状态，经过许多国家，随着时间的推移。

我一直试图找到一个高效的查询，将回答以下问题：

对于许多物品，这可以在几个州，在那里我们登录状态更新中的一个，有多少项目是目前在每个国家每天结束时？

我有一个SQLFiddle here，它有一些示例数据，以及我目前在这个查询中的尝试。它在一些项目上运行良好，但我的数据库有成千上万，所以我的查询目前大约需要5分钟才能运行。

有没有更高效的查询来回答这个问题？

测试数据：

-- items which have periodic updates 
CREATE TABLE items (
    [id] int identity(1, 1) primary key, 
    [name] varchar(100) not null 
); 

-- item updates. updating an item generally means it has a new status, at a certain time. 
CREATE TABLE updates (
    [id] int identity(1, 1) primary key, 
    [item_id] int foreign key references items([id]), 
    [new_status] varchar(100) not null, 
    [update_date] datetime not null 
); 

-- lets just say that we just created 3 new items 
INSERT INTO items (name) 
    VALUES ('item1'), ('item2'), ('item3'); 

-- and they all start in the new state 
INSERT INTO updates (item_id, new_status, update_date) 
SELECT 
    [id], 
    [new_status] = 'new', 
    [update_date] = '2017-10-9 00:00:00.000' 
FROM items 

-- then we have them update over the course of a couple days 
-- item 1 
INSERT INTO updates (item_id, new_status, update_date) 
SELECT [id], [new_status] = 'in progress', [update_date] = '2017-10-10 00:00:00.000' 
FROM items WHERE [name] = 'item1' 
UNION 
SELECT [id], [new_status] = 'ready', [update_date] = '2017-10-12 00:00:00.000' 
FROM items WHERE [name] = 'item1' 
UNION 
SELECT [id], [new_status] = 'complete', [update_date] = '2017-10-14 00:00:00.000' 
FROM items WHERE [name] = 'item1'; 

-- item 2 
INSERT INTO updates (item_id, new_status, update_date) 
SELECT [id], [new_status] = 'in progress', [update_date] = '2017-10-10 00:00:00.000' 
FROM items WHERE [name] = 'item2' 
UNION 
SELECT [id], [new_status] = 'ready', [update_date] = '2017-10-11 00:00:00.000' 
FROM items WHERE [name] = 'item2' 
UNION 
SELECT [id], [new_status] = 'complete', [update_date] = '2017-10-12 00:00:00.000' 
FROM items WHERE [name] = 'item2'; 

-- item 3 
INSERT INTO updates (item_id, new_status, update_date) 
SELECT [id], [new_status] = 'in progress', [update_date] = '2017-10-11 00:00:00.000' 
FROM items WHERE [name] = 'item3' 
UNION 
SELECT [id], [new_status] = 'ready', [update_date] = '2017-10-13 00:00:00.000' 
FROM items WHERE [name] = 'item3' 
UNION 
SELECT [id], [new_status] = 'complete', [update_date] = '2017-10-15 00:00:00.000' 
FROM items WHERE [name] = 'item3';

当前查询：

-- ======================= 
-- Running latest record 
-- ======================= 
-- Goal: For a period of time, with multiple items, which have multiple updates, 
--  find the number of items which are in each state at the end of a day. 
-- 
-- Issue: how can i improve this query for a large database? 
-- 

SELECT 
    dates.[update_date], 
    state = latest_update.[new_status], 
    volume = COUNT(*) 
FROM items i -- start with the items that we want to count per day 
CROSS JOIN (
    SELECT DISTINCT [update_date] FROM updates 
) dates -- the days to count for 
CROSS APPLY (
    -- this cross apply gets all updates for an item, that occurred on or before each date 
    SELECT 
    updates.*, 
    RN = ROW_NUMBER() OVER (PARTITION BY [item_id] ORDER BY [update_date] DESC) 
    FROM updates 
    WHERE [update_date] <= dates.[update_date] AND [item_id] = i.[id] 
) latest_update 
WHERE latest_update.RN = 1 -- only count the latest update 
GROUP BY dates.[update_date], latest_update.[new_status] 
ORDER BY dates.[update_date], latest_update.[new_status]

[结果]：

|   update_date |  state | volume | 
|----------------------|-------------|--------| 
| 2017-10-09T00:00:00Z |   new |  3 | 
| 2017-10-10T00:00:00Z | in progress |  2 | 
| 2017-10-10T00:00:00Z |   new |  1 | 
| 2017-10-11T00:00:00Z | in progress |  2 | 
| 2017-10-11T00:00:00Z |  ready |  1 | 
| 2017-10-12T00:00:00Z | complete |  1 | 
| 2017-10-12T00:00:00Z | in progress |  1 | 
| 2017-10-12T00:00:00Z |  ready |  1 | 
| 2017-10-13T00:00:00Z | complete |  1 | 
| 2017-10-13T00:00:00Z |  ready |  2 | 
| 2017-10-14T00:00:00Z | complete |  2 | 
| 2017-10-14T00:00:00Z |  ready |  1 | 
| 2017-10-15T00:00:00Z | complete |  3 |

编辑您的提问并显示你想要的结果。 –

加1样本数据，前进请包括预期的结果作为文本 – TheGameiswar

小提琴在查询端有预期的输出。问题更多的是如何有效地为大量项目获得正确的答案。 – mcnnowak

答

的GROUP BY CLA在下面的语句末尾使用根据它们的值将new_status列中的数据分组。数据库然后向用户呈现来自new_status列的“不同”值的列表。

select new_status,count(new_status) from updates group by new_status

换句话说，如果我们运行不计（NEW_STATUS）部分的查询那么这将是完全一样的话说：

select distinct new_status from updates

因为我们所要求的计数，数据库能够计算它分组在一起的每个不同值的迭代次数，并将其显示在count（new_status）列中。由于它是数据库不会给一个名称，其对分组更新值的列，但你可以是这样做的：

select new_status,count(new_status) as nmbr_items from updates group by new_status

你可以给你的答案增加一些解释吗？ – kvorobiev

是的，我已经看到了你的要求，有一些解释，我会尽快做到这一点。 – russ

答

一种方法是使用条件汇总：

select cast(update_date as date), status, count(*) 
from (select u.*, 
      row_number() over (partition by cast(update_date as date) order by update_date desc) as seqnum 
     from updates u 
    ) u 
where seqnum = 1 
group by cast(update_date as date) 
order by cast(update_date as date), count(*) desc;

您的查询会计算每天的更新次数，而不是一天之前的最新更新次数。即如果连续4天有新消息，那么您的解决方案会在第一天报告新消息，然后在接下来的3天内报告新消息。 – mcnnowak

如何在SQL中有效地查找正在运行的多个记录的最新更新？

相关推荐