如何根据时间间隔删除数据

问题描述:

我在database.The表的表中有一个字段VehId(int)和trackdt(DATETIME)如何根据时间间隔删除数据

我的表有3000个十亿行(是的,这是正确的3000十亿)。所以我想删除旧数据。但我想基于时间间隔删除数据。

我想删除每辆车每20秒的记录。

下面是表我有

VehId Trackdt 
1  2017-05-20 00:00:30.000 
2  2017-05-20 00:00:32.000   
2  2017-05-20 00:00:42.000 
1  2017-05-20 00:00:40.000 
2  2017-05-20 00:00:52.000 
1  2017-05-20 00:00:50.000 
1  2017-05-20 00:01:00.000 
2  2017-05-20 00:01:02.000 
1  2017-05-20 00:01:10.000 
1  2017-05-20 00:01:20.000 
2  2017-05-20 00:01:12.000 
1  2017-05-20 00:01:30.000 
2  2017-05-20 00:01:22.000 
2  2017-05-20 00:01:32.000 

后删除数据应该像下面

VehId TRackdt 
1  2017-05-20 00:00:30.000 
2  2017-05-20 00:00:32.000 
1  2017-05-20 00:01:00.000 
2  2017-05-20 00:01:02.000 
1  2017-05-20 00:01:30.000 
2  2017-05-20 00:01:32.000 

我想下面的查询,但它花费过多时间

ALTER PROCEDURE [dbo].[DELETEINTERVALDATA] 

@FROMDATE DATETIME, 
@TODATE DATETIME, 
@INTERVAL INT, 
@FLAG INT 

AS 
BEGIN 
SET NOCOUNT ON; 

DECLARE @TRACKDT DATETIME 
DECLARE @I INT =1 
DECLARE @V INT =1 
DECLARE @COUNT INT 
DECLARE @VCOUNT INT 
DECLARE @STARTDATE DATETIME = '' 
DECLARE @VEHID INT 
DECLARE @TIMEDIFF INT 


CREATE TABLE #TEMPVEHICLE 
(
    SNO INT IDENTITY(1,1), 
    VEHID INT 
) 

CREATE TABLE #TEMPLOG 
(
    SNO INT IDENTITY(1,1), 
    TRACKDT DATETIME 
) 

IF (@FLAG = 1) 
    BEGIN 

     INSERT INTO #TEMPVEHICLE (VEHID) SELECT VEHID FROM VEHICLEMASTER ORDER BY VEHID   

     SELECT @VCOUNT = COUNT(SNO) FROM #TEMPVEHICLE 

     WHILE (@V <= @VCOUNT) 
      BEGIN    

       SELECT @VEHID = VEHID FROM #TEMPVEHICLE WHERE SNO = @V 

       INSERT INTO #TEMPLOG(TRACKDT) SELECT TRACKDT 
       FROM TRACKINGLOG WITH(NOLOCK) 
       WHERE TRACKDT BETWEEN @FROMDATE AND @TODATE AND VEHID = @VEHID 
       ORDER BY TRACKDT ASC 

       SELECT @COUNT = COUNT(SNO) FROM #TEMPLOG 

       WHILE (@I <= @COUNT) 
        BEGIN       
         SELECT @TRACKDT=TRACKDT FROM #TEMPLOG WHERE SNO = @I        
         IF (@I = 1) 
          BEGIN 
           SELECT @STARTDATE = @TRACKDT 
          END 
         ELSE 
          BEGIN 
           SELECT @TIMEDIFF = DATEDIFF(SECOND,@STARTDATE,@TRACKDT) 
           IF @TIMEDIFF <= 20 
            BEGIN 
             DELETE FROM TRACKINGLOG WHERE TRACKDT = @TRACKDT AND VEHID = @VEHID           
            END 
           ELSE 
            BEGIN 
             SELECT @STARTDATE = @TRACKDT 
            END 
          END 
         SELECT @I = @I + 1 
        END 
       TRUNCATE TABLE #TEMPLOG 
       SELECT @V = @V + 1,@STARTDATE= '',@I=1 
      END 
     DROP TABLE #TEMPLOG 
     DROP TABLE #TEMPVEHICLE 
    END 
    END 

如何我可以根据时间间隔写出删除数据的查询吗?应该快?

在此先感谢

+0

如果我理解正确,您希望保留每辆'vehid'的最小和最大'trackdt'?另外,什么是确切的表定义,包括索引? –

+0

不,不,我已经更新了我的问题..我想要每辆车的数据至少20秒的时间间隔。表中有ID列,这是PK,所以你可以把它算作我的猜测。 – user2928116

+0

@ user2928116:尝试我的查询。我希望你的SP是正确的。 –

请阅读有关Parameter sniffing。假设你的存储过程很慢但是正确,我已经更新了你的SP。请尝试:

ALTER PROCEDURE [dbo].[DELETEINTERVALDATA] 

@FROMDATE1 DATETIME, 
@TODATE1 DATETIME, 
@INTERVAL1 INT, 
@FLAG1 INT 

AS 
BEGIN 
SET NOCOUNT ON; 

DECLARE @TRACKDT DATETIME 
DECLARE @I INT =1 
DECLARE @V INT =1 
DECLARE @COUNT INT 
DECLARE @VCOUNT INT 
DECLARE @STARTDATE DATETIME = '' 
DECLARE @VEHID INT 
DECLARE @TIMEDIFF INT 

DECLARE @FROMDATE DATETIME = @FROMDATE1 
DECLARE @TODATE DATETIME = @TODATE1 
DECLARE @INTERVAL INT = @INTERVAL1 
DECLARE @FLAG INT = @FLAG1 


CREATE TABLE #TEMPVEHICLE 
(
    SNO INT IDENTITY(1,1), 
    VEHID INT 
) 

CREATE TABLE #TEMPLOG 
(
    SNO INT IDENTITY(1,1), 
    TRACKDT DATETIME 
) 

IF (@FLAG = 1) 
    BEGIN 

     INSERT INTO #TEMPVEHICLE (VEHID) SELECT VEHID FROM VEHICLEMASTER ORDER BY VEHID   

     SELECT @VCOUNT = COUNT(SNO) FROM #TEMPVEHICLE 

     WHILE (@V <= @VCOUNT) 
      BEGIN    

       SELECT @VEHID = VEHID FROM #TEMPVEHICLE WHERE SNO = @V 

       INSERT INTO #TEMPLOG(TRACKDT) SELECT TRACKDT 
       FROM TRACKINGLOG WITH(NOLOCK) 
       WHERE TRACKDT BETWEEN @FROMDATE AND @TODATE AND VEHID = @VEHID 
       ORDER BY TRACKDT ASC 

       SELECT @COUNT = COUNT(SNO) FROM #TEMPLOG 

       WHILE (@I <= @COUNT) 
        BEGIN       
         SELECT @TRACKDT=TRACKDT FROM #TEMPLOG WHERE SNO = @I        
         IF (@I = 1) 
          BEGIN 
           SELECT @STARTDATE = @TRACKDT 
          END 
         ELSE 
          BEGIN 
           SELECT @TIMEDIFF = DATEDIFF(SECOND,@STARTDATE,@TRACKDT) 
           IF @TIMEDIFF <= 20 
            BEGIN 
             DELETE FROM TRACKINGLOG WHERE TRACKDT = @TRACKDT AND VEHID = @VEHID           
            END 
           ELSE 
            BEGIN 
             SELECT @STARTDATE = @TRACKDT 
            END 
          END 
         SELECT @I = @I + 1 
        END 
       TRUNCATE TABLE #TEMPLOG 
       SELECT @V = @V + 1,@STARTDATE= '',@I=1 
      END 
     DROP TABLE #TEMPLOG 
     DROP TABLE #TEMPVEHICLE 
    END 
    END 
+0

感谢您的回应。但它仍然花费太多时间。我需要删除3000亿行。所以我认为需要几天时间才能删除,而我没有那么多时间 – user2928116

+0

好吧。也许别人会建议更好的方法。我们需要从概念上将其划分为效率时间段。 –

+0

参数嗅探不会使光标*任何*更快 –

如果在上一次之后的时间为20秒或更短时间,您想要删除一条记录。但是,如果之前的时间被删除,则比您想要与之前的时间进行比较等。 您需要某种类型的信息。递归表表达式或常规循环。

Declare @myTable table (vehid int, trackdt datetime) 
insert into @mytable values 
(1,  '2017-05-20 00:00:30.000'), 
(2,  '2017-05-20 00:00:32.000'),   
(2,  '2017-05-20 00:00:42.000'), 
(1,  '2017-05-20 00:00:40.000'), 
(2,  '2017-05-20 00:00:52.000'), 
(1,  '2017-05-20 00:00:50.000'), 
(1,  '2017-05-20 00:01:00.000'), 
(2,  '2017-05-20 00:01:02.000'), 
(1,  '2017-05-20 00:01:10.000'), 
(1,  '2017-05-20 00:01:20.000'), 
(2,  '2017-05-20 00:01:12.000'), 
(1,  '2017-05-20 00:01:30.000'), 
(2,  '2017-05-20 00:01:22.000'), 
(2,  '2017-05-20 00:01:32.000') 

--set @@rowcount to 1 
select 1 

while @@ROWCOUNT > 0 
begin 
    DELETE T1 
    from @mytable t1 

    --previous time 
    outer apply (select top 1 trackdt from @mytable where vehid = t1.vehid and trackdt < t1.trackdt order by 1 desc)t2 
    --previous time before that 
    outer apply (select top 1 trackdt from @mytable where vehid = t1.vehid and trackdt < t2.trackdt order by 1 desc)t3 

    --previous time was less or equal to 20 seconds 
    where DATEDIFF(second,t2.trackdt,t1.trackdt)<=20 

    --previous time before that was more than 20 seconds or there is no time before 
    and (DATEDIFF(second,t3.trackdt,t2.trackdt)>20 or t3.trackdt is null) 

end 

select * from @mytable 
+0

感谢您的回应Peter。你的查询似乎有点快,但它没有给我预期的输出。如果之前的时间(开始时间)和当前时间的差异小于20秒,那么我想删除当前时间,那么我将检查开始时间和新的当前时间之间的差异,如果差异小于20秒,则相同的当前时间将被删除如果大于20秒,我的当前时间将成为开始时间,然后我想检查开始时间和下一个当前时间的差异等等。因此,以前的时间将永远不会被删除。 – user2928116

+0

欲了解更多信息,你可以看到我的商店程序 – user2928116

+0

@ user2928116它似乎是适用于您的示例数据。你可以举一个例子,这个查询不能正常工作吗? – Peter

所需输出对应于秒间隔。

您可以使用窗口函数检索结果集中的先前值。例如,LAG(trackdt,1)将返回以前的值。接下来是LEADFIRST_VALUE将返回一组中的第一个值。

查询:

select * , 
    FIRST_VALUE(trackdt) over (partition by vehid order by trackdt) as t0 
from @mytable 

将返回每辆车第一trackdt值,当行被trackdt排序。

1 2017-05-20 00:00:30.000 2017-05-20 00:00:30.000 
1 2017-05-20 00:00:40.000 2017-05-20 00:00:30.000 
1 2017-05-20 00:00:50.000 2017-05-20 00:00:30.000 

这样我们就可以用datediff计算在分区中的当前和第一值之间的间隔。将所得的查询是有点难看:

select * , 
    FIRST_VALUE(trackdt) over (partition by vehid order by trackdt) as t0, 
    datediff(s,FIRST_VALUE(trackdt) over (partition by vehid order by trackdt), 
       trackdt) as interval 
from @mytable 

30除以将给予我们其中的每一行所属的间隔桶。

1 2017-05-20 00:01:10.000 2017-05-20 00:00:30.000 1 
1 2017-05-20 00:01:20.000 2017-05-20 00:00:30.000 1 
1 2017-05-20 00:01:30.000 2017-05-20 00:00:30.000 2 
2 2017-05-20 00:00:32.000 2017-05-20 00:00:32.000 0 
2 2017-05-20 00:00:42.000 2017-05-20 00:00:32.000 0 

计算其余不过,将第一行中,每30秒桶返回0:

select * , 
    FIRST_VALUE(trackdt) over (partition by vehid order by trackdt) as t0, 
    datediff(s,FIRST_VALUE(trackdt) over (partition by vehid order by trackdt), 
       trackdt) %30 as remainder 
from @mytable 

1 2017-05-20 00:01:00.000 2017-05-20 00:00:30.000 0 
1 2017-05-20 00:01:10.000 2017-05-20 00:00:30.000 10 
1 2017-05-20 00:01:20.000 2017-05-20 00:00:30.000 20 
1 2017-05-20 00:01:30.000 2017-05-20 00:00:30.000 0 
2 2017-05-20 00:00:32.000 2017-05-20 00:00:32.000 0 

我们可以用一个或一个以上的CTE整齐向上此查询并执行删除:

with start_times as 
(
    select * , 
     FIRST_VALUE(trackdt) over (partition by vehid order by trackdt) as t0 
    from @mytable 
), 
intervals as 
(
    select * , 
     datediff(s,t0,trackdt) %30 as rem 
    from start_times 
) 
delete 
from intervals 
where rem<>0 

该查询依赖于每隔30秒有一个值。

通常,您可以使用ROW_NUMBER()函数来标识间隔存储桶中的记录,并只需选择存储桶中的第一行。既然我们要保留第一行,我们挑不出什么,有一个ROW_NUMBER> 1:

with start_times as 
(
    select * , 
     FIRST_VALUE(trackdt) over (partition by vehid order by trackdt) as t0 
    from #mytable 
), 
intervals as 
(
    select * , 
     datediff(s,t0,trackdt) /30 as interval 
    from start_times 
), 
ordered as 
(
    select *, 
      ROW_NUMBER() over(partition by vehid,interval order by trackdt) row_num 
    from intervals 
) 
select vehid,trackdt 
from ordered 
where row_num>1 

即使此查询速度快,我不会用它在3万亿行的表。窗口化常常导致后台处理 - 临时结果存储在tempdb中以允许窗口计算。这将是更好的选择必须删除的行的ID,将它们插入到一个临时表,然后使用具有联接子句的DELETE:

DELETE HugeTable 
From HugeTable 
    INNER JOIN TempTable on TempTable.ID=HugeTable.ID 

即使是这样,你可能想批次删除。在这种情况下,你可以使用NTILE函数计算1和N之间的每一行批号:

select ID,NTILE(100) over(order by vehid,trackdt) as batch_number 
from ordered 
where row_num=1 

这将计算1和100之间的批号可以将此存储在临时表中删除一次一行。

+0

Trackdt不会有30秒的差异,它可能是5秒到60秒之间的任何东西。 – user2928116

+0

TIL'NTILE'。 +1 – dd4711