BigQuery - 删除重复记录有时花费很长时间
问题描述:
我们在Cloud中实现了以下ETL过程:每小时在本地数据库中运行查询=>将结果保存为csv并将其加载到云存储中=>将文件从云存储加载到BigQuery table =>使用以下查询删除重复的记录。BigQuery - 删除重复记录有时花费很长时间
SELECT
* EXCEPT (row_number)
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC) row_number
FROM rawData.stock_movement
)
WHERE row_number = 1
自上午8点(柏林当地时间)今天早上删除重复记录的过程中需要更长的时间比平常那样,即使数据量没有太大的不同比它通常是:它通常需要10秒删除重复的记录,而今天早上有时半小时。
是否删除重复记录不稳定?
答
这可能是因为您对特定的id
有许多重复值,因此计算行号需要很长时间。如果要检查是否属于这种情况下,你可以尝试:
#standardSQL
SELECT id, COUNT(*) AS id_count
FROM rawData.stock_movement
GROUP BY id
ORDER BY id_count DESC LIMIT 5;
随着中说,它可能会更快与此查询,而不是删除重复:
#standardSQL
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
FROM rawData.stock_movement AS t
GROUP BY t.id
);
下面是一个例子:
#standardSQL
WITH T AS (
SELECT 1 AS id, 'foo' AS x, TIMESTAMP '2017-04-01' AS timestamp UNION ALL
SELECT 2, 'bar', TIMESTAMP '2017-04-02' UNION ALL
SELECT 1, 'baz', TIMESTAMP '2017-04-03')
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
FROM rawData.stock_movement AS t
GROUP BY t.id
);
,这可能是较快的原因是大量查询将只保留在内存中的最大时间戳列在任何特定的时间点。