熊猫在开始和结束时间加入两个数据帧不等于
问题描述:
我有两个数据框,每个数据框都有关于开始和结束时间事件的信息。问题是这两个数据帧有不同的开始和结束时间,因为它们测量的是不同的东西。小麦我想要做的是创造新的事件,其中包含两个信息。这些必须基于两个数据帧之间的任何分割进行分割。例如:熊猫在开始和结束时间加入两个数据帧不等于
数据框答:
Start End
2016-12-30 18:51:00 2016-12-30 19:37:00
2016-12-30 20:03:00 2016-12-30 20:11:00
2016-12-30 20:12:00 2016-12-30 21:02:00
2016-12-30 21:02:00 2016-12-30 21:04:00
2016-12-30 21:10:00 2016-12-30 21:12:00
2016-12-30 21:12:00 2016-12-30 21:32:00
数据帧B:
Start End
2016-12-30 18:33:45 2016-12-30 19:18:00
2016-12-30 19:18:00 2016-12-30 19:38:00
2016-12-30 19:38:00 2016-12-30 19:48:00
2016-12-30 19:48:00 2016-12-30 20:15:45
2016-12-30 20:15:45 2016-12-30 20:35:45
2016-12-30 20:35:45 2016-12-30 20:45:45
2016-12-30 20:45:45 2016-12-30 21:14:30
2016-12-30 21:14:30 2016-12-30 21:35:00
对于这些理想的输出将是:
Start End
2016-12-30 18:51:00 2016-12-30 19:18:00
2016-12-30 19:18:00 2016-12-30 19:37:00
2016-12-30 20:03:00 2016-12-30 20:11:00
2016-12-30 20:12:00 2016-12-30 20:15:45
2016-12-30 20:15:45 2016-12-30 20:35:45
2016-12-30 20:35:45 2016-12-30 20:45:45
2016-12-30 20:45:45 2016-12-30 21:12:00
2016-12-30 21:12:00 2016-12-30 21:14:30
2016-12-30 21:14:30 2016-12-30 21:32:00
有一对夫妇的方法,我知道这个怎么做。我可以将数据框分解为分钟级别并在几分钟内合并,但问题在于每个数据框都是200万行,这将是一个非常漫长的过程。
我也有SQL可以做到这一点,但是当我试图运行它时,它花了太长时间,DBA杀死了这个进程。
SQL的功能是:
select
a.UNIQUE_ID,
a,
b,
c,
d,
CASE WHEN B.START < A.START THEN A.START
ELSE B.START END START,
CASE WHEN B.END > A.END THEN A.END
ELSE B.END END END
from
(Select
UNIQUE_ID,
START,
END,
a,
b,
from table_1
)a
join
(
UNIQUE_ID,
Select
START,
END,
c,
d
from table_2) b
on 1=1
AND A.UNIQUE_ID = B.UNIQUE_ID
AND ((b.START between a.START and a.END)
or (b.end between a.START and a.END)
or (b.START < a.START and b.end > a.end)
or (a.START < b.START and a.end > b.end)
)
) a
这使得一排开始的每对组合,包含对于UNIQUE_ID至少一个相同分钟结束时间。然后它使用case语句将每行缩减为共享分钟。
我想不出一种有效的方式来使用Pandas在python中复制这个SQL。我在熊猫中唯一知道的合并函数必须具有相同的列进行合并,它们不能是像我使用的连接那样的范围。
是否有大熊猫一类合并的,我可以用做类似的东西:
AND ((b.START between a.START and a.END)
or (b.end between a.START and a.END)
or (b.START < a.START and b.end > a.end)
or (a.START < b.START and a.end > b.end)
)
我能想到的唯一的办法是遍历每行中的DF切片回另一个数据帧到只有在该行的DF b中具有分钟的行,然后在这两个片上合并,并将所有这些合并连接成一个新的DF,但这将花费很长时间。
任何帮助表示赞赏。
答
我要使用我的question这是问类似你有什么书面的实现:
import pandas as pd
df_a = pd.DataFrame({'Start': ['2016-12-30 18:51:00',
'2016-12-30 20:03:00',
'2016-12-30 20:12:00',
'2016-12-30 21:02:00',
'2016-12-30 21:10:00',
'2016-12-30 21:12:00'],
'End': ['2016-12-30 19:37:00',
'2016-12-30 20:11:00',
'2016-12-30 21:02:00',
'2016-12-30 21:04:00',
'2016-12-30 21:12:00',
'2016-12-30 21:32:00']})
df_b = pd.DataFrame({'Start': ['2016-12-30 18:33:45',
'2016-12-30 19:18:00',
'2016-12-30 19:38:00',
'2016-12-30 19:48:00',
'2016-12-30 20:15:45',
'2016-12-30 20:35:45',
'2016-12-30 20:45:45',
'2016-12-30 21:14:30'],
'End': ['2016-12-30 19:18:00',
'2016-12-30 19:38:00',
'2016-12-30 19:48:00',
'2016-12-30 20:15:45',
'2016-12-30 20:35:45',
'2016-12-30 20:45:45',
'2016-12-30 21:14:30',
'2016-12-30 21:35:00']})
# Convert the strings to datetime
df_a['Start'] = pd.to_datetime(df_a['Start'], format='%Y-%m-%d %H:%M:%S')
df_a['End'] = pd.to_datetime(df_a['End'], format='%Y-%m-%d %H:%M:%S')
df_b['Start'] = pd.to_datetime(df_b['Start'], format='%Y-%m-%d %H:%M:%S')
df_b['End'] = pd.to_datetime(df_b['End'], format='%Y-%m-%d %H:%M:%S')
# Create labels for the two datasets
# These labels will help determine the overlaps downstream
df_a['Label'] = 'a'
df_b['Label'] = 'b'
# With the labels created, I can concatenate the dataframes now
df_concat = pd.concat([df_a, df_b])
df_concat = df_concat[['Label', 'Start', 'End']] # Ordering the columns
# Convert the dataframe to a list of tuples
df_concat_rec = df_concat.to_records(index=False)
# Here's where I'm using my answer that I had used in the other question
timelist_new = []
for time in df_concat_rec:
timelist_new.append((time[0], time[1], 'begin'))
timelist_new.append((time[0], time[2], 'end'))
timelist_new = sorted(timelist_new, key=lambda x: x[1])
key = None
keylist = set()
aggregator = []
for idx in range(len(timelist_new[:-1])):
t1 = timelist_new[idx]
t2 = timelist_new[idx + 1]
t1_key = str(t1[0])
t2_key = str(t2[0])
t1_dt = t1[1]
t2_dt = t2[1]
t1_pointer = t1[2]
t2_pointer = t2[2]
if t1_dt == t2_dt:
keylist.add(t1_key)
keylist.add(t2_key)
elif t1_dt < t2_dt:
if t1_pointer == 'begin':
keylist.add(t1_key)
if t1_pointer == 'end':
keylist.discard(t1_key)
key = ','.join(sorted(keylist))
aggregator.append((key, t1_dt, t2_dt))
# This is where I filter out any records where there isn't an overlap and where the start and end dates are equal
filtered = [x for x in aggregator if ((len(x[0]) > 1) & (x[1] != x[2]))]
# Convert the list of tuples back to dataframe
final_df = pd.DataFrame.from_records(filtered, columns=['Label', 'Start', 'End'])
# Print final dataframe
print(final_df)
输出:
Label Start End
0 a,b 2016-12-30 18:51:00 2016-12-30 19:18:00
1 a,b 2016-12-30 19:18:00 2016-12-30 19:37:00
2 a,b 2016-12-30 20:03:00 2016-12-30 20:11:00
3 a,b 2016-12-30 20:12:00 2016-12-30 20:15:45
4 a,b 2016-12-30 20:15:45 2016-12-30 20:35:45
5 a,b 2016-12-30 20:35:45 2016-12-30 20:45:45
6 a,b 2016-12-30 20:45:45 2016-12-30 21:02:00
7 a,b 2016-12-30 21:02:00 2016-12-30 21:04:00
8 a,b 2016-12-30 21:10:00 2016-12-30 21:12:00
9 a,b 2016-12-30 21:12:00 2016-12-30 21:14:30
10 a,b 2016-12-30 21:14:30 2016-12-30 21:32:00
于是我找到了工作但是,这似乎还在起作用,但我仍然会听到任何人在大熊猫身上做出这样的回答。 我在做什么是使用软件包pandasql创建一个sqlite数据库的DF和执行SQL我知道的作品。这是一个非常漂亮的软件包。 – user6745154