的Python:用熊猫GROUPBY,以减少数据帧
问题描述:
的维度在我的数据框,我们称之为DF,我有一个看起来像的Python:用熊猫GROUPBY,以减少数据帧
serial gps_dt lat long dist
1 25Mar x1 y1 Nan
1 26Mar x2 y2 0.01
1 27Mar x3 y3 1.25 (assume this is the 5th occurrence < 160)
2 24Mar x4 y5 Nan
2 25Mar x5 y5 2.1
2 26Mar x6 y6 1.01
2 27Mar x7 y7 175.2
2 28Mar x8 y8 179.3 (assume this is the 5th occurrence > 160)
,这样下去的数据。我已经有一个系列,我们把它叫做check
,告诉我是否serial[i] == serial[i+1]
。我现在想要做的是当它们相等时,在条件hdist < 160
下构造一个包含serial, gps_dt_first, gps_dt_last, avg_lat, avg_long
的新数据帧,并且在此半径内至少有5次出现。如果hdist > 160
,我想建造另一组当且仅当在未来5个事件是中第一个大于160
160例如,输出看起来是这样的:
serial gps_dt_first gps_dt_last avg_lat avg_long
1 25Mar 27Mar avg_x avg_y
2 27Mar 28Mar avg_x avg_y
我我正在看熊猫的group by文档。该数据已经在SAS的serial, gps_dt
订单中。我还需要做df.groupby(['serial', 'gps_dt'])
吗?
一旦DF进行分组,如果需要的话,我的代码的思想是(更多的是伪代码大纲):
if check == true and hdist < 160 and 5 or more occurrences (how to count the occurrences):
result['serial'] = df.serial (first in serial; how to extract)
result['gps_dt_first'] = df.gps_dt (first in gps_dt)
result['gps_dt_last'] = df.gps_dt (last in gps_dt)
result['avg_lat'] = df.lat.mean() (only for the subset of serial meeting criteria)
result['avg_long'] = df.long.mean() (same here)
else if check == true and hdist > 160 and 5 or more occurrences;
do same as above
else:
delete
答
如果您已经阅读文档为groupby
,你可以做什么以下部分解释:
-
Iterate over each element you got from
groupby
; -
Perform one or more
aggregate
operations(包括应用链接操作,或根据列不同操作);