熊猫LTM与重复的总和
问题描述:
我想要计算由实体ID分组的数字列的最后12个月的滚动总和。我的数据看起来有点像这样:熊猫LTM与重复的总和
eID perioddate 123456
14 ABC 2011-01-31 31773.0
74 ABC 2011-01-31 31773.0
35 ABC 2011-01-31 31773.0
96 ABC 2011-01-31 31773.0
57 ABC 2011-04-30 11209.0
18 ABC 2011-04-30 11209.0
81 ABC 2011-07-31 11451.0
44 ABC 2011-07-31 11451.0
07 ABC 2011-07-31 11451.0
70 ABC 2011-10-31 20062.0
34 ABC 2011-10-31 20062.0
98 ABC 2011-10-31 20062.0
62 ABC 2012-01-31 42512.0
26 ABC 2012-01-31 42512.0
90 ABC 2012-01-31 42512.0
56 ABC 2012-01-31 42512.0
24 ABC 2012-04-30 41799.0
92 ABC 2012-04-30 41799.0
60 ABC 2012-07-31 41874.0
28 ABC 2012-07-31 41874.0
99 ABC 2012-07-31 41874.0
69 ABC 2012-10-31 46783.0
而且我想每一行有滚动总和只要至少有历史的整整一年,让我产生新的列是这样的:
eID perioddate 123456 123456_ltm
14 ABC 2011-01-31 31773.0
74 ABC 2011-01-31 31773.0
35 ABC 2011-01-31 31773.0
96 ABC 2011-01-31 31773.0
57 ABC 2011-04-30 11209.0
18 ABC 2011-04-30 11209.0
81 ABC 2011-07-31 11451.0
44 ABC 2011-07-31 11451.0
07 ABC 2011-07-31 11451.0
70 ABC 2011-10-31 20062.0 74495.0
34 ABC 2011-10-31 20062.0 74495.0
98 ABC 2011-10-31 20062.0 74495.0
62 ABC 2012-01-31 42512.0 85234.0
26 ABC 2012-01-31 42512.0 85234.0
90 ABC 2012-01-31 42512.0 85234.0
56 ABC 2012-01-31 42512.0 85234.0
24 ABC 2012-04-30 41799.0 115824.0
92 ABC 2012-04-30 41799.0 115824.0
60 ABC 2012-07-31 41874.0 146247.0
28 ABC 2012-07-31 41874.0 146247.0
99 ABC 2012-07-31 41874.0 146247.0
69 ABC 2012-10-31 46783.0 172968.0
从类似的问题我已经试过如下:
fx = lambda x: x.rolling(4).sum()
df[id + '_ltm'] = df.groupby(['eID','perioddate'])[id].apply(fx)
不幸的是我刚刚从上面得到NaN的。我错过了明显的东西吗?
答
我认为这里不需要groupby,除非我错过了一些东西。所有你需要的是rolling
sum
+ merge
。
v = df.set_index('perioddate')\
.drop_duplicates()['123456'].rolling(4).sum().to_frame()
v
123456
perioddate
2011-01-31 NaN
2011-04-30 NaN
2011-07-31 NaN
2011-10-31 74495.0
2012-01-31 85234.0
2012-04-30 115824.0
2012-07-31 146247.0
2012-10-31 172968.0
df.merge(v, left_on='perioddate', right_index=True)
eID perioddate 123456_x 123456_y
14 ABC 2011-01-31 31773.0 NaN
74 ABC 2011-01-31 31773.0 NaN
35 ABC 2011-01-31 31773.0 NaN
96 ABC 2011-01-31 31773.0 NaN
57 ABC 2011-04-30 11209.0 NaN
18 ABC 2011-04-30 11209.0 NaN
81 ABC 2011-07-31 11451.0 NaN
44 ABC 2011-07-31 11451.0 NaN
7 ABC 2011-07-31 11451.0 NaN
70 ABC 2011-10-31 20062.0 74495.0
34 ABC 2011-10-31 20062.0 74495.0
98 ABC 2011-10-31 20062.0 74495.0
62 ABC 2012-01-31 42512.0 85234.0
26 ABC 2012-01-31 42512.0 85234.0
90 ABC 2012-01-31 42512.0 85234.0
56 ABC 2012-01-31 42512.0 85234.0
24 ABC 2012-04-30 41799.0 115824.0
92 ABC 2012-04-30 41799.0 115824.0
60 ABC 2012-07-31 41874.0 146247.0
28 ABC 2012-07-31 41874.0 146247.0
99 ABC 2012-07-31 41874.0 146247.0
69 ABC 2012-10-31 46783.0 172968.0
编辑:如果你需要的groupby
,你可以将所有内容移动到dfGroupBy.apply
电话:
v = df.set_index('perioddate').groupby('eID', group_keys=False)\
.apply(lambda x: x.drop_duplicates()['123456'].rolling(4).sum()).T
v
eID ABC
perioddate
2011-01-31 NaN
2011-04-30 NaN
2011-07-31 NaN
2011-10-31 74495.0
2012-01-31 85234.0
2012-04-30 115824.0
2012-07-31 146247.0
2012-10-31 172968.0
的merge
步保持不变。
你说你想要一整年,但计数从十月开始......? –
数据点本身是季度总和,所以11/31/11数字将包括数据返回10/31/10 – rgk