在熊猫数据框中获取重叠年龄段的年龄总和
问题描述:
target_value title people start end twitter_map
0 AGE_13_TO_17 13 to 17 1 13 17 AGE_13_TO_17
1 AGE_13_TO_24 13 to 24 NaN 13 24 NaN
2 AGE_13_TO_34 13 to 34 NaN 13 34 NaN
3 AGE_13_TO_49 13 to 49 NaN 13 49 NaN
4 AGE_13_TO_54 13 to 54 NaN 13 54 NaN
5 AGE_OVER_13 Age Over 13 NaN 13 - NaN
6 AGE_18_TO_24 18 to 24 7 18 24 AGE_18_TO_24
7 AGE_18_TO_54 18 to 54 NaN 18 54 NaN
8 AGE_OVER_18 Age Over 18 NaN 18 - NaN
9 AGE_21_TO_34 21 to 34 NaN 21 34 NaN
10 AGE_21_TO_49 21 to 49 NaN 21 49 NaN
11 AGE_21_TO_54 21 to 54 NaN 21 54 NaN
12 AGE_25_TO_34 25 to 34 34 25 34 AGE_25_TO_34
13 AGE_25_TO_49 25 to 49 NaN 25 49 NaN
14 AGE_OVER_25 Age Over 25 NaN 25 - NaN
15 AGE_35_TO_44 35 to 44 15 35 44 AGE_35_TO_44
16 AGE_OVER_35 Age Over 35 NaN 35 - NaN
17 AGE_45_TO_54 45 to 54 1 45 54 AGE_45_TO_54
18 AGE_OVER_50 Age Over 50 NaN 50 - NaN
19 AGE_55_TO_64 55 to 64 3 55 64 AGE_55_TO_64
20 AGE_OVER_65 65+ 6 65 - AGE_OVER_65
21 None All Ages NaN All Ages - NaN
因此,我有如上所示的这个数据框,其中包含一些年龄开始和年龄结束的值。但是有一些重叠的年龄段。我需要的基础上,专门值栏填写正确的人人列在熊猫数据框中获取重叠年龄段的年龄总和
料到产出的前两行
target_value title people start end twitter_map
0 AGE_13_TO_17 13 to 17 1 13 17 AGE_13_TO_17
1 AGE_13_TO_24 13 to 24 8 13 24 NaN
答
我将在一个简单的例子工作:
people start end
1 13 17
NaN 13 24
NaN 13 34
NaN 13 -
7 18 24
NaN 18 -
34 25 34
首先更换-
与无穷大,将所有浮动:
import numpy as np
df = df.replace({'-': np.inf}).astype(float)
然后选择其中给出的“人”的数列,这将是输入:
df_input = df.dropna()
现在定义以下功能:
def func(row):
return df_input.loc[
(df_input['start'] >= row['start']) & (df_input['end'] <= row['end']),
'people'
].sum()
对于在每一行数据框,它将输入中满足定义年龄段条件的所有数字相加(这是无穷大有用的地方)。
最后应用功能:
In [36]: df.apply(func, axis=1)
Out[36]:
0 1.0
1 8.0
2 42.0
3 42.0
4 7.0
5 41.0
6 34.0
前三栏已经加入了与过去的三列 –
什么是预期的输出是什么呢? –
我在前两行给出了一个示例...我希望它解释 –