蟒蛇,在大的熊猫数据帧
问题描述:
操作我也加入了与5个字段命名为大熊猫数据帧:蟒蛇,在大的熊猫数据帧
product | price | percentil_25 | percentil_50 | percentile_75
的每一行我想上课的价格是这样的:
如果价格低于percentil_25我给这个产品类1,依此类推
因此,我所做的是:
classe_final = OrderedDict()
classe_final['sku'] = []
classe_final['class'] = []
for index in range(len(joined)):
classe_final['sku'].append(joined.values[index][0])
if(float(joined.values[index][1]) <= float(joined.values[index][2])):
classe_final['class'].append(1)
elif(float(joined.values[index][2]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][3])):
classe_final['class'].append(2)
elif(float(joined.values[index][3]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][4])):
classe_final['class'].append(3)
else:
classe_final['class'].append(4)
但是,因为我的DataFrame非常大,所以它会一直持续下去。
你有什么想法我可以做得更快吗?
答
# build an empty df
df = pd.DataFrame()
# get a list of the unique products, could skip this perhaps
df['Product'] = other_df['Sku'].unique()
2种方式,定义FUNC并调用应用
def class(x):
if x.price < x.percentil_25:
return 1
elif x.price >= x.percentil_25 and x.price < x.percentil_50:
return 2:
elif x.price >= x.percentil_50 and x.price < x.percentil_75:
return 2:
elif x.price >= x.percentil_75:
return 4
df['class'] = other_df.apply(lambda row: class(row'), axis=1)
另一种方式我认为这是更好,会快很多的,我们可以在“类”列添加到您现有的DF和使用loc
,然后只取感兴趣的2列的观点:
joined.loc[joined['price'] < joined['percentil_25'], 'class'] =1
joined.loc[(joined['price'] >= joined['percentil_25']) & (joined['price'] < joined['percentil_50']), 'class'] =2
joined.loc[(joined['price'] >= joined['percentil_50']) & (joined['price'] < joined['percentil_75']), 'class'] =3
joined.loc[joined['price'] >= joined['percentil_75'], 'class'] =4
classe_final = joined[['cku', 'class']]
只是踢你可以使用的np.where
条件负载:
classe_final['class'] = np.where(joined['price'] > joined['percentil_75'], 4, np.where(joined['price'] > joined['percentil_50'], 3, np.where(joined['price'] > joined['percentil_25'], 2, 1)))
此评估价格是否大于percentil_75大,如果是的话,则class 4否则它计算另一个conditiona等,可能是值得的时序此相比LOC但它是少了很多可读
答
另一个解决办法,如果有人问我打赌哪一个是我会去这样做的最快:
joined.set_index("product").eval(
"1 * (price >= percentil_25)"
" + (price >= percentil_50)"
" + (price >= percentil_75)"
)
对不起你只是想确定类产品的依赖,其价格落在每一个百分点?所以 = 25和 EdChum 2014-09-19 07:50:52
是exaclty @EdChum – woshitom 2014-09-19 07:56:58
对不起,我刚刚注意到你正在使用一个有序的字典来存储你的值,所以我的答案是不正确的,你想要什么生产?你的代码将产生一个以产品为关键词的词典,然后每个类别的产品价格列表也是属于这个词典的,这是否正确?你可以展示一个玩具样本数据集和预期的输出 – EdChum 2014-09-19 08:14:23