Python:逐行替换数据帧值
问题描述:
我想在pandas数据框中逐行条件替换值,以便max(row)将保留,而行中的所有其他值将被设置为None
。 我的直觉走向apply()
,但我不确定这是正确的选择,还是如何去做。Python:逐行替换数据帧值
实施例(但可能有多个列):
tmp= pd.DataFrame({
'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)),
'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10))
})
tmp
A B
0 1 3
1 2 4
2 3 1
3 4 33
4 5 10
5 6 9
6 7 7
7 8 3
8 9 10
9 10 10
求购输出:
somemagic(tmp)
A B
0 None 3
1 None 4
2 3 None
3 None 33
4 None 10
5 None 9
6 7 None # on tie I don't really care which one is set to None
7 8 None
8 None 10
9 10 None # on tie I don't really care which one is set to None
,关于如何实现这一任何建议?
答
print (tmp[tmp.eq(tmp.max(axis=1), axis=0)])
mask = (tmp.eq(tmp.max(axis=1), axis=0))
print (mask)
A B
0 False True
1 False True
2 True False
3 False True
4 False True
5 False True
6 True True
7 True False
8 False True
9 True True
df = (tmp[mask])
print (df)
A B
0 NaN 3.0
1 NaN 4.0
2 3.0 NaN
3 NaN 33.0
4 NaN 10.0
5 NaN 9.0
6 7.0 7.0
7 8.0 NaN
8 NaN 10.0
9 10.0 10.0
,然后你可以添加NaN
如果列中的值相等:
mask = (tmp.eq(tmp.max(axis=1), axis=0))
mask['B'] = mask.B & (tmp.A != tmp.B)
print (mask)
A B
0 False True
1 False True
2 True False
3 False True
4 False True
5 False True
6 True False
7 True False
8 False True
9 True False
df = (tmp[mask])
print (df)
A B
0 NaN 3.0
1 NaN 4.0
2 3.0 NaN
3 NaN 33.0
4 NaN 10.0
5 NaN 9.0
6 7.0 NaN
7 8.0 NaN
8 NaN 10.0
9 10.0 NaN
计时(len(df)=10
):
In [234]: %timeit (tmp[tmp.eq(tmp.max(axis=1), axis=0)])
1000 loops, best of 3: 974 µs per loop
In [235]: %timeit (gh(tmp))
The slowest run took 4.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.64 ms per loop
(len(df)=100k
):
In [244]: %timeit (tmp[tmp.eq(tmp.max(axis=1), axis=0)])
100 loops, best of 3: 7.42 ms per loop
In [245]: %timeit (gh(t1))
1 loop, best of 3: 8.81 s per loop
代码时序:
import pandas as pd
tmp= pd.DataFrame({
'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)),
'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10))
})
tmp = pd.concat([tmp]*10000).reset_index(drop=True)
t1 = tmp.copy()
print (tmp[tmp.eq(tmp.max(axis=1), axis=0)])
def top(row):
data = row.tolist()
return [d if d == max(data) else None for d in data]
def gh(tmp1):
return tmp1.apply(top, axis=1)
print (gh(t1))
答
我会建议你使用apply()
。您可以如下使用它:
In [1]: import pandas as pd
In [2]: tmp= pd.DataFrame({
...: 'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)),
...: 'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10))
...: })
In [3]: tmp
Out[3]:
A B
0 1 3
1 2 4
2 3 1
3 4 33
4 5 10
5 6 9
6 7 7
7 8 3
8 9 10
9 10 10
In [4]: def top(row):
...: data = row.tolist()
...: return [d if d == max(data) else None for d in data]
...:
In [5]: df2 = tmp.apply(top, axis=1)
In [6]: df2
Out[6]:
A B
0 NaN 3
1 NaN 4
2 3 NaN
3 NaN 33
4 NaN 10
5 NaN 9
6 7 7
7 8 NaN
8 NaN 10
9 10 10
我曾在我的脑海里完全一样:'TMP [tmp.eq(tmp.max(轴= 1),轴= 0)]':) –
谢谢你们!非常感激! :) – Ruslan