在熊猫中创建列对
问题描述:
df=pd.DataFrame(np.random.randn(4,4), columns=['a','b','c','d'])
a b c d
0 1.23 -1.25 0.15 1.41
1 0.64 -0.16 0.46 -1.06
2 -0.98 0.54 -0.81 0.33
3 1.71 0.95 1.01 -0.81
如何创建包含所有可能的列对的数据帧?在熊猫中创建列对
p1 p2 c1 c2
a b 1.23 -1.25
a b 0.64 -0.16
a b -0.98 0.54
a b 1.71 0.95
a c 1.23 0.15
a c 0.64 0.46
a c -0.98 -0.81
a c 1.71 1.01
... ... ... ...
c d 0.15 1.41
c d 0.46 -1.06
c d -0.81 0.33
c d 1.01 -0.81
答
用途:
- 通过
itertools
得到列的所有组合:通过列表理解所有组合的 - 获取列表和和重命名列: (
df[[x[0], x[1]]]
用于第一对df[['a', 'b']]
) -
concat
与参数keys
为Multiindex
- 持续一段数据清洗 - 去除第三级,新的列名
from itertools import combinations
cc = list(combinations(df.columns,2))
dfs = [df[[x[0], x[1]]].rename(columns={x[0]:'c1', x[1]:'c2'}) for x in cc]
df1 = pd.concat(dfs, keys=cc)
df1 = df1.reset_index(level=2, drop=True).rename_axis(('p1','p2')).reset_index()
print (df1)
p1 p2 c1 c2
0 a b 1.23 -1.25
1 a b 0.64 -0.16
2 a b -0.98 0.54
3 a b 1.71 0.95
4 a c 1.23 0.15
5 a c 0.64 0.46
6 a c -0.98 -0.81
7 a c 1.71 1.01
8 a d 1.23 1.41
9 a d 0.64 -1.06
10 a d -0.98 0.33
11 a d 1.71 -0.81
12 b c -1.25 0.15
13 b c -0.16 0.46
14 b c 0.54 -0.81
15 b c 0.95 1.01
16 b d -1.25 1.41
17 b d -0.16 -1.06
18 b d 0.54 0.33
19 b d 0.95 -0.81
20 c d 0.15 1.41
21 c d 0.46 -1.06
22 c d -0.81 0.33
23 c d 1.01 -0.81
答
只要你的数据帧不是太大,嵌套循环的作品不够好:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4,4), columns=['a','b','c','d'])
print(df)
a b c d
0 0.004477 -0.367254 -0.251733 -0.957313
1 0.996096 0.879603 1.499766 0.386398
2 -0.459716 0.186510 0.738449 -0.219747
3 0.606211 0.077233 1.583994 0.824706
master = pd.DataFrame()
known_pairs = []
n = df.shape[0]
for p1 in df:
for p2 in df:
pair = sorted((p1,p2))
if (p1 != p2) & (pair not in known_pairs):
known_pairs.append(sorted((p1,p2)))
tmp = pd.DataFrame(np.array([p1,p2] * n).reshape(n, 2), columns=['p1','p2'])
tmp['c1'] = df[p1]
tmp['c2'] = df[p2]
master = pd.concat([master, tmp])
print(master)
p1 p2 c1 c2
0 a b 0.004477 -0.367254
1 a b 0.996096 0.879603
2 a b -0.459716 0.186510
3 a b 0.606211 0.077233
0 a c 0.004477 -0.251733
1 a c 0.996096 1.499766
2 a c -0.459716 0.738449
3 a c 0.606211 1.583994
0 a d 0.004477 -0.957313
1 a d 0.996096 0.386398
2 a d -0.459716 -0.219747
3 a d 0.606211 0.824706
0 b c -0.367254 -0.251733
1 b c 0.879603 1.499766
2 b c 0.186510 0.738449
3 b c 0.077233 1.583994
0 b d -0.367254 -0.957313
1 b d 0.879603 0.386398
2 b d 0.186510 -0.219747
3 b d 0.077233 0.824706
0 c d -0.251733 -0.957313
1 c d 1.499766 0.386398
2 c d 0.738449 -0.219747
3 c d 1.583994 0.824706
答
既然你在两个方面要组合,你可以使用的列itertools.combinations
然后cumcount ()来获取索引。希望它可以帮助
import itertools
df=pd.DataFrame(np.random.randn(4,4), columns=['a','b','c','d'])
r = list(itertools.combinations(df.columns.tolist(), 2))
new = pd.DataFrame(list(r*df.shape[0]),columns=['p1','p2']).sort_values(['p1','p2']).reset_index(drop=True)
new['count'] = new.groupby(['p1','p2']).cumcount()
new['c1'] = new.apply(lambda x: df.loc[x['count'],x['p1']],axis=1)
new['c2'] = new.apply(lambda x: df.loc[x['count'],x['p2']],axis=1)
new = new.drop('count',axis=1)
输出:
p1 p2 c1 c2 0 a b -0.157408 -0.293641 1 a b -0.205898 -0.527494 2 a b -0.740385 1.058200 3 a b 2.163202 0.584529 4 a c -0.157408 0.824047 5 a c -0.205898 0.016703 . . . 22 c d 0.260635 -0.958339 23 c d -0.641043 -1.199849
感谢@jezrael,它完美的作品! – HappyPy
很高兴能帮到你!真的有趣的问题,谢谢;) – jezrael