熊猫索引跳过值
问题描述:
我正在读取两个csv文件,从特定列中选择数据,丢弃NA/Null,然后使用适合某个条件的数据在一个文件中打印另一个文件中的相关数据:熊猫索引跳过值
data1 = pandas.read_csv(filename1, usecols = ['X', 'Y', 'Z']).dropna()
data2 = pandas.read_csv(filename2, usecols = ['X', 'Y', 'Z']).dropna()
i=0
for item in data1['Y']:
if item > -20:
print data2['X'][i]
但是,这将引发我一个错误:
File "hashtable.pyx", line 381, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:7035)
File "hashtable.pyx", line 387, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6976)
KeyError: 6L
原来,当我print data2['X']
我看到失踪数行的索引
0 -1.953779
1 -2.010039
2 -2.562191
3 -2.723993
4 -2.302720
5 -2.356181
7 -1.928778
...
我该如何解决这个问题并重新编号索引值?或者,还有更好的方法?
答
发现在另一个问题的解决方案从这里:Reindexing dataframes
.reset_index(drop=True)
的伎俩!
0 -1.953779
1 -2.010039
2 -2.562191
3 -2.723993
4 -2.302720
5 -2.356181
6 -1.928778
7 -1.925359
答
你的两个文件/数据帧的长度是否相同?如果是这样,你可以利用布尔口罩,做到这一点(它可以避免for循环):
data2['X'][data1['Y'] > -20]
编辑:在回应评论
什么之间发生在:
In [16]: df1
Out[16]:
X Y
0 0 0
1 1 2
2 2 4
3 3 6
4 4 8
In [17]: df2
Out[17]:
Y X
0 64 75
1 65 73
2 36 44
3 13 58
4 92 54
# creates a pandas Series object of True/False, which you can then use as a "mask"
In [18]: df2['Y'] > 50
Out[18]:
0 True
1 True
2 False
3 False
4 True
Name: Y, dtype: bool
# mask is applied element-wise to (in this case) the column of your DataFrame you want to filter
In [19]: df1['X'][ df2['Y'] > 50 ]
Out[19]:
0 0
1 1
4 4
Name: X, dtype: int64
# same as doing this (where mask is applied to the whole dataframe, and then you grab your column
In [20]: df1[ df2['Y'] > 50 ]['X']
Out[20]:
0 0
1 1
4 4
Name: X, dtype: int64
所以它会返回与data1 ['Y']中的值大于-20相同索引的data2 ['X']中的所有值?绝对比我的循环方法更清洁。感谢分享,总是很好地了解新的/不同的方法 – stoves 2014-11-06 18:25:28
@stoves查看编辑 – 2014-11-06 18:52:09