csv文件中的Python条件过滤
问题描述:
请帮忙!我已经尝试了不同的东西/软件包,编写一个程序,它接受4个输入并根据来自csv文件的输入组合返回组的写作分数统计。这是我的第一个项目,所以我会很感激任何见解/提示/提示!csv文件中的Python条件过滤
这里是CSV样品(有200行总数):
id gender ses schtyp prog write
70 male low public general 52
121 female middle public vocation 68
86 male high public general 33
141 male high public vocation 63
172 male middle public academic 47
113 male middle public academic 44
50 male middle public general 59
11 male middle public academic 34
84 male middle public general 57
48 male middle public academic 57
75 male middle public vocation 60
60 male middle public academic 57
这是我到目前为止有:
import csv
import numpy
csv_file_object=csv.reader(open('scores.csv', 'rU')) #reads file
header=csv_file_object.next() #skips header
data=[] #loads data into array for processing
for row in csv_file_object:
data.append(row)
data=numpy.array(data)
#asks for inputs
gender=raw_input('Enter gender [male/female]: ')
schtyp=raw_input('Enter school type [public/private]: ')
ses=raw_input('Enter socioeconomic status [low/middle/high]: ')
prog=raw_input('Enter program status [general/vocation/academic: ')
#makes them lower case and strings
prog=str(prog.lower())
gender=str(gender.lower())
schtyp=str(schtyp.lower())
ses=str(ses.lower())
我所缺少的是如何筛选,只得到统计为特定的组。例如,假设我输入了男性,公众,中级和学术 - 我想要获得该子集的平均写作分数。我尝试了来自熊猫的groupby功能,但是这只能让你获得广泛群体的统计数据(例如公共vs私人)。我也尝试了熊猫的DataFrame,但是这只能让我过滤一个输入,并不确定如何获得写作分数。任何提示将不胜感激!
答
与Ramon达成一致的子集funcitonality,大熊猫肯定是要走的路,有着非同一般的过滤/子设置功能一旦你习惯了它。但是,首先将头部包裹起来可能很困难(或者至少对我来说是这样!),所以我从一些旧代码中找到了一些你需要的子设置的例子。下面的变量itu
是随着时间的推移在不同国家的数据的熊猫数据帧。
# Subsetting by using True/False:
subset = itu['CntryName'] == 'Albania' # returns True/False values
itu[subset] # returns 1x144 DataFrame of only data for Albania
itu[itu['CntryName'] == 'Albania'] # one-line command, equivalent to the above two lines
# Pandas has many built-in functions like .isin() to provide params to filter on
itu[itu.cntrycode.isin(['USA','FRA'])] # returns where itu['cntrycode'] is 'USA' or 'FRA'
itu[itu.year.isin([2000,2001,2002])] # Returns all of itu for only years 2000-2002
# Advanced subsetting can include logical operations:
itu[itu.cntrycode.isin(['USA','FRA']) & itu.year.isin([2000,2001,2002])] # Both of above at same time
# Use .loc with two elements to simultaneously select by row/index & column:
itu.loc['USA','CntryName']
itu.iloc[204,0]
itu.loc[['USA','BHS'], ['CntryName', 'Year']]
itu.iloc[[204, 13], [0, 1]]
# Can do many operations at once, but this reduces "readability" of the code
itu[itu.cntrycode.isin(['USA','FRA']) &
itu.year.isin([2000,2001,2002])].loc[:, ['cntrycode','cntryname','year','mpen','fpen']]
# Finally, if you're comfortable with using map() and list comprehensions,
you can do some advanced subsetting that includes evaluations & functions
to determine what elements you want to select from the whole, such as all
countries whose name begins with "United":
criterion = itu['CntryName'].map(lambda x: x.startswith('United'))
itu[criterion]['CntryName'] # gives us UAE, UK, & US
+0
感谢TC Allen!有效。谢谢你给我一些关键的技巧和提示,因为我刚开始学习这个程序:) – Mikaz 2014-10-07 22:18:52
从这个[段]读取(http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing)起,看看你的身体情况如何,基本上是你问可以做 – EdChum 2014-10-07 14:12:18
看起来像一个典型的布尔索引数据框中的多个列的情况。你可以尝试下面列出的方法[这里](http://*.com/questions/8916302/selecting-across-multiple-columns-with-python-pandas) – 2014-10-07 17:57:02