如何获得来自不同文件中的数据,并组合成一个阵列中的蟒蛇?

问题描述:

我需要收集数据来证明我的假设,即用你的惯用手打字是不是与你的非惯用手打字快。我写了下面的代码给参与者一个随机词,然后他们必须复制它。代码将花费多长时间来键入每个单词,然后将该数据保存到新文件。对于每个测试的参与者,都会创建一个新的CSV文件。如何获得来自不同文件中的数据,并组合成一个阵列中的蟒蛇?

现在我需要另写脚本,会发现平均每手为每个参与者,然后创建一个包含平均值,所以我可以创建一个图,以证明我的假设是否是真正的一个阵列。我将如何去获取来自不同文件的数据并将其组合成一个数组?

我的脚本:

import random 
import time 

name = raw_input('Enter name: ') # get some name for the file 
outfile = file(name + '.csv', 'w') # create a file for this user's data 

# load up a list of 1000 common words 
words = file('1-1000.txt').read().split() 

ntrials = 50 

answers = [] 
print """Type With Dominant Hand""" 
for i in range(ntrials): 
    word = random.choice(words) 
    tstart = time.time() 
    ans = raw_input('Please type ' + word + ': ') 
    tstop = time.time() 
    answers.append((word, ans, tstop - tstart)) 
    print >>outfile, 'Dominant', word, ans, tstop - tstart # write the data to the file 
    if (i % 5 == 3): 
     go = raw_input('take a break, type y to continue: ') 

print """Type With Nondominant Hand"""  
for i in range(ntrials): 
    word = random.choice(words) 
    tstart = time.time() 
    ans = raw_input('Please type ' + word + ': ') 
    tstop = time.time() 
    answers.append((word, ans, tstop - tstart)) 
    print >>outfile, 'Nondominant', word, ans, tstop - tstart # write the data to the file 
    if (i % 5 == 3): 
     go = raw_input('take a break, type y to continue: ') 

outfile.close() # close the file 

样品结果从上面的脚本:

Dominant sit sit 1.81511306763 
Dominant again again 2.54711103439 
Dominant from from 1.53057098389 
Dominant general general 1.98939108849 
Dominant horse horse 1.93938016891 
Dominant of of 1.07597017288 
Dominant clock clock 1.6587600708 
Dominant save save 1.42030906677 
Nondominant story story 3.92807888985 
Nondominant of of 0.93910908699 
Nondominant test test 1.69210004807 
Nondominant low low 1.13296699524 
Nondominant hit hit 1.15252614021 
Nondominant you you 1.22019600868 
Nondominant river river 1.42011594772 
Nondominant middle middle 1.61595511436 

persons = ["billy","bob","joe","kim"] 
num_dom,total_dom,num_nondom,total_nondom=0,0,0,0 
for person in persons: 
    data = file('%s.csv' %person, 'r').readlines() 
    for line in data: 
     if "Nondominant" in line: 
     num_nondom+=1 
     total_nondom+=int(line.split(' ')[-1].strip()) 
     elif "Dominant" in line: 
     num_dom+=1 
     total_nondom+=int(line.split(' ')[-1].strip()) 
     else: continue 
dom_avg = total_dom/num_dom 
nondom_avg = total_nondom/num_nondom 
print "Average speed with Dominant hand: %s" %dom_avg 
print "Average speed with Non-Dominant hand: %s" %nondom_avg 

与被摄对象的名称填写“个人”数组,然后你用讨好什么数据。

PS。 Heltonbiker记下了你的想法并添加了它。还通过添加strip来修复newline bug。

+0

它一直给我一个ValueError:无效的文字为int()与基数10:'2.90565299988 \ n'for第9行total_nondom + = int(line.split('')[ - 1]) – user1864662

+1

替换为'line .split('')[ - 1] .strip()'。该条将删除'\ n'字符,然后您可以将其变为int。我喜欢这个答案,但它应该包括环展示如何处理多个文件! – heltonbiker

def avg_one(filename): 
    vals = { 'Dominant': [], 'Nondominant': [] } 
    for line in input: 
     hand, _, _, t = split(line.strip()) 
     vals[hand].append(float(t)) 
    d = vals['Dominant'] 
    nd = vals['Nondominant'] 
    return (sum(d)/len(d), sum(nd)/len(nd)) 

data = [] 
for f in os.listdir(): 
    if f.endswith('.csv'): 
     data.append(avg_one(f)) 

doms, nondoms = zip(data) 

print "Dominant: " + repr(doms) 
print "Nondominant: " + repr(nondoms) 

这假定在相同的目录中没有其他.csv文件具有不同的格式(并且会失败解析)。一般来说,这需要更多的错误检查,但它可以实现这个想法。

这似乎是另一种语言,如果你不熟悉numpy的,但这里的,它利用其优势的解决方案(注意缺乏循环!)

为了测试,我创建了第二个用户数据文件,每个条目增加1秒。

import glob 
import numpy as np 

usecols = [0, 3] # Columns to extract from data file 
str2num = {'Dominant': 0, 'Nondominant': 1} # Conversion dictionary 
converters = {0: (lambda s: str2num[s])} # Strings -> numbers 

userfiles = glob.glob('*.csv') 
userdat = np.array([np.loadtxt(f, usecols=usecols, converters=converters) 
        for f in userfiles]) 

# Create boolean arrays to filter desired results 
dom = userdat[..., 0] == 0 
nondom = userdat[..., 0] == 1 

# Filter and reshape to keep 'per-user' layout 
usercnt, _, colcnt = userdat.shape 
domdat = userdat[dom ].reshape(usercnt, -1, colcnt) 
nondomdat = userdat[nondom].reshape(usercnt, -1, colcnt) 

domavgs = np.average(domdat, axis=1)[:, 1] 
nondomavgs = np.average(nondomdat, axis=1)[:, 1] 

print 'Dominant averages by user: ', domavgs 
print 'Non-dominant averages by user:', nondomavgs 

输出:

Dominant averages by user:  [ 1.74707571 2.74707571] 
Non-dominant averages by user: [ 1.63763103 2.63763103] 

如果你打算做大量的分析,我会强烈建议让你的头部周围numpy的。

+0

它保持给我一个IndexError:列表索引超出范围和点到第7行userdat = np.array([np.loadtxt(F,usecols = usecols),用于userfiles F]) – user1864662

+0

行,我已经编辑的溶液以便您的原始文件格式将工作。我怀疑你在第一列之后没有添加'0'或'1'的额外列;现在你不需要。 – subnivean

+0

其中最遗憾的“错误”,我与Python正在等待转弯numpy的到我的主要日常工具不久制成。一旦你得到它,事情变得更容易......而且还有一些可能! – heltonbiker