Python：从数据集中收集数据

问题描述：

我是一个非常新的Python，我试图分析数据集中的数据。Python：从数据集中收集数据

比方说，我有一个特定的食物品尝的数据集。例如：

{'review/appearance': 2.5, 'food/style': 'Cook', 'review/taste': 1.5, 'food/type': 'Vegetable' .... } 
{'review/appearance': 5.0, 'food/style': 'Instant', 'review/taste': 4.5, 'food/type': 'Noodle' ....}

，我有这些条目50000，我试图去寻找食物的多少不同类型有在下面的代码中键入：

data = list(parseData("/Path/to/my/dataset/file")) 

def feature(datum): 
    feat = [datum['food/type']] 
    return feat 

#making a separate list of food style 
foodStyle = [feature(d) for d in data] 

newFoodStyle = list() 

#converting the foodStyle list to just one list 
for sublist in foodStyle: 
    for item in sublist: 
    newFoodStyle.append(item) 

uniqueFood = Counter(newFoodStyle) #using counter variable to count how many unique food type there are 

a = "There are %s types of food" % (len(uniqueFood)) 
print a 

#print uniqueFood gives me 'Counter({'Noodle': 4352, 'Vegetable': 3412 and etc})

现在，我得到了多少有不同的食物类型，我需要很多帮助来计算数据集中每种独特食物的“评论/味道”的平均值。

我知道有50K项，所以我想只分析最审查食物前10

我需要循环的每个条目，并查找每个uniqueFood变量，使每个uniqueFood的单独列表，例如Noodle = list []并追加以下'review/taste'编号？

任何有关如何解决这个问题的提示或想法将不胜感激。

尝试使用集合并设置长度https://docs.python.org/2/library/sets.htm升 – SatanDmytro

答

您还可以使用dict类型：

data = list(parseData("/Path/to/my/dataset/file")) 

food_items = dict() 
for datum in data: 
    food_style = datum['food/type'] 
    if food_style in food_items: 
     food_items[food_style].append(datum) 
    else: 
     food_items[food_style] = [datum] 

# unique food list 
unique_food = food_items.keys() 


a = "There are %s types of food" % (len(unique_food)) 
print a 

# avg 'review/taste' 
avg = { 
    key: sum(map(lambda i: i.get('review/taste', 0), values))/float(len(values)) 
    for key, values in food_items.items() 
    if values 
}

答

我会建议将数据转化为大熊猫数据框中，然后你可以做的排序和平均值很容易 - 例如低于：

import pandas as pd 

datalist = [] 

dict1 = {'review/appearance': 2.5, 'food/style': 'Cook', 'review/taste': 1.5, 'food/type': 'Vegetable'} 
dict2 = {'review/appearance': 5.0, 'food/style': 'Instant', 'review/taste': 4.5, 'food/type': 'Noodle'} 
dict2 = {'review/appearance': 3.0, 'food/style': 'Instant', 'review/taste': 3.5, 'food/type': 'Noodle'} 

datalist.append(dict1) 
datalist.append(dict2) 

resultsDF = pd.DataFrame(datalist) 

print(resultsDF.head()) 

AverageResults = resultsDF.groupby(["food/style","food/type"])["review/taste"].mean().reset_index() 
print(AverageResults)

结果：

food/style food/type review/taste 
0  Cook Vegetable   1.5 
1 Instant  Noodle   3.5

Python：从数据集中收集数据

相关推荐