第二个例子——什锦饭
1、烹饪是什么?-探索性数据分析
本笔记本为给定的问题提供了逐步的分析和解决方案。它也可以作为学习如何探索、操纵、转换和学习文本数据的一个很好的起点。它分为三个主要部分:
+探索性分析——作为第一步,我们借助于图式虚拟化来探索数据的主要特征;
+文本处理-这里我们应用一些基本的文本处理技术,以便清理和准备用于模型开发的数据;
+特征工程与数据建模-在这一部分中,我们从数据中提取特征,并建立菜肴的预测模型。
In [1]:
# Data processing
import pandas as pd
import numpy as np
import json
from collections import Counter
import re
# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
数据来源:https://github.com/woshizhangrong/train_raw
train_df = pd.read_json('E:/Whats_Cooking/train.json') # store as dataframe objects
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39774 entries, 0 to 39773
Data columns (total 3 columns):
cuisine 39774 non-null object
id 39774 non-null int64
ingredients 39774 non-null object
dtypes: int64(1), object(2)
memory usage: 932.3+ KB
In [3]:
print("The training data consists of {} recipes".format(len(train_df)))
train_df.head()
The training data consists of 39774 recipes
Out[3]:
cuisine | id | ingredients | |
---|---|---|---|
0 | greek | 10259 | [romaine lettuce, black olives, grape tomatoes... |
1 | southern_us | 25693 | [plain flour, ground pepper, salt, tomatoes, g... |
2 | filipino | 20130 | [eggs, pepper, salt, mayonaise, cooking oil, g... |
3 | indian | 22213 | [water, vegetable oil, wheat, salt] |
4 | indian | 13162 | [black pepper, shallots, cornflour, cayenne pe... |
我们已经将数据作为DataFrame对象导入,上面的代码显示了训练样本的初始外观。我们观察到每个配方是一个单独的行,并具有:
- 唯一的标识符“ID”列;
- 烹饪方法的类型,这是我们的目标变量;
- 一个包含成分的列表对象(食谱)-这将是我们分类问题中解释变量的主要来源。
问题陈述:根据给定的数据(配料)预测菜肴的类型。这是一个分类任务,需要文本处理和分析。
In [4]:
#Now let's explore a little bit more about the target variable
print("Number of cuisine categories: {}".format(len(train_df.cuisine.unique())))
train_df.cuisine.unique()
Number of cuisine categories: 20
Out[4]:
array(['greek', 'southern_us', 'filipino', 'indian', 'jamaican',
'spanish', 'italian', 'mexican', 'chinese', 'british', 'thai',
'vietnamese', 'cajun_creole', 'brazilian', 'french', 'japanese',
'irish', 'korean', 'moroccan', 'russian'], dtype=object)
有20个不同的类别(菜肴),我们将预测。
这意味着手头的问题是一个多类分类。
In [5]:
sns.countplot(y=train_df.cuisine,order=train_df.cuisine.value_counts().reset_index()["index"])
plt.title("Cuisine Distribution")
plt.show()
In [6]:
train_df.cuisine.value_counts()
Out[6]:
italian 7838
mexican 6438
southern_us 4320
indian 3003
chinese 2673
french 2646
cajun_creole 1546
thai 1539
japanese 1423
greek 1175
spanish 989
korean 830
vietnamese 825
moroccan 821
british 804
filipino 755
irish 667
jamaican 526
russian 489
brazilian 467
Name: cuisine, dtype: int64
In [7]:
print('Maximum Number of Ingredients in a Dish: ',train_df['ingredients'].str.len().max())
print('Minimum Number of Ingredients in a Dish: ',train_df['ingredients'].str.len().min())
Maximum Number of Ingredients in a Dish: 65
Minimum Number of Ingredients in a Dish: 1
训练样本中最常见的成分是什么?在数据集中我们能找到多少独特的成分?
2.文本处理
我们将通过进行一些简单的数据处理来进行分析。其目的是为模型开发准备数据。
In [8]:
# Prepare the data
features = [] # list of list containg the recipes
for item in train_df['ingredients']:
features.append(item)
In [9]:
ingrCounter = Counter()
features_processed= [] # here we will store the preprocessed training features
for item in features:
newitem = []
for ingr in item:
ingr.lower() # Case Normalization - convert all to lower case
ingr = re.sub("[^a-zA-Z]"," ",ingr) # Remove punctuation, digits or special characters
ingr = re.sub((r'\b(oz|ounc|ounce|pound|lb|inch|inches|kg|to)\b'), ' ', ingr) # Remove different units
ingrCounter[ingr] += 1
newitem.append(ingr)
features_processed.append(newitem)
In [10]:
ingr_df = pd.DataFrame(ingrCounter.most_common(15),columns=['ingredient','count'])
ingr_df
Out[10]:
ingredient | count | |
---|---|---|
0 | salt | 18049 |
1 | onions | 7972 |
2 | olive oil | 7972 |
3 | water | 7457 |
4 | garlic | 7380 |
5 | sugar | 6434 |
6 | garlic cloves | 6237 |
7 | butter | 4848 |
8 | ground black pepper | 4785 |
9 | all purpose flour | 4632 |
10 | pepper | 4438 |
11 | vegetable oil | 4385 |
12 | eggs | 3388 |
13 | soy sauce | 3296 |
14 | kosher salt | 3113 |
In [11]:
#f, ax=plt.subplots(figsize=(12,20))
sns.barplot(y=ingr_df['ingredient'].values, x=ingr_df['count'].values,orient='h')
#plt.ylabel('Ingredient', fontsize=12)
#plt.xlabel('Count', fontsize=12)
#plt.xticks(rotation='horizontal')
#plt.yticks(fontsize=12)
plt.title("Ingredient Count")
plt.show()
盐似乎是最常用的成分,一点也不奇怪!我们还发现水,洋葱,大蒜和橄榄油-也不奇怪。:)
- 盐、水、洋葱、大蒜是常见的食材,我们期望它们在识别菜肴类型方面具有较差的预测能力。
3. 特征工程与数据建模
In [12]:
train_df['seperated_ingredients'] = train_df['ingredients'].apply(','.join)
In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(binary=True).fit(train_df['seperated_ingredients'].values)
X_train_vectorized = vect.transform(train_df['seperated_ingredients'].values)
X_train_vectorized = X_train_vectorized.astype('float')
In [14]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y_transformed = encoder.fit_transform(train_df.cuisine)
In [15]:
print(X_train_vectorized)
y_transformed
(0, 2798) 0.15183517837377775
(0, 2427) 0.23007896012035983
(0, 2318) 0.3426671291173114
(0, 2202) 0.23913220198081458
(0, 2017) 0.10208411357610164
(0, 1889) 0.1645493089953018
(0, 1885) 0.26100924108701357
(0, 1541) 0.2663871237012894
(0, 1180) 0.35031170238526027
(0, 1103) 0.10531073154596084
(0, 1097) 0.38853112215987895
(0, 967) 0.3040361765035925
(0, 745) 0.3343204746101372
(0, 528) 0.14568369866765699
(0, 251) 0.1398962004921347
(0, 185) 0.20748802168948122
(1, 3012) 0.30913470576050534
(1, 2905) 0.23719808692764152
(1, 2798) 0.20426659039473835
(1, 2775) 0.3034717400305941
(1, 2373) 0.12082052495781231
(1, 2100) 0.3831099504645736
(1, 2017) 0.1373355900588895
(1, 1877) 0.1300036033814326
(1, 1724) 0.23580432530539203
: :
(39772, 350) 0.1941573519292017
(39772, 303) 0.27894483473192366
(39772, 287) 0.13398798263813363
(39772, 149) 0.13758614056520396
(39773, 2971) 0.1975464041226418
(39773, 2798) 0.17872306445833877
(39773, 2672) 0.15854644611127255
(39773, 2373) 0.1057119249320061
(39773, 2316) 0.4290979107163017
(39773, 2017) 0.12016178204711027
(39773, 1898) 0.2568680228381645
(39773, 1890) 0.15485523022733863
(39773, 1368) 0.2873688348522114
(39773, 1215) 0.14683027334043636
(39773, 1201) 0.1848275373976143
(39773, 1103) 0.12395978892263136
(39773, 1053) 0.1468886450663615
(39773, 869) 0.22475778151656522
(39773, 602) 0.20502059327274608
(39773, 583) 0.19438870133941094
(39773, 556) 0.2554909855209906
(39773, 551) 0.2728016336867552
(39773, 496) 0.27507217175058174
(39773, 251) 0.16466986060689143
(39773, 205) 0.23693690350347973
Out[15]:
array([ 6, 16, 4, ..., 8, 3, 13], dtype=int64)
逻辑回归
In [16]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X_train_vectorized, y_transformed , random_state = 0)
lr1 = LogisticRegression(C=10,dual=False)
lr1.fit(X_train , y_train)
lr1.score(X_test, y_test)
Out[16]:
0.794147224456959