1、烹饪是什么？-探索性数据分析

本笔记本为给定的问题提供了逐步的分析和解决方案。它也可以作为学习如何探索、操纵、转换和学习文本数据的一个很好的起点。它分为三个主要部分：
+探索性分析——作为第一步，我们借助于图式虚拟化来探索数据的主要特征；
+文本处理-这里我们应用一些基本的文本处理技术，以便清理和准备用于模型开发的数据；
+特征工程与数据建模-在这一部分中，我们从数据中提取特征，并建立菜肴的预测模型。

In [1]:

# Data processing

import pandas as pd

import numpy as np

import json

from collections import Counter

import re

# Visualization

import seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inline

In [2]:

数据来源：https://github.com/woshizhangrong/train_raw

train_df = pd.read_json('E:/Whats_Cooking/train.json') # store as dataframe objects

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39774 entries, 0 to 39773
Data columns (total 3 columns):
cuisine        39774 non-null object
id             39774 non-null int64
ingredients    39774 non-null object
dtypes: int64(1), object(2)
memory usage: 932.3+ KB

In [3]:

print("The training data consists of {} recipes".format(len(train_df)))

train_df.head()

The training data consists of 39774 recipes

Out[3]:

	cuisine	id	ingredients
0	greek	10259	[romaine lettuce, black olives, grape tomatoes...
1	southern_us	25693	[plain flour, ground pepper, salt, tomatoes, g...
2	filipino	20130	[eggs, pepper, salt, mayonaise, cooking oil, g...
3	indian	22213	[water, vegetable oil, wheat, salt]
4	indian	13162	[black pepper, shallots, cornflour, cayenne pe...

我们已经将数据作为DataFrame对象导入，上面的代码显示了训练样本的初始外观。我们观察到每个配方是一个单独的行，并具有：

唯一的标识符“ID”列；
烹饪方法的类型，这是我们的目标变量；
一个包含成分的列表对象（食谱）-这将是我们分类问题中解释变量的主要来源。

问题陈述：根据给定的数据（配料）预测菜肴的类型。这是一个分类任务，需要文本处理和分析。

In [4]:

#Now let's explore a little bit more about the target variable

print("Number of cuisine categories: {}".format(len(train_df.cuisine.unique())))

train_df.cuisine.unique()

Number of cuisine categories: 20

Out[4]:

array(['greek', 'southern_us', 'filipino', 'indian', 'jamaican',
       'spanish', 'italian', 'mexican', 'chinese', 'british', 'thai',
       'vietnamese', 'cajun_creole', 'brazilian', 'french', 'japanese',
       'irish', 'korean', 'moroccan', 'russian'], dtype=object)

有20个不同的类别（菜肴），我们将预测。
这意味着手头的问题是一个多类分类。

In [5]:

sns.countplot(y=train_df.cuisine,order=train_df.cuisine.value_counts().reset_index()["index"])

plt.title("Cuisine Distribution")

plt.show()

第二个例子——什锦饭

In [6]:

train_df.cuisine.value_counts()

Out[6]:

italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: cuisine, dtype: int64

In [7]:

print('Maximum Number of Ingredients in a Dish: ',train_df['ingredients'].str.len().max())

print('Minimum Number of Ingredients in a Dish: ',train_df['ingredients'].str.len().min())

Maximum Number of Ingredients in a Dish:  65
Minimum Number of Ingredients in a Dish:  1

训练样本中最常见的成分是什么？在数据集中我们能找到多少独特的成分？

2.文本处理

我们将通过进行一些简单的数据处理来进行分析。其目的是为模型开发准备数据。

In [8]:

# Prepare the data

features = [] # list of list containg the recipes

for item in train_df['ingredients']:

    features.append(item)

In [9]:

ingrCounter = Counter()

features_processed= [] # here we will store the preprocessed training features

for item in features:

    newitem = []

    for ingr in item:

        ingr.lower() # Case Normalization - convert all to lower case

        ingr = re.sub("[^a-zA-Z]"," ",ingr) # Remove punctuation, digits or special characters

        ingr = re.sub((r'\b(oz|ounc|ounce|pound|lb|inch|inches|kg|to)\b'), ' ', ingr) # Remove different units

        ingrCounter[ingr] += 1

        newitem.append(ingr)

    features_processed.append(newitem)

In [10]:

ingr_df = pd.DataFrame(ingrCounter.most_common(15),columns=['ingredient','count'])

ingr_df

Out[10]:

	ingredient	count
0	salt	18049
1	onions	7972
2	olive oil	7972
3	water	7457
4	garlic	7380
5	sugar	6434
6	garlic cloves	6237
7	butter	4848
8	ground black pepper	4785
9	all purpose flour	4632
10	pepper	4438
11	vegetable oil	4385
12	eggs	3388
13	soy sauce	3296
14	kosher salt	3113

In [11]:

#f, ax=plt.subplots(figsize=(12,20))

sns.barplot(y=ingr_df['ingredient'].values, x=ingr_df['count'].values,orient='h')

#plt.ylabel('Ingredient', fontsize=12)

#plt.xlabel('Count', fontsize=12)

#plt.xticks(rotation='horizontal')

#plt.yticks(fontsize=12)

plt.title("Ingredient Count")

plt.show()

第二个例子——什锦饭

盐似乎是最常用的成分，一点也不奇怪！我们还发现水，洋葱，大蒜和橄榄油-也不奇怪。：）

盐、水、洋葱、大蒜是常见的食材，我们期望它们在识别菜肴类型方面具有较差的预测能力。

3. 特征工程与数据建模

In [12]:

train_df['seperated_ingredients'] = train_df['ingredients'].apply(','.join)

In [13]:

from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer(binary=True).fit(train_df['seperated_ingredients'].values)

X_train_vectorized = vect.transform(train_df['seperated_ingredients'].values)

X_train_vectorized = X_train_vectorized.astype('float')

In [14]:

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

y_transformed = encoder.fit_transform(train_df.cuisine)

In [15]:

print(X_train_vectorized)

y_transformed

  (0, 2798)	0.15183517837377775
  (0, 2427)	0.23007896012035983
  (0, 2318)	0.3426671291173114
  (0, 2202)	0.23913220198081458
  (0, 2017)	0.10208411357610164
  (0, 1889)	0.1645493089953018
  (0, 1885)	0.26100924108701357
  (0, 1541)	0.2663871237012894
  (0, 1180)	0.35031170238526027
  (0, 1103)	0.10531073154596084
  (0, 1097)	0.38853112215987895
  (0, 967)	0.3040361765035925
  (0, 745)	0.3343204746101372
  (0, 528)	0.14568369866765699
  (0, 251)	0.1398962004921347
  (0, 185)	0.20748802168948122
  (1, 3012)	0.30913470576050534
  (1, 2905)	0.23719808692764152
  (1, 2798)	0.20426659039473835
  (1, 2775)	0.3034717400305941
  (1, 2373)	0.12082052495781231
  (1, 2100)	0.3831099504645736
  (1, 2017)	0.1373355900588895
  (1, 1877)	0.1300036033814326
  (1, 1724)	0.23580432530539203
  :	:
  (39772, 350)	0.1941573519292017
  (39772, 303)	0.27894483473192366
  (39772, 287)	0.13398798263813363
  (39772, 149)	0.13758614056520396
  (39773, 2971)	0.1975464041226418
  (39773, 2798)	0.17872306445833877
  (39773, 2672)	0.15854644611127255
  (39773, 2373)	0.1057119249320061
  (39773, 2316)	0.4290979107163017
  (39773, 2017)	0.12016178204711027
  (39773, 1898)	0.2568680228381645
  (39773, 1890)	0.15485523022733863
  (39773, 1368)	0.2873688348522114
  (39773, 1215)	0.14683027334043636
  (39773, 1201)	0.1848275373976143
  (39773, 1103)	0.12395978892263136
  (39773, 1053)	0.1468886450663615
  (39773, 869)	0.22475778151656522
  (39773, 602)	0.20502059327274608
  (39773, 583)	0.19438870133941094
  (39773, 556)	0.2554909855209906
  (39773, 551)	0.2728016336867552
  (39773, 496)	0.27507217175058174
  (39773, 251)	0.16466986060689143
  (39773, 205)	0.23693690350347973

Out[15]:

array([ 6, 16,  4, ...,  8,  3, 13], dtype=int64)

逻辑回归

In [16]:

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X_train_vectorized, y_transformed , random_state = 0)

lr1 = LogisticRegression(C=10,dual=False)

lr1.fit(X_train , y_train)

lr1.score(X_test, y_test)

Out[16]:

0.794147224456959

第二个例子——什锦饭

1、烹饪是什么？-探索性数据分析

2.文本处理

3. 特征工程与数据建模

逻辑回归

相关推荐