Python-sklearn 机器学习的第一个样例(7)

我们终于有了自己的一个分类器。下面我们再用图形看看它的表现。

In [37]:
dt_scores = cross_val_score(decision_tree_classifier, all_inputs, all_classes, cv=10)

sb.boxplot(dt_scores)
sb.stripplot(dt_scores, jitter=True, color='white')
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x113cd4b38>
Python-sklearn 机器学习的第一个样例(7)

别急,还没结束。我们还应该用其他的分类算法(分类器)进行对比,看看这个决策树的表现如何。

下面我们用“随机森林”分类器做一个对比。

我们已经知道,随机森林分类器通常比独立决策树的表现更好。决策树的通病就是过度拟合,它对训练集数据可以获得接近完美的分类,但它对于测试集,或者说对它没有见过的数据则可能表现不佳。

随机森林分类器的原理:创建一串决策树,每棵树的训练集是从总训练集里随机有放回抽取,而特征值是从所有特征里按比例无放回抽取。通过这一串决策树的共同工作,达到更高的分类精确度。

让我们来看看随机森林分类器是不是表现更好。

scikit-learn的妙处就在于:训练、测试、参数调优等过程对所有建模来说都是一样的,因此我们只需要选择新的分类器即可。

In [40]:
from sklearn.ensemble import RandomForestClassifier

random_forest_classifier = RandomForestClassifier()

parameter_grid = {'n_estimators': [5, 10, 25, 50],
                  'criterion': ['gini', 'entropy'],
                  'max_features': [1, 2, 3, 4],
                  'warm_start': [True, False]}

cross_validation = StratifiedKFold(all_classes, n_folds=10)

grid_search = GridSearchCV(random_forest_classifier,
                           param_grid=parameter_grid,
                           cv=cross_validation)

grid_search.fit(all_inputs, all_classes)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))

grid_search.best_estimator_
Best score: 0.9731543624161074
Best parameters: {'n_estimators': 5, 'max_features': 3, 'warm_start': True, 'criterion': 'gini'}
Out[40]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=3, max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=1,
            oob_score=False, random_state=None, verbose=0, warm_start=True)

下面可以对比两个分类器的表现:

In [42]:
random_forest_classifier = grid_search.best_estimator_

rf_df = pd.DataFrame({'accuracy': cross_val_score(random_forest_classifier, all_inputs, all_classes, cv=10),
                       'classifier': ['Random Forest'] * 10})
dt_df = pd.DataFrame({'accuracy': cross_val_score(decision_tree_classifier, all_inputs, all_classes, cv=10),
                      'classifier': ['Decision Tree'] * 10})
both_df = rf_df.append(dt_df)

sb.boxplot(x='classifier', y='accuracy', data=both_df)
sb.stripplot(x='classifier', y='accuracy', data=both_df, jitter=True, color='white')
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x1141bff28>
Python-sklearn 机器学习的第一个样例(7)

怎么样?看起来两者的表现差不多。这可能是因为我们的数据集只有4个特征值用于分类,而随机森林分类器在几百个可能特征值的情况下才能表现出优越性。换句话说,这个数据集没有太大的改进空间。

Step 6:可重复性

确保我们的工作是可重复的,是任何分析的最后一步,也许是最重要的步骤。我们不能把太大的赌注压在一个我们不能重现的发现上。如果我们的分析不能重现,我们也许就根本不应该做这件事。

这个笔记完整记录了我们所做的每一个步骤,而且解释了为什么这么做。

In [43]:
%install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py
Installed watermark.py. To use it, type:
  %load_ext watermark
In [44]:
%load_ext watermark
In [45]:
%watermark -a 'Randal S. Olson' -nmv --packages numpy,pandas,scikit-learn,matplotlib,Seaborn
Randal S. Olson Fri Aug 21 2015 

CPython 3.4.3
IPython 3.2.1

numpy 1.9.2
pandas 0.16.2
scikit-learn 0.16.1
matplotlib 1.4.3
Seaborn 0.6.0

compiler   : GCC 4.2.1 (Apple Inc. build 5577)
system     : Darwin
release    : 14.5.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit

最后,我们把步骤1-5的核心部分,转化为一个独立的程序段:

In [46]:
%matplotlib inline
import pandas as pd
import seaborn as sb
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score

# We can jump directly to working with the clean data because we saved our cleaned data set
iris_data_clean = pd.read_csv('iris-data-clean.csv')

# Testing our data: Our analysis will stop here if any of these assertions are wrong

# We know that we should only have three classes
assert len(iris_data_clean['class'].unique()) == 3

# We know that sepal lengths for 'Iris-versicolor' should never be below 2.5 cm
assert iris_data_clean.loc[iris_data_clean['class'] == 'Iris-versicolor', 'sepal_length_cm'].min() >= 2.5

# We know that our data set should have no missing measurements
assert len(iris_data_clean.loc[(iris_data_clean['sepal_length_cm'].isnull()) |
                               (iris_data_clean['sepal_width_cm'].isnull()) |
                               (iris_data_clean['petal_length_cm'].isnull()) |
                               (iris_data_clean['petal_width_cm'].isnull())]) == 0

all_inputs = iris_data_clean[['sepal_length_cm', 'sepal_width_cm',
                             'petal_length_cm', 'petal_width_cm']].values

all_classes = iris_data_clean['class'].values

# This is the classifier that came out of Grid Search
random_forest_classifier = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                                max_depth=None, max_features=3, max_leaf_nodes=None,
                                min_samples_leaf=1, min_samples_split=2,
                                min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=1,
                                oob_score=False, random_state=None, verbose=0, warm_start=True)

# All that's left to do now is plot the cross-validation scores
rf_classifier_scores = cross_val_score(random_forest_classifier, all_inputs, all_classes, cv=10)
sb.boxplot(rf_classifier_scores)
sb.stripplot(rf_classifier_scores, jitter=True, color='white')

# ...and show some of the predictions from the classifier
(training_inputs,
 testing_inputs,
 training_classes,
 testing_classes) = train_test_split(all_inputs, all_classes, train_size=0.75)

random_forest_classifier.fit(training_inputs, training_classes)

for input_features, prediction, actual in zip(testing_inputs[:10],
                                              random_forest_classifier.predict(testing_inputs[:10]),
                                              testing_classes[:10]):
    print('{}\t-->\t{}\t(Actual: {})'.format(input_features, prediction, actual))
[ 4.6  3.6  1.   0.2]	-->	Iris-setosa	(Actual: Iris-setosa)
[ 5.2  2.7  3.9  1.4]	-->	Iris-versicolor	(Actual: Iris-versicolor)
[ 7.1  3.   5.9  2.1]	-->	Iris-virginica	(Actual: Iris-virginica)
[ 6.3  3.3  4.7  1.6]	-->	Iris-versicolor	(Actual: Iris-versicolor)
[ 6.7  3.3  5.7  2.5]	-->	Iris-virginica	(Actual: Iris-virginica)
[ 6.9  3.1  5.4  2.1]	-->	Iris-virginica	(Actual: Iris-virginica)
[ 5.1  3.3  1.7  0.5]	-->	Iris-setosa	(Actual: Iris-setosa)
[ 6.3  2.8  5.1  1.5]	-->	Iris-versicolor	(Actual: Iris-virginica)
[ 5.2  3.4  1.4  0.2]	-->	Iris-setosa	(Actual: Iris-setosa)
[ 6.1  2.6  5.6  1.4]	-->	Iris-virginica	(Actual: Iris-virginica)
Python-sklearn 机器学习的第一个样例(7)

结束语:


针对本文开头的数据集,我们获得了一个完整的、可重复的机器学习演示程序段。我们已经达到了预定的标准:精确度>90%,并且,我们的程序有足够的适应性,可以处理任何新的输入数据。看起来不错吧!