[Python嗯~机器学习]---用python3来分析和预测加州房价

加州房价分析和预测

步骤
1、分析工作内容
2、获得数据
3、分析和预处理数据
4、选择模型并训练
5、参数调优
6、描述我们最终方案
7、上线我们的系统

数据集的几个来源

我们用的数据集

基于加州人口普查数据的美国加州房价数据集
[Python嗯~机器学习]---用python3来分析和预测加州房价

开始我们的工作

第一步

  • 分析工作内容

利用加州的人口普查数据来对房价进行预测。
这份数据以街区(district)为最小单位,对应的信息有:人口、收入中位数、房价中位数等等。

  • 明确问题

我们要做什么
记住一个概念--管道(pipline),类似于Linux中的管道
管道各组件独立工作,前一个组件没产出后一个组件就会使用老数据,增强系统鲁棒性
[Python嗯~机器学习]---用python3来分析和预测加州房价

1、收集数据
2、通过数据对街区价格预测
3、通过街区价格预测模型评估街区,获得投资建议

问题:

  • 监督学习?无监督学习?
  • 分类?回归?
  • 批量学习还是在线学习?
  • 判断模型好坏的指标
    1、均方根误差,L2范数,对高阶项敏感,回归问题典型的衡量标准
    2、平均绝对误差,L1范数,适用于数据集中有一些极端数据

[Python嗯~机器学习]---用python3来分析和预测加州房价
    RMSE和MAE都是衡量两个向量之间距离的指标,向量指的是模型预测的预测值和样本自己真实值组成的向量。
    RMSE其实就是L2范数,对应的物理含义是欧几里得距离。
    MAE就是L1范数,对应的是曼哈顿距离,是指我们只能垂直或者水平地运动,那么两点之间的距离就是各个坐标轴之间的距离之和。
    范数的阶越高,那么它就越关注值大的数据,对值小的数据就会忽视一些。这就是为什么RMSE相较于MAE对于极值更加敏感的原因。不过如果极值点并不多的话,RMSE表现得其实很好,也是最常使用的指标。

第二步

  • 1、获取数据

该数据集,没有持续不断的数据流,也没有对于快速变化数据的即时响应的特殊需求,而且整个数据集也不大,完全可以加载到内存里,所以
常规的批量学习算法就可以胜任了。
如果数据量很大,我们完全可以使用分布式学习框架,比如mapreduce。

In [1]:

import os
import pandas as pd
HOUSE_PATH = './dataset'
def load_housing_data(housing_path=HOUSE_PATH):
    csv_path = os.path.join(housing_path, 'housing.csv')
    return pd.read_csv(csv_path)
  • 2、观察数据

In [2]:

housing = load_housing_data()
housing.head()

Out[2]:

  longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY

经度、维度、街区平均房龄、街区总房数、街区总卧室、街区人口、街区住户、收入中位数、房价中位数、离海距离

In [3]:

housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.5+ MB

total_bedrooms 20433 有缺失值,我们重点关注有缺失值的项

  • pandas 的value_counts()函数可以对Series里面的每个值进行计数并且排序。

In [4]:

housing['ocean_proximity'].value_counts()

Out[4]:

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

describe方法可以得到关于数值型字段的一些摘要信息。
count,mean,min, max这几列都是自解释的。这里所有的统计都会把空值排除在外,比如total_bedroom的count是20433,而不是20640。
Std这一行显示了数据的标准差,它衡量了数据的分散程度
25%,50%,75%这几行显示了对应的百分位点。比如,25%的街区的房龄中位数低于18年,50%的街区的房龄中位数低于29年,75%的低于37年。
这些点也常常被叫做1/4分位点,中值点和3/4分为点。

In [5]:

housing.describe()

Out[5]:

  longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000

In [6]:

%matplotlib inline
import matplotlib.pyplot as plt
  • 画直方图
  • hist函数参数
    arr: 需要计算直方图的一维数组
    bins: 直方图的柱数,可选项,默认为10
    normed: 是否将得到的直方图向量归一化。默认为0
    facecolor: 直方图颜色
    edgecolor: 直方图边框颜色
    alpha: 透明度
    histtype: 直方图类型,‘bar’, ‘barstacked’, ‘step’, ‘stepfilled’

In [7]:

housing.hist(bins=50,figsize=(20,15))
plt.show()

[Python嗯~机器学习]---用python3来分析和预测加州房价

    1.收入中位数的单位看起来并不是美元。它的取值范围是0~15,这个具体含义是什么我们其实并不清楚,需要跟收集数据的团队沟通。我们可能会被告知,说这个字段已经被放缩了,而且超过15的都统一被设置成15,而小于0.5的则全部被设置成了0.5。
    在机器学习应用中,使用已经被预处理过的数据是一件很平常的事情,而且一般都不太会造成什么问题,不过为了保险起见,我们总是应该清楚我们使用的数据都是怎样处理的。
    2.房龄和房价看起来也都进行了掐头去尾的操作。对房价的这种掐头去尾的操作可能对我们而言是非常严重的,因为它是我们的预测目标。这可能会导致我们的系统永远也不会预测出高于50万美元的房价。我们必须跟下游团队沟通,确定这对它们而言是不是一个问题。如果他们说对于高于50万美元的房子他们也需要精确的价格预测信息,那么我们基本上只有两个选择:
        a)对于被掐头去尾的街区重新收集真实的价格数据
        b)把这些被掐头去尾的街区数据从数据集中移除
    3.所有的这些字段的取值范围差异很大。我们得记着后面要做特征缩放。
    4.最后,很多直方图都有些长尾的感觉:右侧分布相较于左侧更加长。这种分布可能对某些机器学习算法会有一些影响。我们后面会想办法把它们尽量转化成钟形分布。

  • 3、将数据分成训练集和测试集

In [8]:

import numpy as np

函数shuffle与permutation都是对原来的数组进行重新洗牌(即随机打乱原来的元素顺序);
区别在于:
shuffle直接在原来的数组上进行操作,改变原来数组的顺序,无返回值。
permutation不直接在原来的数组上进行操作,而是返回一个新的打乱顺序的数组,并不改变原来的数组。

In [9]:

def split_train_test(data, test_ratio):
    np.random.seed(42)
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

In [10]:

train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), '训练样本\n', len(test_set), '测试样本')
16512 训练样本
 4128 测试样本
  • 测试阶段保持样本数据集每次测试一致

    为了让评估尽可能地具有一致性,我们应该让原本处于训练集的样本仍然处于训练集,
    原来处于测试集的样本仍然处于测试集。

In [11]:

import hashlib

In [12]:

def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
    return data.loc[~in_test_set], data.loc[in_test_set]

In [13]:

# 使用样本序号作为id
housing_with_id = housing.reset_index()                                      # 增加一个id列
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

In [14]:

# 使用最稳定的特征生成id
housing_with_id['id'] = housing['longitude'] * 1000 + housing['latitude']
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, 'id')
  • sklearn提供了一些API帮助我们方便地进行数据集的划分,最简单的就是train_test_split

In [15]:

from sklearn.model_selection import train_test_split

In [16]:

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

对于重要特征,我们希望测试集的数据能够有足够代表性的样本覆盖。

比如,我们从专家处得知,收入中位数对于房价具有非常强的影响。那么我们需要确定不同层次
的收入水平在测试集中都应该有所覆盖。

In [17]:

housing['income_cat'] = np.ceil(housing['median_income'] / 1.5)
housing['income_cat'].where(housing['income_cat'] < 5, 5.0, inplace=True)

In [18]:

housing['income_cat'].hist()

Out[18]:

<matplotlib.axes._subplots.AxesSubplot at 0x122f1d90>

[Python嗯~机器学习]---用python3来分析和预测加州房价

  • 基于收入水平----分层抽样

In [19]:

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing['income_cat']):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

In [20]:

housing['income_cat'].value_counts() / len(housing)

Out[20]:

3.0    0.350581
2.0    0.318847
4.0    0.176308
5.0    0.114438
1.0    0.039826
Name: income_cat, dtype: float64
  • 不同抽样方法的偏差比较:
    [Python嗯~机器学习]---用python3来分析和预测加州房价

In [21]:

# 删除用于抽样的信息
for s in (strat_train_set, strat_test_set):
    s.drop(['income_cat'], axis=1, inplace=True)

4、分析数据特征之间的相关性

  • 备份数据,观察数据

In [22]:

housing = strat_train_set.copy()
  • 可视化地理数据

In [23]:

housing.plot(kind='scatter', x='longitude', y='latitude')

Out[23]:

<matplotlib.axes._subplots.AxesSubplot at 0xdad4bb0>

[Python嗯~机器学习]---用python3来分析和预测加州房价

  • 改变数据透明度和视觉效果

In [24]:

housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.1)

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0xc711b90>

[Python嗯~机器学习]---用python3来分析和预测加州房价

  • 观察地域和房价的关系

In [25]:

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
             s=housing["population"]/100, label="population",
             c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()

Out[25]:

<matplotlib.legend.Legend at 0xd7266b0>

[Python嗯~机器学习]---用python3来分析和预测加州房价

  • 分析和处理数据相关性

In [26]:

corr_matrix = housing.corr()                                    # 用corr计算两两特征之间的相关性系数
  • 跟房价最相关的特性

In [27]:

corr_matrix["median_house_value"].sort_values(ascending=False)   # 跟街区价格中位数特征的其他特征的相关系数

Out[27]:

median_house_value    1.000000
median_income         0.687160
total_rooms           0.135097
housing_median_age    0.114110
households            0.064506
total_bedrooms        0.047689
population           -0.026920
longitude            -0.047432
latitude             -0.142724
Name: median_house_value, dtype: float64
  • 相关性值(线性相关)的关系图:
    [Python嗯~机器学习]---用python3来分析和预测加州房价[Python嗯~机器学习]---用python3来分析和预测加州房价
  • 用pandas来表示

In [28]:

from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms","housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

Out[28]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0D80AED0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0C67A0B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0D849FB0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0D828F50>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0D884370>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0D884610>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0D8A4A50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0D8CD090>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0D8EA6F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0D9AAB70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0D98D9B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0D9143B0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0D932210>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0E030A70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0D758670>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0DFF6DD0>]],
      dtype=object)

[Python嗯~机器学习]---用python3来分析和预测加州房价

  • 从中发现收入跟房价非常相关

In [29]:

housing.plot(kind='scatter', x='median_income', 
             y='median_house_value', alpha=0.1)

Out[29]:

<matplotlib.axes._subplots.AxesSubplot at 0xd945ab0>

[Python嗯~机器学习]---用python3来分析和预测加州房价

第三步

  • 为机器学习算法准本数据
  • 数据预处理

In [30]:

housing['rooms_per_household'] = housing['total_rooms'] / housing['households']
housing['bedrooms_per_room'] = housing['total_bedrooms'] / housing['total_rooms']
housing['population_per_household'] = housing['population'] / housing['households']
  • 观察相关性

In [31]:

corr_matrix = housing.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)

Out[31]:

median_house_value          1.000000
median_income               0.687160
rooms_per_household         0.146285
total_rooms                 0.135097
housing_median_age          0.114110
households                  0.064506
total_bedrooms              0.047689
population_per_household   -0.021985
population                 -0.026920
longitude                  -0.047432
latitude                   -0.142724
bedrooms_per_room          -0.259984
Name: median_house_value, dtype: float64
  • 准备用于机器学习的数据

In [32]:

housing = strat_train_set.drop('median_house_value', axis=1)   # drop掉median_house_value
housing_labels = strat_train_set['median_house_value'].copy()  # 从原始数据集中copy一份median_house_value
  • 清洗数据

卧室数量(total_bedrooms)这个特征有一些缺失值,我们需要对它们进行处理。
一般我们有三种方式:

1、housing.dropna(subset=["total_bedrooms"]) # option 1 剔除掉有缺失值的样本
2、housing.drop("total_bedrooms", axis=1) # option 2 剔除有缺失值的整个特征
3、median = housing["total_bedrooms"].median()
  housing["total_bedrooms"].fillna(median) # option 3 用均值填充缺失值
  • sklearn中关于缺失值处理类--Imputer

In [33]:

from sklearn.impute import SimpleImputer

In [34]:

imputer = SimpleImputer(strategy='median')

In [35]:

# 中位数只能用于数值类型
housing_num = housing.drop('ocean_proximity', axis=1)

In [36]:

imputer.fit(housing_num)
imputer.statistics_

Out[36]:

array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409])

In [37]:

housing_num.median().values

Out[37]:

array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409])

In [38]:

X = imputer.transform(housing_num)

In [39]:

housing_tr = pd.DataFrame(X, columns=housing_num.columns)
housing_tr.describe()

Out[39]:

  longitude latitude housing_median_age total_rooms total_bedrooms population households median_income
count 16512.000000 16512.000000 16512.000000 16512.000000 16512.000000 16512.000000 16512.000000 16512.000000
mean -119.575834 35.639577 28.653101 2622.728319 533.998123 1419.790819 497.060380 3.875589
std 2.001860 2.138058 12.574726 2138.458419 410.839621 1115.686241 375.720845 1.904950
min -124.350000 32.540000 1.000000 6.000000 2.000000 3.000000 2.000000 0.499900
25% -121.800000 33.940000 18.000000 1443.000000 296.000000 784.000000 279.000000 2.566775
50% -118.510000 34.260000 29.000000 2119.500000 433.000000 1164.000000 408.000000 3.540900
75% -118.010000 37.720000 37.000000 3141.000000 641.000000 1719.250000 602.000000 4.744475
max -114.310000 41.950000 52.000000 39320.000000 6210.000000 35682.000000 5358.000000 15.000100
  • 处理类别特征
  • 如,距离海边距离进行数字型编码

In [40]:

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = housing['ocean_proximity']
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded

Out[40]:

array([0, 0, 4, ..., 1, 0, 3])

In [41]:

# 0,1,2,3,4特征分别表达什么意思,即把枚举特征转化成数值特征
encoder.classes_

Out[41]:

array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
      dtype=object)

现在,所有特征都是数值类型的了,都可以直接被机器学习算法使用了.
但是,我们仔细想一下机器学习的内部原理,以简单的线性回归为例:
y=∑nj=0θjxjy=∑j=0nθjxj
编码数值的相对大小对房价影响很乏! 对于离海距离这个特征,它是一个枚举型的类别特征,不应该存在大小的关系。 如何解决?
五个特征中我们可以设置其中某一个为1其他为0,因为一个房子只有一个特征

  • 独热编码one-hot-encoding

In [42]:

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
# one-hot-encoding接受的是一列数据,就是接受一个向量,所以我们要数据转化一下reshape(-1, 1)成一列
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1, 1))
housing_cat_1hot    # 稀疏矩阵

C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:368: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
  warnings.warn(msg, FutureWarning)

Out[42]:

<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

In [43]:

housing_cat_1hot.toarray()

Out[43]:

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])

OneHotEncoder无法接受字符串类型的数据,只能接受数值型的数据。所以我们必须把枚举型的类别特征转成数值类型,再转换成OneHot编码。
不过在sklearn中有现成的实现: LabelBinarizer

In [44]:

from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)
housing_cat_1hot

Out[44]:

array([[1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1],
       ...,
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0]], dtype=int32)
  • 自己写一个转换器

虽然scikit-learn提供了非常多有用的转换器(transformers),我们有时候仍然会需要自定义一些转换器。
scikit-learn依赖所谓的duck typing,并不是通过继承的方式实现接口,因此我们自定义转换器只需要实现fittransformfit_transform三个方法就可以把它作为转换器直接使用。
我们这里实现一个用于组合特征的转换器:

In [45]:

from sklearn.base import BaseEstimator, TransformerMixin

rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
  • 特征缩放

特征放缩是一个不可缺少的步骤
最常用的两种放缩是:

  • 最大最小放缩(min-max scaling): MinMaxScaler 放缩完是[0,1]之间

    x−minmaxx−minmax

  • 标准化(standardization): StandardScaler 对极端数据的容忍能力比较强

    x−μσx−μσ

注意,无论什么转换器(包括放缩),不要在整个数据集上调用fit。只应该在训练数据集上fit。

转换器管道(transformation pipeline)

我们在把数据从最原始的形态转换成最终供机器学习算法使用的形态,可能需要做多次转换,这个过程可以使用scikit-learn的pipeline来组装起来。

Pipeline的构造器接收一个list,里面的每一个元素都是一个二元组,包括我们对转换器的命名以及转换器对象实例。List中除了最后一个esitmator以外,其他的都必须是transformer(也就是必须有fit_transform方法)

当我们调用pipeline的fit方法时,它会顺序地调用所有transformer的fit_transform方法,而对于最后一个esitmater只会调用它的fit方法。 实际上,pipeline向外会暴露跟最后一个estimator一模一样的方法。

In [46]:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler())
])
housing_num_tr = num_pipline.fit_transform(housing_num)

前面是对数值型数据的转换,那么对于枚举型的类别数据,我们也需要进行转换,那么对于既有数值型又有枚举型,我们可以使用scikit-learn提供的FeatureUnion

In [47]:

from sklearn.base import BaseEstimator, TransformerMixin
    
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

In [48]:

from sklearn.pipeline import FeatureUnion

class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, sparse_output=False):
        self.sparse_output = sparse_output
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        enc = LabelBinarizer(sparse_output=self.sparse_output)
        return enc.fit_transform(X)

num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']

num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', SimpleImputer(strategy='median')),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scalar', StandardScaler())
])

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', CustomLabelBinarizer())
])

full_pipeline = FeatureUnion(transformer_list=[
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline)
])

housing_prepared = full_pipeline.fit_transform(housing)

In [49]:

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared

Out[49]:

array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])

第四步

选择模型并训练

1、 线性模型

In [50]:

from sklearn.linear_model import LinearRegression

In [51]:

lin_reg = LinearRegression()
# 有监督的学习housing_prepare特征, housing_labels目标变量
lin_reg.fit(housing_prepared, housing_labels)

Out[51]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [52]:

# 测试看看
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
  • 我们使用scikit-learn的mean_squared_error来看看RMSE:

In [53]:

from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

Out[53]:

68628.19819848923

大部分街区(district)的房价中位数都在$120,000 ~ $265,000之间,所以这里平均$68,628的误差显然有点不太合适。
训练集上误差很大
这显然是欠拟合了,那么应对欠拟合,我们的措施有:

  1. 换一个更加强大的模型
  2. 构造出更好的特征
  3. 减少对模型的约束

让我们换个模型试试:

2、决策树

In [54]:

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)

Out[54]:

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
  • 观察效果

In [55]:

housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

Out[55]:

0.0
  • 训练集上的误差为0,说明过拟合

用交叉验证集来选择更好的模型

我们可以继续使用train_test_split来把原来的训练集切分成训练集和交叉验证集。
不过scikit-learn为我们准备好了更易用的交叉验证功能。

In [56]:

from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)

In [57]:

def display_scores(scores):
    print("RMSE:", scores)
    print("平均RMSE:", scores.mean())
    print("RMSE标准差:", scores.std())

display_scores(rmse_scores)
RMSE: [69327.01708558 65486.39211857 71358.25563341 69091.37509104
 70570.20267046 75529.94622521 69895.20650652 70660.14247357
 75843.74719231 68905.17669382]
平均RMSE: 70666.74616904806
RMSE标准差: 2928.322738055112
  • 用交叉验证方法评价线性回归:

In [58]:

lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)
RMSE: [66782.73843989 66960.118071   70347.95244419 74739.57052552
 68031.13388938 71193.84183426 64969.63056405 68281.61137997
 71552.91566558 67665.10082067]
平均RMSE: 69052.46136345084
RMSE标准差: 2731.6740017983475

显然决策树过拟合了

3、 用随机森林

In [59]:

from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
print('训练集误差:', forest_rmse)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
训练集误差: 22256.354579885898

In [60]:

forest_reg = RandomForestRegressor()
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
C:\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
RMSE: [52868.21871547 49194.92340594 51713.77716246 55012.10310441
 50973.88861214 55990.4479905  52298.68285092 50953.09058463
 54428.48087563 53461.73225682]
平均RMSE: 52689.53455589254
RMSE标准差: 1980.36257012708

我们通常会试验很多模型,特别是应用不同的参数。
那么,我们可以把模型存储下来,以后再次加载,可以快速地进行模型间的对比:

from sklearn.externals import joblib

joblib.dump(my_model, "my_model.pkl")

my_model_loaded = joblib.load("my_model.pkl")

第五步

参数调优

  • 网格搜索

In [61]:

from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)

Out[61]:

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]}, {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)

In [62]:

grid_search.best_params_

Out[62]:

{'max_features': 6, 'n_estimators': 30}

In [63]:

grid_search.best_estimator_

Out[63]:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=6, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=30, n_jobs=None, oob_score=False,
           random_state=None, verbose=0, warm_start=False)

In [64]:

# 看看每一组的评估
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
64560.56453532923 {'max_features': 2, 'n_estimators': 3}
55582.7022830937 {'max_features': 2, 'n_estimators': 10}
52946.41821544518 {'max_features': 2, 'n_estimators': 30}
60706.36585084631 {'max_features': 4, 'n_estimators': 3}
52620.50721816631 {'max_features': 4, 'n_estimators': 10}
50524.81738581646 {'max_features': 4, 'n_estimators': 30}
58527.66047437809 {'max_features': 6, 'n_estimators': 3}
52222.271025610295 {'max_features': 6, 'n_estimators': 10}
49935.06005093446 {'max_features': 6, 'n_estimators': 30}
58659.80511658877 {'max_features': 8, 'n_estimators': 3}
52363.81008209303 {'max_features': 8, 'n_estimators': 10}
50107.93319293479 {'max_features': 8, 'n_estimators': 30}
62413.99504781873 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54498.69196389087 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
60335.933168485164 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52359.94950597992 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
59131.91266427517 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51851.7761074725 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

随机搜索

scikit-learn提供的RandomizedSearchCV在超参空间很大的时候可以帮助我们更加方便地去寻找合适的超参。
它的使用方式和GridSearchCV没什么区别,不过它有额外的两个好处:

  1. 假如我们让它迭代1000次,它会对每一个超参设置1000个不同的值,使得搜索的空间变得更大
  2. 只要算力够,就可以简单地通过设置迭代次数来扩大搜索空间

集成方法(Ensemble Methods)

将多个模型组合在一起来形成一个更为强大的集成模型,特别是在不同的子模型误差类型迥异的时候效果更佳。

分析最好的模型以及它们的误差

通过分析模型误差,我们可能发现更深刻的洞见。

In [65]:

feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

Out[65]:

array([7.69687766e-02, 7.04319887e-02, 4.38660619e-02, 1.80630448e-02,
       1.66508742e-02, 1.79482850e-02, 1.59942989e-02, 3.27548930e-01,
       5.57736006e-02, 1.05319561e-01, 9.13965806e-02, 1.16638033e-02,
       1.38350194e-01, 1.03976446e-04, 3.83940095e-03, 6.08062375e-03])

In [66]:

extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_one_hot_attribs = list(encoder.classes_)
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

Out[66]:

[(0.32754893001195934, 'median_income'),
 (0.13835019350471953, 'INLAND'),
 (0.10531956070724346, 'pop_per_hhold'),
 (0.09139658055040592, 'bedrooms_per_room'),
 (0.07696877663966344, 'longitude'),
 (0.07043198869339834, 'latitude'),
 (0.05577360056115168, 'rooms_per_hhold'),
 (0.04386606192454317, 'housing_median_age'),
 (0.018063044805051474, 'total_rooms'),
 (0.01794828495917463, 'population'),
 (0.016650874220719983, 'total_bedrooms'),
 (0.01599429894210943, 'households'),
 (0.01166380333543238, '<1H OCEAN'),
 (0.006080623745976418, 'NEAR OCEAN'),
 (0.0038394009525752966, 'NEAR BAY'),
 (0.00010397644587548845, 'ISLAND')]

第六步

最终方案

在测试集上评估最终效果

In [67]:

final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

print(final_rmse)
47579.68069327446

第七步

上线、监控、维护

  • 正式化代码、编写测试
  • 编写监控代码
  • 人工评估旁路
  • 定期重训模型的机制

改进

  1. 试试使用支持向量机(sklearn.svm.SVR),尝试使用不同的核以及正则化因子C
  2. 试试把教程中的GridSearchCV换成RandomizedSearchCV看看是不是能找到更好的超参
  3. 试试在数据准备的pipeline中增加挑选最重要的特征的转换器(transformer)
  4. 试试使用GridSearchCV来自动地决定某些数据准备工作是不是需要做