sklearn包学习

转自：https://www.cnblogs.com/nolonely/p/6902860.html

1首先是sklearn的官网：http://scikit-learn.org/stable/

在官网网址上可以看到很多的demo，下边这张是一张非常有用的流程图，在这个流程图中，可以根据数据集的特征，选择合适的方法。

sklearn包学习

2.sklearn使用的小例子

3.sklearn数据集

在上边例子中，直接使用了sklearn的数据集，在这个包中还有很多其他的数据集，数据集的网址：http://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets不仅可以使用数据集中的数据，还可以生成虚拟的数据，

sklearn中自带的数据集，以房屋数据集为例：

sklearn包学习

sklearn可以生成的数据集，回归模型中使用的数据集为例：

sklearn包学习

Parameters:	n_samples : int, optional (default=100)：The number of samples. n_features : int, optional (default=100)：The number of features. n_informative : int, optional (default=10)：The number of informative features, i.e., the number of features used to build the linear model used to generate the output. n_targets : int, optional (default=1)：The number of regression targets, i.e., the dimension of the y output vector associated with a sample. By default, the output is a scalar. bias : float, optional (default=0.0)：The bias term in the underlying linear model. effective_rank : int or None, optional (default=None) 　　if not None:The approximate number of singular vectors required to explain most of the input data by linear combinations. Using this kind of singular spectrum in the input allows the generator to 　　reproduce the correlations often observed in practice. 　　if None:The input set is well conditioned, centered and gaussian with unit variance. tail_strength : float between 0.0 and 1.0, optional (default=0.5)：The relative importance of the fat noisy tail of the singular values profile if effective_rank is not None. noise : float, optional (default=0.0)：The standard deviation of the gaussian noise applied to the output. shuffle : boolean, optional (default=True)：Shuffle the samples and the features. coef : boolean, optional (default=False)：If True, the coefficients of the underlying linear model are returned. random_state : int, RandomState instance or None, optional (default=None)：If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Returns:	X : array of shape [n_samples, n_features]：The input samples. y : array of shape [n_samples] or [n_samples, n_targets]：The output values. coef : array of shape [n_features] or [n_features, n_targets], optional：The coefficient of the underlying linear model. It is returned only if coef is True.

Parameters:

n_samples : int, optional (default=100)：The number of samples.

n_features : int, optional (default=100)：The number of features.

n_informative : int, optional (default=10)：The number of informative features, i.e., the number of features used to build the linear model used to generate the output.

n_targets : int, optional (default=1)：The number of regression targets, i.e., the dimension of the y output vector associated with a sample. By default, the output is a scalar.

bias : float, optional (default=0.0)：The bias term in the underlying linear model.

effective_rank : int or None, optional (default=None)

　　if not None:The approximate number of singular vectors required to explain most of the input data by linear combinations. Using this kind of singular spectrum in the input allows the generator to 　　reproduce the correlations often observed in practice.

　　if None:The input set is well conditioned, centered and gaussian with unit variance.

tail_strength : float between 0.0 and 1.0, optional (default=0.5)：The relative importance of the fat noisy tail of the singular values profile if effective_rank is not None.

noise : float, optional (default=0.0)：The standard deviation of the gaussian noise applied to the output.

shuffle : boolean, optional (default=True)：Shuffle the samples and the features.

coef : boolean, optional (default=False)：If True, the coefficients of the underlying linear model are returned.

random_state : int, RandomState instance or None, optional (default=None)：If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Returns:

X : array of shape [n_samples, n_features]：The input samples.

y : array of shape [n_samples] or [n_samples, n_targets]：The output values.

coef : array of shape [n_features] or [n_features, n_targets], optional：The coefficient of the underlying linear model. It is returned only if coef is True.

sklearn包学习

4。模型的参数

sklearn 的 model 属性和功能都是高度统一的. 你可以运用到这些属性查看 model 的参数和值等等.

输出的结果：

5.标准化：normalization

normalization 在数据跨度不一的情况下对机器学习有很重要的作用.特别是各种数据属性还会互相影响的情况之下. Scikit-learn 中标准化的语句是 preprocessing.scale() . scale 以后, model 就更能从标准化数据中学到东西.

6.交叉验证 cross validation（1）

sklearn 中的 cross validation 交叉验证对于我们选择正确的 model 和model 的参数是非常有帮助的. 有了他的帮助, 我们能直观的看出不同 model 或者参数对结构准确度的影响.

sklearn包学习

k越大越容易underfitting而不是overfitting

如果想要对不同的机器学习模型来计算，可能需要把knn的值换一下

7.交叉验证 cross validation（2）

sklearn.learning_curve 中的 learning curve 可以很直观的看出我们的 model 学习的进度,对比发现有没有 overfitting 的问题.然后我们可以对我们的 model 进行调整,克服 overfitting 的问题.

sklearn包学习

8.交叉验证 cross validation（3）

连续三节的 cross validation让我们知道在机器学习中 validation 是有多么的重要, 这一次的 sklearn 中我们用到了 sklearn.learning_curve 当中的另外一种, 叫做 validation_curve, 用这一种 curve 我们就能更加直观看出改变 model 中的参数的时候有没有 overfitting 的问题了.这也是可以让我们更好的选择参数的方法.

sklearn包学习

9，存储模型：

我们练习好了一个 model 以后总需要保存和再次预测, 所以保存和读取我们的 sklearn model 也是同样重要的一步.本文采用了两种方法来存储

youtube学习：

周莫烦：https://www.youtube.com/user/MorvanZhou

个人主页：https://morvanzhou.github.io/tutorials/

源码：https://github.com/MorvanZhou

相关推荐