Sklearn.cross_validation模块和数据划分方法
1、sklearn.cross_validation模块
(1)sklearn.cross_validation.cross_val_score()函数:返回交叉验证后得到的分类率。
详情见http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_score.html
sklearn.cross_validation.
cross_val_score
(estimator, X, y=None, scoring=None, cv=None, n_jobs=1,verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’)
其中部分参数解释:
estimator:是不同的分类器,可以是任何分类器形式。比如逻辑回归得到分类器: clf=sklearn.linear_model.LogisticRegression(C=1.0,penalty='l1',tol=1e-6)
cv:代表不同的cross validation方法,取值可以为int型值、cross-validation生成器或迭代器。默认为None,使用3-fold cross-validation;如果是integer,如cv=5,表明是5-fold cross-validation;如果是对象,则是生成器。另外,如果是一个int值,并且提供了参数y,那么表示使用StratifiedKFold分类方式。
scoring:默认为None,准确率的算法。如果不指定,使用estimator默认自带的准确率算法。
例子:
>>>sklearn.cross_validation.cross_val_score(clf,x,y,cv=5)
array([ 0.81564246, 0.81564246, 0.78651685, 0.78651685, 0.81355932])
(2)sklearn.cross_validation.train_test_split()函数
详情见http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
sklearn.cross_validation.
train_test_split
(*arrays, **options)
返回将arrays按比例随机划分成训练集和测试集。
其中部分参数解释:
*array:输入样本
train_size:取值在0到1之间,表明所占样本比例。
test_size:取值在0到1之间,表明所占样本比例。如果train_size=None,那么test_size=0.25。
random_state:如果为int,表明是随机数生成器的种子。
2、数据划分方法
(1)K折交叉验证:KFold、GroupKFold、StratifiedKFold
例子:
K-fold:默认采用的CV策略,主要参数包括两个,一个是样本数目,一个是k-fold要划分的份数。
- fromsklearn.model_selection import KFold
- X= np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
- y= np.array([1, 2, 3, 4])
- kf= KFold(n_splits=2)
- kf.get_n_splits(X)#给出K折的折数,输出为2
- print(kf)
- #输出为:KFold(n_splits=2, random_state=None,shuffle=False)
- for train_index, test_index in kf.split(X):
- print("TRAIN:",train_index, "TEST:", test_index)
- X_train,X_test = X[train_index], X[test_index]
- y_train,y_test = y[train_index], y[test_index]
- #输出:TRAIN: [2 3] TEST: [0 1]
- TRAIN: [0 1] TEST: [2 3]
这里,kf.split(X)返回的是X中进行划分后train和test的索引值,另X中数据集的索引值为0,1,2,3;第一次划分,先选择
test,索引为0和1的数据集为test,剩下索引为2和3的数据集为train;第二次划分时,先选择test,索引为2和3的数据集为test,剩下索引为0和1的数据集为train。Stratified k-fold:与k-fold类似,将数据集划分为k份,不同点在于,划分的k份中,每一份内各个类别数据的比例和原始数据集中各个类别的比例相同。
- from sklearn.model_selection import StratifiedKFold
- X= np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
- y= np.array([0, 0, 1, 1])
- skf= StratifiedKFold(n_splits=2)
- skf.get_n_splits(X, y)#给出K折的折数,输出为2
- print(skf)
- #输出为:StratifiedKFold(n_splits=2,random_state=None, shuffle=False)
- for train_index, test_index in skf.split(X, y):
- print("TRAIN:",train_index, "TEST:", test_index)
- X_train,X_test = X[train_index], X[test_index]
- y_train,y_test = y[train_index], y[test_index]
- #输出:TRAIN: [1 3] TEST: [0 2]
- TRAIN: [0 2] TEST: [1 3]
(2)留一法:LeaveOneGroupOut、LeavePGroupsOut、LeaveOneOut、LeavePOut
例子:
leave-one-out:每个样本单独作为验证集,其余的N-1个样本作为训练集,所以LOO-CV会得到N个模型,用这N个模型最终的验证集得到的分类率的平均数作为此下LOO-CV分类器的性能指标。参数只有一个,即样本数目。
- from sklearn.model_selection import LeaveOneOut
- X= [1, 2, 3, 4]
- loo= LeaveOneOut()
- for train, test in loo.split(X):
- print("%s%s" % (train, test))
- #结果:[1 2 3] [0]
- [0 2 3] [1]
- [0 1 3] [2]
- [0 1 2] [3]
leave-P-out:每次从整体样本中去除P条样本作为测试集,如果共有n条样本数据,那么会生成(n p)个训练集/测试集对。和LOO,KFold不同,这种策略中p个样本中会有重叠。
- from sklearn.model_selection import LeavePOut
- X= np.ones(4)
- lpo= LeavePOut(p=2)
- for train, test in lpo.split(X):
- print("%s%s" % (train, test))
- #结果:[2 3] [0 1]
- [1 3] [0 2]
- [1 2] [0 3]
- [0 3] [1 2]
- [0 2] [1 3]
- [0 1] [2 3]
leave-one-label-out:这种策略划分样本时,会根据第三方提供的整数型样本类标号进行划分。每次划分数据集时,取出某个属于某个类标号的样本作为测试集,剩余的作为训练集。
- from sklearn.model_selection import LeaveOneLabelOut
- labels = [1,1,2,2]
- Lolo=LeaveOneLabelOut(labels)
- for train, test in lolo:
- print("%s%s" % (train, test))
- #结果:[2 3] [0 1]
- [0 1] [2 3]
leave-P-label-out:与leave-one-label-out类似,但这种策略每次取p种类标号的数据作为测试集,其余作为训练集。
- from sklearn.model_selection import LeavePLabelOut
- labels = [1,1,2,2,3,3]
- Lplo=LeaveOneLabelOut(labels,p=2)
- for train, test in lplo:
- print("%s%s" % (train, test))
- #结果:[4 5] [0 1 2 3]
- [2 3] [0 1 4 5]
- [0 1] [2 3 4 5]
(3)随机划分法:ShuffleSplit、GroupShuffleSplit、StratifiedShuffleSplit