XGBoost学习笔记（1）

–
第一讲：初识XGBoost

XGBoost：eXtreme Gradient Boosting，是Gradient Boosting Machines（GBM）的C++实现，快速有效。
1. XGBoost简介
1.1 GBM
-根据梯度下降组合弱学习器
常用弱学习器：
决策树：每个叶子结点对应一个决策
分类回归树：每个叶子结点有个预测分数，比决策树更加灵活

目标函数通常包含两部分：
（1）损失函数：与任务有关
回归：残差平方
分类：0-1损失、logistics损失、合叶损失（SVM）
…
（2）正则项：与模型复杂度有关
L1、L2正则…

第一个Boosting算法：AdaBoost
Firedman将AdaBoost推广到一般Gradient Boosting框架，得到GBM：将Boosting视作一个数值优化问题，采用类似梯度下降的方式优化求解。

1.2 XGBoost的特别之处
正则化（标准GBM的实现没有显示的正则化步骤）：
正则化对减少过拟合有帮助；

并行处理：
借助OpenMP，自动利用单机CPU的多核进行并行运算；
支持GPU加速；
支持分布式；

高度的灵活性（允许用户自定义优化目标和评价函）：
只需损失函数的一阶导数和二阶导数；

剪枝：
当新增分裂带来负增益时，GBM会停止分裂；
XGBoost一直分裂到指定的最大深度（max
_depth），然后回过头来剪枝；

内置交叉验证
在线学习
提供多语言接口

2. 与XGBoost的第一次接触
处理数据科学任务的一般流程

XGBoost学习笔记（1）

2.1 确定任务
数据集：UCI机器学习库的Mushroom数据集（XGBoost安装包中的demo数据）
任务：根据蘑菇的22个特征判断蘑菇是否有毒
总样本数：8124，可食用4208（51.8%），有毒3916（48.2%），训练样本6513，测试样本1611
特征：Demo中的22维特征经过处理变成126维特征

2.2 导入工具包

import xgboost as xgb
from sklearn.metrics import accuracy_score

2.3 读取数据
XGBboost可以加载libsvm格式的文本数据，其文件格式（稀疏特征）如1 53:1, 0 47:0.3等，每行开头的“1”或“0”是样本的标签，“53”、“47”等特征索引，冒号后面跟特征值。XGBoost加载的数据存储在对象DMatrix中，做了存储效率和运行速度的优化。

my_workpath = './data/'
dtrain = xgb.DMatrix(my_workpath + 'agaricus.txt.train')
dtest = xgb.DMatrix(my_workpath + 'agaricus.txt.test')

2.4 设置训练参数

param = {'max_depth':2, 'eta':1, 'silent':0, 'objective':'binary:logistic'}

其中，
max_depth：树的最大深度，缺省值为6，取值范围为[1,∞]；
eta：为了防止过拟合，更新过程中用到的收缩步长，eta通过缩减特征的权重使提升计算过程更加保守，缺省值是0.3，取值范围为：[0,1]；
silent：0表示打印出运行时信息，1表示以缄默方式运行，不打印运行时信息，缺省值为0；
objective：定义学习任务及相应的学习目标，“binary:logistics”表示二分类的逻辑回归问题，输出为概率。
ps：缺省值，指默认值
2.5 模型训练

# 设置Boosting迭代计算次数
num_round = 2

import time 
starttime = time.clock()

bst = xgb.train(param, dtrain, num_round)

endtime = time.clock()
print(endtime - starttime)

2.6 测试
查看模型在训练集上的分类性能

train_preds = bst.predict(dtrain)
train_predictions = [round(value) for value in train_preds]
y_train = dtrain.get_label()
train_accuracy = accuracy_score(y_train, train_predictions)
print('Train Accuracy: %.2f%%' % (train_accuracy * 100.0))

用训练好的模型对测试数据进行预测

preds = bst.predict(dtest)
predictions = [round(value) for value in preds]
y_test = dtest.get_label()
test_accuracy = accuracy_score(y_test, predictions)
print('Test Accuracy: %.2f%%' % (test_accuracy * 100.0))

2.7 模型可视化
plot_tree()的三个参数：
模型
树的索引，从0开始
显示方式，缺省为竖直，‘LR’是水平方向

import matplotlib.pyplot as plt
import graphviz
xgb.plot_tree(bst, num_trees=0, rankdir='LR')
plt.show()
# xgb.plot_tree(bst,num_trees=1, rankdir= 'LR' )
# plt.show()
# xgb.to_graphviz(bst,num_trees=0)
# xgb.to_graphviz(bst,num_trees=1)

3. XGBoost与scikit_learn结合

3.1 加载LibSVM格式数据模块

from xgboost import XGBClassifier
from sklearn.datasets import load_svmlight_file
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt 
my_workpath = './data/'
X_train, y_train = load_svmlight_file(my_workpath + 'agaricus.txt.train')
X_test, y_test = load_svmlight_file(my_workpath + 'agaricus.txt.test')

3.2 模型训练

num_round = 2
bst = XGBClassifier(max_depth=2, learning_rate=1, n_estimators=num_round, silent=True, objective='binary:logistic')
bst.fit(X_train, y_train)

3.3 校验集
留一部分训练数据作为验证集，选择在校验集上表现最好的模型。
假设三分之一数据作为检验数据

seed = 7
test_size = 0.33
X_train_part, X_validate, y_train_part, y_validate = train_test_split(X_train, y_train, test_size=test_size, random_state=seed)

3.3 学习曲线
表示模型预测性能随着某个变化的学习参数如训练样本数目、迭代次数的变化情况

num_round = 100
bst = XGBClassifier(max_depth=2, learning_rate=0.1, n_estimators=num_round, silent=True, objective='binary:logistic')
eval_set = [(X_train_part, y_train_part), (X_validate, y_validate)]
bst.fit(X_train_part, y_train_part, eval_metric=['error', 'logloss'], eval_set=eval_set, verbose=True)

显示学习曲线

results = bst.evals_result()
epochs = len(results['validation_0']['error'])
x_axis = range(0, epochs)
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
plt.ylabel('Log Loss')
plt.title('XGBoost Log Loss')
plt.show()

fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['error'], label='Train')
ax.plot(x_axis, results['validation_1']['error'], label='Test')
ax.legend()
plt.ylabel('Classification Error')
plt.title('XGBoost Classification Error')
plt.show()

XGBoost学习笔记（1）
3.4 Early stop
防止模型过拟合，如果在经过固定次数的迭代后，校验集上的性能不再提高，结束训练过程。

bst.fit(X_train_part, y_train_part, early_stopping_rounds=10, eval_metric="error",eval_set=eval_set, verbose=True)

3.5 交叉验证 cross_validation(CV)
k-折交叉验证：
重复k次，每次留出一份做校验，其余k-1份做训练；
k次校验集上的平均性能视为模型在测试集上性能的评估。

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

kfold = StratifiedKFold(n_splits=10, random_state=7)
results = cross_val_score(bst, X_train, y_train, cv=kfold)

3.6 GridSearchCV 参数调优

根据交叉验证评估的结果选择参数的模型：输入待调节参数的范围（grid），对一组参数对应的模型进行评估，并给出最佳模型及其参数。

from sklearn.grid_search import GridSearchCV
#设置Boosting迭代计算次数搜索范围
param_test = {'n_estimators':range(1, 51, 1, 1)}
clf = GridSearchCV(estimator = bst, param_grid = param_test, scoring='accuracy', cv=5)
clf.fit(X_train, y_train)
clf.grid_scores_, clf.best_params_, clf.best_score_

3.7 模型评估小节
通常k-折交叉验证是评价机器学习模型的黄金准则（k=3，5,10）；
当类别数目较多，或者每类样本数目不均衡时，采用stratified交叉验证；
当训练数据集很大，train/test split带来的模型性能估计偏差很小，或者模型训练很慢时，采用train/test split；
对回归问题，采用10-fold cross_validation，对分类，采用stratified 10-fold-validation。

XGBoost学习笔记（1）

– 第一讲：初识XGBoost

相关推荐

–
第一讲：初识XGBoost