置信区间的置信区间_xgboost的置信区间

置信区间的置信区间

Gradient Boosting methods are a very powerful tool for performing accurate predictions quickly, on large datasets, for complex variables that depend non linearly on a lot of features.

梯度提升方法是一种非常强大的工具,可对大型数据集上的非线性快速依赖许多特征的复杂变量快速执行准确的预测。

Moreover, it has been implemented in various ways: XGBoost, CatBoost, GradientBoostingRegressor, each having its own advantages, discussed here or here. Something these implementations all share is the ability to choose a given objective for training to minimize. And even more interesting is the fact that XGBoost and CatBoost offer easy support for a custom objective function.

而且,它已经以各种方式实现: XGBoostCatBoostGradientBoostingRegressor ,每种方法都有其各自的优势,在此处此处进行讨论。 这些实现都具有的共同点是能够选择给定目标进行培训以使其最小化。 更有趣的是,XGBoost和CatBoost为自定义目标函数提供了轻松的支持。

Why do I need a custom objective?

为什么需要自定义目标?

Most implementations provide standard objective functions, like Least Square, Least Deviation, Huber, RMSE, … But sometimes, the problem you’re working on requires a more specific solution to achieve the expected level of precision. Using a custom objective is usually my favourite option for tuning models.

大多数实现都提供标准的目标函数,例如最小二乘,最小偏差,Huber,RMSE等。但是有时,您正在解决的问题需要更具体的解决方案才能达到预期的精度水平。 使用自定义目标通常是我最喜欢的用于调整模型的选项。

Can you provide us with an example?

您能提供一个例子吗?

Sure! Recently, I’ve been looking for a way to associate the prediction of one of our models with confidence intervals. As a short reminder, confidence intervals are characterised by two elements:

当然! 最近,我一直在寻找一种将我们模型之一的预测与置信区间相关联的方法。 简要提醒一下,置信区间的特征包括两个要素:

  1. An interval [x_l, x_u]

    间隔[x_l,x_u]
  2. The confidence level i.e. the probability that the predicted values lie in this interval.

    置信度, 预测值在此间隔内的概率。

For instance, we can say that the 99% confidence interval of average temperature on earth is [-80, 60].

例如,我们可以说地球上平均温度的99%置信区间为[-80,60]。

Associating confidence intervals with predictions allows us to quantify the level of trust in a prediction.

将置信区间与预测相关联可以使我们量化预测中的信任级别。

How do you compute confidence intervals?

您如何计算置信区间?

You’ll need to train two models :

您需要训练两个模型:

  • One for the upper bound of your interval

    一个为间隔的上限
  • One for the lower bound of your interval

    一个用于间隔的下限

And guess what? You need specific metrics to achieve that: Quantile Regression objectives. Both the scikit-learn GradientBoostingRegressor and CatBoost implementations provide a way to compute these, using Quantile Regression objective functions, but both use the non-smooth standard definition of this regression :

你猜怎么着? 您需要特定的指标才能实现以下目标:分位数回归目标。 scikit-learn GradientBoostingRegressorCatBoost实现都提供了一种使用分位数回归目标函数来计算它们的方法,但是都使用了这种回归的非平滑标准定义:

Where t_i is the ith true value and a_i is the ith predicted value. w_i are optional weights used to ponderate the error. And alpha defines the quantile.

其中t_i是第i个真实值,而a_i是第i个预测值。 w_i是用于考虑错误的可选权重。 alpha定义了分位数。

For instance, using this objective function, if you set alpha to 0.95, 95% of the obervations are below the predicted value. Conversely, if you set alpha to 0.05, only 5% of the observations are below the prediction. And 90% of real values lie between these two predictions.

例如,使用此目标函数,如果将alpha设置为0.95,则95%的观测值低于预测值。 相反,如果将alpha设置为0.05,则只有5%的观测值低于预测值。 90%的实际价值介于这两个预测之间。

Let’s plot it using the following code, for the range [-10, 10] and various alphas:

让我们使用以下代码为[-10,10]范围和各种alpha进行绘制:

As you can see in the resulting plot below, this objective function is continuous but its derivative is not. There is a singularity in (0, 0), i.e. it’s a C_0 function, with respect to the error, but not a C_1 function. This is an issue, as gradient boosting methods require an objective function of class C_2, i.e. that can be differentiated twice to compute the gradient and hessian matrices.

正如您在下面的结果图中看到的那样,该目标函数是连续的,但其导数不是。 (0,0)中有一个奇点, 。 关于错误,它是一个C_0函数,但不是C_1函数。 这是一个问题,因为梯度增强方法需要C_2类的目标函数,即可以将其微分两次以计算梯度和粗麻布矩阵。

置信区间的置信区间_xgboost的置信区间

If you are familiar with the MAE objective, you should have recognized that these quantile regression functions are simply the MAE, scaled and rotated. If you’re not, the screenshot below should convince you :

如果您熟悉MAE目标,那么您应该已经认识到这些分位数回归函数就是简单的MAE,可以对其进行缩放和旋转。 如果不是这样,下面的屏幕截图应该可以说服您:

置信区间的置信区间_xgboost的置信区间

The logcosh objective

逻辑目标

As a reminder, the formula for the MAE objective is simply

提醒一下,MAE目标的公式很简单

置信区间的置信区间_xgboost的置信区间
MAE objective formula
MAE目标公式

The figure above also shows a regularized version of the MAE, the logcosh objective. As you can see, this objective is very close to the MAE, but is smooth, i.e. its derivative is continuous and differentiable. Hence, it can be used as an objective in any gradient boosting method, and provides a reasonable rate of convergence compared to default, non-differentiable ones.

上图还显示了MAE(logcosh目标)的正规化版本。 如您所见,该目标非常接近MAE,但很平滑,即其导数是连续且可微的。 因此,它可以用作任何梯度增强方法的目标,并且与默认的不可微方法相比,可以提供合理的收敛速度。

And as it is a very close approximation of the MAE, if we manage to scale and rotate it, we’ll get a twice differentiable approximation of the quantile regression objective function.

而且由于它是MAE的非常接近的近似值,因此,如果我们能够缩放和旋转它,我们将得到分位数回归目标函数的二次微分近似值。

You might have noticed that there is a slight offset between the curve of the MAE and the log cosh. We will explain that in detail a little further below.

您可能已经注意到,MAE的曲线和对数曲线之间存在一些偏移。 我们将在下面进一步详细解释。

The formula for the logcosh is straightforward :

Logcosh的公式很简单:

置信区间的置信区间_xgboost的置信区间
Formula for the logcosh objective
Logcosh目标的公式

Rotation and scaling of the logcosh

Logcosh的旋转和缩放

All we need to do now is to find a way to rotate and scale this objective so that it becomes a good approximation of the quantile regression objective. Nothing complex here. As logcosh is similar to the MAE, we apply the same kind of change as for the Quantile Regression, i.e. we scale it using alpha :

我们现在要做的就是找到一种旋转和缩放该目标的方法,以使其成为分位数回归目标的良好近似。 这里没什么复杂的。 由于logcosh与MAE相似,因此我们采用与分位数回归相同的更改,即我们使用alpha对其进行缩放:

置信区间的置信区间_xgboost的置信区间
Smooth Quantile regression using log cosh
使用log cosh平滑分位数回归

That can be done with these twelve lines of code:

这可以通过以下十二行代码来完成:

And this works, as shown below :

这可以正常工作,如下所示:

置信区间的置信区间_xgboost的置信区间

But wait a minute!

但是请稍等!

You might be curious as to why combining two non-linear functions like log and cosh results in such a simple, near linear curve.

您可能对为什么将两个非线性函数(例如log和cosh)组合在一起产生如此简单,接近线性的曲线感到好奇。

The answer lies in the formula of cosh :

答案在于cosh的公式:

置信区间的置信区间_xgboost的置信区间
cosh formula
科什公式

When x is positive and large enough, cosh can be approximated by

当x为正且足够大时, cosh可以近似为

置信区间的置信区间_xgboost的置信区间
Approximation of cosh when x >> 0
x >> 0时的cosh近似值

Conversely, when x is negative enough, cosh can be approximated by

相反,当x足够负时, cosh可以近似为

置信区间的置信区间_xgboost的置信区间
Approximation of cosh when x << 0
x << 0时的cosh近似值

We begin to understand how combining these two formulae leads to such linear results. Indeed, as we apply the log to these approximations of cosh, we get :

我们开始理解将这两个公式结合起来如何得出这样的线性结果。 实际上,将对数应用于这些近似的cosh时,我们得到:

置信区间的置信区间_xgboost的置信区间
logcosh simplification for x >> 0
x >> 0的logcosh简化

for x >>0. The same stands for x << 0 :

对于x >> 0。 x << 0的相同含义:

置信区间的置信区间_xgboost的置信区间

It is now clear why these two functions closely approximate the MAE. We also get as a side benefit the explanation for the slight gap between the MAE and the logcosh. It’s log(2)!

现在很清楚,为什么这两个函数非常接近MAE。 作为附带的好处,我们还可以解释MAE和Logcosh之间的微小差距。 是log(2)!

Let’s try it on a real example

让我们尝试一个真实的例子

It is now time to ensure that all the theoretical maths we perform above works in real life. We won’t evaluate our method on a simple sinus, as proposed in scikit here ;) Instead, we are going to use real-world data, extracted from the TLC trip record dataset, that contains more than 1 billion taxi trips.

现在是时候确保我们在现实生活中完成以上工作的所有理论数学。 我们不会在这里的 scikit中提出的那样对基于简单窦的方法进行评估;)相反,我们将使用从TLC行程记录数据集中提取的包含10亿次出租车行程的真实数据。

The code snippet below implements the idea presented above. It defines the logcosh quantile regression objective log_cosh_quantile, that computes its gradient and the hessian. Those are required to minimize the objective.

下面的代码片段实现了上面介绍的想法。 它定义了logcosh分位数回归目标log_cosh_quantile ,该目标计算其梯度和粗麻布。 需要这些以最小化目标。

As stated at the beginning of this article, we need to train two models, one for the upper bound, and another one for the lower bound.

如本文开头所述,我们需要训练两个模型,一个模型用于上限,另一个模型用于下限。

The remaining part of the code simply loads data and performs minimal data cleaning, mainly removing outliers.

该代码的其余部分仅加载数据并执行最少的数据清理,主要是消除异常值。

In this code, we have chosen to compute the 90% confidence interval. Hence we use alpha=0.95 for the upper bound, and alpha=0.05 for the lower bound.

在此代码中,我们选择了计算90%置信区间。 因此,我们将alpha = 0.95用作上限,将alpha = 0.05用作下限。

Hyperparameter tuning has been done manually, using fairly standard values. It could certainly be improved, but the results are good enough to illustrate this paper.

超参数调整已使用相当标准的值手动完成。 当然可以改进它,但是结果足以说明本文。

The last lines of the script are dedicated to the plotting of the first 150 predictions of the randomly build test set with their confidence interval:

脚本的最后几行专门用于绘制随机构建测试集的前150个预测及其置信区间:

置信区间的置信区间_xgboost的置信区间

Note that we have also included at the end of the script a counter to evaluate the number of real values whose confidence interval is correct. On our test set, 22 238 over 24 889 (89.3%) of the real values were within the calculated confidence interval.

注意,在脚本的末尾还包含一个计数器,用于评估置信区间正确的实数值的数量。 在我们的测试集中,超过24 889的实际值中的22 238(89.3%)在计算的置信区间内。

The model has been trained on the first 100 000 lines of the January 2020 dataset of the TLC trip record dataset.

该模型已在TLC旅行记录数据集的2020年1月数据集的前100000行上进行了训练。

Conclusion

结论

With simple maths, we have been able to define a smooth quantile regression objective function, that can be plugged into any machine learning algorithm based on objective optimisation.

通过简单的数学运算,我们已经能够定义一个平滑的分位数回归目标函数,该函数可以插入基于目标优化的任何机器学习算法中。

Using these regularized functions, we have been able to predict reliable confidence intervals for our prediction.

使用这些正则化函数,我们已经能够为我们的预测预测可靠的置信区间。

This method has the advantage over the one presented here of being parameters-less. Hyperparameter tuning is already a demanding step in optimizing ML models, we don’t need to increase the size of the configuration space with another parameter ;)

此处介绍的方法相比,此方法的优点在于无需参数。 在优化ML模型中,超参数调整已经是一个艰巨的步骤,我们不需要使用其他参数来增加配置空间的大小;)

翻译自: https://towardsdatascience.com/confidence-intervals-for-xgboost-cac2955a8fde

置信区间的置信区间