规则引擎决策表决策树

引擎盖下 (Under the Hood)

This is the third article in a series of articles where we will understand the “under the hood” workings of various ML algorithms, using their base math equations.

这是系列文章中的第三篇，我们将使用它们的基本数学方程式来理解各种ML算法的“幕后工作”。

With so many optimized implementations out there, we sometimes focus too much on the library and the abstraction it provides, and too little on the underlying calculations that go into a model. Understanding these calculations can often be the difference between a good and a great model.

由于有这么多优化的实现，有时我们会过多地关注库及其提供的抽象，而很少关注模型中的基础计算。理解这些计算通常可能是好的模型与好的模型之间的区别。

In this series, I focus on implementing the algorithms by hand, to understand the math behind it, which will hopefully help us train and deploy better models.

在本系列文章中，我专注于手工实现算法，以了解其背后的数学原理，这有望帮助我们训练和部署更好的模型。

决策树 (Decision Tree)

Decision Trees are one of the most popular Machine Learning algorithms out there owing to their simplicity and intuitive nature. It forms the base for many other algorithms like Random Forests and Gradient Boosting Machines, which were the staple of most data science competitions up-until a few years ago, and still remain as one of the most versatile, and easy to understand ML algorithms.

由于决策树的简单性和直观性，它是其中最受欢迎的机器学习算法之一。它为许多其他算法(如随机森林和梯度提升机)奠定了基础，直到几年前，它们还是大多数数据科学竞赛的主要内容，并且仍然是功能最丰富，最容易理解的ML算法之一。

Decision Trees work very similar to how we humans make decisions by asking a series of questions. For instance, we often predict(or used to, until we could just look up the weather forecast for the day on our phones) whether or not to carry an umbrella while going out, based on various decision points, such as —

决策树的工作原理与人类通过提出一系列问题来做出决策的方式非常相似。例如，我们经常根据各种决策点来预测(或习惯，直到我们只能在手机上查询当天的天气预报)，否则外出时是否带把雨伞。

Is it cloudy?
多云吗？
What month is it?
几月几号
Has it rained at this time over the last couple of days?
最近几天这段时间下雨了吗？
How far am I going?
我要走多远？
Do I own an umbrella? ( This should be the first question)
我有雨伞吗？ (这应该是第一个问题)

Based on the answers to one or many of these questions, we may or may not take the umbrella when going out.

根据对这些问题中一个或多个问题的回答，我们外出时可能会或可能不会打伞。

Decision Trees (for classification) work in a similar way, questioning one variable at a time, to find the best split between various classes in the data.

决策树(用于分类)以类似的方式工作，一次询问一个变量，以找到数据中各个类别之间的最佳划分。

Consider the following data, with two input features — X1, X2, and one binary (0/1) target feature — Y

考虑以下数据，其中包含两个输入功能-X1，X2和一个二进制(0/1)目标功能-Y

规则引擎决策表决策树_引擎盖下的决策树 — Sample data with 10 rows

The algorithm follows the following steps to find such an optimal split of the data.

该算法遵循以下步骤来找到数据的最佳分割。

For each input variable, calculate the split of data at various thresholds.
对于每个输入变量，计算各种阈值下的数据分割。

2. Choose the threshold that gives best split.

2.选择提供最佳分割的阈值。

3. Repeat steps 1 & 2 until convergence.

3.重复步骤1和2，直到收敛为止。

4. Combine all selected thresholds to form a chain (tree) of rules for classifying the data.

4.组合所有选定的阈值以形成用于对数据进行分类的规则链(树)。

Let’s dive deeper into each of these steps, using our sample data.

让我们使用示例数据深入研究每个步骤。

1.对于每个输入变量，计算各种阈值下的数据分割 (1. For each input variable, calculate the split of data at various thresholds)

This is the primary step in a Decision Tree algorithm, and comes with two major decision points —

这是决策树算法的第一步，它带有两个主要决策点-

How do we choose the variables to threshold?
我们如何选择阈值变量？
How do we choose the thresholds?
我们如何选择阈值？

For the first one, the most common strategy is to consider all variables, and check which variable + threshold combination gives the best split of the data.

对于第一种方法，最常见的策略是考虑所有变量，并检查哪种变量+阈值组合可以最好地分割数据。

The second point is where things get a bit more interesting. There can be various ways to choose the threshold for each variable —

第二点是事情变得更加有趣。可以通过多种方式为每个变量选择阈值-

Choose random values within the range of each variable. For instance, for variable X2, the values range between 3 and 9 (Assuming last two rows are preserved for validating our “model”). We could choose a integer thresholds at 3, 4, 5, 6, 7, 8, and 9.

在每个变量的范围内选择随机值。例如，对于变量X2 ，值的范围在3到9之间(假设保留了最后两行以验证我们的“模型”)。我们可以选择3、4、5、6、7、8和9处的整数阈值。

As evident, this will not scale well as the range of the variable grows to hundreds, or when dealing with floating point features. How many floating point thresholds can we fit between 3.6 and 3.7?

显而易见，随着变量范围增加到数百个或处理浮点特征时，缩放比例不会很好。我们可以在3.6和3.7之间设置多少个浮点阈值？
Another option is to sort the unique values in the variable and select a threshold between each consecutive value — using max/min/average. This would reduce our search space to only a subset of values present in our data, as the thresholds are selected based on the values that are present in our data, and not the entire range of values.

另一个选择是对变量中的唯一值进行排序，并使用最大/最小/平均值在每个连续值之间选择一个阈值。这将使我们的搜索空间减少到数据中出现的值的子集，因为阈值是根据数据中出现的值而不是整个值范围选择的。

For the variable

对于变量

X2, the thresholds would look like —

X2 ，阈值看起来像-

Similarly, for X1 —

同样，对于X1-

There is an even better way to choose thresholds that lower the search space further. We will look at it after we have an understanding of “best split” and how a decision node is formed using these thresholds.

有一种更好的选择阈值的方法来进一步降低搜索空间。 在了解了“最佳分割”以及如何使用这些阈值形成决策节点之后，我们将对其进行研究。

Let’s proceed with choosing the Threshold(Avg) values as our thresholds.

让我们继续选择Threshold(Avg)值作为我们的阈值。

2.选择提供最佳分割的阈值 (2. Choose the threshold that gives best split)

To choose the threshold that gives the best split, we first need to define what a “best split” is.

为了选择给出最佳分割的阈值，我们首先需要定义什么是“最佳分割”。

To understand it visually, we can see that in the following images, X1=t2 looks like the best split, as the left half of the split contains only blue points (pure node). In other words, it can be said that if a randomly chosen point has X1<t2, the point is blue. We can not make the same inference at t1 or t3.

为了直观地理解它，我们可以在下图中看到X1 = t2看起来是最佳分割，因为分割的左半部分仅包含蓝点(纯节点)。换句话说，可以说，如果随机选择的点的X1 <t2 ，则该点为蓝色。我们无法在t1或t3做出相同的推断。

Thus, at each threshold, we check whether the distribution of the set of points on either sides of the threshold is better than before being split. We choose the threshold which gives the best distribution of data — ideally, the threshold that puts most points of a single class(blue in our case) on one side while maintaining few(preferably 0) instances of the other class(red in our case) on that side of the split.

因此，在每个阈值处，我们检查阈值两侧的点集的分布是否比拆分之前好。我们选择阈值以提供最佳的数据分布-理想情况下，该阈值将一个类的大多数点(在我们的情况下为蓝色)放在一边，而保留另一类(在我们的情况下为红色)的实例(最好为0) )。

Mathematically, we check the measure of impurity(density of points of each class) of the data before splitting vs the average measure of impurity of the splits, and choose the threshold that gives maximum difference. For our case, let’s consider the “Gini Impurity”, defined as —

在数学上，我们检查分裂前数据的杂质度量(每个类别的点的密度)与分裂杂质的平均度量之间的关系，并选择给出最大差异的阈值。对于我们的情况，让我们考虑“基尼杂质”，其定义为-

where k is the number of classes, and pᵢ is the probability of class i.

其中，k是类的数量，而pᵢ 是第i类的概率。

For our data,

对于我们的数据，

To calculate the Gini Impurity for the thresholds we chose in Step 1, we start with sorting the data along the X2 variable —

为了计算我们在步骤1中选择的阈值的基尼杂质，我们首先沿着X2变量对数据进行排序-

Our data has four points each of class 0 and class 1. Thus the Gini Impurity before split can be calculated as —

我们的数据分别具有0类和1类四个点。因此，拆分前的基尼杂质可以计算为—

If we consider the split at a threshold of X2=5, the data distribution looks like —

如果我们将拆分的阈值设为X2 = 5，则数据分布看起来像是-

We end up with 4 rows in each part of the split. We can calculate the Gini Impurity for both the splits as —

我们在拆分的每个部分最后得到4行。我们可以计算两个分割的基尼杂质为-

Thus, we can calculate the overall change(reduction) in impurity as —

因此，我们可以将杂质的总体变化(减少)计算为—

This change in gain (ΔG), is also known as “Information Gain”.We calculate the same for all thresholds of X2

增益(ΔG)的这种变化也称为“信息增益”。我们对X2的所有阈值进行计算

The above table shows that the best possible gain when making a decision using feature X2, is at the threshold value of 7.

上表显示，使用特征X2进行决策时，最佳增益为阈值7。

We need to repeat the process for the other feature — X1. Following the set of steps as we did for X2, we get —

我们需要为另一个功能X1重复该过程。按照针对X2所做的一系列步骤，我们可以得出-

We notice that we get the same value of highest gain (0.340) if we split the data at X1=5.5.

我们注意到，如果在X1 = 5.5处分割数据，则将获得相同的最高增益(0.340)值。

In such a scenario, where we get the same maximum gain for multiple features, the first feature from the left (as in the original data) is chosen. Following this rule for our case, the “best split” occurs at X1=5.5.

在这种情况下，对于多个要素，我们获得相同的最大增益，则从左边开始选择第一个要素(与原始数据一样)。对于我们的情况，遵循此规则，“最佳分割”发生在X1 = 5.5处 。

Thus, with our decision node at X1=5.5, we can start creating a “Decision Tree” like —

因此，在X1 = 5.5的决策节点下，我们可以开始创建“决策树”，例如-

3.重复步骤1和2，直到收敛 (3. Repeat steps 1 & 2 until convergence)

We repeat steps 1 & 2 on each of the split data until all our splits contain pure samples (have an impurity of zero).

我们对每个拆分数据重复步骤1和2，直到所有拆分都包含纯样本(杂质为零)为止。

One of the datasets(Split 1) generated from the split at X1=5.5 in the previous step is already pure — it has only samples from class 0. Hence, we can say that this branch has reached the optimum point and we do not need to split it any further.

在上一步中从X1 = 5.5的拆分生成的数据集(拆分1)已经是纯数据了-它仅具有0类的样本。因此，我们可以说该分支已达到最佳点，我们不需要进一步拆分。

We work with data from Split 2 for selecting further thresholds —

我们使用拆分2中的数据来选择其他阈值-

If we follow the strategy of sorting unique values of the feature and selecting thresholds at average of consecutive values, we will end up with thresholds at 6.5, 7.5 and 8.5 for X1, and 3.5, 5, and 7.5 for X2.A quick visual inspection shows that for both the variables, the first three splits will be sub-optimal. The best split is at X1=8.5.

如果我们按照排序特征的唯一值，并在平均的连续值中选择阈值的策略，我们将结束与阈值在6.5，7.5和8.5为X1，和3.5，5和7.5 X2 .A快速目测检查显示对于这两个变量，前三个分割将不是最佳的。最佳分割为X1 = 8.5。

This brings us back to the point of choosing the best strategy to select thresholds. Now that we have seen how the best split is calculated, and considering that we are only looking at one feature at a time, we can say that the ideal threshold candidates are located at the point of transition between classes i.e., when the feature is sorted, only those points where the target changes from 0 to 1, or the other way round, are good decision points as any other location would mean splitting points from same class, which would lead to an increase in impurity across both our splits.

这使我们回到选择最佳策略来选择阈值的地步。现在我们已经了解了最佳分割的计算方式，并且考虑到我们一次只查看一个特征，可以说理想阈值候选位于类之间的过渡点，即特征排序时，只有那些目标从0变为1或相反的点才是好的决策点，因为任何其他位置都意味着同等级的分裂点，这将导致我们两次分裂的杂质增加。

Thus we can see that when selecting threshold for X1, our ideal candidate is 8.5, as there is no change in the target value(1) when moving from X1=6 to X1=7, or from 7 to 8.

因此我们可以看到，为X1选择阈值时，理想的候选对象是8.5，因为从X1 = 6变为X1 = 7或从7变为8时目标值(1)不变。

Thus, our gain at X1=8.5 —

因此，我们在X1 = 8.5处获得的收益-

Similarly, for X2, our only candidate for threshold is X2=6, giving us an equivalent gain value of 0.32.

类似地，对于X2 ，我们唯一的阈值候选是X2 = 6 ，给我们等效的增益值为0.32。

This gives us the second decision node at X1=8.5, rendering our decision tree as —

这为我们提供了X1 = 8.5处的第二个决策节点，使我们的决策树为-

As all our nodes are pure, no further splits are required.

由于我们所有节点都是纯节点，因此无需进一步拆分。

Thus our end nodes are — Split-1 with class 0, Split-3 with class 1, and Split-4 with class 0.

因此，我们的末端节点是-类1的Split-1，类1的Split-3和类0的Split-4。

We can predict the class of a new variable by traversing the tree, and evaluating which node it falls in.

我们可以通过遍历树并评估它属于哪个节点来预测新变量的类。

Let’s validate the accuracy of our decision tree by testing with the two rows of data we did not use during training —

让我们通过测试训练中未使用的两行数据来验证决策树的准确性-

For the first data point (3,5), since X1 is less than 5.5, the point ends up in Split-1, with a predicted class of 0; same as our actual target class.

对于第一个数据点(3,5)，由于X1小于5.5，因此该点以Splitclass 1结束，预测类别为0；与我们的实际目标类别相同。

Similarly, for the other endpoint, we see that the sample ends in Split 3, with a prediction of 1, which matches the actual class.

同样，对于另一个端点，我们看到样本在拆分3中结束，预测为1，与实际类匹配。

And that’s it. At its core, this is all that the Decision Tree algorithm does.

就是这样。从根本上讲，这就是决策树算法的全部工作。

这真的是决策树所做的“全部”吗？ (Is this really “all” that Decision Tree does?)

“Under the Hood” being the focus of this series, we took a look at the foundation of Decision Tree building one node at a time, to fit our data.

“高级知识”是本系列的重点，我们了解了决策树一次构建一个节点以适应我们的数据的基础。

While it’s true that this is what Decision Tree does at its core, there are a few intricacies that I have skipped, like —

虽然这确实是决策树的核心，但是我略过了一些复杂之处，例如-

Why Gini Impurity?
为什么是基尼杂质？
What other Impurity metrics are there to select the best split?
还有哪些其他杂质指标可以选择最佳分割？
Decision Trees’ tendency to over-fit the data.
决策树倾向于过度拟合数据。
Feature Importance — We can see that both our decision points are dependent only on the variable X1 and not X2. Does this mean we can discard X2?

特征重要性-我们可以看到我们的两个决策点仅取决于变量X1 ，而不取决于X2。 这是否意味着我们可以丢弃X2？
Pruning and aggregating nodes.
修剪和聚合节点。
How are predicted probabilities calculated?
如何计算预测概率？

I will cover these concepts in a parallel series focused on the various intricacies of the different ML algorithms we cover in this series.

我将在一个并行系列中介绍这些概念，重点关注我们在本系列中介绍的不同ML算法的各种复杂性。

下一步是什么？ (What’s Next?)

In the next article in this series, we will continue with the Decision Trees, and look under the hood of Decision Trees for regression, and learn how Decision Trees are used to fit, and output a continuous target value.

在本系列的下一篇文章中，我们将继续使用决策树，并在决策树的幕后进行回归分析 ，并了解如何使用决策树进行拟合并输出连续的目标值。

翻译自: https://towardsdatascience.com/under-the-hood-decision-tree-454f8581684e