分类模型评估指标_模型评估的分类指标最终指南

分类模型评估指标

Model evaluation is an essential part of machine learning. In dealing with classification problems, there are so many choices of metrics and sometimes it causes confusion. This article will first discuss and compare all the commonly used metrics for a binary classification problem, not only by their definitions, but also by the situations when some metrics are preferred over the others. Lastly, this article will discuss how to adjust the model in favor of certain metrics.

模型评估是机器学习的重要组成部分。在处理分类问题时，度量标准的选择太多，有时会引起混淆。本文将首先讨论和比较所有针对二进制分类问题的常用度量，不仅通过它们的定义，还可以通过在某些度量优于其他度量的情况下进行比较。最后，本文将讨论如何根据某些指标调整模型。

Common metrics for binary classification problems

二进制分类问题的通用指标

Binary classification problems are a typical supervised machine learning problem with binary target values. We usually refer to the target values as positive class and negative class. When evaluating the performance of a model, the most common and straightforward metric to use is Accuracy. Accuracy is the number of right prediction divided by total prediction:

二进制分类问题是具有二进制目标值的典型监督式机器学习问题。我们通常将目标值称为正类和负类。在评估模型的性能时，最常用和最直接的指标是准确性。 准确度是正确预测数除以总预测数：

At least in these two situations, accuracy is not enough:

至少在以下两种情况下，准确性不够：

First, when data is imbalanced. If the training data is imbalanced towards the negative class, for example, we expect 99% of the data is in the negative class, any model that always predict “negative” will obtain an accuracy at 99%. However, does it mean this model is an exceptional model to deploy? Obviously not. Thus, besides methods we can use to deal with unbalanced datasets, we need metrics that can distinguish the accuracy in positive and negative classes.

首先，当数据不平衡时。例如，如果训练数据在消极类别上不平衡，我们预计99％的数据在消极类别中，则始终预测“消极”的任何模型都将获得99％的准确性。但是，这是否意味着该模型是可部署的特殊模型？显然不是。因此，除了我们可以用来处理不平衡数据集的方法外，我们还需要能够区分正负类准确性的指标。

Second, when I want to know how models perform in predicting one class particularly. For example, when I care about the model predicting the positive class right more than it predicting the negative class right, accuracy by itself wouldn’t give me enough information I want.

其次，当我想知道模型如何特别预测一个班级时。例如，当我更关心模型预测积极的阶级权利而不是预测消极的阶级权利时，准确性本身并不能给我足够的信息。

Therefore, we need to consider other metrics to find the best model that works in different scenarios. Let’s discuss the “accuracy” in position and negative class separately. In this article, I will combine the machine learning models evaluation with traditional statistical hypothesis test to illustrate the connections between certain concepts. In hypothesis testing, we usually define the null hypothesis as no effect or negative, while testing the alternative hypothesis for the positive effect. In binary machine learning models, we test whether a data point is in the positive or negative class. Thus, we can say that in machine learning, when a data point is predicted to be in the negative class, is the same as saying we fail to reject the null hypothesis (accept H0) in hypothesis testing. Here is a table that connects hypothesis testing and machine learning:

因此，我们需要考虑其他指标，以找到适用于不同情况的最佳模型。让我们分别讨论位置和否定类别的“准确性”。在本文中，我将机器学习模型评估与传统的统计假设检验相结合，以说明某些概念之间的联系。在假设检验中，我们通常将零假设定义为无效果或否定，同时测试替代假设的正面效果。在二进制机器学习模型中，我们测试数据点是处于正类还是负类。因此，我们可以说在机器学习中，当一个数据点被预测为否定类时，与说我们在假设检验中不能拒绝原假设(接受H0)相同。这是连接假设检验和机器学习的表格：

Column 2 and column 3 show the results of model predictions, no matter we are using hypothesis testing or machine learning models, while row 2 and row 3 represent the true class of the data. We define True Negative (TN) as the model predicts negative when the data is in negative class, and vice versa for True Positive (TP). We further define False Positive (FP) as the model predicts positive while the data is in negative class, and False Negative (FN) is the model predicts negative while the data is in positive class. The true predictions are TN+TP, while the false prediction is FP+FN. In statistical analysis, FP is Type I error because you reject H0 (reject negative) when H0 is true ( true negative). FN is Type II error when accepting H0(accept negative) while H0 is false (true positive). 1- Type I error is confidence level, while 1-Type II error is statistical power.

第2列和第3列显示了模型预测的结果，无论我们使用的是假设检验还是机器学习模型，而第2列和第3列代表数据的真实类。我们将True Negative(TN)定义为当数据为负类时模型预测为负，而True True(TP)反之亦然。我们进一步将误报(FP)定义为模型在数据为负类时预测为正，而误报(FN)是模型在数据为正类时预测为负。正确的预测是TN + TP，而错误的预测是FP + FN。在统计分析中，FP是I类错误，因为当H0为true(真负数)时，您拒绝H0(拒绝负数)。当H0为假(真为正)时，当H0(接受否)为FN时为II型错误。 1- I型错误是置信度，而1-II型错误是统计功效。

Going back to binary model metrics, we can redefine accuracy as:

回到二元模型指标，我们可以将精度重新定义为：

To test “accuracy” in different classes, I will introduce two sets of metrics. They are recall and precision, sensitivity and specificity.

为了测试不同类别中的“准确性”，我将介绍两组指标。它们是召回率，精确度，敏感性和特异性。

Recall and Precision: what and when

召回率和精确度：什么时候

If we want to make sure when the model predicts positive, it is very likely the data is true positive, then we can check out precision:

如果我们想确定模型何时预测为正，则很有可能数据为正，那么我们可以检查精度：

which is also called Positive Prediction Value.

也称为正预测值。

If we want to make sure when model predicts negative, it is very likely the data is true negative, then we can check out recall:

如果我们想确定模型何时预测为负，那么数据很可能为负，那么我们可以检查一下召回率：

which is the same formula for sensitivity.

这是相同的灵敏度公式。

Comparing precision and recall, we can see that the two formulas are the same expect precision has FP in the denominator while recall has FN there. Thus, to increase precision, the model needs to have as less FP as possible, while FN is ignorable. In contrast, to increase recall, the model needs to have small FN while not caring about FP.

比较精度和查全率，我们可以看到两个公式是相同的，期望分母中的精度为FP，而查全率中的FN为FP。因此，为了提高精度，模型需要具有尽可能少的FP，而FN是可忽略的。相反，为了增加召回率，该模型需要具有较小的FN，同时又不关心FP。

Let’s consider a scenario that we are trying to find to the targeted customer to deliver advertisements. Let's define customers in positive class as the customers that would make a purchase after seeing the advertisements. The goal here is to show advertisements to positive customers so they can make a purchase. In this way, we make the best use of advertisements and let the company make more profits. How to evaluate the model to find the best model? We can only answer this question based on different business settings.

让我们考虑一个我们试图找到目标客户来投放广告的场景。让我们将积极的客户定义为看到广告后将进行购买的客户。这里的目标是向积极的客户展示广告，以便他们进行购买。这样，我们可以充分利用广告，并让公司获得更多利润。如何评估模型以找到最佳模型？我们只能根据不同的业务设置回答此问题。

Scenario one, when the cost of advertisements is high. The cost can be the actual cost of sending advertisements evaluated by the time and efforts from employees, and also it can be the cost of losing potential customers when showing advertisement too frequently to them. When the advertisement cost is high, the model needs to be very precise in making positive predictions. Thus, Precision is the best metric since you want to make sure most of the positive predictions are correct.

方案一，广告成本很高。成本可以是根据员工的时间和精力评估的发送广告的实际成本，也可以是在向他们过于频繁地展示广告时失去潜在客户的成本。当广告成本很高时，该模型在做出正面预测时需要非常精确。因此，精度是最好的指标，因为您要确保大多数肯定的预测都是正确的。

Scenario two, when the cost of advertisements is low, then it is okay to show advertisements to customers who are actually in the negative class. Here the goal will be to make sure all customers that are in the positive class receive advertisements. We should not leave out any positive customers in the negative class. Thus, the model needs to reduce False Negative, and increase Recall.

方案二，当广告成本较低时，可以向实际处于负面类别的客户展示广告。这里的目标是确保所有处于正面评价的客户都能收到广告。我们不应该在负面类别中遗漏任何正面顾客。因此，模型需要减少假阴性，并增加召回率。

Sensitivity and Specificity: what and when

敏感性和特异性：什么时候

The other two sets of metrics that are usually used in medical settings. For example, when test whether a patient has a certain disease. They are sensitivity and specificity:

医疗环境中通常使用的另外两组指标。例如，在测试患者是否患有某种疾病时。它们是敏感性和特异性：

Sensitivity is evaluating the model correctly predicts how many positive points out of all true positive points, while specificity is evaluating the model correctly predicts how many negative points out of all negative points. In the example of testing whether a person is healthy (negative) or sick (positive), sensitivity is testing how many patients are successfully located out of all patients. High sensitivity means correctly identifies patients with a disease. Specificity is testing how many healthy people are tested negative out of all healthy persons. The specificity of a test refers to how well a test identifies patients who do not have a disease.

敏感性评估模型可以正确预测所有真实阳性点中有多少阳性点，而特异性评估模型可以正确预测所有阴性点中有多少阴性点。在测试一个人是否健康(阴性)或生病(阳性)的示例中，敏感性是测试所有患者中成功定位了多少患者。高灵敏度意味着可以正确识别出患有疾病的患者。特异性是测试所有健康人中有多少健康人被检测为阴性。测试的特异性是指测试对没有疾病的患者的识别能力。

The combinations of metrics

指标组合

In some scenarios, we only care about lowering FP or FN, but in other cases, we care about both. Other than accuracy, we have other metrics that measure both FP and FN so we can adjust the model to lower them at the same time. The widely used ones are AUC (areas under the ROC curve), F scores, etc. The ROC (receiver operating characteristic curve) curve is a curve showing the performance of a classification model at all classification thresholds. The curve shows the combination of True Positive Rate (TPR) and False Positive Rate (FPR):

在某些情况下，我们只关心降低FP或FN，但在其他情况下，我们关心这两者。除了准确性外，我们还有其他指标可同时测量FP和FN，因此我们可以调整模型以同时降低它们。广泛使用的是AUC(ROC曲线下的区域)，F分数等。ROC(接收机工作特性曲线)曲线是显示在所有分类阈值下分类模型的性能的曲线。曲线显示了正确率(TPR)和错误率(FPR)的组合：

Higher AUC is always more desirable. In the perfect situation when FN and FP are both zero, TPR is one and FPR is zero, which makes AUC equals to one. Thus, AUC evaluates both FN and FP.

更高的AUC总是更可取。在FN和FP均为零的理想情况下，TPR为1而FPR为零，这使AUC等于1。因此，AUC评估FN和FP。

In the same way, we can use Precision-Recall curve, instead of ROC curve, to calculate area under the Precision-Recall curve. The difference will be using precision and recall on the axis. The differences in the graph will be:

同样，我们可以使用Precision-Recall曲线而不是ROC曲线来计算Precision-Recall曲线下的面积。区别在于在轴上使用精度和召回率。图中的差异为：

Both curves help measure FN and FP in the same time. However, ROC curves should be used when there are roughly equal numbers of observations for each class, while Precision-Recall curves should be used when there is a moderate to large class imbalance.

两条曲线都有助于同时测量FN和FP。但是，当每个类别的观测值大致相等时，应使用ROC曲线，而在中等至大型类别不平衡时，应使用精确召回曲线。

The F score is a formula that combines both recall and precision in the equation, thus it measures and helps lower FN and FP at the same time. The specific formula can be found here.

F分数是在公式中结合了查全率和精确度的公式，因此它可以同时测量和帮助降低FN和FP。具体公式可在此处找到。

How to adjust the model in favor of specific metrics?

如何根据特定指标调整模型？

The last question one may ask is if I know I want to increase recall, and precision is not important for me, how should I adjust the model. For example, when we conduct a study of tumor detection, one may expect to have high recall since we don't want to let the tumor untreated. Thus, when using a machine learning model to solve this binary classification problem, we want to make sure that False Negative is as small as possible. We want to be very cautious in predicting negative class. If we are using logistic regression, for example, the default threshold of predicting positive or negative class is 0.5. If we want to increase recall, we can decrease this threshold to be lower than 0.5. Thus, the model makes fewer predictions in negative class and False Negative will be less as well.

可能要问的最后一个问题是，我是否知道我想增加召回率，而精度对我而言并不重要，我应该如何调整模型。例如，当我们进行肿瘤检测的研究时，由于我们不想让肿瘤得到治疗，因此人们可能会期望召回率很高。因此，当使用机器学习模型解决此二进制分类问题时，我们要确保False Negative尽可能小。我们希望在预测负面学生时要非常谨慎。例如，如果我们使用逻辑回归，则预测阳性或阴性类别的默认阈值为0.5。如果要增加召回率，可以将该阈值降低到小于0.5。因此，该模型在否定类中做出的预测更少，而假否定性方面的预测也会更少。

This is the ultimate guide for binary classification model metrics. Thank you for reading!

这是二进制分类模型指标的最终指南。感谢您的阅读！

翻译自: https://towardsdatascience.com/the-ultimate-guide-of-classification-metrics-for-model-evaluation-83e4cdf294d9

分类模型评估指标

分类模型评估指标_模型评估的分类指标最终指南

相关推荐