机器学习预测交通

Road accidents constitute a significant proportion of the number of serious injuries reported every year. Yet, it is often challenging to determine which specific conditions lead to such events, making it more difficult for local law enforcement to address the number and severity of road accidents. We all know that some characteristics of vehicles and the surroundings play a key role (engine capacity, condition of the road, etc.). However, many questions are still open. Which of these factors are the leading ones? How much are the external factors to blame, compared to the driver skills?

[R OAD事故属严重受伤人数的显著比重逐年报道。然而，确定导致这些事件的具体条件通常是具有挑战性的，这使得地方执法部门更难以解决交通事故的数量和严重性。我们都知道，车辆和周围环境的某些特征起着关键作用(发动机容量，道路状况等)。但是，许多问题仍然悬而未决。哪些因素是主要因素？与驾驶员的技能相比，应归咎于多少外部因素？

We leveraged Machine Learning and the United Kingdom’s road accidents database to clarify these questions and specifically provide impact on two major areas:

我们利用机器学习和英国的道路事故数据库来澄清这些问题，并特别对两个主要领域产生影响：

First, we developed a risk score that quantifies the likelihood of a driver having a fatal/serious accident solely based on inputs gathered from individual and vehicle data. This score can be used both to influence driving rules and regulation and inform drivers on the factors that increase their accident risk.

首先， 我们开发了一个风险评分，仅根据从个人和车辆数据中收集的输入数据来量化驾驶员发生致命/严重事故的可能性 。此分数可用于影响驾驶规则和法规，并告知驾驶员增加其事故风险的因素。
Second, we analysed situational information (such as road type, weather conditions, etc.) to estimate the severity of an accident. Such insights would help governments to better understand the sources of accidents and act to reduce them.

其次， 我们分析了情况信息 (例如道路类型，天气条件等)以估计事故的严重性。这些见解将有助于*更好地了解事故的根源并采取行动减少事故。

数据 (Data)

We use 220k+ accident reports from the Department for Transport of the United Kingdom, covering 2018. For each report, we have the information collected at the scene of the accident including:

我们使用了英国运输部提供的超过22万起事故报告，涵盖了2018年。对于每份报告，我们都有在事故现场收集的信息，包括：

Casualty characteristics (e.g. gender, age and home area type)

伤亡特征 (例如性别，年龄和居住地区类型 )
Situational variables (e.g. weather, road type and light conditions)

情景变量 (例如天气， 道路类型和光照条件 )
Accident descriptors (e.g. severity, presence of police)

事故描述符 (例如严重程度 ， 警察在场 )
Vehicle descriptors (e.g. age, power, type, model)

车辆描述符 (例如，年龄，功率，类型，型号 )

Overall, the data provided by the Department for Transport can be grouped into driver information, which can be further broken down into vehicle and individual data, and external information ( e.g. accident location and light conditions).

总体而言，运输部提供的数据可以分为驾驶员信息，这些信息可以进一步细分为车辆和个人数据，以及外部信息(例如， 事故地点和光照条件 )。

驾驶员评分 (Driver Score)

To understand driver risk factors, we created a Driver Score, using each driver’s unique characteristics. Every driver would be able to input information including their age and vehicle type to get back a value describing their risk of having a severe accident. In addition, the model is able to inform that driver about the major factors of their risk.

为了了解驾驶员风险因素，我们利用每个驾驶员的独特特征创建了驾驶员分数。每个驾驶员将能够输入包括他们的年龄和车辆类型的信息，以获取一个描述他们发生严重事故风险的值。此外，该模型还可以告知驾驶员有关其风险的主要因素。

For example, it might be that for some drivers the major cause of risk is having an old vehicle, while for others it might be that they live in a rural area common for having poor road conditions. By having this information, individual drivers can make more informed decisions going forward, for instance, purchasing vehicles that pose a lower risk.

例如，对于某些驾驶员来说，主要的风险原因是旧车，而对于另一些人来说，则可能是他们生活在因路况不佳而偏僻的农村地区。通过获得此信息，各个驾驶员可以做出更明智的决定，例如，购买风险较低的车辆。

To develop a model estimating this score, we focused on ex-ante features only, which are known prior to an accident occurring. We defined the target variable as 0 if an observed accident has low severity or resulted in no casualty, and as 1 if it has severe or fatal consequences (For more information on how to develop a Risk Score, check this article). We trained multiple models on the driver and vehicle features, in order to be able to compare their performance.

为了开发一个评估该分数的模型，我们仅关注事前特征，这些特征在发生事故之前是已知的。如果观察到的事故严重程度较低或没有造成人员伤亡，则将目标变量定义为0，如果将其造成严重或致命的后果，则将目标变量定义为1(有关如何制定风险评分的更多信息，请参见本文 )。为了能够比较它们的性能，我们对驾驶员和车辆特征训练了多个模型。

The models used were:

使用的模型是：

Logistic Regression
逻辑回归
Random Forest
随机森林
XGBoost
XGBoost
Optimal Classification Trees (OCT)

最佳分类树(OCT)

机器学习预测交通_使用机器学习预测交通事故 — Comparison of the models performance. Due to the heavily unbalanced distribution of the classes, we used the Area Under ROC Curve (AUC) to compare the models performance.

As it can be seen from the table above, Optimal Classification Trees achieved the highest out-of-sample performance. In addition, OCTs provide near full interpretability since it is not an ensemble method, unlike Random Forest and XGBoost. In the figure below, we can see a branch of the OCT Decision Tree, which provides criteria that are sensible and similar to what a human would intuitively expect. In this example, indeed, the model shows that if the accident involves a motorcycle with engine capacity higher than 200cc, and the vehicle is more than 20 years old, then the accident is likely to be severe.

从上表可以看出，最佳分类树获得了最高的样本外性能。另外，与随机森林和XGBoost不同，OCT并不是一种整体方法，因此几乎可以完全解释。在下图中，我们可以看到OCT决策树的一个分支，该分支提供了明智的标准，并且类似于人类的直觉期望。实际上，在此示例中，模型显示，如果事故涉及引擎容量高于200cc的摩托车，并且车辆使用时间超过20年，则事故很可能是严重的。

By predicting the probability of getting in a severe accident, we can use the probability as a risk score that becomes the Driver Score. Through this score, we are able to highlight riskier and less risky drivers. By using a highly interpretable model, we can understand the features that drive most of the score by inspecting the decision tree and the variable importance.

通过预测发生严重事故的概率，我们可以将该概率用作风险评分，成为驾驶员评分。通过此分数，我们可以重点介绍风险较高和风险较小的驾驶员。通过使用高度可解释的模型，我们可以通过检查决策树和变量的重要性来理解驱动大部分得分的功能。

描述性统计 (Descriptive Statistics)

However, driver risk only tells part of the story. We then moved on to analyse descriptive statistics related to the accidents. In particular, we used ex-post information such as the weather condition at the time of the accident, the lighting condition on the road and the road condition itself to better understand the drivers of accidents across the UK. By doing so, we would be able to understand not only how drivers can mitigate their risk (partly using their Driver Score), but also how external factors come into play.

但是，驾驶员的风险只能说明部分情况。然后，我们继续分析与事故相关的描述性统计数据。尤其是，我们使用事后信息，例如事故发生时的天气状况，道路上的照明状况以及道路状况本身，以更好地了解整个英国的事故驾驶员。这样，我们不仅可以了解驾驶员如何降低风险(部分使用驾驶员评分)，还可以了解外部因素如何发挥作用。

By understanding the external factors driving the risk of an accident, the government can prioritise spending by targeting first the major drivers of accidents. For instance, if we find that the light condition is more important than the road condition itself, the Department of Transport can allocate its limited budget prioritising lighting conditions first and then the quality of roads.

通过了解导致事故风险的外部因素，*可以通过首先确定事故的主要驱动因素来优先考虑支出。例如，如果我们发现光照条件比道路条件本身更重要，那么交通运输部可以分配有限的预算，首先分配照明条件，然后再优先道路质量。

We used the same target variable, but this time not in a predictive setting, as well as different models. Similarly to the previous section, we performed stratified sampling when training our models using 5-fold Cross-Validation:

我们使用了相同的目标变量，但是这次没有使用预测性变量，而是使用了不同的模型。与上一节类似，我们在使用5倍交叉验证训练模型时执行了分层抽样：

逻辑回归 (Logistic Regression)

Lasso and Ridge regularisers with hyper-parameter grid between 1.0 and 3.0 in steps of 0.1.

具有超参数网格的Lasso和Ridge正则化器在1.0和3.0之间，步长为0.1。

大车 (CART)

Minimum Samples Split varied between 3 and 11

最小样本分割在3到11之间变化
Minimum Samples Leaf between 5 and 13

最小样本叶子介于5到13之间
Maximum Number of Features at each split either “None”, “sqrt”,or “log2”.

每个拆分的最大功能数( “无”，“ sqrt”或“ log2”)。

随机森林 (Random Forest)

Bootstrap equal to “True”

引导程序等于“ True”
Maximum Features either “sqrt” or “log2”

最大功能为“ sqrt”或“ log2”
Minimum Samples Leaf either 5 or 10

最少取样 5或10
Number of Trees either 400, 800, or 1000.

树的数量为400、800或1000。

XGBoost (XGBoost)

Learning Rate in 0.001, 0.01 and 0.1

0.001、0.01和0.1的学习率
Number of Trees in 2500, 2000 and 1500, respectively

树木数量分别为2500、2000和1500
Minimum Samples Leaf in 4, 8, or 12.

最少采样数为 4、8或12。

After having trained the models, we chose to use a Gradient Boosting Classifier with Out-of-Sample AUC of 0.72 and Accuracy of 0.87, which achieved the best performance. Analysing the decision tree and the insights derived from it, we can highlight some key aspects. First of all, as can be seen from the following image, light accidents vary depending on the time of the day, occurring more often during rush hours. Differently, severe or fatal accidents are somewhat uniformly distributed regardless of the hour of the day.

在对模型进行训练之后，我们选择使用具有0.72的样本外AUC和0.87的精度的梯度提升分类器，从而获得了最佳性能。通过分析决策树和从中得出的见解，我们可以重点介绍一些关键方面。首先，从下图可以看出，轻度事故随一天中的不同时间而变化，在高峰时段更常见。不同的是，无论一天中的小时数如何，严重或致命的事故在某种程度上都是均匀分布的。

Furthermore, we reported the ratio of severe to light accidents across the UK for each Police Force area, highlighting in blue rural areas and in red metropolitan areas. Locations with larger circles have more severe accidents relative to the number of minor accidents and vice versa. Intuitively, cities like London have larger than average circles, because the level of traffic and congestion in such a large city is higher, leading to more severe accidents.

此外，我们报告了整个英国每个警区的严重事故与轻度事故的比率，突出显示在蓝色的农村地区和红色的大都市地区。相对于小事故，具有较大圆圈的位置发生的事故更为严重，反之亦然。直觉上，像伦敦这样的城市圈数大于平均水平，这是因为这样一个大城市的交通和拥堵程度更高，从而导致更严重的事故。

However, it is interesting to notice how different locations across the UK, despite being rural, have a ratio of severe-to-light accidents just as high as London’s. This can drive useful insights for local administrations and the broader government, showing which areas in the UK are at highest risk, where improvements and policies need to have priority. These are, for instance, areas such as Lancashire and the northern areas of Wales, close to the border with England. These areas, shown from our model as some of the most at risk of severe accidents, have also been recently confirmed as some of the most dangerous by different news outlets.

然而，有趣的是，注意到尽管英国处于农村，但不同地区的重轻事故发生率与伦敦一样高。这可以为地方*和更广泛的*提供有用的见解，显示英国哪些地区风险最高，需要优先考虑改进和政策。例如，这些地区是兰开夏郡和威尔士北部地区，与英格兰接壤。从我们的模型中可以看出，这些区域是发生严重事故风险最大的区域，最近，不同的新闻媒体也已确认这些区域为最危险的区域。

摘要 (Summary)

We analysed road accidents data across the UK to find insights that can drive decisions aimed at saving lives.
我们分析了整个英国的道路交通事故数据，以寻找有助于推动旨在挽救生命的决策的见解。
We tackled the issue, first developing a Driver Score which assigns to each driver a level of risk. Each driver is then able to understand whether they are at risk or not and, most importantly, what characteristics are the main factors.
我们解决了这个问题，首先制定了驾驶员评分，为每个驾驶员分配了一定的风险等级。然后，每个驾驶员都可以了解他们是否处于危险之中，最重要的是，主要特征是什么。
Second, we analysed external information, which the driver has not control on, such as road conditions. We were able to highlight areas across the UK that should be prioritised, focusing government funding on these high-priority areas.
其次，我们分析了驾驶员无法控制的外部信息，例如路况。我们能够突出显示整个英国应优先考虑的领域，将*资金重点放在这些高优先领域。
Both of these analyses give us further understanding into the underlying causes of accidents empowering drivers and governments alike to prevent them before they happen.
这两种分析使我们对事故的根本原因有了进一步的了解，从而使驾驶员和*都有权在事故发生之前进行预防。

To read more articles like this, follow me on Twitter, LinkedIn or my Website.

要阅读更多类似的文章，请在 Twitter ， LinkedIn 或我的 网站上关注我。

翻译自: https://towardsdatascience.com/using-machine-learning-to-predict-car-accidents-44664c79c942