【论文翻译】Deep Residual Learning for Image Recognition

论文题目:Deep Residual Learning for Image Recognition
论文来源Deep Residual Learning for Image Recognition
翻译人:[email protected]实验室

Deep Residual Learning for Image Recognition

Kaiming He & Xiangyu Zhang & Shaoqing Ren & Jian Sun

图像识别领域的深度残差学习

Kaiming He & Xiangyu Zhang & Shaoqing Ren & Jian Sun

Abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

摘要

更深的神经网络往往更难以训练,我们在此提出一个残差学习的框架,以减轻网络的训练负担,这是个比以往的网络要深的多的网络。我们明确地将层作为输入学习残差函数,而不是学习未知的函数。我们提供了非常全面的实验数据来证明,残差网络更容易被优化,并且可以在深度增加的同时也增加准确性。在ImageNet的数据集上我们评测了一个深度152层(是VGG的8倍)的残差网络,但依旧拥有比VGG更低的复杂度。残差网络整体达成了3.57%的错误率,这个结果获得了ILSVRC2015的分类任务第一名,我们还用CIFAR-10数据集分析了100层和1000层的网络。

在一些计算机视觉识别方向的任务当中,深度表示往往是重点。极深的网络让我们得到了28%的相对提升(在COCO的对象检测数据集上)。我们在深度残差网络的基础上做了提交的版本参加ILSVRC和COCO2015的比赛,我们还获得了ImageNet对象检测,Imagenet对象定位,COCO对象检测和COCO图像分割的第一名。

1. Introduction

Deep convolutional neural networks have led to a series of breakthroughs for image classification . Deep networks naturally integrate low/mid/highlevel features and classifiers in an end-to-end multilayer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth). Recent evidence reveals that network depth is of crucial importance, and the leading results on the challenging ImageNet dataset all exploit “very deep” models, with a depth of sixteen to thirty . Many other nontrivial visual recognition tasks have also greatly benefited from very deep models.

深度卷积神经网络为图像分类带来了一系列突破。 深度网络自然的整合了低中高不同层次的特征,并且使用端到端的多层次分类,特征的“层次”可以靠加深网络层数来丰富。最近的研究揭示了网络深度是非常重要的关键点。在具有挑战性的ImageNet数据集上的领先结果都采用了“非常深”的模型,从16到30层不等。其他一些计算机视觉的问题也受益于超级深的网络模型。

Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients, which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization and intermediate normalization layers, which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation.

受到深度的意义的驱使,出现了这样一个问题:是不是更多的堆叠层就一定能学习出更好的网络?这个问题的一大障碍就是臭名昭著的梯度消失/爆炸问题,它从一开始就阻碍了收敛,然而梯度消失/爆炸的问题,很大程度上可以通过标准的初始化和正则化层来基本解决,确保几十层的网络能够收敛(用随机梯度下降(SGD)+反向传播)。

When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in and thoroughly verified by our experiments. Fig. 1 shows a typical example.

然而当开始考虑更深层的网络的收敛问题时,退化问题就暴露了:随着神经网络深度的增加,精确度开始饱和(这是不足为奇的),然后会迅速的变差。出人意料的,这样一种退化,并不是过拟合导致的,并且增加更多的层匹配深度模型,会导致更多的训练误差,就像文章中说的那样,通过我们的实验将得到充分证实。图1展示了一个典型的例子。

【论文翻译】Deep Residual Learning for Image Recognition

Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.

图1 使用20层/56层普通网络在CIFAR-10数据集,训练集误差(左),测试集误差(右)。越深的网络错误率越高,图4显示了ImageNet上的类似现象。

The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time).

训练精度的退化表明,不是所有的系统都同样容易优化。让我们考虑一个浅层架构和它的对应的增加了更多层的深层架构。存在一个解决方案来构建更深层次的模型:添加的层是自身映射,其他层从是训练好的浅模型中复制而来。这种特殊的构建方式的存在表明,深的模型应该不会比浅的模型产生更高的训练误差。但实验结果表明,我们现有的求解器找不到比构造的解决方案好或更好的解决方案(或无法在可行的时间内找到解决方案)。

In this paper, we address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x)\mathcal{H}({\rm{x}}), we let the stacked nonlinear layers fit another mapping of F(x):=H(x)x\mathcal{F}({\rm{x}}):=\mathcal{H}({\rm{x}})−{\rm{x}}. The original mapping is recast into F(x)+x\mathcal{F}({\rm{x}})+{\rm{x}}. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.

在本文中,我们通过引入一个深度残差学习框架,解决了这个退化问题。我们不期望每一层能直接适合所需的底层映射,我们明确的让这些层去适合残差映射。形式上看,就是用H(x)\mathcal{H}({\rm{x}})来表示所需的底层映射,但我们让堆叠的非线性层去拟合另一个映射F(x):=H(x)x\mathcal{F}({\rm{x}}):=\mathcal{H}({\rm{x}})−{\rm{x}}, 此时原底层映射就可以改写成F(x)+x\mathcal{F}({\rm{x}})+{\rm{x}}。我们假设残差映射跟原映射相比更容易被优化。极端情况下,如果一个身份映射是最优的,那么把残差推至0比把此映射逼近另一个非线性层来拟合身份映射要容易的多。

The formulation of F(x)+x\mathcal{F}({\rm{x}})+{\rm{x}} can be realized by feedforward neural networks with “shortcut connections” (Fig. 2). Shortcut connections are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers.

F(x)+x\mathcal{F}({\rm{x}})+{\rm{x}}的公式可以通过在前馈网络中做一个“快捷连接”来实现(如图2),快捷连接跳过一个或多个层。在我们的用例中,快捷连接简单的执行自身映射,它们的输出被添加到叠加层的输出中。自身快捷连接既不会增加额外的参数也不会增加计算复杂度。整个网络依然可以用SGD+反向传播来做端到端的训练,并且可以很容易用通用框架来实现(比如Caffe)不用修改求解器配置。

【论文翻译】Deep Residual Learning for Image Recognition

Figure 2. Residual learning: a building block
图 2 残差网络:一个结构块

We present comprehensive experiments on ImageNet to show the degradation problem and evaluate our method. We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks.

我们目前用ImageNet的数据集做了很多综合实验,来证实退化问题并评估我们的方法。我们发现:1)我们的超深残差网络是很容易去优化的,不过对应的普通网络(简单的堆叠层)当深度增加时,表现出更高的训练误差。2)我们的深度残差网络可以轻松的享受深度增加带来的精度增加,产生比以前的网络更好的结果。

Similar phenomena are also shown on the CIFAR-10 set, suggesting that the optimization difficulties and the effects of our method are not just akin to a particular dataset. We present successfully trained models on this dataset with over 100 layers, and explore models with over 1000 layers.

类似的现象在CIFAR-10数据集的实验中也一样,这表明着优化是困难的,我们提出的训练方法不仅仅类似于特定数据集,在此数据集上,超过100层的网络表现很成功,还可以扩展到1000层以上的模型。

On the ImageNet classification dataset , we obtain excellent results by extremely deep residual nets. Our 152-layer residual net is the deepest network ever presented on ImageNet, while still having lower complexity than VGG nets . Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC 2015 classification competition. The extremely deep representations also have excellent generalization performance on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions. This strong evidence shows that the residual learning principle is generic, and we expect that it is applicable in other vision and non-vision problems.

在ImageNet对象分类数据集上,我们用深度残差网络获得了出色的结果,我们152层的残差网络是ImageNet的参赛网络中最深的,然而却拥有比VGG更低的复杂度。我们最终的效果在测试集中的前5错误率为3.57%,获得了ILSVRC2015对象分类的第一名。这种超级深的表示方法在其他识别任务中也有良好的泛化能力,使我们在ILSVRC和COCO 2015竞赛中进一步赢得了多个第一名(有ImageNet detection, Imagenet localization,COCOdetection 和COCOsegmentation),这般有利的证据证明残差学习原理是通用的,我们同样期望残差学习的方法能用在其他的视觉和非视觉问题上。

2. Related Work

Residual Representations. In image recognition, VLAD is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector can be formulated as a probabilistic version of VLAD. Both of them are powerful shallow representations for image retrieval and classification. For vector quantization, encoding residual vectors is shown to be more effective than encoding original vectors.

残差表示:在图像识别任务中,VLAD是用基于词典的残差向量的来进行特征编码的,Fisher向量可以看作VLAD的一个概率版本,它们在图像检索和浅层分类中都是不错的,对于矢量量化,编码残差向量都被证明了比编码原始向量要更有效果。

In low-level vision and computer graphics, for solving Partial Differential Equations (PDEs), the widely used Multigrid method reformulates the system as subproblems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a finer scale. An alternative to Multigrid is hierarchical basis preconditioning, which relies on variables that represent residual vectors between two scales. It has been shown that these solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions. These methods suggest that a good reformulation or preconditioning can simplify the optimization.

在低级视觉和计算机图形学中,求解偏微分方程(PDE),通常是使用多重网格(Multigrid)法,把系统重建成多尺度的子问题,每个子问题负责求解出粗粒度和细粒度之间的残差,除此之外,另一种求解PDE的方法是分层基预处理,是基于表达两个尺度之间残差的向量进行的。已经证明这些用了残差的解法收敛速度都比不用残差的普通解法要快的多。这些研究表明,一个好的模型重构或者预处理手段是可以简化优化过程的。

Shortcut Connections. Practices and theories that lead to shortcut connections have been studied for a long time. An early practice of training multi-layer perceptrons (MLPs) is to add a linear layer connected from the network input to the output. In[44,24], a few intermediate layers are directly connected to auxiliary classifiers for addressing vanishing/exploding gradients. The papers of propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections. In[44], an “inception” layer is composed of a shortcut branch and a few deeper branches.

快捷连接: 实践和理论引出了“快捷连接”这个想法,它已经被研究了很长的时间。在训练多层感知器网络(MLP)的早期实践,包括添加一个线性层(从网络的输入直连到输出),在[43,24]中提到,少量中间层被直接连到附加的分类层解决梯度消失/爆炸问题,论文提出了通过快捷连接实现居中层响应,梯度和传播错误的方法。在论文[43]中,一个“开始层”是由一个快捷分支和少量较深的分支构成。

Concurrent with our work, “highway networks” present shortcut connections with gating functions . These gates are data-dependent and have parameters, in contrast to our identity shortcuts that are parameter-free. When a gated shortcut is “closed” (approaching zero), the layers in highway networks represent non-residual functions. On the contrary, our formulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through, with additional residual functions to be learned. In addition, highway networks have not demonstrated accuracy gains with extremely increased depth (e.g., over 100 layers).

和我们同期的工作也有一些,“Highway network”[41,42]提出的高速公路网络,展示了设置了门选通的快捷连接,与我们的不带参数的身份快捷方式相反,这些门取决于数据并具有参数。当一个门快捷连接呈关闭状态(接近0)时,highway network的层就代表着非残余函数,相反的,我们的方法总是学习残差方程。我们的自身快捷连接是永不关闭的,因此信息总能通过,与借此学习残差函数。此外highway network没有表现出精度随深度增加的特性(比如超过100层后)。

3. Deep Residual Learning

3.1. Residual Learning

Let us consider H(x)\mathcal{H}({\rm{x}}) as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net), with x{\rm{x}} denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions, then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., H(x)x\mathcal{H}({\rm{x}})−{\rm{x}}(assuming that the input and output are of the same dimensions). So rather than expect stacked layers to approximate H(x)\mathcal{H}({\rm{x}}), we explicitly let these layers approximate a residual function F(x):=H(x)x\mathcal{F}({\rm{x}}):=\mathcal{H}({\rm{x}})−{\rm{x}}. The original function thus becomes F(x)+x\mathcal{F}({\rm{x}})+{\rm{x}}. Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different.

让我们考虑 H(x)\mathcal{H}({\rm{x}}) 是一个有若干堆叠的网络层将进行拟合的映射(不一定要整个网络),x{\rm{x}} 表示这些层中第一层的输入。如果有一个假设:多层的非线性网络层可以逐渐逼近很复杂的函数,那么相当于可以假设它们同样能逼近残差函数,如 H(x)x\mathcal{H}({\rm{x}})−{\rm{x}}(假设输入和输出都有着相同的维度)。 因此,我们没有让叠加的层逼近 H(x)\mathcal{H}({\rm{x}}),而是明确让这些层逼近残差函数F(x):=H(x)x\mathcal{F}({\rm{x}}):=\mathcal{H}({\rm{x}})−{\rm{x}}。 因此,原始函数变为F(x)+x\mathcal{F}({\rm{x}})+{\rm{x}}。 尽管两种形式都应能够渐近地逼近所需的函数(如假设),但学习的难易程度可能有所不同。

This reformulation is motivated by the counterintuitive phenomena about the degradation problem (Fig. 1, left). As we discussed in the introduction, if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart. The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.

这个重构的动机是出于对退化问题的反直觉现象(图1,左)。正如我们在引言中讨论的,如果添加的层可以以恒等的方式被构造为自身映射,一个加深的模型的训练误差一定不大于较浅的对应模型的训练误差。退化问题表明,求解过程中在使多个非线性层逼近自身映射时有困难。而用残差的方法重构它,如果自身映射达到最佳的,则求解可能仅仅是更新多个非线性层的权值逼近零去接近自身映射。

In real cases, it is unlikely that identity mappings are optimal, but our reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. We show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning.

在现实情况下,自身映射一开始就达到最优几乎是不可能的事,但我们的重构将有助于对此问题做预处理。如果优化的函数比起零映射更接近于自身映射的话,网络会更容易学习去确定自身映射的扰动参考,而不是将其作为一个全新的函数去学习。我们通过实验验证(图7),学习的残差函数通常都具有较小的响应,这表明自身映射提供了更合理的预处理手段。

3.2. Identity Mapping by Shortcuts

We adopt residual learning to every few stacked layers. A building block is shown in Fig. 2. Formally, in this paper we consider a building block defined as: ​y=F(x,{Wi})+x {\rm{y}}=\mathcal{F}({\rm{x}},\{ W_i \})+{\rm{x}} Here x{\rm{x}} and y{\rm{y}} are the input and output vectors of the layers considered. The function F(x,{Wi})\mathcal{F}({\rm{x}},\{ W_i \}) represents the residual mapping to be learned. For the example in Fig. 2 that has two layers, F=W2σ(W1x)\mathcal{F}=W_2 \sigma(W_1 {\rm{x}} ) in which σ\sigma denotes ReLU and the biases are omitted for simplifying notations. The operation F+x\mathcal{F}+{\rm{x}} is performed by a shortcut connection and element-wise addition. We adopt the second nonlinearity after the addition (i.e., σ(y)\sigma({\rm{y}}), see Fig. 2).

我们将残差学习的方式应用到了每一组堆叠层,一个构造块在图2所示,在本文中,我们把一个构造块定义成:
y=F(x,{Wi})+x {\rm{y}}=\mathcal{F}({\rm{x}},\{ W_i \})+{\rm{x}} 此处,x{\rm{x}}y{\rm{y}} 分别表示构造块的输入和输出向量,函数 F(x,{Wi})\mathcal{F}({\rm{x}},\{ W_i \}) 表示被将被训练的残差映射。举个例子,在图2中有两层示例,F=W2σ(W1x)\mathcal{F}=W_2 \sigma(W_1 {\rm{x}} )中的 σ\sigma 表示RELU,出于简化考虑省略了偏置项。操作 F+x\mathcal{F}+{\rm{x}} 是由一个快捷连接进行逐元素的添加得。我们在做加法后得到的模型具有二阶非线性(即σ(y)\sigma({\rm{y}}),见图2)。

The shortcut connections in Eqn.(1) introduce neither extra parameter nor computation complexity. This is not only attractive in practice but also important in our comparisons between plain and residual networks. We can fairly compare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition).

公式(1)中介绍的这个快捷连接既没有引入额外的参数,也没有增加计算复杂性。这不仅是在实践中中具有吸引力,在我们对普通及残差网络的比较中也尤为重要。这样我们可以公平的比较具有相同数量的参数、深度、宽度和计算成本的普通/残差网络(除了可以忽略不计的逐元素加法运算)。

The dimensions of x{\rm{x}} and F\mathcal{F} must be equal in Eqn.(1). If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection WsW_s by the shortcut connections to match the dimensions:
y=F(x,{Wi})+Wsx(2) {\rm{y}}=\mathcal{F}({\rm{x}},\{ W_i \})+W_s{\rm{x}} \qquad \qquad (2) We can also use a square matrix WsW_s in Eqn.(1). But we will show by experiments that the identity mapping is sufficient for addressing the degradation problem and is economical, and thus WsW_s is only used when matching dimensions.

公式1中 x{\rm{x}}F\mathcal{F} 的维度必须保持一致,如果不一致(比如改变输入输出的通道数)我们可以在快捷连接上进行一个线性投影 WsW_s 来匹配维度:
y=F(x,{Wi})+Wsx(2) {\rm{y}}=\mathcal{F}({\rm{x}},\{ W_i \})+W_s{\rm{x}} \qquad \qquad (2) 我们同样可以在公式1中用一个平方矩阵 WsW_s,不过我们的实验表明,自身映射足以解决退化问题并且很经济,因此 WsW_s 仅仅被用来匹配维度。

The form of the residual function F\mathcal{F} is flexible. Experiments in this paper involve a function F\mathcal{F} that has two or three layers (Fig. 5), while more layers are possible. But if F\mathcal{F} has only a single layer, Eqn.(1) is similar to a linear layer: y=W1x+x{\rm{y}}=W_1{\rm{x}}+{\rm{x}}, for which we have not observed advantages.

残差函数F\mathcal{F}的形式是灵活的,本文的实验包括了F\mathcal{F}为2层或3层的情况(图5),虽然更多的层也是可以的,但如果只有一层,则(公式1)会等价于一个线性层:y=W1x+x{\rm{y}}=W_1{\rm{x}}+{\rm{x}},这样一来就没有可见的优势了。

We also note that although the above notations are about fully-connected layers for simplicity, they are applicable to convolutional layers. The function F(x,{Wi})\mathcal{F}({\rm{x}},\{ W_i \}) can represent multiple convolutional layers. The element-wise addition is performed on two feature maps, channel by channel.

我们还注意到,尽管上述的公式为了简化起见,都是关于完全连接层的,但是它们同样适用于卷积层。函数F(x,{Wi})\mathcal{F}({\rm{x}},\{ W_i \})可以代表多个卷积层。逐元素的加法运算则是两个特征映射的加法,按照通道对应。

3.3. Network Architectures

We have tested various plain/residual nets, and have observed consistent phenomena. To provide instances for discussion, we describe two models for ImageNet as follows.

我们测试过各种普通/残差网络,并观察到一致的现象,为了提供讨论的实证,我们将在下文描述(用于ImageNet的)两个模型。

Plain Network. Our plain baselines (Fig. 3, middle) are mainly inspired by the philosophy of VGG nets(Fig. 3, left). The convolutional layers mostly have 3×3 filters and follow two simple design rules: (i) for the same output feature map size, the layers have the same number of filters; and (ii) if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by convolutional layers that have a stride of 2. The network ends with a global average pooling layer and a 1000-way fully-connected layer with softmax. The total number of weighted layers is 34 in Fig. 3 (middle).

普通网络。我们的普通基准网络(图3,中间)主要是受到VGG网络(图3左)的启发。卷积层的过滤器大多为3x3,遵循了两个设计原则:1)对于相同的尺寸的输出特征映射,每层必须含有相同数量的过滤器。2)如果特征映射的大小减半,则过滤器的数量翻倍,以保持每层的时间复杂度。我们直接通过卷积层(stride=2)进行下采样,网络末端以全局的均值池化层结束,有1000路的全连接层(Softmax**)。含有权重的网络层的总计为34层(见图3中)。

It is worth noticing that our model has fewer filters and lower complexity than VGG nets(Fig. 3, left). Our 34- layer baseline has 3.6 billion FLOPs (multiply-adds), which is only 18% of VGG-19 (19.6 billion FLOPs).

值得注意的是,我们的模型包含了更少的过滤器和比VGG更低的复杂度(图3左),我们的34层基本计算量为3.6亿FLOPS(包括乘法和加法),这仅仅是VGG(196亿FLOPs)的18%。

【论文翻译】Deep Residual Learning for Image Recognition

Figure 3. Example network architectures for ImageNet. Left: the VGG-19 model [40] (19.6 billion FLOPs) as a reference. Middle: a plain network with 34 parameter layers (3.6 billion FLOPs). Right: a residual network with 34 parameter layers (3.6 billion FLOPs). The dotted shortcuts increase dimensions. Table 1 shows more details and other variants.

图3. ImageNet的网络结构示例。左图:VGG-19模型(196亿次FLOPs)作为参考。中间层:一个有34个参数层(36亿次FLOPs)的普通网络。右图:一个有34个参数层(36亿次FLOPs)的残差网络。虚线捷径增加维度。表1显示了更多细节和其他变体。

Residual Network. Based on the above plain network, we insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version. The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Fig. 3). When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1×1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.

残差网络。在简单网络的基础上,我们插入了快捷连接(图3,右),将网络转化为其对应的残差版本。当输入输出是相同维度的时候,直接使用自身捷径(公式(1))。当输入输出尺寸发生增加时(图3中的虚线的快捷连接),我们考虑两个策略:
(A)快捷连接仍然使用自身映射,对于维度的增加用零来填补空缺。此策略不会引入额外的参数;
(B)投影捷径(公式2)被用来匹配维度(靠1×1的卷积完成)。
对于这两种选项,当快捷连接在两个不同大小的特征映射上出现时,用stride=2来处理。

3.4. Implementation

Our implementation for ImageNet follows the practice in [21, 40]. The image is resized with its shorter side randomly sampled in [256, 480] for scale augmentation [40]. A 224×224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted [21]. The standard color augmentation in [21] is used. We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16]. We initialize the weights as in [12] and train all plain/residual nets from scratch. We use SGD with a mini-batch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to 60×10460×10^4 iterations. We use a weight decay of 0.0001 and a momentum of 0.9. We do not use dropout [13], following the practice in [16].

我们用于ImageNet的网络是根据]21,40]来实现的,图片被根据短边等比缩放,按照[256,480]区间的尺寸随机采样进行尺度增强[40]。一个224x224的裁切是随机抽样的图像或其水平翻转,并将裁剪结果减去它的平均像素值[21],使用了标准颜色的增强。我们把批量正则化(batch-normalization,BN)用在了每个卷积层和**层之间,我们按照[12]说的方法初始化了权重,分别从0开始训练普通/残差网络。我们使用SGD算法,mini-batch的大小为256。学习速率初始化为0.1,当误差趋于平稳时就把学习速率除以10,对各模型进行60×10460\times 10^4次迭代,我们使用了权重衰减,参数设了0.0001,动量参数为0.9,我们不用dropout[13],参考了[16]的实验结果。

In testing, for comparison studies we adopt the standard 10-crop testing [21]. For best results, we adopt the fullyconvolutional form as in [40, 12], and average the scores at multiple scales (images are resized such that the shorter side is in {224, 256, 384, 480, 640}).

测试时,为对结果进行比较研究,我们采用了标准的10-crop实验,为达到最佳效果,我们采用[40,12]中的全卷积形式的网络,最终结果为对多个尺寸图像(调整图像大小,使较短的边在{224,256,384,480,640}中)的实验结果求平均值。

4. Experiments

4.1. ImageNet Classification

We evaluate our method on the ImageNet 2012 classification dataset [35] that consists of 1000 classes. The models are trained on the 1.28 million training images, and evaluated on the 50k validation images. We also obtain a final result on the 100k test images, reported by the test server. We evaluate both top-1 and top-5 error rates.

我们使用ImageNet 2012的分类数据集来评估我们的方法,它有1000个分类。各模型均在128万张训练图像上进行训练,用来评估的验证集有5万张交叉验证图片,我们还用10万张测试图获得了一个最终结果,结果由测试服务器报告,我们还分别验证了第一和前5的错误率。

Plain Networks. We first evaluate 18-layer and 34-layer plain nets. The 34-layer plain net is in Fig. 3 (middle). The 18-layer plain net is of a similar form. See Table 1 for detailed architectures.

普通网络。 我们首先评测了18层和34层的普通网络。34层普通网络(见图3中间),18层的网络是一个相似的结构。具体结构见表1。

【论文翻译】Deep Residual Learning for Image Recognition

Table 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Downsampling is performed by conv3_1, conv4_1, and conv5_1 with a stride of 2.

表1: Imagenet的结构,块结构在了括号里(也可以看图5)几种类型的块堆积成网络架构,下采样用的stride=2的conv3_1、conv4_1,和conv5_1。

The results in Table 2 show that the deeper 34-layer plain net has higher validation error than the shallower 18-layer plain net. To reveal the reasons, in Fig. 4 (left) we compare their training/validation errors during the training procedure. We have observed the degradation problem - the 34-layer plain net has higher training error throughout the whole training procedure, even though the solution space of the 18-layer plain network is a subspace of that of the 34-layer one.

表2的结果证明更深的34层普通网络比18层的普通网络有更高的错误率,为了揭示此现象的原因,在图4左边我们比较了在训练过程中训练集和验证集的错误率,我们观察到了退化问题:尽管18层普通网络的解空间仅仅是34层的子集,34层的普通网络在整个训练过程中都有更高的训练误差。

【论文翻译】Deep Residual Learning for Image Recognition

Table 2. Top-1 error (%, 10-crop testing) on ImageNet validation. Here the ResNets have no extra parameter compared to their plain counterparts. Fig. 4 shows the training procedures.

表2: 最大错误率(%,10-crop测试),在imageNet的验证集上做测试。这儿残差网络和普通网络相比也没有任何额外参数。图4展现了训练的过程。

【论文翻译】Deep Residual Learning for Image Recognition

Figure 4. Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: plain networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot, the residual networks have no extra parameter compared to their plain counterparts.

图4:Imagenet的训练,细的曲线表示训练集误差,粗的曲线表示核心部分的验证集误差。左:普通网络(18/34层),右:残差网络(18/34层)。在这个图中,残差网络与普通网络相比没有额外的参数。

We argue that this optimization difficulty is unlikely to be caused by vanishing gradients. These plain networks are trained with BN , which ensures forward propagated signals to have non-zero variances. We also verify that the backward propagated gradients exhibit healthy norms with BN. So neither forward nor backward signals vanish. In fact, the 34-layer plain net is still able to achieve competitive accuracy (Table 3), suggesting that the solver works to some extent. We conjecture that the deep plain nets may have exponentially low convergence rates, which impact the reducing of the training error. The reason for such optimization difficulties will be studied in the future.

我们认为,这种优化困难不太可能是梯度消失造成的。这些普通的网络是用了BN训练的,这确保了前向传播时有非零的方差。我们还确认了反向传播时的梯度表现具有BN的规范性。因此,既不是向前也不是向后的梯度消失。事实上,34层的普通网络仍然能够达到有竞争力的精度(表3),这表明求解器在一定程度上是能工作的。我们推测,深的普通网络可能具有指数级的较低的收敛速度,这会对训练误差的降低产生影响。这样优化困难的原因我们还将在未来进一步探究。

Residual Networks. Next we evaluate 18-layer and 34- layer residual nets (ResNets). The baseline architectures are the same as the above plain nets, expect that a shortcut connection is added to each pair of 3×3 filters as in Fig. 3 (right). In the first comparison (Table 2 and Fig. 4 right), we use identity mapping for all shortcuts and zero-padding for increasing dimensions (option A). So they have no extra parameter compared to the plain counterparts.

残差网络。 接下来我们验证18层和34层的残差网络(ResNets)。残差网络的基本架构和上述普通网络相同,不同的是多了一些快捷连接,被添加到每对3x3的过滤器之间,如图3(右)。在第一个对比中(表2和图4右),我们对所有快捷连接使用自身映射和用零填充增加的维度(optionA),因此和普通网络相比没有任何额外的参数。

We have three major observations from Table 2 and Fig. 4. First, the situation is reversed with residual learning – the 34-layer ResNet is better than the 18-layer ResNet (by 2.8%). More importantly, the 34-layer ResNet exhibits considerably lower training error and is generalizable to the validation data. This indicates that the degradation problem is well addressed in this setting and we manage to obtain accuracy gains from increased depth.

从表2和图4中得出,我们有三个主要的观察。首先是情况逆转:残差学习在34层上表现比18层好(降低了2.8%)。更重要的,34层残差网络表现出了相当低的训练误差,并且同样适用于验证集。这表明在这种情况下,退化问题得到了很好的控制和解决,即我们能在增加深度时获取更高的精度。

Second, compared to its plain counterpart, the 34-layer ResNet reduces the top-1 error by 3.5% (Table 2), resulting from the successfully reduced training error (Fig. 4 right vs. left). This comparison verifies the effectiveness of residual learning on extremely deep systems.

其次,相比于它基于的普通网络版本,34层残差网络降低了 3.5%的最大错误率(表2),成功地降低了训练误差(图4右与左)。这个对比验证了残差网络在深度学习系统中的有效性。

Last, we also note that the 18-layer plain/residual nets are comparably accurate (Table 2), but the 18-layer ResNet converges faster (Fig. 4 right vs. left). When the net is “not overly deep” (18 layers here), the current SGD solver is still able to find good solutions to the plain net. In this case, the ResNet eases the optimization by providing faster convergence at the early stage.

最后,我们也注意到,18层普通/残差网络是相对接近的(表2),但18层残差网络的收敛速度更快(图4右与左)。当网络“不深”(18层以下)的时候,目前的SGD算法仍然能在普通网络上找到较好的解决方案。在这种情况下,残差网络能加速优化,在训练初期提供更快的收敛速度。

【论文翻译】Deep Residual Learning for Image Recognition

Table 3. Error rates (%, 10-crop testing) on ImageNet validation. VGG-16 is based on our test. ResNet-50/101/152 are of option B that only uses projections for increasing dimensions.

表3. ImageNet验证集的错误率(%,10个裁剪测试)。VGG-16基于我们的测试。ResNet-50/101/152属于选项B,仅使用投影来增加维度。

【论文翻译】Deep Residual Learning for Image Recognition

Table 4. Error rates (%) of single-model results on the ImageNet validation set (except † reported on the test set).

表4. ImageNet验证集上单模型结果的错误率(%)(测试集上报告的†除外)。

【论文翻译】Deep Residual Learning for Image Recognition

Table 5. Error rates (%) of ensembles. The top-5 error is on the test set of ImageNet and reported by the test server.

表5. 整体的错误率(%)。前5的错误率在ImageNet的测试集中,由测试服务器报告。

Identity vs. Projection Shortcuts. We have shown that parameter-free, identity shortcuts help with training. Next we investigate projection shortcuts (Eqn.(2)). In Table 3 we compare three options: (A) zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameterfree (the same as Table 2 and Fig. 4 right); (B) projection shortcuts are used for increasing dimensions, and other shortcuts are identity; and (C) all shortcuts are projections.

自身(identity)捷径 vs 投影(projection)捷径:我们已经证实了无参数的自身捷径对训练有增益作用。接下来我们打算研究投影捷径(公式2)在表三中我们比较三个选项:
- (A)零填充捷径用来增加维度,所有的捷径都是没有参数的自身捷径(跟表2和图4右一样)
- (B)投影捷径用来增加维度,其他的捷径都是没有参数的自身捷径。
- (C)所有的捷径都是投影捷径

Table 3 shows that all three options are considerably better than the plain counterpart. B is slightly better than A.We argue that this is because the zero-padded dimensions in A indeed have no residual learning. C is marginally better than B, and we attribute this to the extra parameters introduced by many (thirteen) projection shortcuts. But the small differences among A/B/C indicate that projection shortcuts are not essential for addressing the degradation problem. So we do not use option C in the rest of this paper, to reduce memory/time complexity and model sizes. Identity shortcuts are particularly important for not increasing the complexity of the bottleneck architectures that are introduced below.

表3展示了这三个选项都远优于对应的普通网络,B稍微比A好一点儿,我们认为这是因为补零填充的维度在A中没有进行残差学习。C略优于B,我们把这个归功于额外的参数(在一些投影捷径中的参数)但是A/B/C之间这么一点细微的不同表明,投影捷径在解决退化问题上不是重点,所以我们在后文的其他部分就再不用C方法了,以减少内存/时间复杂度和模型大小。自身连接是在不增加复杂度上是非常重要的,特别是针对我们下面要介绍的瓶颈体系结构。

Deeper Bottleneck Architectures. Next we describe our deeper nets for ImageNet. Because of concerns on the training time that we can afford, we modify the building block as a bottleneck design4design^4. For each residual function F\mathcal{F}, we use a stack of 3 layers instead of 2 (Fig. 5). The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/output dimensions. Fig. 5 shows an example, where both designs have similar time complexity.

深度瓶颈架构。 接下来我们描述我们为ImageNet准备的更深的网络。出于对训练时间的考虑,我们将构建块修改为瓶颈设计。对于每个残差函数F\mathcal{F},我们使用3层来描述,而不是2层(图5)。这三层分别是1×1、3×3,和1×1的卷积层,其中1×1层负责先减少后增加(恢复)维度的,使3×3层具有较小的输入/输出维度瓶颈。图5显示了一个例子,两种设计都具有相似的时间复杂度。

【论文翻译】Deep Residual Learning for Image Recognition

Figure 5. A deeper residual function F\mathcal{F} for ImageNet. Left: a building block (on 56×56 feature maps) as in Fig. 3 for ResNet-34. Right: a “bottleneck” building block for ResNet-50/101/152.

图5. ImageNet的一个更深的残差函数F\mathcal{F}。左图:如图3所示的残差网络的构建块(在56×56特征图上)。右图:ResNet-50/101/152的“瓶颈”构建块。

The parameter-free identity shortcuts are particularly important for the bottleneck architectures. If the identity shortcut in Fig. 5 (right) is replaced with projection, one can show that the time complexity and model size are doubled, as the shortcut is connected to the two high-dimensional ends. So identity shortcuts lead to more efficient models for the bottleneck designs.

无参数自身快捷连接在瓶颈架构中非常的重要,如果把自身连接(图5右)换成投影连接,可以看出时间复杂度和模型大小都会翻倍,因为该快捷连接连到了两个高维端,所以自身连接会为瓶颈设计提供了更有效的模型。

50-layer ResNet: We replace each 2-layer block in the 34-layer net with this 3-layer bottleneck block, resulting in a 50-layer ResNet (Table 1). We use option B for increasing dimensions. This model has 3.8 billion FLOPs.

50层残差网络: 我们把34层网络中的每一个2层的块都改换成3层的瓶颈块,在50层残差网络中的表现结果(见表1),我们用了OptionB来增加维度,这个模型的基础计算量为3.8亿Flops。

101-layer and 152-layer ResNets: We construct 101- layer and 152-layer ResNets by using more 3-layer blocks (Table 1). Remarkably, although the depth is significantly increased, the 152-layer ResNet (11.3 billion FLOPs) still has lower complexity than VGG-16/19 nets (15.3/19.6 billion FLOPs).

101层和152层残差网络: 我们建立了101层和152层的残差网络,用了更多的3层块(表1),值得注意的是,尽管深度是在很明显的增加,152层的残差网络(1.13亿Flops)仍然比VGG16/19网络(15.3亿,19.6亿)有更低的复杂度。

The 50/101/152-layer ResNets are more accurate than the 34-layer ones by considerable margins (Table 3 and 4). We do not observe the degradation problem and thus enjoy significant accuracy gains from considerably increased depth. The benefits of depth are witnessed for all evaluation metrics (Table 3 and 4).

50/101/152层的残差网络与34层相比,有着可观的精确度提升(表3,4)。我们没有观察到退化问题,因此享受着深度增加带来的显著的精度提升。深度的好处从所有的评估指标中可以见体现(表3,4)。

Comparisons with State-of-the-art Methods. In Table 4 we compare with the previous best single-model results. Our baseline 34-layer ResNets have achieved very competitive accuracy. Our 152-layer ResNet has a single-model top-5 validation error of 4.49%. This single-model result outperforms all previous ensemble results (Table 5). We combine six models of different depth to form an ensemble (only with two 152-layer ones at the time of submitting). This leads to 3.57% top-5 error on the test set (Table 5).This entry won the 1st place in ILSVRC 2015.

与其他最先进技术的比较。 在表4中,我们与之前最好的单模型结果进行了比较。我们的34层基础残差网络取得了非常有竞争力的准确度。我们的152层残差网络单模型对前5的验证错误率4.49%。这个单模型的结果优于所有以前的综合模型的结果(表5)。我们结合了六个不同深度的模型,来形成一个综合模型(在提交时只用了两个152层),达成了测试集上 3.57% 的top-5误差(表5)。这个结果在2015 ILSVRC获得了第一名。

4.2. CIFAR-10 and Analysis

We conducted more studies on the CIFAR-10 dataset, which consists of 50k training images and 10k testing images in 10 classes. We present experiments trained on the training set and evaluated on the test set. Our focus is on the behaviors of extremely deep networks, but not on pushing the state-of-the-art results, so we intentionally use simple architectures as follows.

我们对CIFAR-10数据集进行了进一步的研究,该数据集包括10个类别的5万张训练图像和1万张测试图像。我们提出了在训练集上训练实验,并在测试集上进行评估。我们的重点是极深层网络的行为,而不是推动最先进的结果,因此我们有意使用以下简单的架构。

The plain/residual architectures follow the form in Fig. 3 (middle/right). The network inputs are 32×32 images, with the per-pixel mean subtracted. The first layer is 3×3 convolutions. Then we use a stack of 6nn layers with 3×3 convolutions on the feature maps of sizes {32,16,832, 16, 8} respectively, with 2nn layers for each feature map size. The numbers of filters are {16,32,6416, 32, 64} respectively. The subsampling is performed by convolutions with a stride of 2. The network ends with a global average pooling, a 10-way fully-connected layer, and softmax. There are totally 6nn+2 stacked weighted layers. The following table summarizes the architecture:
【论文翻译】Deep Residual Learning for Image Recognition

When shortcut connections are used, they are connected to the pairs of 3×3 layers (totally 3nn shortcuts). On this dataset we use identity shortcuts in all cases (i.e., option A), so our residual models have exactly the same depth, width, and number of parameters as the plain counterparts.

普通/残差架构遵循图3(中/右)的形式。网络的输入是32x32的图像,预先减去每一个像素的均值。第一层是3×3卷积层。然后我们对于尺寸分别为{32, 16, 8 }的特征映射分别使用了一组包括了6n个3x3卷积层,每个尺寸的特征图谱使用2n个层,即过滤器的数量分别为{16,32,64},降采样是通过步长为2的卷积进行的,网络以全局的均值池化、一个10路全连通层和softmax终止。以上一共6n+2个加权层。下方的表格总结了结构:
【论文翻译】Deep Residual Learning for Image Recognition

当快捷连接被使用时,它们分别连接到3×3的网络层对(一共3n个快捷连接)。在该数据集上,我们在各种情形下都使用自身快捷路径(即选项A),所以我们的残差模型跟普通模型拥有一模一样的深度、宽度和参数个数。

We use a weight decay of 0.0001 and momentum of 0.9, and adopt the weight initialization in [12] and BN but with no dropout. These models are trained with a mini-batch size of 128 on two GPUs. We start with a learning rate of 0.1, divide it by 10 at 32k and 48k iterations, and terminate training at 64k iterations, which is determined on a 45k/5k train/val split. We follow the simple data augmentation in [24] for training: 4 pixels are padded on each side, and a 32×32 crop is randomly sampled from the padded image or its horizontal flip. For testing, we only evaluate the single view of the original 32×32 image.

我们使用权重衰减参数为0.0001,动量为0.9,并采用论文[13]和BN[16]提到的权重初始化[12],不过不使用dropout。这些模型训练时用了两个GPU,mini-batch大小为128。我们初始学习率为0.1,在32k和48k次迭代时除以10,在64k次迭代时终止训练,按45k/5k的比例确定训练集/验证集。我们遵循[24]提出的简单数据增强策略进行训练:在每侧填充4个像素,并从填充的图像或其水平翻转中随机采样32×32的切割。在验证时,我们只评估原始32×32图像的单个视图。

We compare n={3,5,7,9}n = \{3, 5, 7, 9\}, leading to 20, 32, 44, and 56-layer networks. Fig. 6 (left) shows the behaviors of the plain nets. The deep plain nets suffer from increased depth, and exhibit higher training error when going deeper. This phenomenon is similar to that on ImageNet (Fig. 4, left) and on MNIST (see [41]), suggesting that such an optimization difficulty is a fundamental problem.

我们比较了n={3,5,7,9}n=\{3,5,7,9\},分别对应20,32,44,56层的网络。图6(左)展示了普通网络的表现。深度普通网络受深度影响,在深度增加时表现出了更高的训练误差,这个现象和在Imagenet(图4左)以及MNIST数据集[41]是如出一辙的,表明了这样的优化困难问题是一个基本问题。

Fig. 6 (middle) shows the behaviors of ResNets. Also similar to the ImageNet cases (Fig. 4, right), our ResNets manage to overcome the optimization difficulty and demonstrate accuracy gains when the depth increases.

图6(中)展示了残差网络的表现。也和用Imagenet做实验时表现的差不多(图4右),我们残差网络就是设法克服了优化困难的问题,并且做到了让精度随着深度增加而增加。

【论文翻译】Deep Residual Learning for Image Recognition

Figure 6. Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing error. Left: plain networks. The error of plain-110 is higher than 60% and not displayed. Middle: ResNets. Right: ResNets with 110 and 1202 layers.

图6. CIFAR-10训练。虚线表示训练错误率,粗线表示测试错误率。左图:普通网络。Plain-110的错误大于60%,不显示。中间:残差网络。右图:具有110层和1202层的残差网络。

We further explore n=18n = 18 that leads to a 110-layer ResNet. In this case, we find that the initial learning rate of 0.1 is slightly too large to start converging. So we use 0.01 to warm up the training until the training error is below 80% (about 400 iterations), and then go back to 0.1 and continue training. The rest of the learning schedule is as done previously. This 110-layer network converges well (Fig. 6, middle). It has fewer parameters than other deep and thin networks such as FitNet and Highway (Table 6), yet is among the state-of-the-art results (6.43%, Table 6).

我们进一步研究了n=18n=18的情况,这个对应了110层的残差网络,在这个情况下,我们发现初始学习速率设成0.1太大了,无法让网络开始收敛,所以们用了0.01的初始学习速率来预热训练,直到训练误差低于80%的时候(大约400次迭代),然后再回到0.1的学习速率继续训练。接下来的训练方案是跟前文提到的一模一样。这个110层的网络收敛的很好(图6中),它和其他网络(Fitnet[34],highway[41])比起来有更少的参数。且获得了目前最好的结果(6.43%,表6)。

【论文翻译】Deep Residual Learning for Image Recognition

Table 6. Classification error on the CIFAR-10 test set. All methods are with data augmentation. For ResNet-110, we run it 5 times and show “best (mean±std)” as in [42].

表6:在CIFAR-10数据集上的分类错误,所有方法都有数据增强,对于110层的残差网络我们运行了5次,展示了“最优值(平均±\pm波动)”。

Analysis of Layer Responses. Fig. 7 shows the standard deviations (std) of the layer responses. The responses are the outputs of each 3×3 layer, after BN and before other nonlinearity (ReLU/addition). For ResNets, this analysis reveals the response strength of the residual functions. Fig. 7 shows that ResNets have generally smaller responses than their plain counterparts. These results support our basic motivation (Sec.3.1) that the residual functions might be generally closer to zero than the non-residual functions. We also notice that the deeper ResNet has smaller magnitudes of responses, as evidenced by the comparisons among ResNet-20, 56, and 110 in Fig. 7. When there are more layers, an individual layer of ResNets tends to modify the signal less.

网络层响应分析。 图7展示了层响应的标准差(std)。响应是指的每个3x3层的输出(BN之后,其他非线性操作(Relu/addition)前)。对于残差网络,这个分析解释了残差函数的响应强度。图7展示了残差网络和普通网络比起来有着较小的响应。这些结果支撑了我们的原始动机(第3.1节),即残差函数可能比非残差函数更接近0。我们通过比较ResnNet-20、56和110的结果,还注意到,更深的残差网络有着更小幅度的响应,当有更多层的时候,单个的ResNet层会修改的更少。

【论文翻译】Deep Residual Learning for Image Recognition

Figure 7. Standard deviations (std) of layer responses on CIFAR-10. The responses are the outputs of each 3×3 layer, after BN and before nonlinearity. Top: the layers are shown in their original order. Bottom: the responses are ranked in descending order.

图7. CIFAR-10层响应的标准差(std)。响应是每3×3层的输出,在BN后和非线性之前。顶部:图层按原始顺序显示。底部:响应按降序排列。

Exploring Over 1000 layers. We explore an aggressively deep model of over 1000 layers. We set n=200n = 200 that leads to a 1202-layer network, which is trained as described above. Our method shows no optimization difficulty, and this 10310^3-layer network is able to achieve training error <0.1% (Fig. 6, right). Its test error is still fairly good (7.93%, Table 6).

超1000层网络的探究。 我们探索了一个更深的模型,超过1000层。我们把n设成200,这对应了1202层的网络,用前文一样的方法去训练。我们的方法没有显示出优化的困难,1000层的网络可以获得训练误差<0.1%的结果(图6右),它的测试集错误率仍然相当的不错(7.93%,表6)。

But there are still open problems on such aggressively deep models. The testing result of this 1202-layer network is worse than that of our 110-layer network, although both have similar training error. We argue that this is because of overfitting. The 1202-layer network may be unnecessarily large (19.4M) for this small dataset. Strong regularization such as maxout or dropout is applied to obtain the best results ([9, 25, 24, 34]) on this dataset. In this paper, we use no maxout/dropout and just simply impose regularization via deep and thin architectures by design, without distracting from the focus on the difficulties of optimization. But combining with stronger regularization may improve results, which we will study in the future.

不过对于这种太深的网络,仍然存在一些开放性的问题,1202层的网络在测试集上的结果比110层差,尽管在训练集上的错误率表现的比较相近。我们认为这是因为过拟合。1202层的网络相对于这个数据集来说实在是庞大到有点没必要。诸如maxout[9]或者dropout[13]通常用在此数据集上来解决这个问题,以获得更好的结果[9,25,24,34]。在本文中,我们没有用maxout/dropout,只是简单的实施了正则化来配合增大/减少网络深度架构设计,而不分散对优化困难的关注。然而结合更强的正则化手段进一步提升结果,是我们未来要研究的方向。

4.3. Object Detection on PASCAL and MS COCO

Our method has good generalization performance on other recognition tasks. Table 7 and 8 show the object detection baseline results on PASCAL VOC 2007 and 2012 and COCO . We adopt Faster R-CNN as the detection method. Here we are interested in the improvements of replacing VGG-16 [40] with ResNet-101. The detection implementation (see appendix) of using both models is the same, so the gains can only be attributed to better networks. Most remarkably, on the challenging COCO dataset we obtain a 6.0% increase in COCO’s standard metric ([email protected][.5, .95]), which is a 28% relative improvement. This gain is solely due to the learned representations.

我们的方法在其他识别任务上具有良好的泛化性能。表7和表8显示了PASCAL VOC 2007和2012和COCO的目标检测基本结果。我们采用了更快的R-CNN[32]作为检测方法,这儿我们重点关注的是把VGG16替换成ResNet101后的性能提升。使用这两种检测模型的实现是类似的(见附录),所以增益只能归功于更好的网络结构。值得注意的是,在COCO数据集上的挑战,我们获得的结果比COCO标准测试([email protected][.5,.95])提升了6.0%,相对提高了28%,这个增益完全归功于本文描述的残差学习方法。

【论文翻译】Deep Residual Learning for Image Recognition

Table 7. Object detection mAP (%) on the PASCAL VOC 2007/2012 test sets using baseline Faster R-CNN. See also appendix for better results.

表7. 使用基准Faster R-CNN在PASCAL VOC 2007/2012测试集中进行的对象检测mAP(%)。

【论文翻译】Deep Residual Learning for Image Recognition

Table 8. Object detection mAP (%) on the COCO validation set using baseline Faster R-CNN. See also appendix for better results.

表8. 使用基线Faster R-CNN在COCO验证集上进行的目标检测mAP(%)。

Based on deep residual nets, we won the 1st places in several tracks in ILSVRC & COCO 2015 competitions: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. The details are in the appendix.

在深度残差网络的基础上,我们在ILSVRC&COCO 2015多个竞赛中获得了第一名:ImageNet detection, ImageNet localization, COCO detection 和 COCO segmentation 等。详情见附录。