【论文翻译】Deep residual learning for image recognition

论文题目:Deep residual learning for image recognition
论文来源: Deep residual learning for image recognition
翻译人:[email protected]实验室

Deep residual learning for image recognition

深度残差学习用于图像识别

Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun

Abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.

The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions 1 , where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

摘要

深度神经网络更难训练。我们提出了一种残差的学习框架来简化以前使用的深度网络训练。本文重新创制了层,使得层能根据输入来学习残差函数而非原始函数。本文提供全面的依据表明:这些残差网络更易于优化并且可以通过增加深度获得准确性。在ImageNet数据集上,我们评估深度最大为152层的残差网络,虽然这相当于深8倍的VGG网络,但仍具有较低的复杂度。这些残留网络的组合模型(ensemble)在ImageNet测试仪上实现了3.57%的误差。该结果在ILSVRC 2015分类任务中获得第一名。我们还用100层和1000层的模型分析了CIFAR-10数据集。

表达的深度在许多视觉识别任务中至关重要。仅由于我们相当深的表达,我们在COCO目标检测数据集上获得了28%的相对提升。深度残差网是我们参加ILSVRC&COCO 2015竞赛上所使用模型的基础,并且在ImageNet检测、ImageNet本地化、COCO检测和COCO分割方面还获得了第一。

1. Introduction

Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 49, 39]. Deep networks naturally integrate low/mid/high-level features [49] and classifiers in an end-to-end multilayer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth). Recent evidence [40, 43] reveals that network depth is of crucial importance, and the leading results [40, 43, 12, 16] on the challenging ImageNet dataset [35] all exploit “very deep” [40] models, with a depth of sixteen [40] to thirty [16]. Many other nontrivial visual recognition tasks [7, 11, 6, 32, 27] have also greatly benefited from very deep models.

Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [14, 1, 8], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 8, 36, 12] and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].

When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [10, 41] and thoroughly verified by our experiments. Fig. 1 shows a typical example.

【论文翻译】Deep residual learning for image recognition

The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time).

In this paper, we address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F(x) := H(x)−x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.

The formulation of F(x)+x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2). Shortcut connections [2, 33, 48] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers.

【论文翻译】Deep residual learning for image recognition

We present comprehensive experiments on ImageNet[35] to show the degradation problem and evaluate our method. We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks.

Similar phenomena are also shown on the CIFAR-10 set[20], suggesting that the optimization difficulties and the effects of our method are not just akin to a particular dataset. We present successfully trained models on this dataset with over 100 layers, and explore models with over 1000 layers.

On the ImageNet classification dataset [35], we obtain excellent results by extremely deep residual nets. Our 152layer residual net is the deepest network ever presented on ImageNet, while still having lower complexity than VGG nets [40]. Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC 2015 classification competition. The extremely deep representations also have excellent generalization performance on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions. This strong evidence shows that the residual learning principle is generic, and we expect that it is applicable in other vision and non-vision problems.

1. 引言

深度卷积神经网络在图像分类领域取得一系列突破。深度网络自然地将一个端到端多层模型中的低/中/高级特征以及分类器整合起来,而特征的“等级”可以通过堆叠层的数量(深度)来丰富。最近有结果显示:模型的深度发挥着至关重要的作用,这样导致了ImageNet竞赛的参赛模型都趋向于“非常深”(16层到30层)。许多其它的视觉识别任务也都受益于非常深的模型。

在深度的重要性的驱使下,出现了一个新的问题:训练一个更好的网络是否和堆叠更多的层一样简单呢?解决这一问题的障碍便是困扰人们很久的梯度消失/梯度爆炸,它从一开始便阻碍了模型的收敛。归一初始化(normalized initialization)和中间归一化(intermediate normalization)在很大程度上解决了这一问题,它使得数十层的网络在反向传播的随机梯度下降(SGD)上能够收敛。

当深层网路能够收敛时,又会出现一个退化问题(degradation problem):随着网络深度的增加,准确率达到饱和(不足为奇)然后迅速退化。令人意外的是这种退化并不是由过拟合造成的,并且即使在一个合理的深度模型中增加更多的层却导致了更高的错误率(我们的实验也证明了这点)。Fig.1展示了一个典型的例子。

【论文翻译】Deep residual learning for image recognition

(训练准确率的)退化的出现表明了并非所有系统都容易优化。让我们来比较一个浅层的结构和它的深层版本。有一种通过构建更深的模型结构的方案:恒等映射(identity mapping)来构建增加的层,而其它层直接从浅层模型中复制而来。这种结构方案也表明了一个更深的模型不应当产生比它的浅层版本更高的训练错误率。但实验表明:我们目前无法找到一个与这种结构方案相当或者更好的方案(或者说无法在可行的时间内实现)。

本文中,我们提出一种深度残差学习框架来解决这个退化问题。我们明确的让这些层来拟合残差映射(residual mapping),而不是让每一个堆叠的层直接来拟合所需的底层映射(desired underlying mapping)。假设所需的底层映射为H(x),我们让堆叠的非线性层来拟合另一个映射为F(x) := H(x)−x。因此原来的映射转化为F(x)+x。我们推断残差映射比原始未参考的映射(unreferenced mapping)更容易优化。在极端的情况下,如果某个恒等映射是最优的,那么将残差变为0比用非线性层的堆叠来拟合恒等映射更简单。

公式 F(x)+x 可以通过前馈神经网络的“shortcut连接”来实现(Fig.2)。Shortcut连接就是跳过一个或者多个层。在我们的例子中,shortcut连接只是简单的执行恒等映射,再将它们的输出和堆叠层的输出叠加在一起(Fig.2)。恒等的shortcut连接并不增加额外的参数和计算复杂度。完整的网络仍然能通过端到端的SGD反向传播进行训练,并且能够简单的通过公共库(例如Caffe)来实现而无需修改求解器。

【论文翻译】Deep residual learning for image recognition

我们在ImageNet数据集上进行了综合性的实验来展示这个退化问题并评估了我们提出的方法。本文表明了:1) 我们极深的残差网络很容易优化,但是对应的“plain”网络(仅是层的堆叠)在深度增加时却出现了更高的错误率。2) 我们的深度残差网络能够轻易的由增加层来提高准确率并且结果也大大优于以前的网络。

CIFAR-10数据集上也出现了类似的现象,这表明了我们提出的方法的优化难度和效果并不仅仅是对于一个特定数据集而言的。我们在这个数据集上成功的提出了超过100层的训练模型并尝试了超过1000层的模型。

极深的残差网络在ImageNet分类数据集上获得了优异的成绩。我们的152层的残差网络是目前ImageNet尚最深的网络并且比VGG网络的复杂度还要低。我们的组合模型在ImageNet测试集上的top-5错误率仅为3.57%,并赢得了ILSVRC 2015分类竞赛的第一名。这个极深的模型在其他识别任务上同样也具有非常好的泛化性能,这让我们在ILSVRC & COCO 2015 竞赛的ImageNet检测、ImageNet定位、COCO检测以及COCO分割上均获得了第一名的成绩。这强有力的证明了残差学习法则的通用性,因此我们将把它应用到其他视觉甚至非视觉问题上。

2. Related Work

Residual Representations. In image recognition, VLAD [18] is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector [30] can be formulated as a probabilistic version [18] of VLAD. Both of them are powerful shallow representations for image retrieval and classification [4, 47]. For vector quantization, encoding residual vectors [17] is shown to be more effective than encoding original vectors.

In low-level vision and computer graphics, for solving Partial Differential Equations (PDEs), the widely used Multigrid method [3] reformulates the system as subproblems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a finer scale. An alternative to Multigrid is hierarchical basis preconditioning [44, 45], which relies on variables that represent residual vectors between two scales. It has been shown [3, 44, 45] that these solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions. These methods suggest that a good reformulation or preconditioning can simplify the optimization.

Shortcut Connections. Practices and theories that lead to shortcut connections [2, 33, 48] have been studied for a long time. An early practice of training multi-layer perceptrons (MLPs) is to add a linear layer connected from the network input to the output [33, 48]. In [43, 24], a few intermediate layers are directly connected to auxiliary classifiers for addressing vanishing/exploding gradients. The papers of [38, 37, 31, 46] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections. In [43], an “inception” layer is composed of a shortcut branch and a few deeper branches.

Concurrent with our work, “highway networks” [41, 42] present shortcut connections with gating functions [15]. These gates are data-dependent and have parameters, in contrast to our identity shortcuts that are parameter-free. When a gated shortcut is “closed” (approaching zero), the layers in highway networks represent non-residual functions. On the contrary, our formulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through, with additional residual functions to be learned. In addition, highway networks have not demonstrated accuracy gains with extremely increased depth (e.g., over 100 layers).

2. 相关工作

残差表达。在图像识别中,VLAD是残差向量对应于字典进行编码的一种表达形式,Fisher Vector可以看做是VLAD的一个概率版本。它们在图像检索和分类中都是强力的浅层表达,而残差向量编码在向量量化中比原始向量编码更加有效。

在低级视觉和计算机图形学中,为了求解偏微分方程(PDEs),通常使用Multigrid方法将系统重新表达成多尺度的子问题来解决,每一个子问题就是解决粗细尺度之间的残差问题。Multigrid的另外一种方式是分层基预处理,它依赖于代表着两个尺度之间残差向量的变量。实验证明 这些求解器比其他标准求解器的收敛要快得多,却并没有意识到这是该方法的残差特性所致。这些方法表明了一个好的重新表达或者预处理能够简化优化问题。

Shortcut连接。Shortcut连接已经经过了很长的一段实践和理论研究过程。训练多层感知器(MLPs)的一个早期实践就是添加一个连接输入和输出的线性层。在Szegedy2015Going及Lee2015deeply中,将一些中间层直接与辅助分类器相连接可以解决梯度消失/爆炸问题。在Szegedy2015Going中,一个“inception”层由一个shortcut分支和一些更深的分支组合而成。

与此同时,“highway networks”将shortcut连接与门控函数 结合起来。这些门是数据相关并且是有额外参数的,而我们的恒等shortcuts是无参数的。当一个门的shortcut是“closed”(接近于0)时,highway网络中的层表示非残差函数。而我们的模型总是学习残差函数;我们的恒等shortcuts从不关闭,在学习额外的残差函数时,所有的信息总是通过的。此外,highway网络并不能由增加层的深度(例如超过100层)来提高准确率。

3. Deep Residual Learning

3.1. Residual Learning

Let us consider H(x) as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net), with x denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions 2 , then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., H(x) − x (assuming that the input and output are of the same dimensions). So rather than expect stacked layers to approximate H(x), we explicitly let these layers approximate a residual function F(x) := H(x) − x. The original function thus becomes F(x)+x. Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different.

This reformulation is motivated by the counterintuitive phenomena about the degradation problem (Fig. 1, left). As we discussed in the introduction, if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart. The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.

In real cases, it is unlikely that identity mappings are optimal, but our reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. We show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning.

3.2. Identity Mapping by Shortcuts

We adopt residual learning to every few stacked layers. A building block is shown in Fig. 2. Formally, in this paper we consider a building block defined as:

【论文翻译】Deep residual learning for image recognition

Here x and y are the input and output vectors of the layers considered. The function F(x, { W i } ) represents the residual mapping to be learned. For the example in Fig. 2 that has two layers, F = W2 σ(W 1 x) in which σ denotes ReLU [29] and the biases are omitted for simplifying notations. The operation F + x is performed by a shortcut connection and element-wise addition. We adopt the second nonlinearity after the addition (i.e., σ(y), see Fig. 2).

The shortcut connections in Eqn.(1) introduce neither extra parameter nor computation complexity. This is not only attractive in practice but also important in our comparisons between plain and residual networks. We can fairly compare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition).

The dimensions of x and F must be equal in Eqn.(1). If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection W s by the shortcut connections to match the dimensions:

【论文翻译】Deep residual learning for image recognition

We can also use a square matrix W s in Eqn.(1). But we will show by experiments that the identity mapping is sufficient for addressing the degradation problem and is economical, and thus W s is only used when matching dimensions.

The form of the residual function F is flexible. Experiments in this paper involve a function F that has two or three layers (Fig. 5), while more layers are possible. But if F has only a single layer, Eqn.(1) is similar to a linear layer: y = W 1 x + x, for which we have not observed advantages.

We also note that although the above notations are about fully-connected layers for simplicity, they are applicable to convolutional layers. The function F(x, { W i } ) can represent multiple convolutional layers. The element-wise addition is performed on two feature maps, channel by channel.

3.3. Network Architectures

We have tested various plain/residual nets, and have observed consistent phenomena. To provide instances for discussion, we describe two models for ImageNet as follows.

Plain Network. Our plain baselines (Fig. 3, middle) are mainly inspired by the philosophy of VGG nets [40] (Fig. 3, left). The convolutional layers mostly have 3×3 filters and follow two simple design rules: (i) for the same output feature map size, the layers have the same number of filters; and (ii) if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by convolutional layers that have a stride of 2. The network ends with a global average pooling layer and a 1000-way fully-connected layer with softmax. The total number of weighted layers is 34 in Fig. 3 (middle).

It is worth noticing that our model has fewer filters and lower complexity than VGG nets [40] (Fig. 3, left). Our 34layer baseline has 3.6 billion FLOPs (multiply-adds), which is only 18% of VGG-19 (19.6 billion FLOPs).

Residual Network. Based on the above plain network, we insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version. The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Fig. 3). When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1×1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.

3.4. Implementation

Our implementation for ImageNet follows the practice in [21, 40]. The image is resized with its shorter side randomly sampled in [256, 480] for scale augmentation [40]. A 224×224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted [21]. The standard color augmentation in [21] is used. We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16]. We initialize the weights as in [12] and train all plain/residual nets from scratch. We use SGD with a mini-batch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to 60 × 10 4 iterations. We use a weight decay of 0.0001 and a momentum of 0.9. We do not use dropout [13], following the practice in [16].

In testing, for comparison studies we adopt the standard 10-crop testing [21]. For best results, we adopt the fullyconvolutional form as in [40, 12], and average the scores at multiple scales (images are resized such that the shorter side is in { 224, 256, 384, 480, 640 } ).

3. 深度残差学习

3.1. 残差学习

我们将H(x)看作一个由部分堆叠的层(并不一定是全部的网络)来拟合的底层映射,其中x是这些层的输入。假设多个非线性层能够逼近复杂的函数,这就等价于这些层能够逼近复杂的残差函数,例如H(x)−x(假设输入和输出的维度相同)。所以我们明确的让这些层来估计一个残差函数:F(x):=H(x)−x而不是H(x)。因此原始函数变成了:F(x)+x。尽管这两个形式应该都能够逼近所需的函数(正如假设一样),但是学习的难易程度并不相同。

这个重新表达的动机是由退化问题这个反常的现象(Fig.1左)。正如我们在引言(introduction)中讨论的,如果增加的层能以恒等映射来构建,一个更深模型的训练错误率不应该比它对应的浅层模型的更大。退化问题表明了求解器在通过多个非线性层来估计恒等映射上可能是存在困难的。而伴随着残差学习的重新表达,如果恒等映射是最优的,那么求解器驱使多个非线性层的权重趋向于零来逼近恒等映射。

在实际情况下,恒等映射不太可能达到最优,但是我们的重新表达对于这个问题的预处理是有帮助的。如果最优函数更趋近于恒等映射而不是0映射,那么对于求解器来说寻找关于恒等映射的扰动比学习一个新的函数要容易的多。通过实验(Fig.7)表明,学习到的残差函数通常只有很小的响应,这说明了恒等映射提供了合理的预处理。

3.2. 用Shortcuts进行目标映射

我们在堆叠层上采取残差学习算法。一个构建块如Fig.2所示。本文中的构建块定义如下(1式):

【论文翻译】Deep residual learning for image recognition

其中x和y分别表示层的输入和输出。函数F(x,{Wi})代表着学到的残差映射。Fig.2中的例子包含两层,F=W2σ(W1x),其中σ代表ReLU,为了简化省略了偏置项。F+x操作由一个shortcut连接和元素级(element-wise)的加法来表示。在加法之后我们再执行另一个非线性操作(例如σ(y),见Fig.2)。

1式中的shortcut连接没有增加额外的参数和计算复杂度。这不仅是一个很有吸引力的做法,同时在对“plain”网络和残差网络进行比较时也是非常重要的。我们可以在参数、深度、宽度以及计算成本都相同的基础上对两个网络进行公平的比较(除了可以忽略不计的元素级的加法)。

在1式中,x和F的维度必须相同。如果不相同(例如,当改变了输入/输出的通道),我们可以通过shortcut连接执行一个线性映射Ws来匹配两者的维度(2式):

【论文翻译】Deep residual learning for image recognition

在1式中同样可以使用方阵Ws。但我们的实验表明:恒等映射已足够解决退化问题并且是经济的,因此Ws只是用来解决维度不匹配的问题。

残差函数F的形式灵活可变。本文实验中涉及到的函数F是两层或者三层的(Fig.5),当然更多层也是可行的。但是如果F只含有一层,1式就和线性函数:y=W1x+x一致,因此并不具有任何优势。

我们还发现不仅是对于全连接层,对于卷积层也是同样适用的。函数F(x,{Wi})可以表示多个卷积层,在两个特征图的通道之间执行元素级的加法。

3.3. 网络架构

我们在多个plain网络和残差网络上进行了测试并都观测到了一致的现象。接下来我们将在ImageNet上对两个模型进行讨论。

Plain网络。我们的plain网络结构(Fig.3中)主要受VGG网络 (Fig.3左)的启发。卷积层主要为3*3的滤波器并遵循以下两点要求:(i) 输出特征尺寸相同的层含有相同数量的滤波器;(ii) 如果特征尺寸减半,则滤波器的数量增加一倍来保证每层的时间复杂度相同。我们直接通过stride为2的卷积层来进行下采样。在网络的最后是一个全局的平均pooling层和一个1000 类的包含softmax的全连接层。加权层的层数为34,如Fig.3中所示。

值得注意的是我们的模型比VGG网络(Fig.3左)有更少的滤波器和更低的计算复杂度。我们34层的结构含有36亿个FLOPs(乘-加),而这仅仅只有VGG-19(196亿个FLOPs)的18%。

残差网络。 在plain网络的基础上,我们插入shortcut连接(Fig.3右)将网络变成了对应的残差版本。如果输入和输出的维度相同时,可以直接使用恒等shortcuts (Eq.1)(Fig.3中的实线部分)。当维度增加时(Fig.3中的虚线部分),我们有两个选项:(A) shortcut仍然使用恒等映射,在增加的维度上使用0来填充,这样做不会增加额外的参数;(B) 使用Eq.2的映射shortcut来使维度保持一致(通过1*1的卷积)。对于这两个选项,当shortcut跨越两种尺寸的特征图时,均使用stride为2的卷积。

【论文翻译】Deep residual learning for image recognition

3.4. 执行结果

针对ImageNet的网络实现遵循了Krizhevsky2012ImageNet和Simonyan2014Very。调整图像的大小使它的短边长度随机的从[256,480]中采样来增大图像的尺寸。 从一张图像或者它的水平翻转图像中随机采样一个224*224的crop,每个像素都减去均值。图像使用标准的颜色增强。我们在每一个卷积层之后,**层之前均使用batch normalization(BN)。我们根据He2014spatial来初始化权值然后从零开始训练所有plain/残差网络。 我们使用的mini-batch的尺寸为256。学习率从0.1开始,每当错误率平稳时将学习率除以10,整个模型进行60∗10460∗104次迭代训练。我们将权值衰减设置为0.0001,a 动量为0.9。根据 Ioffe2015Batch,我们并没有使用Dropout。

在测试中,为了进行比较,我们采取标准的10-crop测试。为了达到最佳的结果,我们使用Simonyan2014Very及He2014spatial中的全卷积形式,并在多个尺度的结果上取平均分(调整图像的大小使它的短边长度分别为{224,256,384,480,640}{224,256,384,480,640})。

【论文翻译】Deep residual learning for image recognition

【论文翻译】Deep residual learning for image recognition

【论文翻译】Deep residual learning for image recognition

4. Experiments

4.1. ImageNet Classification

We evaluate our method on the ImageNet 2012 classification dataset [35] that consists of 1000 classes. The models are trained on the 1.28 million training images, and evaluated on the 50k validation images. We also obtain a final result on the 100k test images, reported by the test server. We evaluate both top-1 and top-5 error rates.

Plain Networks. We first evaluate 18-layer and 34-layer plain nets. The 34-layer plain net is in Fig. 3 (middle). The 18-layer plain net is of a similar form. See Table 1 for detailed architectures.

The results in Table 2 show that the deeper 34-layer plain net has higher validation error than the shallower 18-layer plain net. To reveal the reasons, in Fig. 4 (left) we compare their training/validation errors during the training procedure. We have observed the degradation problem - the 34-layer plain net has higher training error throughout the whole training procedure, even though the solution space of the 18-layer plain network is a subspace of that of the 34-layer one.

We argue that this optimization difficulty is unlikely to be caused by vanishing gradients. These plain networks are trained with BN [16], which ensures forward propagated signals to have non-zero variances. We also verify that the backward propagated gradients exhibit healthy norms with BN. So neither forward nor backward signals vanish. In fact, the 34-layer plain net is still able to achieve competitive accuracy (Table 3), suggesting that the solver works to some extent. We conjecture that the deep plain nets may have exponentially low convergence rates, which impact the reducing of the training error 3 . The reason for such optimization difficulties will be studied in the future.

Residual Networks. Next we evaluate 18-layer and 34layer residual nets (ResNets). The baseline architectures are the same as the above plain nets, expect that a shortcut connection is added to each pair of 3×3 filters as in Fig. 3 (right). In the first comparison (Table 2 and Fig. 4 right), we use identity mapping for all shortcuts and zero-padding for increasing dimensions (option A). So they have no extra parameter compared to the plain counterparts.

We have three major observations from Table 2 and Fig. 4. First, the situation is reversed with residual learning – the 34-layer ResNet is better than the 18-layer ResNet (by 2.8%). More importantly, the 34-layer ResNet exhibits considerably lower training error and is generalizable to the validation data. This indicates that the degradation problem is well addressed in this setting and we manage to obtain accuracy gains from increased depth.

Second, compared to its plain counterpart, the 34-layer ResNet reduces the top-1 error by 3.5% (Table 2), resulting from the successfully reduced training error (Fig. 4 right vs. left). This comparison verifies the effectiveness of residual learning on extremely deep systems.

Last, we also note that the 18-layer plain/residual nets are comparably accurate (Table 2), but the 18-layer ResNet converges faster (Fig. 4 right vs. left). When the net is “not overly deep” (18 layers here), the current SGD solver is still able to find good solutions to the plain net. In this case, the ResNet eases the optimization by providing faster convergence at the early stage.

Identity vs. Projection Shortcuts. We have shown that parameter-free, identity shortcuts help with training. Next we investigate projection shortcuts (Eqn.(2)). In Table 3 we compare three options: (A) zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameterfree (the same as Table 2 and Fig. 4 right); (B) projection shortcuts are used for increasing dimensions, and other shortcuts are identity; and (C) all shortcuts are projections.

Table 3 shows that all three options are considerably better than the plain counterpart. B is slightly better than A. We argue that this is because the zero-padded dimensions in A indeed have no residual learning. C is marginally better than B, and we attribute this to the extra parameters introduced by many (thirteen) projection shortcuts. But the small differences among A/B/C indicate that projection shortcuts are not essential for addressing the degradation problem. So we do not use option C in the rest of this paper, to reduce memory/time complexity and model sizes. Identity shortcuts are particularly important for not increasing the complexity of the bottleneck architectures that are introduced below.

Deeper Bottleneck Architectures. Next we describe our deeper nets for ImageNet. Because of concerns on the training time that we can afford, we modify the building block as a bottleneck design 4 . For each residual function F, we use a stack of 3 layers instead of 2 (Fig. 5). The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/output dimensions. Fig. 5 shows an example, where both designs have similar time complexity.

The parameter-free identity shortcuts are particularly important for the bottleneck architectures. If the identity shortcut in Fig. 5 (right) is replaced with projection, one can show that the time complexity and model size are doubled, as the shortcut is connected to the two high-dimensional ends. So identity shortcuts lead to more efficient models for the bottleneck designs.

50-layer ResNet: We replace each 2-layer block in the 34-layer net with this 3-layer bottleneck block, resulting in a 50-layer ResNet (Table 1). We use option B for increasing dimensions. This model has 3.8 billion FLOPs.

101-layer and 152-layer ResNets: We construct 101layer and 152-layer ResNets by using more 3-layer blocks (Table 1). Remarkably, although the depth is significantly increased, the 152-layer ResNet (11.3 billion FLOPs) still has lower complexity than VGG-16/19 nets (15.3/19.6 billion FLOPs).

The 50/101/152-layer ResNets are more accurate than the 34-layer ones by considerable margins (Table 3 and 4). We do not observe the degradation problem and thus enjoy significant accuracy gains from considerably increased depth. The benefits of depth are witnessed for all evaluation metrics (Table 3 and 4).

Comparisons with State-of-the-art Methods. In Table 4 we compare with the previous best single-model results. Our baseline 34-layer ResNets have achieved very competitive accuracy. Our 152-layer ResNet has a single-model top-5 validation error of 4.49%. This single-model result outperforms all previous ensemble results (Table 5). We combine six models of different depth to form an ensemble (only with two 152-layer ones at the time of submitting). This leads to 3.57% top-5 error on the test set (Table 5). This entry won the 1st place in ILSVRC 2015.

4.2. CIFAR-10 and Analysis

We conducted more studies on the CIFAR-10 dataset [20], which consists of 50k training images and 10k testing images in 10 classes. We present experiments trained on the training set and evaluated on the test set. Our focus is on the behaviors of extremely deep networks, but not on pushing the state-of-the-art results, so we intentionally use simple architectures as follows.

The plain/residual architectures follow the form in Fig. 3 (middle/right). The network inputs are 32×32 images, with the per-pixel mean subtracted. The first layer is 3×3 convolutions. Then we use a stack of 6n layers with 3×3 convolutions on the feature maps of sizes { 32, 16, 8 } respectively, with 2n layers for each feature map size. The numbers of filters are { 16, 32, 64 } respectively. The subsampling is performed by convolutions with a stride of 2. The network ends with a global average pooling, a 10-way fully-connected layer, and softmax. There are totally 6n+2 stacked weighted layers. The following table summarizes the architecture:

【论文翻译】Deep residual learning for image recognition

When shortcut connections are used, they are connected to the pairs of 3×3 layers (totally 3n shortcuts). On this dataset we use identity shortcuts in all cases (i.e., option A), so our residual models have exactly the same depth, width, and number of parameters as the plain counterparts.

We use a weight decay of 0.0001 and momentum of 0.9, and adopt the weight initialization in [12] and BN [16] but with no dropout. These models are trained with a minibatch size of 128 on two GPUs. We start with a learning rate of 0.1, divide it by 10 at 32k and 48k iterations, and terminate training at 64k iterations, which is determined on a 45k/5k train/val split. We follow the simple data augmentation in [24] for training: 4 pixels are padded on each side, and a 32×32 crop is randomly sampled from the padded image or its horizontal flip. For testing, we only evaluate the single view of the original 32×32 image.

We compare n = { 3, 5, 7, 9 } , leading to 20, 32, 44, and 56-layer networks. Fig. 6 (left) shows the behaviors of the plain nets. The deep plain nets suffer from increased depth, and exhibit higher training error when going deeper. This phenomenon is similar to that on ImageNet (Fig. 4, left) and on MNIST (see [41]), suggesting that such an optimization difficulty is a fundamental problem.

Fig. 6 (middle) shows the behaviors of ResNets. Also similar to the ImageNet cases (Fig. 4, right), our ResNets manage to overcome the optimization difficulty and demonstrate accuracy gains when the depth increases.

We further explore n = 18 that leads to a 110-layer ResNet. In this case, we find that the initial learning rate of 0.1 is slightly too large to start converging 5 . So we use 0.01 to warm up the training until the training error is below 80% (about 400 iterations), and then go back to 0.1 and continue training. The rest of the learning schedule is as done previously. This 110-layer network converges well (Fig. 6, middle). It has fewer parameters than other deep and thin networks such as FitNet [34] and Highway [41] (Table 6), yet is among the state-of-the-art results (6.43%, Table 6).

Analysis of Layer Responses. Fig. 7 shows the standard deviations (std) of the layer responses. The responses are the outputs of each 3×3 layer, after BN and before other nonlinearity (ReLU/addition). For ResNets, this analysis reveals the response strength of the residual functions. Fig. 7 shows that ResNets have generally smaller responses than their plain counterparts. These results support our basic motivation (Sec.3.1) that the residual functions might be generally closer to zero than the non-residual functions. We also notice that the deeper ResNet has smaller magnitudes of responses, as evidenced by the comparisons among ResNet-20, 56, and 110 in Fig. 7. When there are more layers, an individual layer of ResNets tends to modify the signal less.

Exploring Over 1000 layers. We explore an aggressively deep model of over 1000 layers. We set n = 200 that leads to a 1202-layer network, which is trained as described above. Our method shows no optimization difficulty, and this 10 3 -layer network is able to achieve training error <0.1% (Fig. 6, right). Its test error is still fairly good (7.93%, Table 6).

But there are still open problems on such aggressively deep models. The testing result of this 1202-layer network is worse than that of our 110-layer network, although both have similar training error. We argue that this is because of overfitting. The 1202-layer network may be unnecessarily large (19.4M) for this small dataset. Strong regularization such as maxout [9] or dropout [13] is applied to obtain the best results ([9, 25, 24, 34]) on this dataset. In this paper, we use no maxout/dropout and just simply impose regularization via deep and thin architectures by design, without distracting from the focus on the difficulties of optimization. But combining with stronger regularization may improve results, which we will study in the future.

4.3. Object Detection on PASCAL and MS COCO

Our method has good generalization performance on other recognition tasks. Table 7 and 8 show the object detection baseline results on PASCAL VOC 2007 and 2012 [5] and COCO [26]. We adopt Faster R-CNN [32] as the detection method. Here we are interested in the improvements of replacing VGG-16 [40] with ResNet-101. The detection implementation (see appendix) of using both models is the same, so the gains can only be attributed to better networks. Most remarkably, on the challenging COCO dataset we obtain a 6.0% increase in COCO’s standard metric ([email protected][.5, .95]), which is a 28% relative improvement. This gain is solely due to the learned representations.

Based on deep residual nets, we won the 1st places in several tracks in ILSVRC & COCO 2015 competitions: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. The details are in the appendix.

4. 实验

4.1. ImageNet分类

本文在1000类的ImageNet2012数据集上对我们的方法进行评估。训练集包含128万张图像,验证集包含5万张图像。我们在10万张测试图像上进行测试,并对top-1和top-5 的错误率进行评估。

Plain网络。 我们首先评估了18层和34层的plain网络。34层的网络如图Fig.3中所示;18层的结构也很相似,具体细节参见Table 1。

Table 2中展示的结果表明了34层的网络比18层的网络具有更高的验证错误率。为了揭示产生这种现象的原因,在Fig.4左中我们比较了整个训练过程中的训练及验证错误率。从结果中我们观测到了明显的退化问题——在整个训练过程中34 层的网络具有更高的训练错误率,即使18层网络的解空间为34层解空间的一个子空间。

我们认为这种优化上的困难不太可能是由梯度消失所造成的。因为这些plain网络的训练使用了BN,这能保证前向传递的信号是具有非零方差的。我们同样验证了在反向传递阶段的梯度由于BN而具有良好的范式,所以在前向和反向阶段的信号不会存在消失的问题。事实上34层的plain网络仍然具有不错的准确率(Table 3),这表明了求解器在某种程度上也是有效的。我们推测,深层的plain网络的收敛率是指数衰减的,这可能会影响训练错误率的降低。这种优化困难的原因我们将在以后的工作中进行研究。

残差网络。 接下来我们对18层和34层的残差网络ResNets进行评估。如Fig.3右所示,ResNets的基本框架和plain网络的基本相同,除了在每一对3*3的滤波器上添加了一个shortcut连接。在Table 2以及Fig.4右的比较中,所有的shortcuts都是恒等映射,并且使用0对增加的维度进行填充(选项 A)。因此他们并没有增加额外的参数。

我们从Table 2和Fig.4中观测到以下三点:第一,与plain网络相反,34层的ResNet比18层ResNet的结果更优(2.8%)。更重要的是,34 层的ResNet在训练集和验证集上均展现出了更低的错误率。这表明了这种设置可以很好的解决退化问题,并且我们可以由增加的深度来提高准确率。

第二,与对应的plain网络相比,34层的ResNet在top-1 错误率上降低了3.5% (Table 2),这得益于训练错误率的降低(Fig.4 右 vs 左)。这也验证了在极深的网络中残差学习的有效性。

最后,我们同样注意到,18层的plain网络和残差网络的准确率很接近 (Table 2),但是ResNet 的收敛速度要快得多。(Fig.4 右 vs 左)。如果网络“并不是特别深” (如18层),现有的SGD能够很好的对plain网络进行求解,而ResNet能够使优化得到更快的收敛。

恒等vs映射Shortcuts。 我们已经验证了无参数的恒等shortcuts是有助于训练的。接下来我们研究映射shortcut(Eq.2)。在Table 3中,我们比较了三种选项:(A) 对增加的维度使用0填充,所有的shortcuts是无参数的(与Table 2 和 Fig.4 (右)相同);(B) 对增加的维度使用映射shortcuts,其它使用恒等shortcuts;(C) 所有的都是映射shortcuts。

Table 3表明了三种选项的模型都比对于的plain模型要好。B略好于A,我们认为这是因为A中的0填充并没有进行残差学习。C略好于B,我们把这个归结于更多的(13个)映射shortcuts所引入的参数。在A、B、C三个结果中细小的差距也表明了映射shortcuts对于解决退化问题并不是必需的。所以我们在本文接下来的内容中,为了减少复杂度和模型尺寸,并不使用选项C的模型。恒等shortcuts因其无额外复杂度而对以下介绍的瓶颈结构尤为重要。

深度瓶颈结构。 接下来我们介绍更深的模型。考虑到训练时间的限制,我们将构建块修改成\emph{瓶颈}的设计。对于每一个残差函数F,我们使用了三个叠加层而不是两个(Fig.5)。 这三层分别是11、33 和11 的卷积,11 的层主要负责减少然后增加(恢复)维度,剩下的3*3的层来减少输入和输出的维度。Fig.5展示了一个例子,这两种设计具有相似的时间复杂度。

无参数的恒等shortcuts对于瓶颈结构尤为重要。如果使用映射shortcuts来替代Fig.5(右)中的恒等shortcuts,将会发现时间复杂度和模型尺寸都会增加一倍,因为shortcut连接了两个高维端,所以恒等shortcuts对于瓶颈设计是更加有效的。

【论文翻译】Deep residual learning for image recognition

50层 ResNet:我们将34层网络中2层的模块替换成3层的瓶颈模块,整个模型也就变成了50层的ResNet (Table 1)。对于增加的维度我们使用选项B来处理。整个模型含有38亿个FLOPs。

101层和152层 ResNets:我们使用更多的3层模块来构建101层和152层的ResNets (Table 1)。值得注意的是,虽然层的深度明显增加了,但是152层ResNet的计算复杂度(113亿个FLOPs)仍然比VGG-16(153 亿个FLOPs)和VGG-19(196亿个FLOPs)的小很多。

50/101/152层ResNets比34层ResNet的准确率要高得多(Table 3 和4)。而且我们并没有观测到退化问题。所有的指标都证实了深度带来的好处。 (Table 3 和4)。

与最优秀方法的比较。 在Table 4中我们比较了目前最好的单模型结果。我们的34层ResNets取得了非常好的结果,152层的ResNet的单模型top-5验证错误率仅为 4.49%,甚至比先前组合模型的结果还要好 (Table 5)。我们将6个不同深度的ResNets合成一个组合模型(在提交结果时只用到2个152层的模型)。这在测试集上的top-5错误率仅为3.57% (Table 5),这一项在ILSVRC 2015 上获得了第一名的成绩。

【论文翻译】Deep residual learning for image recognition

【论文翻译】Deep residual learning for image recognition

【论文翻译】Deep residual learning for image recognition

【论文翻译】Deep residual learning for image recognition

4.2. CIFAR-10 及分析

我们在包含5万张训练图像和1万张测试图像的10类CIFAR-10数据集上进行了更多的研究。我们在训练集上进行训练,在测试集上进行验证。我们关注的是验证极深模型的效果,而不是追求最好的结果,因此我们只使用简单的框架如下。

Plain网络和残差网络的框架如 Fig.3(中/右)所示。网络的输入是3232的减掉像素均值的图像。第一层是33的卷积层。然后我们使用6n个3*3的卷积层的堆叠,卷积层对应的特征图有三种:{32,16,8}{32,16,8},每一种卷积层的数量为2n 个,对应的滤波器数量分别为{16,32,64}{16,32,64}。使用strde为2的卷积层进行下采样。在网络的最后是一个全局的平均pooling层和一个10类的包含softmax的全连接层。一共有6n+2个堆叠的加权层。具体的结构见下表:

【论文翻译】Deep residual learning for image recognition

使用shortcut连接3*3的卷积层对(共有 3n个shortcuts)。在这个数据集上我们所有的模型都使用恒等shortcuts(选项 A),因此我们的残差模型和对应的plain模型具有相同的深度、宽度和参数量。

权重的衰减设置为0.0001,动量为0.9,采用了He2015Delving中的权值初始化以及BN,但是不使用Dropout,mini-batch的大小为128,模型在2块GPU 上进行训练。学习率初始为0.1,在第32000和48000次迭代时将其除以10,总的迭代次数为64000,这是由45000/5000的训练集/验证集分配所决定的。我们在训练阶段遵循Lee2015deeply中的数据增强法则:在图像的每条边填充4个像素,然后在填充后的图像或者它的水平翻转图像上随机采样一个32*32 的crop。在测试阶段,我们只使用原始32*32的图像进行评估。

我们比较了n={3,5,7,9}n={3,5,7,9},也就是20、32、44以及56层的网络。Fig.6(左) 展示了plain网络的结果。深度plain网络随着层数的加深,训练错误率也变大。这个现象与在ImageNet(Fig.4, 左)和MNIST上的结果很相似,表明了优化上的难度确实是一个很重要的问题。

Fig.6(中)展示了ResNets的效果。与ImageNet(Fig.4, 右)中类似,我们的ResNets能够很好的克服优化难题,并且随着深度加深,准确率也得到了提升。

我们进一步探索了n=18,也就是110层的ResNet。在这里,我们发现0.1的初始学习率有点太大而不能很好的收敛。所以我们刚开始使用0.01的学习率,当训练错误率在80%以下(大约400次迭代)之后,再将学习率调回0.1继续训练。剩余的学习和之前的一致。110层的ResNets很好的收敛了 (Fig.6, 中)。它与其他的深层窄模型,如FitNet和 Highway (Table 6)相比,具有更少的参数,然而却达到了最好的结果 (6.43%, Table 6)。

层响应分析 Fig.7展示了层响应的标准方差(std)。 响应是每一个3*3卷积层的BN之后、非线性层(ReLU/addition)之前的输出。对于ResNets,这个分析结果也揭示了残差函数的响应强度。Fig.7表明了ResNets的响应比它对应的plain网络的响应要小。这些结果也验证了我们的基本动机(Sec3.1),即残差函数比非残差函数更接近于0。从Fig.7中ResNet-20、56和110的结果,我们也注意到,越深的ResNet的响应幅度越小。当使用更多层是,ResNets中单个层对信号的改变越少。

超过1000层的尝试 我们探索了一个超过1000层的极其深的模型。我们设置n=200,也就是1202层的网络模型,按照上述进行训练。我们的方法对103103层的模型并不难优化,并且达到了<0.1%的训练错误率(Fig.6, 右),它的测试错误率也相当低(7.93%, Table 6)。

但是在这样一个极其深的模型上,仍然存在很多问题。1202层模型的测试结果比110层的结果要差,尽管它们的训练错误率差不多。我们认为这是过拟合导致的。这样一个1202层的模型对于小的数据集来说太大了(19.4M)。在这个数据集上应用了强大的正则化方法,如maxout或者 dropout,才获得了最好的结果。本文中,我们并没有使用maxout/dropout,只是简单的通过设计深层窄模型来进行正则化,而且不用担心优化的难度。但是通过强大的正则化或许能够提高实验结果,我们会在以后进行研究。

【论文翻译】Deep residual learning for image recognition

【论文翻译】Deep residual learning for image recognition

4.3. PASCAL和MS COCO上的目标检测

我们的方法在其它识别任务上展现出了很好的泛化能力。Table 7和8展示了在PASCAL VOC 2007 和 2012以及 COCO上的目标检测结果。我们使用Faster R-CNN作为检测方法。在这里,我们比较关注由ResNet-101 替换VGG-16所带来的的提升。使用不同网络进行检测的实现是一样的,所以检测结果只能得益于更好的网络。最值得注意的是,在COCO数据集上,我们在COCO的标准指标([email protected][.5, .95])上比先前的结果增加了6.0%,这相当于28%的相对提升。而这完全得益于所学到的表达。

基于深度残差网络,我们在ILSVRC & COCO 2015竞赛的ImageNet检测、ImageNet定位、COCO检测以及COCO分割上获得了第一名。

【论文翻译】Deep residual learning for image recognition