论文《deep residual learning for image recognition》-Kaiming He

文章中的源码https://github.com/KaimingHe/deep-residual-networks

  • 开头便提到神经网络越深,越难以训练优化。**ResNet出现的主要原因(目的)**是解决深层网络中的退化现象,属于优化难题,SGD的优化更困难。degradation problem:即随着网络的depth增加,train error非但没有降低,反而增加了。

In this paper, we address the degradation problem by introducing a deep residual learning framework.

  • 深层网络中的两大问题:vanishing/exploding gradient && dgradation problem

vanishing/exploding gradients , which
hamper convergence from the beginning.

解决办法:Normalization 正则化,平衡**函数的线性区(敏感区)和饱和区。

however, has been largely addressed by normalized initialization and intermediate normalization layers, which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with back propagation .
When deeper network

深度学习领域有一个重要的假设:IID即独立同分布假设。Batch Normalization批正则化是对每个层的计算结果(还没有进入**函数)进行规范,使之服从mean=0,std = 1的标准正态分布,这样保证每层的输入同分布。BN灵感来源于对输入的image进行白化处理。
【batch normalization】S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. In ICML, 2015.

参看深入理解:batch normalization批标准化

  • ResNet学习残差函数F(x)
    论文《deep residual learning for image recognition》-Kaiming He
    假定残差映射更容易优化。

Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear
layers fit another mapping of F(x) := H(x)−x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize
the original, unreferenced mapping.

  • resnet shortcut connection 与 Highway connection
    两者有一定的相似性,均采用shortcut方式,使得前向计算时,前面层的feature信息可以直接传到后面的层,同时在反向传播BP时,也有利于梯度向前流向,从而达到深层网络较好的优化效果。
  1. ResNet 中的shortcut connection相比较highway connection而言,没有额外的参数。
  2. highway connection中的shortcut是带有“门控”的。

“highway networks” present shortcut connections with gating functions.

参看shortcut connection 和 highway network

  • identity mapping by shortcut
    ResNet中的恒等映射:
    论文《deep residual learning for image recognition》-Kaiming He
    其中的 F(x, {Wi})表示的是将要学习的残差映射 residual mapping .

The operation F + x is performed by a shortcut connection and element-wise addition. The element-wise addition is performed on two feature maps, channel by channel.

相加之后再施加relu**函数。
论文《deep residual learning for image recognition》-Kaiming He
x 和 F的channel 数目一定是相等的,若dimension增加时(如下图),可以采取两种方法使之相等:

  • with extra zero entries padded for increasing dimensions. This option introduces no extraparameter;
  • The projection shortcut in Eqn(2),即使用1*1的卷积,等价于 perform a linear projection Ws by the
    shortcut connections to match the dimensions:论文《deep residual learning for image recognition》-Kaiming He
    论文《deep residual learning for image recognition》-Kaiming He
  • Plain Network 与 Residual Network
    前者指网络中没有shortcut,只是layers的stack,eg VGG[Oxford Visual Geometry Group , 全是3*3 卷积],VGG中的两个设计原则:
  • for the same output feature map size, the layers have the same number of filters;
  • if the feature map size is halved, the number of filters is doubled.
    参看 一文读懂VGG-知乎
    论文《deep residual learning for image recognition》-Kaiming He
    该结果表示,resnet可以使得网络depth增加,error减少。
  • Deeper Bottleneck Architectures
    论文《deep residual learning for image recognition》-Kaiming He
    为了在可行的时间耗费范围内,作者采用bottleneck设计方式来研究更深的resnet结构。
  • Exploring Over 1000 layers
    -论文《deep residual learning for image recognition》-Kaiming He

作者发现当depth=1202与depth = 110时,两者的train error 差不多,但是depth = 1202 的test error还不如 depth = 110的,作者分析是出现了过拟合。maxout 和dropout可以缓解overfitting。

  • Maxout 和 Dropout
    Maxout 需要参数k, 该参数使得参数量增加了k倍,maxout层可以看做是一个“**函数”,且是分段的,不固定的,可学习的“函数”。Maxout隐隐含层、函数逼近器。相当于在原来的input 与output层之间又增加了k个node。
    论文《deep residual learning for image recognition》-Kaiming He
    论文《deep residual learning for image recognition》-Kaiming He