使用tensorflow 2 0理解并实现resnet 50

重点 (Top highlight)

Our intuition may suggest that deeper neural networks should be able to catch more complex features and thus they can be used for representing more complex functions compared to the shallower ones. The question that should arise is — if learning a better network is equivalent to stacking more and more layers? What are the problems and benefits of this approach? These questions and some very important other concepts were discussed in the Deep Residual Learning for Image Recognition paper by K. He et al. in 2017. This architecture is known as ResNet and many important must-know concepts related to Deep Neural Network (DNN) were introduced in this paper and, these will all be addressed in this post including an implementation of 50 layer ResNet in TensorFlow 2.0. What you can expect to learn from this post —

我们的直觉可能表明，较深的神经网络应能够捕获更复杂的特征，因此与较浅的神经网络相比，它们可用于表示更复杂的功能。应该出现的问题是–学习更好的网络是否等同于堆叠越来越多的层？这种方法有什么问题和好处？这些问题以及一些非常重要的其他概念在K. He等人的“ 深度残差图像识别学习”一文中进行了讨论。该架构在2017年被称为ResNet，并且在本文中介绍了许多与深度神经网络(DNN)相关的重要必不可少的概念，所有这些都将在本文中解决，包括在TensorFlow 2.0中实现50层ResNet的实现。您可以从这篇文章中学到什么—

Problem with Very Deep Neural Network.
非常深的神经网络的问题。
Mathematical Intuition Behind ResNet.
ResNet背后的数学直觉。
Residual Block and Skip Connection.
残余阻滞和跳过连接。
Structuring ResNet and Importance of 1×1 Convolution.
构建ResNet和1×1卷积的重要性。
Implement ResNet with TensorFlow.
使用TensorFlow实施ResNet。

Let’s begin!

让我们开始！

降级问题： (Degradation Problem :)

The main motivation of the ResNet original work was to address the degradation problem in a deep network. Adding more layers to a sufficiently deep neural network would first see saturation in accuracy and then the accuracy degrades. He et.al. presented the following picture of train and test error with Cifar-10 data-set using vanilla net--

ResNet原始工作的主要动机是解决深度网络中的降级问题。向足够深的神经网络添加更多层将首先看到精度饱和，然后精度下降。他等人展示了以下使用香草网的Cifar-10数据集的火车和测试错误的图片-

使用tensorflow 2 0理解并实现resnet 50 — Fig. 1: Classification error with Cifar-10 data increases with increasing no. of layers for both training (left) and test data (right) in a plain DNN. Reference: [1]

As we can see the training (left) and test errors (right) for the deeper network (56 layer) are higher than the 20 layer network. More the depth and with increasing epochs, the error increases. At first, it appears that as the number of layers increase, the number of parameters increase, thus this is a problem of overfitting. But it is not, let’s understand.

如我们所见，较深层网络(56层)的训练(左)和测试错误(右)高于20层网络。随着深度的增加和历时的增加，误差也会增加。首先，似乎随着层数的增加，参数的数目也增加，因此这是过度拟合的问题。但事实并非如此，让我们了解一下。

One way of thinking about the problem is to consider a sufficiently DNN that calculates a sufficiently strong set of features that is necessary for the task in hand (ex: Image classification). If we add one more layer of the network to this already very DNN, what will this additional layer do? If already the network could calculate strong features then this additional layer does need to calculate any extra features, rather just copy the already calculated features i.e. perform an identity mapping (kernels in the added layer produce exact same features to that of the previous kernel). This seems to be a very simple operation but within a deep neural net this is far from our expectations.

思考问题的一种方法是考虑一个足够的DNN，该DNN可计算出足够强大的一组功能，这些功能对于手头任务是必需的(例如：图像分类)。如果我们在已经非常DNN的基础上再增加一个网络层，这个附加层将做什么？如果网络已经可以计算出强大的功能，则该额外层确实需要计算任何额外的功能，而不仅仅是复制已经计算出的功能，即执行身份映射(添加层中的内核产生的功能与先前内核的功能完全相同) 。这似乎是一个非常简单的操作，但是在一个深层的神经网络中，这远非我们的预期。

ResNet背后的数学直觉： (Mathematical Intuition behind ResNet:)

Let us consider a DNN architecture including learning rate and other hyperparameters that can reach a class of functions F. So for all f∈ F, there exist parameters W which we can obtain after training the network for a particular data-set. If f* denotes the function that we would really like to find (the result of best possible optimization) but if it is not within F, then we try to find f1, which is the best case within F. If we design a more powerful architecture G, we should arrive at a better outcome g1, which is better than f1. But if F ⊈ G, then there is no guarantee that the above assumption would hold. In fact g1 could be worse than f1 and this is degradation problem. So the main point is — if deeper neural net function classes contain the simpler and shallower network function classes then we can guarantee that the deeper network will increase the feature finding power of the original shallow network. This will be more clear once we will introduce the residual block in the next section.

让我们考虑一个DNN架构，其中包括学习率和其他可以达到一类函数F的超参数 。因此，对于所有f∈F，存在参数W ，我们可以在训练网络以获取特定数据集后获得这些参数。如果f *表示我们真正想找到的函数(可能的最佳优化结果)，但如果不在F内，则尝试找到f1，这是F内的最佳情况。如果我们设计一个更强大的函数在体系结构G中，我们应该得出比f1更好的结果g1 。但是，如果F⊈G，则不能保证上述假设成立。实际上， g1可能比f1差，这是降级问题。因此，重点是- 如果较深的神经网络功能类包含较简单和较浅的网络功能类，那么我们可以保证较深的网络将增加原始浅层网络的特征发现能力。 一旦在下一节中介绍残差块，这一点将更加清楚。

残留块： (Residual Block :)

The idea of a residual block is completely based on the intuition that was explained before. Simpler function (shallower network) should be a subset of Complex function (deeper network) so that degradation problem can be addressed. Let us consider input x and the desired mapping from input to output is denoted by g(x). Instead of dealing with this function we will deal with a simpler function f(x) = g(x)-x. The original mapping is then recast to f(x)+x. In the ResNet paper He et al. hypothesized that it is easier to optimize the residual f(x) than the original g itself. Also optimizing the residual takes care of the fact that we don’t need to bother about the dreaded identity mapping f(y)→ y in a very deep network. Let’s see the schematic of the residual block below —

剩余块的想法完全基于之前解释的直觉。较简单的功能(较浅的网络)应该是“复杂”功能(较深的网络)的子集，以便可以解决降级问题。让我们考虑输入x ，从输入到输出的期望映射用g(x)表示 。代替处理此函数，我们将处理一个更简单的函数f(x)= g(x)-x 。然后将原始映射重铸到f(x)+ x 。在ResNet论文中，He等人。 假设比原始g本身更容易优化残差f(x)。 此外，优化残差还可以解决以下事实：我们不需要在非常深的网络中担心可怕的身份映射f(y)→y 。让我们看一下下面残差块的示意图—

The residual learning formulation ensures that when identity mappings are optimal (i.e. g(x) = x), the optimization will drive the weights towards zero of the residual function. ResNet consists of many residual blocks where residual learning is adopted to every few (usually 2 or 3 layers) stacked layers. The building block is shown in Figure 2 and the final output can be considered as y = f(x, W) + x. Here W’s are the weights and these are learned during training. The operation f + x is performed by a shortcut (‘skip’ 2/3 layers) connection and element-wise addition. This is the simplest block where no additional parameters are involved in the skip connection. Element-wise addition is only possible when the dimension of f and x are same, if this is not the case then, we multiply the input x by a projection matrix Ws, so that dimensions of f and x matches. In this case the output will change from the previous equation to y = f(x, W) + Ws * x. The elements in the projection matrix will also be trainable.

残差学习公式可确保当身份映射为最佳时(即g(x)= x )，优化将使权重接近残差函数的零。 ResNet由许多残差块组成，其中每隔几层(通常为2或3层)堆叠的层都采用残差学习。构造块如图2所示，最终输出可以视为y = f(x，W)+ x 。 W是权重，它们是在训练过程中获得的。 f + x运算是通过快捷方式(“跳过” 2/3层)连接和逐元素加法执行的。这是最简单的块，其中跳过连接中不涉及其他参数。仅当f和x的维数相同时才可以进行逐元素加法，否则，我们将输入x与投影矩阵Ws相乘，以使f和x的维数匹配。在这种情况下，输出将从先前的等式变为y = f(x，W)+ Ws * x 。投影矩阵中的元素也将是可训练的。

建立ResNet和1×1卷积： (Building ResNet and 1× 1 Convolution:)

We will build the ResNet with 50 layers following the method adopted in the original paper by He. et al. The architecture adopted for ResNet-50 is different from the 34 layers architecture. The shortcut connection skips 3 blocks instead of 2 and, the schematic diagram below will help us clarify some points-

我们将按照He原始论文中采用的方法，将ResNet构建为50层。等。 ResNet-50采用的体系结构与34层体系结构不同。快捷方式连接跳过了3个块而不是2个块，下面的示意图将帮助我们阐明一些要点-

In ResNet-50 the stacked layers in the residual block will always have 1×1, 3×3, and 1×1 convolution layers. The 1×1 convolution first reduces the dimension and then the features are calculated in bottleneck 3×3 layer and then the dimension is again increased in the next 1×1 layer. Using 1×1 filter for reducing and increasing the dimension of feature maps before and after the bottleneck layer was described in the GoogLeNet model by Szegedy et al. in their Inception paper. Since there’s no pooling layer within the residual block, the dimension is reduced by 1×1 convolution with strides 2. With these points in mind let’s build ResNet-50 using TensorFlow 2.0.

在ResNet-50中，残差块中的堆叠层将始终具有1×1、3×3和1×1卷积层。 1×1卷积首先减小尺寸，然后在瓶颈3×3层中计算特征，然后在下一个1×1层中再次增大尺寸。 Szegedy等人在GoogLeNet模型中描述了使用1×1过滤器缩小和增加瓶颈层前后的特征图尺寸。在他们的盗版文章中。由于残差块内没有池化层， 因此尺寸以步长2减小1×1卷积 。考虑到这些要点，让我们使用TensorFlow 2.0构建ResNet-50。

构建ResNet-50： (Building ResNet-50:)

Before coding, let’s see the ResNet-34 architecture as presented in the original paper —

在进行编码之前，让我们看一下原始论文中介绍的ResNet-34架构-

The only pooling layers are placed at the very beginning and, before the dense connection at the end of the architecture. To change dimension elsewhere, 1×1 convolution is used as described in the previous section. For the number of filters and other parameters, I followed the Keras example. Now it is time to code. First, we define the simplest identity block where dimension of the input doesn’t change but only the depth, below is the code block-

唯一的池化层位于体系结构的最开始以及密集连接之前。 要在其他地方更改尺寸，请使用上一节中所述的1×1卷积。对于过滤器和其他参数的数量，我遵循Keras示例。现在该进行编码了。首先，我们定义最简单的身份块输入的尺寸不变，只有深度，下面是代码块-

Simplest Residual Block without any change in dimension but only depth.

最简单的残留块，尺寸不变，只有深度。

The other residual block will include a change in the dimension of the input by using a 1×1 convolution with a stride 2. Thus the skip connection will also go through a dimension change —

另一个残差块将通过使用跨度为2的1×1卷积来改变输入的尺寸。因此，跳过连接也将经历尺寸变化-

Convolution with strides 2. Dimension of the input changes

跨度大的卷积2。输入维度的变化

Combining these two residual blocks we can now build the complete 50 layer ResNet as below —

结合这两个剩余的块，我们现在可以构建完整的50层ResNet，如下所示-

Using a batch size of 64, 160 epochs and data augmentation, the accuracy of ∼ 85% on training data and ∼ 82% on test data was achieved. Below are training and validation curves —

使用64、160个纪元的批处理大小和数据扩充，在训练数据上达到约85％的精度，在测试数据上达到约82％的精度。以下是训练和验证曲线-

Also, the confusion matrix for all the 10 classes in Cifar-10 data can be plotted

同样，可以绘制Cifar-10数据中所有10类的混淆矩阵

讨论： (Discussion:)

Here we have seen one example of implementing ResNet-50 with TensorFlow and trained the model using Cifar-10 data. One important point of discussion is the order of Convolution — BatchNorm — Activation, which is still a point of debate. The order used in the original BatchNorm paper is not considered best by many. See a GitHub issue here. I recommend you to try different parameters than those were used in the notebook to understand their effects.

在这里，我们看到了一个使用TensorFlow实施ResNet-50的示例，并使用Cifar-10数据训练了模型。讨论的一个重点是卷积的顺序-BatchNorm-**，这仍然是争论的焦点。许多人认为原始BatchNorm纸张中使用的顺序不是最佳选择。在此处查看GitHub问题。我建议您尝试使用与笔记本电脑中使用的参数不同的参数，以了解其效果。

Few of the important points that you can take away from this is —

您可以从中获得的重要要点很少是-

The distinction between degradation and overfitting and why degradation occurs in a very deep network.
退化和过度拟合之间的区别，以及为什么退化发生在非常深的网络中。
Using 1×1 convolution to increase and decrease the dimension of the feature maps.
使用1×1卷积来增加和减少特征图的尺寸。
How does the residual block help to prevent the degradation problem?
残留块如何帮助防止降解问题？

That’s all for now! Hope this helps you a little and stay strong !!

目前为止就这样了！希望这对您有所帮助并保持强壮！