论文阅读：ResNeSt: Split-Attention Networks

文章目录

1、论文总述

本篇论文在投稿阶段就在知乎上引发了广泛讨论，争议很多，这个争议我觉得不仅仅是ResNeSt特有的，而是现在的好多学术论文都有的问题，争议点就在于：文中提出的模型在数据集上提高了几个点，这几个点的提升是因为提出的网络的改进还是因为训练时加入了很多tricks，因为本文的作者在训练时加入了很多tricks（具体见下文），但是在abalation study和原版resnet对比时没有用同样的settings，导致读者并不知道涨点到底是从哪里来的。。

作者也说到efficientdet系列的涨点有一半也是由于efficienet这个backbone的使用，而efficienet的涨点很大原因也是各种tricks的使用。。

论文主要有两个贡献：一个是提出了跨通道的注意力机制，通俗点说就是结合了SENet ，SKNet和ResNeXt的优点在ResNet上进行改进（ResNeSt和SKNet相比就相当与ResNeXt与ResNet）；另一个主要贡献（工程上）是提供了一个点数很高的backbone（ResNeSt），这个backbone迁移到目标检测和分割任务时都是直接涨好几个点，可以直接把它的预训练模型拿来使用。

关于这篇论文的讨论很多，我这里放几个值得看的链接（评价的比较公平，不吹不黑）：

1、ResNeSt在刷榜之后被ECCV2020 strong reject，孰之过？

2、关于ResNeSt的点滴疑惑

然后再放几个比较比较有用的评论：

@张航（作者本人）：
论文阅读：ResNeSt: Split-Attention Networks

@打酱油的疯子

论文阅读：ResNeSt: Split-Attention Networks
@Naiyan Wang

论文阅读：ResNeSt: Split-Attention Networks
ResNeSt的刷榜结果：

论文阅读：ResNeSt: Split-Attention Networks
论文的两个贡献：

As the first contribution of this paper, we explore a simple architectural
modification of the ResNet [23], incorporating feature-map split attention within
the individual network blocks. More specifically, each of our blocks divides the
feature-map into several groups (along the channel dimension) and finer-grained
subgroups or splits, where the feature representation of each group is determined
via a weighted combination of the representations of its splits (with weights cho-
sen based on global contextual information). We refer to the resulting unit as
a Split-Attention block, which remains simple and modular. By stacking several Split-Attention blocks, we create a ResNet-like network called ResNeSt (S
stands for “split”). Our architecture requires no more computation than existing
ResNet-variants, and is easy to be adopted as a backbone for other vision tasks.

The second contributions of this paper are large scale benchmarks on image
classification and transfer learning applications. We find that models utilizing a
ResNeSt backbone are able to achieve state of the art performance on several
tasks, namely: image classification, object detection, instance segmentation and
semantic segmentation.

2、1乘1卷积相当于一种注意力机制

NIN [40]（Network in network） first uses a global average pooling layer to replace the heavy
fully connected layers, and adopts 1 × 1 convolutional layers to learn non-linear
combination of the featuremap channels, which is the first kind of featuremap
attention mechanism.

这篇论文后续得好好看！

3、ResNeSt网络结构

大体上的：

论文阅读：ResNeSt: Split-Attention Networks
具体点的：

论文阅读：ResNeSt: Split-Attention Networks G = KR的来历：（G是最终的feature maps groups）

Feature-map Group. As in ResNeXt blocks [61], the feature can be divided
into several groups, and the number of feature-map groups is given by a cardinality hyperparameter K. We refer to the resulting feature-map groups as
cardinal groups. We introduce a new radix hyperparameter R that indicates
the number of splits within a cardinal group, so the total number of feature
groups is G = KR.

4、SKNet网络结构

注：不同大小的卷积核进行卷积，然后再加上注意力机制，让模型自己学习什么时候用哪个卷积核的结果（DetectoRS中的SAM模块也是基于此进行改进的）
论文阅读：ResNeSt: Split-Attention Networks

5、ResNeSt的两种等价实现

注：这也是作者在论文中提到的cardinality-major和radix-major两种角度来看ResNeSt，论文中讲解时用的cardinality-major（ResNext中的那个cardinality），这个角度的实现latency比较大，所以作者在代码中真正实现的是用的radix-major角度。

论文阅读：ResNeSt: Split-Attention Networks 上图中左图是cardinality-major，右图是radix-major，中图是转换过程中的一个中间结果。
图的参考链接

6、ResNeSt对resnet网络结构上的改动

注：改动后的resnet才是ResNeSt的baseline，但是文章开始说的提升3个点（在cascade rcnn上）是相对于原版的resnet。。。

（1）平均池化层代替带步长的卷积

Convolutional layers require handling featuremap boundaries with zero-padding strategies, which is often suboptimal when transferring to other dense prediction tasks. Instead of using strided convolution
at the transitioning block (in which the spatial resolution is downsampled), we
use an average pooling layer with a kernel size of 3 × 3 .

（2）两个ResNet-D中已有的trick

注：ResNet-D是这篇论文里提出的：Bag of tricks for image classification with convolutional neural networks.

Tweaks from ResNet-D.
We also adopt two simple yet effective ResNet modifications introduced by [26]:
(1) The first 7 × 7 convolutional layer is replaced with three consecutive 3 ×3 convolutional layers, which have the same receptive field size with a similar computation cost as the original design. （我看评论里有说这trick就可以提升一个点）
(2) A 2 × 2 average pooling layer is added to the shortcut connection prior to the 1 × 1
convolutional layer for the transitioning blocks with stride of two.

7、ResNeSt训练时候使用的tricks

（1）Large Mini-batch Distributed Training以及 large epochs

Following prior work [19,37], we train our models using 8 servers (64 GPUs in total) in parallel.
Our learning rates are adjusted according to a cosine schedule [26,31]. We follow the common
practice using linearly scaling-up the initial learning rate based on the minibatch size. The initial learning rate is given by η = B 256 ηbase, where B is the mini-batch size and we use ηbase = 0.1

We use a mini-batch of size 8192
for ResNeSt-50, 4096 for ResNeSt 101, and 2048 for ResNeSt-{200, 269}.

Training is done for 270 epochs with a weight
decay of 0.0001 and momentum of 0.9, using a cosine learning rate schedule
with the first 5 epochs reserved for warm-up.

注意：还是用了 cosine形式的学习率调整以及 warm up，我看评论里有人说Large Mini-batch是训练轻量级网络的必要条件，应该对于ResNeSt还挺重要的

（2）Label Smoothing

注：主要是为了防止过拟合

论文阅读：ResNeSt: Split-Attention Networks

（3）Auto Augmentation.

这是利用NAS搜索出来的一些分类网络训练时的一些数据增强策略
论文解读链接

（4）Mixup Training（分类网络的）

论文阅读：ResNeSt: Split-Attention Networks
（5）Large Crop Size.

作者根据EfficientNet得出来的结论，就是说增大网络深度和宽度的同时也要增大网络的Input size，不然超不过EfficientNet。

For fair comparison, we use a crop size of 224 when comparing our ResNeSt with
ResNet variants, and a crop size of 256 when comparing with other approaches.（这个Other应该就是指EfficientNet）

（6）Regularization

用了 dropout 、drop block、 weight-decay

使用weight-decay的正确姿势：

Finally, we also apply weight decay (i.e. L2 regularization) which additionally
helps stabilize training. Prior work on large mini-batch training suggests weight
decay should only be applied to the weights of convolutional and fully connected
layers [19, 26]. We do not subject any of the other network parameters to weight
decay, including bias units, γ and β in the batch normalization layers.（Bias 和BN的参数不进行weight-decay）

8、Ablation study

论文阅读：ResNeSt: Split-Attention Networks
ResNeSt-50-fast与ResNeSt-50的区别：

In ResNeSt-fast setting, the effective average downsampling is applied prior to the
3 × 3 convolution to avoid introducing extra computational costs in the model.
With the downsampling operation moved after the convolutional layer, ResNeSt-
50 achieves 81.13% accuracy.

9、ResNeSt与其他模型的效果对比

注：这些指标不能全信，训练setting不一样
论文阅读：ResNeSt: Split-Attention Networks

注：下面这张图说明resnet加满trick训练并且加大Input size也能达到和efficient差不多的精度
论文阅读：ResNeSt: Split-Attention Networks

10、附录中的彩蛋

Beyond the paper contributions, we empirically find several minor conclusions
which may be helpful for peers:
– depth-wise convolution is not optimal for training and inference efficiency
on GPU,
– model accuracy get saturated on ImageNet with a fixed input image size,
– increasing input image size can get better accuracy and FLOPS trade-off.
– bicubic upsampling strategy is needed for large crop-size (≥ 320).