论文阅读:EfficientDet: Scalable and Efficient Object Detection

1、论文总述

本篇论文是Google提出的EfficientNet的兄弟篇,看这篇论文之前建议先看EfficientNet。
文章的主要贡献有两点:一提出weighted的BiFPN,更好地进行特征融合,这也是好多大模型的出发点;二是根据EfficientNet中的compound scaling method提出了一个适用于目标检测的compound scaling method。
注意:EfficientDet的模型是One-staged的,baseline是RetinaNet。

EfficientDet屠榜图如下:
论文阅读:EfficientDet: Scalable and Efficient Object Detection

A natural question is: Is it possible to build a scalable detection architecture with both higher accuracy and
better efficiency across a wide spectrum of resource constraints (e.g., from 3B to 300B FLOPS)? This paper aims to tackle this problem by systematically studying various
design choices of detector architectures. Based on the onestage detector paradigm, we examine the design choices for backbone, feature fusion, and class/box network, and identify two main challenges:

Challenge 1: efficient multi-scale feature fusion – Sinceintroduced in [16], FPN has been widely used for multiscale feature fusion. Recently, PANet [19], NAS-FPN [5],
and other studies [13, 12, 34] have developed more network
structures for cross-scale feature fusion. While fusing different input features, most previous works simply sum them up without distinction; however, since these different input
features are at different resolutions, we observe they usually contribute to the fused output feature unequally. To address this issue, we propose a simple yet highly effective
weighted bi-directional feature pyramid network (BiFPN),
which introduces learnable weights to learn the importance
of different input features, while repeatedly applying topdown and bottom-up multi-scale feature fusion.

Challenge 2: model scaling – While previous works
mainly rely on bigger backbone networks [17, 27, 26, 5] or
larger input image sizes [8, 37] for higher accuracy, we observe that scaling up feature network and box/class prediction network is also critical when taking into account both
accuracy and efficiency. Inspired by recent works [31], we
propose a compound scaling method for object detectors,
which jointly scales up the resolution/depth/width for all
backbone, feature network, box/class prediction network.

Finally, we also observe that the recently introduced Effi-
cientNets [31] achieve better efficiency than previous commonly used backbones (e.g., ResNets [9], ResNeXt [33],and AmoebaNet [24]). Combining EfficientNet backbones
with our propose BiFPN and compound scaling, we have
developed a new family of object detectors, named Effi-
cientDet, which consistently achieve better accuracy with
an order-of-magnitude fewer parameters and FLOPS than
previous object detectors.

2、各式各样的FPN及其效果

论文阅读:EfficientDet: Scalable and Efficient Object Detection
效果:
论文阅读:EfficientDet: Scalable and Efficient Object Detectionrepeated PANet的精度其实也很不错,就是慢点。。

注意:

Notably, the original FPN [16] and PANet
[19] only have one top-down or bottom-up flow, but for fair
comparison, here we repeat each of them 5 times (same as BiFPN). We use the same backbone and class/box prediction network, and the same training settings for all experiments. As we can see, the conventional top-down FPN is inherently limited by the one-way information flow and thus
has the lowest accuracy. While repeated PANet achieves
slightly better accuracy than NAS-FPN [5], it also requires
more parameters and FLOPS. Our BiFPN achieves similar
accuracy as repeated PANet, but uses much less parameters
and FLOPS. With the additional weighted feature fusion,
our BiFPN further achieves the best accuracy with fewer
parameters and FLOPS.

3、BiFPN设计的心路历程

By studying the performance and efficiency of these
three networks (Table 4), we observe that PANet achieves
better accuracy than FPN and NAS-FPN,
but with the cost
of more parameters and computations. To improve model
efficiency, this paper proposes several optimizations for
cross-scale connections: First, we remove those nodes that
only have one input edge. Our intuition is simple: if a node
has only one input edge with no feature fusion, then it will
have less contribution to feature network that aims at fusing different features. This leads to a simplified PANet as shown in Figure 2(e);
Second, we add an extra edge from
the original input to output node if they are at the same level,
in order to fuse more features without adding much cost, as
shown in Figure 2(f);
Third, unlike PANet [19] that only
has one top-down and one bottom-up path, we treat each
bidirectional (top-down & bottom-up) path as one feature
network layer, and repeat the same layer multiple times to
enable more high-level feature fusion.
Section 4.2 will discuss how to determine the number of layers for different resource constraints using a compound scaling method. With
these optimizations, we name the new feature network as
bidirectional feature pyramid network (BiFPN), as shown
in Figure 2(f) and 3.

4、Weighted Feature Fusion

可学习的权重参数:wi(不同level的feature map的重要性)

论文阅读:EfficientDet: Scalable and Efficient Object Detection

最后为了更高效(softmax计算量大),选用下面这个方法:

论文阅读:EfficientDet: Scalable and Efficient Object Detection

论文阅读:EfficientDet: Scalable and Efficient Object Detection

4、EfficientDets家族的网络结构图

论文阅读:EfficientDet: Scalable and Efficient Object Detection
论文阅读:EfficientDet: Scalable and Efficient Object Detection

5、Compound Scaling针对目标检测的设置

论文阅读:EfficientDet: Scalable and Efficient Object Detection
论文阅读:EfficientDet: Scalable and Efficient Object Detection

6、实验训练设置细节

注意:Our models are trained with batch size 128 on
32 TPUv3 chips.

We evaluate EfficientDet on COCO 2017 detection
datasets [18]. Each model is trained using SGD optimizer
with momentum 0.9 and weight decay 4e-5. Learning
rate is first linearly increased from 0 to 0.08 in the initial
5% warm-up training steps and then annealed down using cosine decay rule. Batch normalization is added after
every convolution with batch norm decay 0.997 and epsilon 1e-4. We use exponential moving average with decay 0.9998. We also employ commonly-used focal loss
[17] with α = 0.25 and γ = 1.5, and aspect ratio {1/2,
1, 2}. Our models are trained with batch size 128 on
32 TPUv3 chips. We use RetinaNet [17] preprocessing
for EfficientDet-D0/D1/D3/D4, but for fair comparison, we
use the same auto-augmentation for EfficientDet-D5/D6/D7
when comparing with the prior art of AmoebaNet-based
NAS-FPN detectors [37].

7、EfficientDet performance on COCO

论文阅读:EfficientDet: Scalable and Efficient Object Detection

8、EfficientNet和BiFPN的贡献各有多少

Starting from a RetinaNet detector [17] with ResNet-50 [9]
backbone and top-down FPN [16], we first replace the backbone with EfficientNet-B3, which improves accuracy by
about 3 mAP with slightly less parameters and FLOPS.
By further replacing FPN with our proposed BiFPN, we
achieve additional 4 mAP gain with much fewer parameters
and FLOPS. These results suggest that EfficientNet backbones and BiFPN are both crucial for our final models.

论文阅读:EfficientDet: Scalable and Efficient Object Detection

参考文献

1、【CV中的特征金字塔】五,Google Brain EfficientDet

2、【CV中的特征金字塔】二,Feature Pyramid Network

3、【CV中的特征金字塔】四,CVPR 2018 PANet