[paper] DeepLab-v3+

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Paper: https://arxiv.org/abs/1802.02611

Blog: https://research.googleblog.com/2018/03/semantic-image-segmentation-with.html

Code: https://github.com/tensorflow/models/tree/master/research/deeplab

在 DeepLab-v3 上添加 decoder 细化分割结果(尤其是物体边界),且使用 depthwise separable convolution 加速。

DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries.

We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network.

Introduction

考虑语义分割的两种类型:

  1. 空间金字塔池化模块:通过池化不同分辨率的特征,捕获丰富的上下文信息

  2. 编码器解码器结构:能够获得锐利的物体边界

we consider two types of neural networks for semantic segmentation:

  1. spatial pyramid pooling module: captures rich contextual information by pooling features at different resolution

  2. encoder-decoder structure: is able to obtain sharp object boundaries.

空间金字塔池化模块计算密集,因为提取输出特征只是输入大小的 8 倍甚至 4 倍。

it is computationally prohibitive to extract output feature maps that are 8, or even 4 times smaller than the input resolution.

编码器解码器结构在编码器路径中计算更快(因为没有扩散特征)。

encoder-decoder models lend themselves to faster computation (since no features are dilated) in the encoder path and gradually recover sharp object boundaries in the decoder path.

结合二者优点,提出加入多尺度上下文信息来丰富编码器解码器网络中的编码器模块。

Attempting to combine the advantages from both methods, we propose to enrich the encoder module in the encoder-decoder networks by incorporating the multi-scale contextual information.

采用 Xception 模型在速度和精度上都有所提升

show improvement in terms of both speed and accuracy by adapting the Xception model

contributions:

  1. propose a novel encoder-decoder structure which employs DeepLabv3 as a powerful encoder module.

  2. can arbitrarily control the resolution of extracted encoder features by atrous convolution to trade-off precision and runtime, which is not possible with existing encoder-decoder models.

  3. adapt the Xception model for the segmentation task and apply depthwise separable convolution to both
    ASPP module and decoder module, resulting in a faster and stronger encoder-decoder network.

  4. attains a new state-of-art performance on PASCAL VOC 2012 dataset.

  5. make our Tensorflow-based implementation of the proposed model publicly available.

Related Work

Spatial pyramid pooling

Encoder-decoder

Depthwise separable convolution

Methods

Encoder-Decoder with Atrous Convolution

  • Atrous convolution

  • Depthwise separable convolution

  • DeepLabv3 as encoder

  • Proposed decoder

    In the work of DeepLabv3, the features are bilinearly upsampled by a factor of 16, which could be considered a naive decoder module.

    However, this naive decoder module may not successfully recover object segmentation details.

    We thus propose a simple yet effective decoder module, as illustrated in Fig. 2.

    1. The encoder features are first bilinearly upsampled by a factor of 4

    2. then concatenated with the corresponding low-level features from the network backbone that have the same spatial resolution

      apply another 1 × 1 convolution on the low-level features to reduce the number of channels

    3. apply a few 3 × 3 convolutions to refine the features

    4. another bilinear upsampling by a factor of 4

    解码器结构:

    1. 编码器特征双线性插值上采样 4 倍

    2. 然后与具有相同空间分辨率的相应低级特征合并

      作用一个 1 x 1 卷积在低级特征上以减少通道数

    3. 经过一些 3 x 3 的卷积以精炼特征

    4. 再双线性插值上采样 4 倍

[paper] DeepLab-v3+

Figure 2. Our proposed DeepLabv3+ extends DeepLabv3 by employing a encoder-decoder structure. The encoder module encodes multiscale contextual information by applying atrous convolution at multiple scales, while the simple yet effective decoder module refines the segmentation results along object boundaries.

Modified Aligned Xception

The Xception model has shown promising image classification results on ImageNet with fast computation.

More recently, the MSRA team modifies the Xception model (called Aligned Xception) and further pushes the performance in the task of object detection.

[paper] DeepLab-v3+

Figure 3. The Xception model is modified as follows: (1) more layers (same as MSRA’s modification except the changes in Entry flow), (2) all the max pooling operations are replaced by depthwise separable convolutions with striding, and (3) extra batch normalization and ReLU are added after each 3 × 3 depthwise convolution, similar to MobileNet.

a few more changes:

一些改进:

  1. 不修改入口流的网络结构,为了快速计算和存储效率

  2. 替代最大池化操作为深度可分离卷积,这使我们能够应用多孔分离卷积在任意分辨率提取特征(另一种选择是延长 arous 算法到最大池化操作)

  3. 在 3 x 3 的深度可分离卷积后添加额外的 BN 和 ReLU **

  4. do not modify the entry flow network structure for fast computation and memory efficiency

  5. all max pooling operations are replaced by depthwise separable convolution with striding, which enables us to apply atrous separable convolution to extract feature maps at an arbitrary resolution (another option is to extend the atrous algorithm to max pooling operations)

  6. extra batch normalization and ReLU activation are added after each 3 × 3 depthwise convolution

Experimental Evaluation

Decoder Design Choices

考虑三个地方进行不同的设计:

  1. 用来减少编码器模块的底层特征图的通道的 1 × 1 卷积

  2. 得到清晰的分割结果的 3 × 3 卷积

  3. 使用哪些编码器低级特征

we consider three places for different design choices:

  1. the 1 × 1 convolution used to reduce the channels of the low-level feature map from the encoder module

    实验表明,48 个 channel 的 1 x 1 卷积效果最好

  2. the 3 × 3 convolution used to obtain sharper segmentation results

    实验表明,2 个 256 channel 的 3 x 3 卷积效果最好

  3. what encoder low-level features should be used

    实验表明,只使用 Conv2 的特征效果最好

ResNet-101 as Network Backbone

  • Baseline

  • Adding decoder

  • Coarser feature maps

[paper] DeepLab-v3+

Table 3. Inference strategy on the PASCAL VOC 2012 val set when using ResNet-101 as feature extractor. train OS: The output stride used during training. eval OS: The output stride used during evaluation. Decoder: Employing the proposed decoder structure. MS: Multi-scale inputs during evaluation. Flip: Adding left-right flipped inputs.

Xception as Network Backbone

  • ImageNet pretraining

  • Baseline

  • Adding decoder

  • Using depthwise separable convolution

  • Pretraining on COCO

  • Pretraining on JFT

  • Test set results

  • Qualitative results

  • Failure mode

[paper] DeepLab-v3+

Table 5. Inference strategy on the PASCAL VOC 2012 val set when using modified Xception as feature extractor. train OS: The output stride used during training. eval OS: The output stride used during evaluation. Decoder: Employing the proposed decoder structure. MS: Multi-scale inputs during evaluation. Flip: Adding left-right flipped inputs. SC: Adopting depthwise separable convolution for both ASPP and decoder modules. COCO: Models pretrained on MS-COCO dataset. JFT: Models pretrained on JFT dataset.

Improvement along Object Boundaries

employing the proposed decoder for both ResNet-101 and Xception network backbones improves the performance compared to the naive bilinear upsampling.

Conclusion

Our proposed model “DeepLabv3+” employs the encoderdecoder structure where DeepLabv3 is used to encode the rich contextual information and a simple yet effective decoder module is adopted to recover the object boundaries. One could also apply the atrous convolution to extract the encoder features at an arbitrary resolution, depending on the available computation resources. We also explore the Xception model and atrous separable convolution to make the proposed model faster and stronger. Finally, our experimental results show that the proposed model sets a new state-of-the-art performance on the PASCAL VOC 2012 semantic image segmentation benchmark.