Context Encoding for Semantic Segmentation[CVPR2018]

更像是17年encoding network的一个应用,实验部分比较好
18,17

1.FCN framework

Good explanation

global receptive fields: conv(no linearities) + downsample

spatial resolution loss
  • encoder: dilated conv
    • pro: expand receptive field
    • con: isolate pixels from context scene, misclassified
  • decoder: upsample, deeplabv3+
multiple scale object

multi-resolution pyramid-based representation: SPP module
Q: Is capturing contextual information the same as increasing the receptive field size?

2.Architecture

Context Encoding for Semantic Segmentation[CVPR2018]

Featuremap Attention

本文的核心贡献, dense feature map经过一个encoding layer得到context embedding,然后通过FC得到一个classwise的score,作为权重(一种独特的Attention)

Semantic Encoding loss

实际上就是multi-label classification loss,分割网络加入一支classification loss可以提高结果
eg: Learning Multi-level Region Consistency with Dense Multi-label Networks for Semantic Segmentation[IJCAI2017]

Encoding Layer

本文的基石,

对比

Context Encoding for Semantic Segmentation[CVPR2018]
方法和传统方法的对比,以前使用bag of words或fisher vector, Dictionary一般通过聚类/GMM得到

步骤

Context Encoding for Semantic Segmentation[CVPR2018]
Context Encoding for Semantic Segmentation[CVPR2018]
rik=xick

ek是针对第k个code word的输出结果, s是平滑系数(learnable), Encoder最终输出定长表示E={e1,,ek},与code word数K有关,与输入的特征数N无关

每个codeword encode之后的embedding,是残差(区别只是所有pixel-wise Feature的soft weight,还是直接选最近的一个)

HxWxC => N x C -> k x C -(fc)-> 1 x C
reshape encoding
直觉解释, K种visual,每种都由C-dim channel不同程度贡献而成,所以已知K种visual的表现型,可以反过来得到每个channel的一个attention

3. excerpt

For a given in-
put image, hand-engineered features are densely extracted
using SIFT [38] or filter bank responses [30, 48]. Then a vi-
sual vocabulary (dictionary) is often learned and the global
feature statistics are described by classic encoders such as
Bag-of-Words (BoW) [8, 13, 26, 46], VLAD [25] or Fisher
Vector [44]. The classic representations encode global con-
textual information by capturing feature statistics