Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in their building modules.


In this work, they introduce two new modules to enhance the transformation modeling capability of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from the target tasks, without additional supervision.

在这项工作中,作者介绍了两种新的模块来增强CNN的变形建模能力,叫做deformable convolution deformable RoI pooling。两者都基于这样的想法:利用从目标任务中学到的offsets,来增加模块中的空间采样位置,而无需额外的监督。

The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard back-propagation, giving rise to deformable convolutional networks. Extensive experiments validate the performance of their approach.



A key challenge in visual recognition is how to accommodate geometric variations or model geometric transformations in object scale, pose, viewpoint, and part deformation.


In general, there are two ways.

The first is to build the training datasets with sufficient desired variations.

The second is to use transformation-invariant features and algorithms.




There are two drawbacks in above ways.

First, the geometric transformations are assumed fixed and known.

Second, handcrafted design of invariant features and algorithms could be difficult or infeasible for overly complex transformations, even when they are known.




The first is deformable convolution. It adds 2D offsets to the regular grid sampling locations in the standard convolution. It enables free form deformation of the sampling grid. The offsets are learned from the preceding feature maps, via additional convolutional layers. Thus, the deformation is conditioned on the input features in a local, dense, and adaptive manner.

第一个是deformable convolution,它将2D偏移添加到标准卷积中的常规网格采样位置。 它可以使采样网格*变形。 通过附加的卷积层从前面的特征图中学习offsets因此,变形以局部、密集和自适应方式对输入特征进行调节。

The second is deformable RoI pooling. It adds an offset to each bin position in the regular bin partition of the previous RoI pooling [15, 7]. Similarly, the offsets are learned from the preceding feature maps and the RoIs, enabling adaptive part localization for objects with different shapes.

第二个是Deformable RoI pooling,它为原版RoI池化的常规bin分区中的每个bin位置添加了一offsets类似地,从前面的特征图和RoI学习offsets从而实现具有不同形状的对象的自适应部件定位。

Deformable Convolution

The 2D convolution consists of two steps:

1) sampling using a regular grid R over the input feature map x;

2) summation of sampled values weighted by w. The grid R defines the receptive field size and dilation. For example,

                        R = {(−1, −1),(−1, 0), . . . ,(0, 1),(1, 1)}

defines a 3 × 3 kernel with dilation 1.





                          R = {(−1, −1),(−1, 0), . . . ,(0, 1),(1, 1)}

定义了一个扩张为13 × 3的卷积核


For each location p0 on the output feature map y, we have

where pn enumerates the locations in R


In deformable convolution, the regular grid R is augmented with offsets {∆pn|n = 1, ..., N}, where N = |R|. Eq. (1) becomes

deformable convolution中,常规的Roffsets {∆pn|n = 1, ..., N}所增强,其中,N=|R|,公式(1)变成了

Now, the sampling is on the irregular and offset locations pn+∆pn. As the offset ∆pn is typically fractional, Eq. (2) is implemented via bilinear interpolation as

Deformable Convolutional Networks 可变形卷积


公式(3)中p表示一个任意(小数)的位置(p = p0 + pn + ∆pn for Eq. (2))q枚举输入特征图x上所有的整数位置, G(·, ·) 是一个双线性插值核。注意到G是二维的,这里它被拆分成两个一维核,如公式(4)所示。其中g(a, b) = max(0, 1 − |a − b|).

RoI Pooling, Given the input feature map x and a RoI of size w×h and top-left corner p0, RoI pooling divides the RoI into k × k (k is a free parameter) bins and outputs a k × k feature map y. For (i, j)-th bin (0 ≤ i, j < k), we have

Deformable Convolutional Networks 可变形卷积

where nij is the number of pixels in the bin. The (i, j)-th bin spans Deformable Convolutional Networks 可变形卷积   and Deformable Convolutional Networks 可变形卷积

Similarly as in Eq. (2), in deformable RoI pooling, offsets {∆pij |0 ≤ i, j < k} are added to the spatial binning positions. Eq.(5) becomes

Deformable Convolutional Networks 可变形卷积

Typically, ∆pij is fractional. Eq. (6) is implemented by bilinear interpolation via Eq. (3) and (4).

和公式(2)相似,在deformable RoI pooling中,offsets {∆pij |0 ≤ i, j < k}被添加到空间位置中,公式(5)变成了(6)。一般来说,pij是小数,所以公式(6)也是通过双线性插值来实现的。

Firstly, RoI pooling (Eq. (5)) generates the pooled feature maps. From the maps, a fc layer generates the normalized offsets Deformable Convolutional Networks 可变形卷积 , which are then transformed to the offsets ∆pij in Eq. (6) by element-wise product with the RoI’s width and height, asDeformable Convolutional Networks 可变形卷积. Here γ is a pre-defined scalar to modulate the magnitude of the offsets. It is empirically set to γ = 0.1. The offset normalization is necessary to make the offset learning invariant to RoI size.Deformable Convolutional Networks 可变形卷积


Position-Sensitive (PS) RoI Pooling

Deformable Convolutional Networks 可变形卷积