Transposed convolution & Fully Convolutional Neural Network

Given a kernel (e.g. $3 \times 3$ filter), we can get the sparse Toeplitz matrix $C$ whose elements are are weights in kernel.

We can either say this kernel defines a direct convolution whose forward and backward pass are computed by $C$ and $C^{⊤}$ respectively.
We can also say this kernel defines a transposed convolution whose forward and backward pass are computed by $C^{⊤}$ and $C$ respectively.

Direct Conv & Tranposed Conv

It’s always possible to emulate (模仿) a transposed convolution with a direct conv. The disadvantage is that it usually involves adding many columns and rows of zeros to the input, resulting in a much less efficient implementation.
Interpretation:
- The simplest way to think about a transposed convolution on a given input is to imagine such an input as being the result of a direct convolution applied on some initial feature map.
- The trasposed convolution can then be considered as the operation that allows to recover the shape of this initial feature map.
- Note here we only recover the shape, not the exact value of the input. Anyway transposed convolution is not the inverse of convolution!!!
To maintain the connectivity between direct conv and transposed conv, the direct conv which is used to emulate the transposed conv, may experience a specific zero padding.
The connectivity consistency matters!!!

only contains locally connected layers (like conv, pooling, upsampling), no dense layer used in FCC.
- reduce number of paras and computation time
- the network can work regardless of the original image size, given all connections are local.
Segmentation net contains two path:
- downsampling path: capture semantic/contextual information
- upsampling path: recover spatial information (precise localization)
- to further recover the spatial information we use skip connection

Skip connection:

concatenating or summing feature maps from the downsampling path with feature maps from the upsampling path
Merging features from various resolution levels helps combining context information with spatial information.

diff between FCN-32s, 16s, 8s

FCN-32 : Directly produces the segmentation map from conv7, by using a transposed convolution layer with stride 32.
FCN-16 : Sums the 2x upsampled prediction from_conv7_ (using a transposed convolution with stride 2) with pool4 and then produces the segmentation map, by using a transposed convolution layer with stride 16 on top of that.
FCN-8 : Sums the 2x upsampled conv7 (with a stride 2 transposed convolution) with_pool4_, upsamples them with a stride 2 transposed convolution and sums them with_pool3_, and applies a transposed convolution layer with stride 8 on the resulting feature maps to obtain the segmentation map.

Transposed convolution & Fully Convolutional Neural Network

Introducing a switched variable which records the location of maximum element (when using max pooling), and then unpooling the feature map