论文精要解读:Going Deeper with Convolutions

Increase the depth and width of the network while keeping the computational budget constant.


One encouraging news is that most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures.

For most of the experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time, so that the they do not end up to be a purely academic curiosity, but could be put to real world use, even on large datasets, at a reasonable cost.

In this paper, we will focus on an efficient deep neural network architecture for computer vision, codenamed Inception, which derives its name from the Network in network paper by Lin et al [12] in conjunction with the famous “we need to go deeper” internet meme.
灵感来自Network In Network.

Related Work

convolutional neural networks (CNN) have typically had a standard structure – stacked convolutional layers (optionally followed by contrast normalization and max pooling) are followed by one or more fully-connected layers.

For larger datasets such as Imagenet, the recent trend has been to increase the number of layers an layer size ,while using dropout to address the problem of overfitting.

Network-in-Network:When applied to convolutional layers, the method could be viewed as additional 1×1 convolutional layers followed typically by the rectified linear activation

1 × 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our networks. This allows for not just increasing the depth, but also the width of our networks without significant performance penalty.

Network In Network

We instantiate the micro neural network with a multilayer perceptron, which is a potent function approximator.

With enhanced local modeling via the micro network, we are able to utilize global average pooling over feature maps in the classification layer, which is easier to interpret and less prone to overfitting than traditional fully connected layers.
Motivation and High Level Considerations

The most straightforward way of improving the performance of deep neural networks is by increasing their size.
Two major drawbacks:
Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting.Another drawback of uniformly increased network size is the dramatically increased use of computational resources.

The fundamental way of solving both issues would be by ultimately moving from fully connected to sparsely connected architectures, even inside the convolutions. 解决办法:稀疏连接

On the downside, todays computing infrastructures are very inefficient when it comes to numerical
calculation on non-uniform sparse data structures.The uniformity of the structure and a large number of filters and greater batch size allow for utilizing efficient dense computation.

clustering sparse matrices into relatively dense submatrices tends to give state of the art practical performance for sparse matrix multiplication.解决方案:将稀疏矩阵聚类成相对密集的子矩阵。

Architectural Details

The main idea of the Inception architecture is based on finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components.

This means, we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of 1×1 convolutions in the next layer.

However, one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions.



输入的 feature map 是 28×28×192,1×1 卷积通道为 64,3×3 卷 积 通 道 为128, 5×5 卷 积 通 道 为 32 , 如 果 是 左 图 结 构 , 那 么 卷 积 核 参 数 为1×1×192×64+3×3×192×128+5×5×192×32,而右图对 3×3 和 5×5 卷积层前分别加入了通道数为96 和 16 的 1×1 卷积层,这样卷积核参数就变成了1×1×192×64+(1×1×192×96+3×3×96×128)+(1×1×192×16+5×5×16×32),参数大约减少到原来的三分之一。


传统的CNN网络中,在使用卷积层提取特征之后会将提取出来的Feature map输入一个全连接神经网络,再接一个softmax逻辑回归层完成分类任务,由于全连接层容易导致模型过拟合,后来出现的dropout很好的解决了这个问题。在NIN这篇论文中,作者提出了用全局平均池化来代替全连接层的新策略。
GAP的思路是在所有mlpconv层之后,将最后一层mlpconv层输出的每一张feature map进行相加求平均,也就是说输出的每张feature map都会计算得到一个平均值,最后将这些feature map对应的平均值作为某一类的置信度输入到softmax中进行分类,那么这里存在的一个问题就是每张feature map得到的全局平均值与最终的类别是一一对应的,也就是说最后一层mlpconv输出多少个feature map就会有多少种分类,因此这里需要控制最后一层mlpconv输出feature map数与总的类别数相同。下图展示了全局平均池化层:
