CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

导读：深度卷积神经网络的最新架构综述

原作者

Asifullah Khan1, 2*, Anabia Sohail1, 2, Umme Zahoora1, and Aqsa Saeed Qureshi1
1 Pattern Recognition Lab, DCIS, PIEAS, Nilore, Islamabad 45650, Pakistan
2 Deep Learning Lab, Center for Mathematical Sciences, PIEAS, Nilore, Islamabad 45650, Pakistan
[email protected]

更新中……

4 Architectural Innovations in CNN

4.1 Spatial Exploitation based CNNs

4.2.1 Highway Networks

4.2.2 ResNet

4.2.3 Inception-V3, V4 and Inception-ResNet

4.2.4 ResNext

4.3 Multi-Path based CNNs

4.3.1 Highway Networks

4.3.2 ResNet

4.3.3 DenseNets

4.4 Width based Multi-Connection CNNs

4.4.1 WideResNet

4.4.2 Pyramidal Net

4.4.3 Xception

4.4.4 Inception Family

4.5 Feature Map (ChannelFMap) Exploitation based CNNs

4.5.1 Squeeze and Excitation Network

4.5.2 Competitive Squeeze and Excitation Networks

4.6 Channel(Input) Exploitation based CNNs

4.6.1 Channel Boosted CNN using TL

4.7 Attention based CNNs

4.7.1 Residual Attention Neural Network

4.7.2 Convolutional Block Attention Module

4.7.3 Concurrent Spatial and Channel Excitation Mechanism

4 Architectural Innovations in CNN

Different improvements in CNN architecture have been made from 1989 to date. These improvements can be categorized as parameter optimization, regularization, structural reformulation, etc. However, it is observed that the main thrust in CNN performance improvement came from restructuring of processing units and designing of new blocks. Most of the innovations in CNN architectures have been made in relation with depth and spatial exploitation. Depending upon the type of architectural modification, CNN can be broadly categorized into seven different classes namely; spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting, and attention based CNNs. The taxonomy of deep presented in Fig. 4 showing seven different classes, while their summary is represented in Table 1.

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

Fig. 4: Taxonomy of deep CNN architectures.

Table 1 Performance comparison of the recent architectures of different categories. Top 5 error rate is reported for all architectures.

4.1 Spatial Exploitation based CNNs

CNN has a large number of parameters and hyperparameters such as the weights, biases, number of processing units (neurons), number of layers, filter size, stride, learning rate, activation function, etc. [119], [120]. As convolutional operation considers the neighborhood (locality) of input pixels, therefore different levels of correlation can be explored by using a different filter size. Consequently, in early 2000, researchers exploited spatial filters to improve performance in this regard; various size of filters were explored to evaluate their impact on learning of the network. Different size of filters encapsulate different levels of granularity; usually, small size filters extract fine-grained and large size extract coarse-grained information. In this way, by the adjustment of filter size, CNN can perform well both on coarse and fine-grained details.

4.1.1 LeNet

LeNet was proposed by LeCuN in 1998 [65]. It is famous due to its historical importance as it was the first CNN, which showed state-of-the-art performance on hand digit recognition tasks. It has the ability to classify digits without being affected by small distortions, rotation, and variation of position and scale. LeNet is a feed-forward NN that constitutes of five alternating layers of convolutional and pooling, followed by two fully connected layers. In early 2000, GPU was not commonly used to speed up training, and even CPUs were slow [121]. The main limitation of traditional multilayer fully connected NN was that it considers each pixel as a separate input and applies a transformation on it, which was a huge computational burden, specifically at that time [122]. LeNet exploited the underlying basis of image that the neighboring pixels are correlated to each other and are distributed across the entire image. Therefore, convolution with learnable parameters is an effective way to extract similar features at multiple locations with few parameters. This changed the conventional view of training where each pixel was considered as a separate input feature from its neighborhood and ignored the correlation among them. LeNet was the first CNN architecture, which not only reduced the number of parameters and computation but was able to automatically learn features.

4.1.2 AlexNet

LeNet [65] though begin the history of deep CNNs but at that time, CNN was limited to hand digit recognition tasks, and didn’t scale well to all classes of images. AlexNet [21] is considered as the first deep CNN architecture, which showed groundbreaking results for image classification and recognition tasks. AlexNet was proposed by Krizhevesky et al., who enhanced the learning capacity of the CNN by making it deeper and by applying a number of parameter optimizations strategies [21]. Basic architectural design of AlexNet is shown in Fig. 5. In early 2000, hardware limitations curtailed the learning capacity of deep CNN architecture by restricting them to small size. In order to get benefit of the representational capacity of CNN, Alexnet was trained in parallel on two NVIDIA GTX 580 GPUs to overcome shortcomings of the hardware. In AlexNet, feature extraction stages were extended from 5 (LeNet) to 7 to make CNN applicable for diverse categories of images. Despite the fact that generally, depth improves generalization for different resolutions of images but, the main drawback associated with increase in depth is overfitting. To address this challenge, Krizhevesky et al. (2012) exploited the idea of Hinton [56], [123], whereby their algorithm randomly skips some transformational units during training to enforce the model to learn features that are more robust. In addition to this, ReLU was employed as a non-saturating activation function to improve the convergence rate by alleviating the problem of vanishing gradient to some extent [53], [124]. Overlapping subsampling and local response normalization were also applied to improve the generalization by reducing overfitting. Other adjustments made were the use of large size filters (11x11 and 5x5) at the initial layers, compared to previously proposed networks. Due to efficient learning approach of AlexNet, it has a significant importance in the new generation of CNNs and has started a new era of research in the architectural advancements of CNNs.

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

Fig. 5: Basic layout of AlexNet architecture.

4.1.3 ZefNet

Learning mechanism of CNN, before 2013, was largely based on hit-and-trial, without knowing the exact reason behind the improvement. This lack of understanding limited the performance of deep CNNs on complex images. In 2013, Zeiler and Fergus proposed an interesting multilayer Deconvolutional NN (DeconvNet), which got famous as ZefNet [28]. ZefNet was developed to quantitatively visualize network performance. The idea of the visualization of network activity was to monitor CNN performance by interpreting neuron’s activation. In one of the previous studies, Erhan et al. (2009) exploited the same idea and optimized performance of Deep Belief Networks (DBNs) by visualizing hidden layers’ feature [125]. In the same manner, Le et al. (2011) evaluated the performance of deep unsupervised autoencoder (AE) by visualizing the image classes generated by the output neurons [126]. DeconvNet works in the same manner as the forward pass CNN but, reverses the order of convolutional and pooling operation. This reverse mapping projects the output of convolutional layer back to visually perceptible image patterns consequently gives the neuron-level interpretation of the internal feature representation learned at each layer [127], [128]. The objective of ZefNet was to monitor the learning scheme during training and thus use the findings in diagnosing a potential problem associated with the model. This idea was experimentally validated on AlexNet using DeconvNet, which showed that only a few neurons were active, while other neurons were dead (inactive) in the first and second layer of the network. Moreover, it showed that the features extracted by the second layer exhibited aliasing artifacts. Based on these findings, Zeiler and Fergus adjusted CNN topology and performed parameter optimization. Zeiler and Fergus maximized the learning of CNN by reducing both the filter size and stride to retain maximum number of features in the first two convolutional layers. This readjustment in CNN topology resulted in performance improvement, which suggested that features visualization can be used for identification of design shortcomings and for timely adjustment of parameters.

4.1.4 VGG

With the successful use of CNNs for image recognition, Simonyan et al. proposed a simple and effective design principle for CNN architectures. Their architecture named as VGG was modular in layers pattern [29]. VGG was made 19 layers deep compared to AlexNet and ZefNet to simulate the relationship of depth with the representational capacity of the network [21], [28]. ZefNet, which was a frontline network of 2013-ILSVRC competition, suggested that small size filters can improve the performance of the CNNs. Based on these findings, VGG replaced the 11x11 and 5x5 filters with a stack of 3x3 filters layer and experimentally demonstrated that concurrent placement of 3x3 filters can induce the effect of the large size filter (receptive field as effective as that of large size filters (5x5 and 7x7)). Use of the small size filters provide an additional benefit of low computational complexity by reducing the number of parameters. These findings set a new trend in research to work with smaller size filters in CNN. VGG regulates complexity of network by placing 1x1 convolution in between the convolutional layers, which in addition, learn a linear combination of the resultant feature maps. For the tuning of the network, max pooling is placed after the convolutional layer, while padding was performed to maintain the spatial resolution [46]. VGG showed good results both for image classification and localization problems. Although, VGG was not at the top place of 2014-ILSVRC competition but, got fame due to its simplicity, homogenous topology, and increased depth. The main limitation associated with VGG was that of high computational cost. Even with the use of small size filters, VGG suffered from high computational burden due to the use of about 140 million parameters.

4.1.5 GoogleNet

GoogleNet was the winner of the 2014-ILSVRC competition and is also known as Inception-V1. The main objective of the GoogleNet architecture was to achieve high accuracy with a reduced computational cost [99]. It introduced the new concept of inception block in CNN, whereby it incorporates multi-scale convolutional transformations using split, transform, and merge idea. The architecture of inception block is shown in Fig. 6. This block encapsulates filters of different sizes (1x1, 3x3, and 5x5) to capture spatial information at different scales (both at fine and coarse grain level). In GoogleNet, conventional convolutional layers are replaced in small blocks similar to the idea of substituting each layer with micro NN as proposed in Network in Network (NIN) architecture [57]. The exploitation of the idea of split, transform, and merge by GoogleNet, helped in addressing a problem related to the learning of diverse types of variations present in the same category of different images. In addition to the improvement in learning capacity, GoogleNet focus was to make CNN parameter efficient. GoogleNet regulates the computation by adding a bottleneck layer with a 1x1 convolutional filter, before employing large size kernels. It used sparse connections (not all the output feature maps are connected to all the input feature maps), to overcome the problem of redundant information and reduced cost by omitting feature maps (channels) that were not relevant. Furthermore, connection’s density was reduced by using global average pooling at the last layer, instead of using a fully connected layer. These parameter tunings caused a significant decrease in the number of parameters from 40 million to 5 million parameters. Other regulatory factors applied were batch normalization and use of RmsProp as an optimizer [129]. GoogleNet also introduced the concept of auxiliary learners to speed up the convergence rate. However, the main drawback of the GoogleNet was its heterogeneous topology that needs to be customized from module to module. Another, limitation of GoogleNet was a representation bottleneck that drastically reduces the feature space in the next layer and thus sometimes may lead to loss of useful information.

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

Fig. 6: Basic architecture of inception block

4.2 Depth based CNNs

Deep CNN architectures are based on the assumption that with the increase in depth, the network can better approximate the target function with a number of nonlinear mappings and improved feature representations [130]. Network depth has played an important role in the success of supervised training. Theoretical studies have shown that deep networks can represent certain classes of function more efficiently than shallow architectures [131]. Csáji represented universal approximation theorem in 2001, which states that a single hidden layer is sufficient to approximate any function, but this comes at the cost of exponentially many neurons thus, often making it computationally unfeasible [132]. In this regard, Bengio and Delalleau [133] suggested that deeper networks have the potential to maintain the expressive power of the network at a reduced cost [134]. In 2013, Bengio et al. empirically showed that deep networks are computationally more efficient for complex tasks [84], [135]. Inception and VGG, which showed the best performance in 2014-ILSVRC competition, further strengthen the idea that the depth is an essential dimension in regulating learning capacity of the networks [29], [33], [99], [100].

4.2.1 Highway Networks

Based on the intuition that the learning capacity can be improved by increasing the network depth, Srivastava et al. in 2015, proposed a deep CNN, named as Highway Network [101]. The main problem concerned with deep Nets is slow training and convergence speed [136]. Highway Network exploited depth for learning enriched feature representation by introducing new cross-layer connectivity (discussed in Section 4.3.1.). Therefore, highway networks are also categorized into multi-path based CNN architectures. Highway Network with 50-layers showed better convergence rate than thin but deep architectures on ImageNet dataset [94], [95]. Srivastava et al. experimentally showed that performance of a plain Net decreases after adding hidden units beyond 10 layers [137]. Highway networks, on the other hand, were shown to converge significantly faster than the plain ones, even with depth of 900 layers.

4.2.2 ResNet

ResNet was proposed by He et al., which is considered as a continuation of deep Nets [31]. ResNet revolutionized the CNN architectural race by introducing the concept of residual learning in CNN and devised an efficient methodology for training of deep Nets. Similar to Highway Networks, it is also placed under the Multi-Path based CNNs, thus its learning methodology is discussed in Section 4.3.2. ResNet proposed 152-layers deep CNN, which won the 2015-ILSVRC competition. Architecture of the residual block of ResNet is shown in Fig. 7. ResNet, which was 20 and 8 times deeper than AlexNet and VGG respectively, showed less computational complexity than previously proposed Nets [21], [29]. He et al. empirically showed that ResNet with 50/101/152 layers has less error on image classification task than 34 layers plain Net. Moreover, ResNet gained 28% improvement on the famous image recognition benchmark dataset named as COCO [138]. Good performance of ResNet on image recognition and localization tasks showed that depth is of central importance for many visual recognition tasks.

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

4.2.3 Inception-V3, V4 and Inception-ResNet

Inception-V3, V4 and Inception-ResNet, are improved versions of Inception-V1 and V2 [33], [99], [100]. The idea of Inception-V3 was to reduce the computational cost of deeper Nets without affecting the generalization. For this purpose, Szegedy et al. replaced large size filters (5x5 and 7x7) with small and asymmetric filters (1x7 and 1x5) and used 1x1 convolution as a bottleneck prior to the large filters [100]. This makes the traditional convolution operation more like cross-channel correlation. In one of the previous works, Lin et al. exploited the potential of 1x1 filters in NIN architecture [57]. Szegedy et al. [100] used the same concept in an intelligent way. In Inception-V3, 1x1 convolutional operation was used, which maps the input data into 3 or 4 separate spaces that are smaller than the original input space, and then maps all correlations in these smaller 3D spaces, via regular 3x3 or 5x5 convolutions. In Inception-ResNet, Szegedy et al. combined the power of residual learning and inception block [31], [33]. In doing so, filter concatenation was replaced by the residual connection. Moreover, Szegedy et al. experimentally showed that Inception-V4 with residual connections (Inception-ResNet) has the same generalization power as plain Inception-V4 but with increased depth and width. However, they observed that Inception-ResNet converges more quickly than Inception-V4, which clearly depicts that training with residual connections accelerates the training of Inception networks significantly.

4.2.4 ResNext

ResNext, also known as Aggregated Residual Transform Network, is an improvement over the Inception Network [115]. Xie et al. exploited the concept of the split, transform and merge in a powerful but simple way by introducing a new term; cardinality [99]. Cardinality is an additional dimension, which refers to the size of the set of transformations [139], [140]. Inception network has not only improved learning capability of conventional CNNs but also makes a network resource effective. However, due to the use of diverse spatial embedding’s (such as use of 3x3, 5x5 and 1x1 filter) in the transformation branch, each layer needs to be customized separately. In fact, ResNext derives characteristic features from Inception, VGG, and ResNet [29], [31], [99]. ResNext utilized the deep homogenous topology of VGG and simplified GoogleNet architecture by fixing spatial resolution to 3x3 filters within the split, transform, and merge block. It also uses residual learning. Building block for ResNext is shown in Fig. 8. ResNext used multiple transformations within a split, transform and merge block and defined these transformations in terms of cardinality. Xie et al. (2017) showed that increase in cardinality significantly improves the performance. The complexity of ResNext was regulated by applying low embedding’s (1x1 filters) before 3x3 convolution. Whereas training was optimized by using skip connections [141].

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

Fig. 8: ResNext building block.

4.3 Multi-Path based CNNs

Training of deep networks is a challenging task and this has been the subject of much the recent research on deep Nets. Deep CNN generally, perform well on complex tasks. However, deeper networks may suffer from performance degradation, gradient vanishing or explosion problems, which are not caused by overfitting but instead by an increase in the depth [53], [142]. Vanishing gradient problem not only results in higher test error but also in higher training error [142]–[144]. For training deeper Nets, the concept of multi-path or cross-layer connectivity was proposed [101], [107], [108], [113]. Multiple paths or shortcut connections can systematically connect one layer to another by skipping some intermediate layers to allow the specialized flow of information across the layers [145], [146]. Cross-layer connectivity partition the network into several blocks. These paths also try to solve the vanishing gradient problem by making gradient accessible to lower layers. For this purpose, different types of shortcut connections are used, such as zero-padded, projection-based, dropout, skip connections, and 1x1 connections, etc.

4.3.1 Highway Networks

The increase in depth of a network improves performance mostly for complex problems, but it
also makes training of the network difficult. In deep Nets, due to a large number of layers, the
backpropagation of error may result in small gradient values at lower layers. To solve this
problem, Srivastava et al. [101] in 2015, proposed a new CNN architecture named as Highway
Network based on the idea of cross-layer connectivity. In Highway Network, the unimpeded
flow of information across layers is enabled by imparting two gating units within a layer
(equation (5)). The idea of a gating mechanism was inspired from Long Short Term Memory
(LSTM) based Recurrent Neural Networks (RNN) [147], [148]. The aggregation of information
by combining the lth layer, and previous l k layers information creates a regularizing effect,
making gradient-based training of very deep networks easy. This enables training of a network
with more than 100 layers, even as deep as 900 layers with Stochastic Gradient Descent (SGD)
algorithm. Cross-layer connectivity for Highway Network is defined in equation (5 & 6).

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

In equation (5), g T refers to transformation gate, which expresses the amount of the produced output whereas g C is a carry gate. In a network, ( , ) l i Hl H x W represents working of hidden layers, and the residual implementation. Whereas, 1 ( , ) g i Cg T x W behaves as a switch in a layer, which decides the path for the flow of information.

4.3.2 ResNet

To address the problem faced during training of deeper Nets, in 2015, He et al. proposed ResNet [31] in which they exploited the idea of bypass pathways used in Highway Networks. Mathematical formulation of ResNet is expressed in equation (7 & 8).

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

Where, ()i fx is a transformed signal, whereas ix is an original input. Original input ix is added to ()i fx through bypass pathways. In essence, ()ii g x x  , performs residual learning. ResNet introduced shortcut connections within layers to enable cross-layer connectivity, but these gates are data independent and parameter free in comparison to Highway Networks. In Highway Networks, when a gated shortcut is closed, the layers represent non-residual functions. However, in ResNet, residual information is always passed and identity shortcuts are never closed. Residual links (shortcut connections) speed up the convergence of deep networks, thus givingResNet the ability to avoid gradient diminishing problems. ResNet with the depth of 152 layers, (having 20 and 8 times more depth than AlexNet and VGG, respectively) won the 2015-ILSVRC championship [21]. Even with increased depth, ResNet exhibited lower computational complexity than VGG [29].

4.3.3 DenseNets

In continuation of Highway Networks and ResNet, DenseNet was proposed to solve the vanishing gradient problem [31], [101], [107]. The problem with ResNet was that it explicitly
preserves information through additive identity transformations due to which many layers may
contribute very little or no information. To address this problem, DenseNet used cross-layer
connectivity but, in a modified fashion. DenseNet connected each layer to every other layer in a feed-forward fashion, thus feature maps of all preceding layers were used as inputs into all
subsequent layers. This establishes ( 1) 2 l l  direct connections in DenseNet, as compared to l connections between a layer and its preceding layer in the traditional CNNs. It imprints the effect of cross-layer depth wise convolutions. As DenseNet concatenates the previous layers features instead of adding them, thus, the network may gain the ability to explicitly differentiate between information that is added to the network and information that is preserved. DenseNet has narrow layer structure; however, it becomes parametrically expensive with an increase in a number of feature maps. The direct admittance of each layer to the gradients through the loss function improves the flow of information throughout the network. This incorporates a regularizing effect, which reduces overfitting on tasks with smaller training sets.

4.4 Width based Multi-Connection CNNs

During 2012-2015, the focus was largely on exploiting the power of depth along with the effectiveness of multi-pass regulatory connections in network regularization [31], [101]. However, Kawaguchi et al. reported that the width of network is also important [149]. Multilayer perceptron gained an advantage of mapping complex functions over perceptron by making parallel use of multiple processing units within a layer. This suggests that width is an important parameter in defining principles of learning along with depth. Lu et al. (2017), and Hanin and Sellke (2017) have recently shown that NNs with ReLU activation function have to be wide enough in order to hold universal approximation property with the increase in depth [150]. Moreover, a class of continuous functions on a compact set cannot be arbitrarily well approximated by an arbitrarily deep network, if the maximum width of the network is not larger than the input dimension [135], [151]. Although, stacking of multiple layers (increasing depth) may learn diverse feature representations, but may not necessarily increase the learning power of the NN. One major problem linked with deep architectures is that some layers or processing units may not learn useful features. To tackle this problem, the focus of research shifted from deep and narrow architecture towards thin and wide architectures.

4.4.1 WideResNet

It is concerned that the main drawback associated with deep residual networks is the feature reuse problem in which some feature transformations or blocks may contribute very little to learning [152]. This problem was addressed by WideResNet [34]. Zagoruyko and Komodakis suggested that the main learning potential of deep residual networks is due to the residual units, whereas depth has a supplementary effect. WideResNet exploited the power of the residual blocks by making ResNet wide rather than deep [31]. WideResNet increased width by introducing an additional factor k, which controls the width of the network. WideResNet showed that the widening of the layers may provide a more effective way of performance improvement than by making the residual networks deep. Although, deep residual networks improved representational capacity, but they have some demerits such as time intensive training, inactivation of many feature maps (feature reuse problem), and gradient vanishing and exploding problem. He et al. addressed feature reuse problem by incorporating dropout in residual blocks to regularize network in an effective way [31]. Similarly, Huang et al. introduced the concept of stochastic depth by exploiting dropouts to solve vanishing gradient and slow learning problem [105]. It was observed that even fraction improvement in performance may require the addition of many new layers. An empirical study showed that WideResNet was twice the number of parameters as compared to ResNet, but can be trained in a better way than the deep networks [34]. Wider residual network was based on the observation that almost all architectures before residual networks, including the most successful Inception and VGG, were wider as compared to ResNet. In WideResNet, learning is made effective by adding a dropout in-between the convolutional layers rather than inside a residual block.

4.4.2 Pyramidal Net

In earlier deep CNN architectures such as AlexNet, VGG, and ResNet, due to the deep stacking of multiple convolutional layers, depth of feature maps increase in subsequent layers. However, the spatial dimension decreases, as each convolutional layer is followed by a sub-sampling layer [21], [29], [31]. Therefore, Han et al. argued that in deep CNNs, enriched feature representation is compensated by a decrease in feature map size [35]. The drastic increase in the feature map depth and at the same time the loss of spatial information limits the learning ability of CNN. ResNet has shown remarkable results for image classification problem. However in ResNet, the deletion of residual block, where dimension of both spatial and feature map (channel) varies (feature map depth increases, while spatial dimension decreases), generally deteriorates performance. In this regard, stochastic ResNet, improved the performance by reducing information loss associated with the dropping of the residual unit [105]. To increase the learning ability of ResNet, Han et al. proposed, Pyramidal Net [35]. In contrast to the drastic decrease in spatial width with an increase in depth by ResNet, Pyramidal Net increases the width gradually per residual unit. This strategy enables pyramidal Net to cover all possible locations instead of maintaining the same spatial dimension within each residual block until down-sampling occurs. Because of a gradual increase in the depth of features map in a top-down fashion, it was named as pyramidal Net. In pyramidal Net, depth of feature maps is regulated by factor l , and is computed using equation (9).

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

Where, lD denotes the dimension of th l residual unit, n is the total number of the residual units, whereas  is a step factor and
n  regulates the increase in depth. The depth regulating factortries to distribute the burden of an increase in feature maps. Residual connections were inserted in between the layers by using zero-padded identity mapping. The advantage of zero-padded
identity mapping is that it needs less number of parameters as compared to the projection based shortcut connection, hence may result in better generalization [153]. Pyramidal Net uses two different approaches for the widening of the network including addition, and multiplication based widening. The difference between the two types of widening is that additive pyramidal structure increases linearly, multiplicative one increases geometrically [50], [54]. However,major problem with Pyramidal Net is that with the increase in width, a quadratic times increase in both space and time occurs.

4.4.3 Xception

Xception can be considered as an extreme Inception architecture, which exploits the idea of depthwise separable convolution introduced by AlexNet [21], [114]. Xception modified the original inception block by making it wider and replacing the different spatial dimensions (1x1, 5x5, 3x3) with a single dimension (3x3) followed by 1x1 to regulate computational complexity. Architecture of Xception block is shown in Fig. 9. Xception makes the network computationally efficient by decoupling spatial and feature map (channel) correlation. It works by first mapping the convolved output to low dimensional embeddings using 1x1 convolution and then spatially transforms it th k times, where k is a width defining cardinality, which determines the number of transformations. Xception makes computation easy by separately convolving each feature map across spatial axes, which is followed by pointwise convolution (1x1 convolutions) to perform cross-channel correlation. In Xception, 1x1 convolution is used to regulate feature map depth. In conventional CNN architectures; conventional convolutional operation uses only one transformation segment, inception block uses three transformation segment, whereas in Xception number of transformation segment is equal to the number of feature maps. Although, the transformation strategy adopted by Xception does not reduce the number of parameters, but it makes learning more efficient and results in improved performance.

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

Fig. 9: Xception building block.

4.4.4 Inception Family

Inception family of CNNs also comes under the class of width based methods [33], [99], [100]. In Inception networks, within a layer, varying sizes of the filters were used which increased the output of the intermediate layers. The use of the different sizes of filters are helpful in capturing the diversity in high-level features. Salient characteristics of Inception family are discussed in section 4.1.4 and 4.2.3.

4.5 Feature Map (ChannelFMap) Exploitation based CNNs

CNN became popular for MV tasks because of its hierarchical learning and automatic feature extraction ability [12]. Selection of features play an important role in determining the performance of classification, segmentation, and detection modules. Conventional feature extraction techniques are generally static and limit the performance of classification module because of limited types of features [154]. In CNN, features are dynamically set by tuning the weights associated with a kernel (mask). Also, multiple stages of feature extraction are used, which can extract diverse types of features (known as feature maps or channels in CNN). However, some of the feature maps impart little or no role in object discrimination [116]. Enormous feature sets may create an effect of noise and thus lead to over-fitting of the network. This suggests that apart from network engineering, selection of feature maps can play an important role in improving generalization of the network. In this section, feature maps and channels will be interchangeably used as many researchers have used the word channels for the feature maps.

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

Fig. 10: Squeeze and Excitation block.

4.5.1 Squeeze and Excitation Network

Squeeze and Excitation network (SE-Network) was reported by Hu et al. [116]. They proposed a new block for the selection of feature maps (commonly known as channels) relevant to object discrimination. This new block was named as SE-block (shown in Fig. 10), which suppresses the less important feature maps, but gives high weightage to the class specifying feature maps. SENetwork reported record decrease in error on ImageNet dataset. SE-block is a processing unit that is design in a generic way, and therefore can be added in any CNN architecture before the convolution layer. The working of this block consists of two operations; squeeze and excitation. Convolution kernel captures information locally, but it ignores contextual relation of features
(correlation) that are outside of this receptive field. To obtain a global view of feature maps, the squeeze block generates feature map wise statistics by suppressing spatial information of the convolved input. As global average pooling has the potential to learn the extent of target object effectively, therefore, it is employed by the squeeze operation to generate feature map wise statistics using the following equation [57], [155]:

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

Where, MD is a feature map descriptor, and *mn is a spatial dimension of input. The output of squeeze operation; MD is assigned to the excitation operation, which models motif-wise interdependencies by exploiting gating mechanism. Excitation operation assigns weights to feature maps using two layer feed forward NN, which is mathematically expressed in equation(11).

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

In equation (11), M V denotes weightage for each feature map, where  and  refer to the ReLU and sigmoid function, respectively. In excitation operation, 1 w and 2 w are used as a regulating factor to limit the model complexity and aid the generalization [50], [51]. The output of squeezeblock is preceded by ReLU activation function, which adds non-linearity in feature maps. Gating
mechanism is exploited in SE-block using sigmoid activation function, which models interdependencies among feature map and assigns a weight based on feature map relevance [156]. SE-block is simple and adaptively recalibrates each layer feature maps by multiplying convolved input with the motif responses.

4.5.2 Competitive Squeeze and Excitation Networks

Competitive Inner-Imaging Squeeze and Excitation for Residual Network also known as CMPESE Network was proposed by Hu et al. in 2018 [118]. Hu et al. used the idea of SE-block to improve the learning of deep residual networks [116]. SE-Network recalibrates the feature maps based upon their contribution in class discrimination. However, the main concern with SE-Net is that in ResNet, it only considers the residual information for determining the weight of each channel [116]. This minimizes the impact of SE-block and makes ResNet information redundant. Hu et al. addressed this problem by generating feature map wise statistics from both residual and identity mapping based features. In this regard, global representation of feature maps is generated using global average pooling operation, whereas relevance of feature maps is estimated by making competition between residual and identity mapping based descriptors. This phenomena is termed as inner imaging [118]. CMPE-SE block not only models the relationship between residual feature maps but also maps their relation with identity feature map and makes a competition between residual and identity feature maps. The mathematical expression for CMPE-SE block is represented using the following equation:

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

where idx is the identity mapping of input, seF represents the squeeze operation applied on residual feature map r u and identity feature map id x , and res F shows implementation of SE-block on residual feature maps. The output of squeeze operation is multiplied with the SE-block output res F . The backpropagation algorithm thus tries to optimize the competition between identity and residual feature maps and the relationship between all feature maps in the residual block.

4.6 Channel(Input) Exploitation based CNNs

Image representation plays an important role in determining the performance of the image- processing algorithms including both conventional and deep learning algorithms. A good representation of the image is one that can define the salient features of an image from a compact code. In literature, various types of conventional filters are applied to extract different levels of information for a single type image [157], [158]. These diverse representations are then used as an input of the model to improve performance [159], [160]. Now CNN is an effective feature learner that can automatically extract discriminating features depending upon the problem [161]. However, the learning of CNN relies on input representation. The lack of diversity and absence of class discernable information in the input may affect CNN performance as a discriminator. For this purpose, the concept of channel boosting (input channel dimension) using auxiliary learners is introduced in CNN to boost the representation of the network [36].

4.6.1 Channel Boosted CNN using TL

In 2018, Khan et al. proposed a new CNN architecture named as Channel boosted CNN (CB- CNN) based on the idea of boosting the number of input channels for improving the representational capacity of the network [36]. Block diagram of CB-CNN is shown in Fig. 11. Channel boosting is performed by artificially creating extra channels (known as auxiliary channels) through deep generative models and then exploiting it through the deep discriminative models. It provides the concept that TL can be used at both generation and discrimination stages. Data representation plays an important role in determining the performance of a classifier, as different representations may present different aspects of information [84]. For improving the representational potential of the data, Khan et al. exploited the power of TL and deep generative learners [24], [162], [163]. Generative learners attempt to characterize the data generating distribution during the learning phase. In CB-CNN, autoencoders are used as a generative learner to learn explanatory factors of variation behind the data. The concept of inductive TL is used in a novel way to build a boosted input representation by augmenting learned distribution of the input data with the original channel space (input channels). CB-CNN encodes channel-boosting phase into a generic block, which is inserted at the start of a deep Net. For training, Khan et al. used a pre-trained network, to reduce computational cost. The significance of the study is that multi- deep learners are used where generative learning models are used as auxiliary learners that enhance the representational capacity of deep CNN based discriminator. Although the potential of the channel boosting was only evaluated by inserting boosting block at the start, however, Khan et al. suggested that this idea can be extended by providing auxiliary channels at any layer in the deep architecture. CB-CNN has also been evaluated on medical image dataset, where it also shows improved results compared to previously proposed approaches. The convergence plot of CB-CNN on mitosis dataset is shown in Fig. 12.

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

Fig. 11: Basic architecture of CB-CNN.

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

Fig. 12: Convergence plot of CB-CNN on mitosis dataset. Loss and accuracy is shown on y-axis, whereas x-axis represents epochs. Training plot of CB-CNN shows that the model converges after about 14 epochs.

4.7 Attention based CNNs

Different levels of abstraction have an important role in defining discrimination power of the NN. In addition to learning different levels of abstraction, focusing on features relevant to the context also play a significant role in image localization and recognition. In human visual system, this phenomenon is referred as attention. Humans view the scene in a succession of partial glimpses and pay attention to context-relevant parts. This process not only serves to focus selected region but also deduces different interpretations of objects at that location and thus helps in capturing visual structure in a better way. A more or less similar kind of interpretability is added into RNN and LSTM [147], [148]. RNN and LSTM networks exploit attention module for generation of sequential data and the new samples are weighted based on their occurrence in previous iterations. The concept of attention was incorporated into CNN, by various researchers to improve representation and overcome the computational limits. This idea of attention also helps in making CNN intelligent enough to recognize objects even from cluttered backgrounds and complex scenes.

4.7.1 Residual Attention Neural Network

Wang et al. proposed a Residual Attention Network (RAN) to improve feature representation of the network [38]. The motivation behind the incorporation of attention in CNN was to make network capable of learning object aware features. RAN is a feed forward CNN, which was built by stacking residual blocks with attention module. Attention module is branched off into trunk and mask branches that adopt bottom-up top-down learning strategy. The assembly of two different learning strategies into the attention module enables fast feed-forward processing and top-down attention feedback in a single feed-forward process. Bottom-up feed-forward structure produces low-resolution feature maps with strong semantic information. Whereas, top-down architecture produces dense features in order to make an inference of each pixel. In the previously proposed studies, a top-down bottom-up learning strategy was used by Restricted Boltzmann Machines [164]. Similarly, Goh et al. exploited the top-down attention mechanism as a regularizing factor in Deep Boltzmann Machine (DBM) during the reconstruction phase of the training. Top-down learning strategy globally optimizes network in such a way that gradually output the maps to input during the learning process [82], [164], [165]. Attention module in RAN generates object aware soft mask Si,FM (xc ) at each layer [166]. Soft mask, Si,FM (xc ) assigns attention towards object using equation (13) by recalibrating trunk branch output and thus, behaves like a control gate for every neuron output. Ti,FM (xc )

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

In one of the previous studies, Transformation network [167], [168] also exploited the idea of attention in a simple way by incorporating it with convolution block, but the main problem was that attention module in Transformation network are fixed and cannot adapt to changing circumstances. RAN was made efficient towards recognition of cluttered, complex, and noisy images by stacking multiple attention modules. Hierarchical organization of RAN endowed the ability to adaptively assign weight to each feature map based on their relevance in the layers [38]. Learning of deep hierarchical structure was supported through residual units. Moreover, three different levels of attention: mixed, channel, and spatial attention were incorporated thus, leveraging the capability to capture object-aware features at different levels [38].

4.7.2 Convolutional Block Attention Module

The significance of attention mechanism and feature map exploitation is validated through RAN and SE-Network [38], [111]. In this regard, Woo et al. came up with new attention based CNN; named as Convolutional Block Attention Module (CBAM) [37]. CBAM is simple in design and similar to SE-Network. SE-Network only considers the contribution of feature maps in image classification, but it ignores the spatial locality of the object in images. Spatial location of the object has an important role in object detection. CBAM infer attention maps sequentially by first applying feature map (channel) attention and then spatial attention, to find the refined feature maps. In literature, generally, 1x1 convolution and pooling operations are used for spatial attention. Woo et al. showed that pooling of features along spatial axis generates an efficient feature descriptor. CBAM concatenates average pooling operation with max pooling, which generate a strong spatial attention map. Likewise, feature map statistics were modeled using a combination of max pooling and global average pooling operation. Woo et al. showed that max pooling can provide the clue about distinctive object features, whereas use of global average pooling returns suboptimal inference of feature map attention. Exploitation of both average pooling and max-pooling improves representational power of the network. These refined feature maps not only focus on the important part but also increase the representational power of the selected feature maps. Woo et al. empirically showed that formulation of 3D attention map via serial learning process helps in reduction of the parameters as well as computational cost. Due to the simplicity of CBAM, it can be integrated easily with any CNN architecture.

4.7.3 Concurrent Spatial and Channel Excitation Mechanism

In 2018, Roy et al. extended the work of Hu et al. by incorporating the effect of spatial information in combination with feature map (channel) information to make it applicable to segmentation tasks [111], [112]. They introduced three different modules: (i) squeezing spatially and exciting feature map wise (cSE), (ii) squeezing feature map wise and exciting spatially (sSE) and (iii) concurrent spatial and channel squeeze & excitation (scSE). In this work, autoencoder based convolutional NN was used for segmentation, whereas proposed modules were inserted after the encoder and decoder layer. In cSE module, the same concept as that of SE-block is exploited. In this module, scaling factor is derived based on the combination of feature maps in object detection. As spatial information has an important role in segmentation, therefore in sSE module, spatial locality has been given more importance than feature map information. For this purpose, different combinations of feature map are selected and exploited spatially to use them for segmentation. In the last module; scSE, attention to each channel is assigned by deriving scaling factor both from spatial and channel information and thus to selectively highlight the object specific feature maps [112].

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

4 Architectural Innovations in CNN

4.1 Spatial Exploitation based CNNs

4.1.1 LeNet

4.1.2 AlexNet

4.1.3 ZefNet

4.1.4 VGG

4.1.5 GoogleNet

4.2 Depth based CNNs

4.2.1 Highway Networks

4.2.2 ResNet

4.2.3 Inception-V3, V4 and Inception-ResNet

4.2.4 ResNext

4.3 Multi-Path based CNNs

4.3.1 Highway Networks

4.3.2 ResNet

4.3.3 DenseNets

4.4 Width based Multi-Connection CNNs

4.4.1 WideResNet

4.4.2 Pyramidal Net

4.4.3 Xception

4.4.4 Inception Family

4.5 Feature Map (ChannelFMap) Exploitation based CNNs

4.5.1 Squeeze and Excitation Network

4.5.2 Competitive Squeeze and Excitation Networks

4.6 Channel(Input) Exploitation based CNNs

4.6.1 Channel Boosted CNN using TL

4.7 Attention based CNNs

4.7.1 Residual Attention Neural Network

4.7.2 Convolutional Block Attention Module

4.7.3 Concurrent Spatial and Channel Excitation Mechanism

相关推荐