CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》

导读：深度卷积神经网络的最新架构综述

原作者

Asifullah Khan1, 2*, Anabia Sohail1, 2, Umme Zahoora1, and Aqsa Saeed Qureshi1
1 Pattern Recognition Lab, DCIS, PIEAS, Nilore, Islamabad 45650, Pakistan
2 Deep Learning Lab, Center for Mathematical Sciences, PIEAS, Nilore, Islamabad 45650, Pakistan
[email protected]

Abstract

Deep Convolutional Neural Networks (CNNs) are a special type of Neural Networks, which have shown state-of-the-art performance on various competitive benchmarks. The powerful learning ability of deep CNN is largely due to the use of multiple feature extraction stages (hidden layers) that can automatically learn representations from the data. Availability of a large amount of data and improvements in the hardware processing units have accelerated the research in CNNs, and recently very interesting deep CNN architectures are reported. The recent race in developing deep CNNs shows that the innovative architectural ideas, as well as parameter optimization, can improve CNN performance. In this regard, different ideas in the CNN design have been explored such as the use of different activation and loss functions, parameter optimization, regularization, and restructuring of the processing units. However, the major improvement in representational capacity of the deep CNN is achieved by the restructuring of the processing units. Especially, the idea of using a block as a structural unit instead of a layer is receiving substantial attention. This survey thus focuses on the intrinsic taxonomy present in the recently reported deep CNN architectures and consequently, classifies the recent innovations in CNN architectures into seven different categories. These seven categories are based on spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting, and attention. Additionally, this survey also covers the elementary understanding of CNN components and sheds light on its current challenges and applications.

深度卷积神经网络(CNNs)是一种特殊类型的神经网络，在各种竞争性基准测试中表现出了最先进的性能。深度CNN强大的学习能力很大程度上是由于它使用了多个特征提取阶段(隐含层)，可以从数据中自动学习表示。大量数据的可用性和硬件处理单元的改进加速了CNNs的研究，并且，最近报道了非常有意思的深度CNN架构。最近开发深度CNNs的竞赛表明，创新的架构思想和参数优化可以提高CNN的性能。为此，在CNN的设计中探索了不同的思路，如使用不同的**和丢失函数、参数优化、正则化以及处理单元的重组。然而，深度CNN的代表性能力的主要提高是通过处理单元的重组实现的。特别是，使用一个块作为一个结构单元而不是一层的想法正在得到大量的关注。因此，本次调查的重点是最近报道的深度CNN架构的内在分类，因此，将CNN架构的最新创新分为七个不同的类别。这七个类别分别基于空间开发、深度、多路径、宽度、特征地图开发、通道提升和注意力机制。此外，本调查还涵盖了对CNN组件的基本理解，并阐明了其当前的挑战和应用。

Keywords: Deep Learning, Convolutional Neural Networks, Architecture, Representational Capacity, Residual Learning, and Channel Boosted CNN.

关键词：深度学习，卷积神经网络，架构，表征能力，残差学习，通道提升的CNN

1、Introduction

Machine Learning (ML) algorithms belong to a specialized area in Artificial Intelligence (AI), which endows intelligence to computers by learning the underlying relationships among the data and making decisions without being explicitly programmed. Different ML algorithms have been developed since the late 1990s, for the emulation of human sensory responses such as speech and vision, but they have generally failed to achieve human-level satisfaction [1]–[6]. The challenging nature of Machine Vision (MV) tasks gives rise to a specialized class of Neural Networks (NN), known as Convolutional Neural Network (CNN) [7].	机器学习(ML)算法属于人工智能(AI)的一个专门领域，它通过学习数据之间的基本关系并在没有显示编程的情况下做出决策，从而赋予计算机智能。自20世纪90年代末以来，针对语音、视觉等人类感官反应的仿真，人们开发了各种各样的ML算法，但普遍未能达到人的满意程度[1]-[6]。由于机器视觉(MV)任务的挑战性，产生了一类专门的神经网络(NN)，称为卷积神经网络(CNN)[7]。
CNNs are considered as one of the best techniques for learning image content and have shown state-of-the-art results on image recognition, segmentation, detection, and retrieval related tasks [8], [9]. The success of CNN has captured attention beyond academia. In industry, companies such as Google, Microsoft, AT&T, NEC, and Facebook have developed active research groups for exploring new architectures of CNN [10]. At present, most of the frontrunners of image processing competitions are employing deep CNN based models.	CNNs被认为是学习图像内容的最佳技术之一，在图像识别、分割、检测和检索相关任务[8]、[9]方面已经取得了最新的成果。CNN的成功吸引了学术界以外的关注。在业界，谷歌、微软、AT&T、NEC、Facebook等公司都建立了活跃的研究小组，探索CNN[10]的新架构。目前，大多数图像处理竞赛的领跑者，都在使用基于深度CNN的模型。
The topology of CNN is divided into multiple learning stages composed of a combination of the convolutional layer, non-linear processing units, and subsampling layers [11]. Each layer performs multiple transformations using a bank of convolutional kernels (filters) [12]. Convolution operation extracts locally correlated features by dividing the image into small slices (similar to the retina of the human eye), making it capable of learning suitable features. Output of the convolutional kernels is assigned to non-linear processing units, which not only helps in learning abstraction but also embeds non-linearity in the feature space. This non-linearity generates different patterns of activations for different responses and thus facilitates in learning of semantic differences in images. Output of the non-linear function is usually followed by subsampling, which helps in summarizing the results and also makes the input invariant to geometrical distortions [12], [13].
The architectural design of CNN was inspired by Hubel and Wiesel’s work and thus largely follows the basic structure of primate’s visual cortex [14], [15]. CNN first came to limelight through the work of LeCuN in 1989 for the processing of grid-like topological data (images and time series data) [7], [16]. The popularity of CNN is largely due to its hierarchical feature extraction ability. Hierarchical organization of CNN emulates the deep and layered learning process of the Neocortex in the human brain, which automatically extract features from the underlying data [17]. The staging of learning process in CNN shows quite resemblance with primate’s ventral pathway of visual cortex (V1-V2-V4-IT/VTC) [18]. The visual cortex of primates first receives input from the retinotopic area, where multi-scale highpass filtering and contrast normalization is performed by the lateral geniculate nucleus. After this, detection is performed by different regions of the visual cortex categorized as V1, V2, V3, and V4. In fact, V1 and V2 portion of visual cortex are similar to convolutional, and subsampling layers, whereas inferior temporal region resembles the higher layers of CNN, which makes inference about the image [19]. During training, CNN learns through backpropagation algorithm, by regulating the change in weights with respect to the input. Minimization of a cost function by CNN using backpropagation algorithm is similar to the response based learning of human brain. CNN has the ability to extract low, mid, and high-level features. High level features (more abstract features) are a combination of lower and mid-level features. With the automatic feature extraction ability, CNN reduces the need for synthesizing a separate feature extractor [20]. Thus, CNN can learn good internal representation from raw pixels with diminutive processing.
The main boom in the use of CNN for image classification and segmentation occurred after it was observed that the representational capacity of a CNN can be enhanced by increasing its depth [21]. Deep architectures have an advantage over shallow architectures, when dealing with complex learning problems. Stacking of multiple linear and non-linear processing units in a layer wise fashion provides deep networks the ability to learn complex representations at different levels of abstraction. In addition, advancements in hardware and thus the availability of high computing resources is also one of the main reasons of the recent success of deep CNNs. Deep CNN architectures have shown significant performance of improvements over shallow and conventional vision based models. Apart from its use in supervised learning, deep CNNs have potential to learn useful representation from large scale of unlabeled data. Use of the multiple mapping functions by CNN enables it to improve the extraction of invariant representations and consequently, makes it capable to handle recognition tasks of hundreds of categories. Recently, it is shown that different level of features including both low and high-level can be transferred to a generic recognition task by exploiting the concept of Transfer Learning (TL) [22]–[24]. Important attributes of CNN are hierarchical learning, automatic feature extraction, multi-tasking, and weight sharing [25]–[27].

Various improvements in CNN learning strategy and architecture were performed to make CNN scalable to large and complex problems. These innovations can be categorized as parameter optimization, regularization, structural reformulation, etc. However, it is observed that CNN based applications became prevalent after the exemplary performance of AlexNet on ImageNet dataset [21]. Thus major innovations in CNN have been proposed since 2012 and were mainly due to restructuring of processing units and designing of new blocks. Similarly, Zeiler and Fergus [28] introduced the concept of layer-wise visualization of features, which shifted the trend towards extraction of features at low spatial resolution in deep architecture such as VGG [29]. Nowadays, most of the new architectures are built upon the principle of simple and homogenous topology introduced by VGG. On the other hand, Google group introduced an interesting idea of split, transform, and merge, and the corresponding block is known as inception block. The inception block for the very first time gave the concept of branching within a layer, which allows abstraction of features at different spatial scales [30]. In 2015, the concept of skip connections introduced by ResNet [31] for the training of deep CNNs got famous, and afterwards, this concept was used by most of the succeeding Nets, such as Inception-ResNet, WideResNet, ResNext, etc [32]–[34].
In order to improve the learning capacity of a CNN, different architectural designs such as WideResNet, Pyramidal Net, Xception etc. explored the effect of multilevel transformations in terms of an additional cardinality and increase in width [32], [34], [35]. Therefore, the focus of research shifted from parameter optimization and connections readjustment towards improved architectural design (layer structure) of the network. This shift resulted in many new architectural ideas such as channel boosting, spatial and channel wise exploitation and attention based information processing etc. [36]–[38].
In the past few years, different interesting surveys are conducted on deep CNNs that elaborate the basic components of CNN and their alternatives. The survey reported by [39] has reviewed the famous architectures from 2012-2015 along with their components. Similarly, in the literature, there are prominent surveys that discuss different algorithms of CNN and focus on applications of CNN [20], [26], [27], [40], [41]. Likewise, the survey presented in [42] discussed taxonomy of CNNs based on acceleration techniques. On the other hand, in this survey, we discuss the intrinsic taxonomy present in the recent and prominent CNN architectures. The various CNN architectures discussed in this survey can be broadly classified into seven main categories namely; spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting, and attention based CNNs. The rest of the paper is organized in the following order (shown in Fig. 1): Section 1 summarizes the underlying basics of CNN, its resemblance with primate’s visual cortex, as well as its contribution in MV. In this regard, Section 2 provides the overview on basic CNN components and Section 3 discusses the architectural evolution of deep CNNs. Whereas, Section 4, discusses the recent innovations in CNN architectures and categorizes CNNs into seven broad classes. Section 5 and 6 shed light on applications of CNNs and current challenges, whereas section 7 discusses future work and last section draws conclusion.

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》

Fig. 1: Organization of the survey paper.

2 Basic CNN Components

Nowadays, CNN is considered as the most widely used ML technique, especially in vision related applications. CNNs have recently shown state-of-the-art results in various ML applications. A typical block diagram of an ML system is shown in Fig. 2. Since, CNN possesses both good feature extraction and strong discrimination ability, therefore in a ML system; it is mostly used for feature extraction and classification.

A typical CNN architecture generally comprises of alternate layers of convolution and pooling followed by one or more fully connected layers at the end. In some cases, fully connected layer is replaced with global average pooling layer. In addition to the various learning stages, different regulatory units such as batch normalization and dropout are also incorporated to optimize CNN performance [43]. The arrangement of CNN components play a fundamental role in designing new architectures and thus achieving enhanced performance. This section briefly discusses the role of these components in CNN architecture.

2.1 Convolutional Layer

Convolutional layer is composed of a set of convolutional kernels (each neuron act as a kernel). These kernels are associated with a small area of the image known as a receptive field. It works by dividing the image into small blocks (receptive fields) and convolving them with a specific set of weights (multiplying elements of the filter with the corresponding receptive field elements) [43]. Convolution operation can expressed as follows:

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》

Where, the input image is represented by x, y I , , xy shows spatial locality and k
l K represents the lth convolutional kernel of the kth layer. Division of image into small blocks helps in extracting locally correlated pixel values. This locally aggregated information is also known as feature motif. Different set of features within image are extracted by sliding convolutional kernel on the whole image with the same set of weights. This weight sharing feature of convolution operation makes CNN parameter efficient as compared to fully connected Nets. Convolution operation may further be categorized into different types based on the type and size of filters, type of padding, and the direction of convolution [44]. Additionally, if the kernel is symmetric, the convolution operation becomes a correlation operation [16].

2.2 Pooling Layer

Feature motifs, which result as an output of convolution operation can occur at different locations in the image. Once features are extracted, its exact location becomes less important as long as its approximate position relative to others is preserved. Pooling or downsampling like convolution, is an interesting local operation. It sums up similar information in the neighborhood of the receptive field and outputs the dominant response within this local region [45].

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》

Equation (2) shows the pooling operation in which l Z represents the lth output feature map, ,lxyF shows the lth input feature map, whereas p f (.) defines the type of pooling operation. The use ofpooling operation helps to extract a combination of features, which are invariant to translational shifts and small distortions [13], [46]. Reduction in the size of feature map to invariant feature set not only regulates complexity of the network but also helps in increasing the generalization by reducing overfitting. Different types of pooling formulations such as max, average, L2, overlapping, spatial pyramid pooling, etc. are used in CNN [47]–[49].

2.3 Activation Function

Activation function serves as a decision function and helps in learning a complex pattern. Selection of an appropriate activation function can accelerate the learning process. Activation function for a convolved feature map is defined in equation (3).

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》

In above equation, k l F is an output of a convolution operation, which is assigned to activation function; A f (.) that adds non-linearity and returns a transformed output k l T for kth layer. In literature, different activation functions such as sigmoid, tanh, maxout, ReLU, and variants of ReLU such as leaky ReLU, ELU, and PReLU [39], [48], [50], [51] are used to inculcate nonlinear combination of features. However, ReLU and its variants are preferred over others activations as it helps in overcoming the vanishing gradient problem [52], [53].

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》

Fig. 2: Basic layout of a typical ML system. In ML related tasks, initially data is preprocessed and then assigned to a
classification system. A typical ML problem follows three steps: stage 1 is related to data gathering and generation,
stage 2 performs preprocessing and feature selection, whereas stage 3 is based on model selection, parameter tuning,
and analysis. CNN has a good feature extraction and strong discrimination ability, therefore in a ML system; it can
be used for feature extraction and classification.

2.4 Batch Normalization

Batch normalization is used to address the issues related to internal covariance shift within feature maps. The internal covariance shift is a change in the distribution of hidden units’ values, which slow down the convergence (by forcing learning rate to small value) and requires careful initialization of parameters. Batch normalization for a transformed feature map k lT is shown in equation (4).

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》

In equation (4), k l N represents normalized feature map, kl F is the input feature map, B and 2 B  depict mean and variance of a feature map for a mini batch respectively. Batch normalization unifies the distribution of feature map values by bringing them to zero mean and unit variance [54]. Furthermore, it smoothens the flow of gradient and acts as a regulating factor, which thus helps in improving generalization of the network.

2.5 Dropout

Dropout introduces regularization within the network, which ultimately improves generalization by randomly skipping some units or connections with a certain probability. In NNs, multiple connections that learn a non-linear relation are sometimes co-adapted, which causes overfitting [55]. This random dropping of some connections or units produces several thinned network architectures, and finally one representative network is selected with small weights. This selected
architecture is then considered as an approximation of all of the proposed networks [56].

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》

Abstract

1、Introduction

2 Basic CNN Components

2.1 Convolutional Layer

2.2 Pooling Layer

2.3 Activation Function

2.4 Batch Normalization

相关推荐