Deep Learning Meets SAR

Deep Learning Meets SAR

Abstract

Deep learning in remote sensing has become an international hype, but it is mostly limited to the evaluation of optical data. Although deep learning has been introduced in SAR data processing, despite successful first attempts, its huge potential remains locked. For example, to the best knowledge of the authors, there is no single example of deep learning in SAR that has been developed up to operational processing of big data or integrated into the production chain of any satellite mission. In this paper, we provide an introduction to the most relevant deep learning models and concepts, point out possible pitfalls by analyzing special characteristics of SAR data, review the stateof-the-art of deep learning applied to SAR in depth, summarize available benchmarks, and recommend some important future research directions. With this effort, we hope to stimulate more research in this interesting yet under-exploited research field.

Index Terms—Benchmarks, deep learning, despeckling, InSAR, object detection, parameter inversion, SAR, SAR-optical data fusion, terrain surface classification.

I. MOTIVATION

In recent years, deep learning [1] has been developed at a dramatic pace, achieving great success in many fields. Unlike conventional algorithms, deep learning-based methods commonly employ hierarchical architectures, such as deep neural networks, to extract feature representations of raw data for numerous tasks. For instance, convolutional neural networks (CNNs) are capable of learning low- and high-level features from raw images with stacks of convolutional and pooling layers, and then applying the extracted features to various computer vision tasks, such as large-scale image recognition [2], object detection [3], and semantic segmentation [4]. Inspired by numerous successful applications in the computer vision community, the use of deep learning in remote sensing is now obtaining wide attention [5]. As first attempts in SAR, deep learning-based methods have been adopted for a variety of tasks, including terrain surface classification [6], object detection [7], parameter inversion [8], despeckling [9], specific applications in InSAR [10], and SAR-optical data fusion [11].

For terrain surface classification from SAR and Polarimetric SAR(PolSAR)images,effectivefeatureextractionisessential. These features are extracted based on expert domain knowledge and are usually applicable to a small number of cases and data sets. Deep learning feature extraction has however proved to overcome, to some degrees, both of the aforementioned issues [6]. For SAR target detection, conventional approaches mainly rely on template matching, where specific templates are created manually [12] to classify different categories, or through the use of traditional machine learning approaches, such as Support Vector Machines (SVMs) [13], [14]; in contrast, modern deep learning algorithms aim at applying deep CNNs to extract discriminative features automatically for target recognition [7]. For parameter inversion, deep learning modelsareemployedtolearnthelatentmappingfunctionfrom SAR images to estimated parameters, e.g., sea ice concentration [8]. Regarding despeckling, conventional methods often rely on artificial filters and may suffer from mis-eliminating sharp features when denoising. Furthermore, the development ofjointanalysisofSARandopticalimageshasbeenmotivated by the capacities of extracting features from both types of images. For applications in InSAR, only a few studies have been carried out such as the work described in [10]. However, these algorithms neglect the special characteristics of phase and simply use an out-of-the-box deep learning-based model.

Despite the first successes, and unlike the evaluation of optical data, the huge potential of deep learning in SAR and InSAR remains locked. For example, to the best knowledge of the authors, there is no single example of deep learning in SAR that has been developed up to operational processing of big data or integrated into the production chain of any satellite mission. This paper aims at stimulating more research in this interesting yet under-exploited research field.

In the remainder of this paper, Section II first introduces the most commonly used deep learning models in remote sensing. Section III describes the specific characteristics of SAR data that have to be taken into account to exploit the full potential of SAR combined with deep learning. Section IV details recent advances in the utilization of deep learning on different SAR applications, which were outlined earlier in the section. Section V reviews the existing benchmark data sets for different applications of SAR and their limitations. Finally, Section VI concludes current research, and gives an overview of promising future directions.

Deep Learning Meets SAR

II. INTRODUCTION TO RELEVANT DEEP LEARNING MODELS AND CONCEPTS

In this section, we briefly review relevant deep learning algorithmsoriginallyproposedforvisualdataprocessingthatare widely used for the state-of-the-art research of deep learning in SAR. In addition, we mention the latest developments of deep learning, which are not yet widely applied to SAR but may help create next generation of its algorithms. Fig. 1 gives an overview of the deep learning models we discuss in this section. Before discussing deep learning algorithms, we would like to stress that the importance of high-quality benchmark datasets in deep learning research cannot be overstated. Especially in supervised learning, the knowledge that can be learned by the model is bounded by the information present in the training dataset. For example, the MNIST [25] dataset played a key role in Yann LeCun’s seminal paper about convolutional neural networks and gradient-based learning [26]. Similarly, there would be no AlexNet [27], the network that kick-started the current deep learning renaissance, without the ImageNet [28] dataset, which contains over 14 million images and 22,000 classes. ImageNet has been such an important part of deep learning research that, even after over 10 years of being published, it is still used as a standard benchmark to evaluate the performance of CNNs for image classification.

A. Deep Learning Models

The main principle of deep learning models is to encode input data into effective feature representations for target tasks.

To examplify how a deep learning framework works, we take autoencoder as an example: it first maps an input data to a latent representation via a trainable nonlinear mapping and then reconstructs inputs through reverse mapping. The reconstruction erroris usually defined as the Euclidian distance between inputs and reconstructed inputs.Parameters of autoencoders are optimized by gradient descent based optimizers,like stochastic gradient descent (SGD), RMSProp [29] and ADAM [30], during the backpropagation step.

  1. Convolutional Neural Networks (CNN): With the success of AlexNet in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC-2012), where it scored a top-5 test error of 15.3% compared to 26.2% of the second best, CNNs have attracted worldwide attention and are now used for many image understanding tasks, such as image classification, object detection, and semantic segmentation. AlexNet consists of five convolutional layers, three max-pooling layers, and three fully-connected layers. One of the key innovations of the AlexNet was the use of GPUs, which made it possible to train such large networks with huge datasets without using supercomputers. In just two years, VGGNet [2] overtook AlexNet in performance by achieving a 6.8% top-5 test error in ILSVRC-2014; the main difference was that it only used 3x3-sized convolutional kernels, which enabled it to have more number of channels and in turn capture more diverse features. ResNet [31], U-Net [32], and DenseNet [33] were the next major CNN architectures. The main feature of all these architectures was the idea of connecting, not only neighboring layers but any two layers in the network, by using skip connections. This helped reduce loss of information across networks, mitigated the problem of vanishing gradients and allowed the design of deeper networks. U-Net is one of the most commonly used image segmentation networks. It has autoencoder based architecture where it uses skip connections to concatenate features from the first layer to last, second to second last, and so on: this way it can get fine-grained information from initial layers to the end layers. U-Net was initially proposed for medical image segmentation, where data labeling is a big problem. The authors used heavy data augmentation techniques on input data, making it possible to learn from only a few hundred annotated samples. In ResNet skip connections were used within individual blocks and not across the whole network. Since its initial proposal, it has seen many architectural tweaks, and even after 4-5 years its variants are always among the top scorers on ImageNet. In DenseNet all the layers were attached to all preceding layers, reducing the size of the network, albeit at the cost of memory usage. For a more detailed explanations of different CNN models, interested readers are referred to [34].

  2. Recurrent Neural Networks (RNN): Besides CNNs, RNNs [35] are another major class of deep networks. Their mainbuildingblocksarerecurrentunits,whichtakethecurrent input and output of the previous state as input. They provide state-of-the-art results for processing data of variable lengths like text and time series data. Their weights can be replaced with convolutional kernels for visual processing tasks such as image captioning and predicting future frames/points in visual time-series data. Long short term memory (LSTM) [36] is one of the most popular architectures of RNN: its cells can store values from any past instances while not being severely affected by the problem of gradient diminishing… 3) GANs: Proposed by Ian Goodfellow et al. [37], GANs are among the most popular and exciting inventions in the field of deep learning. Based on game-theoretic principles, they consist of two networks called a generator and a discriminator. The generator’s objective is to learn a latent space, through which it can generate samples from the same distribution as the training data, while the discriminator tries to learn to distinguish if a sample is from the generator or training data. This very simple mechanism is responsible for most cuttingedge algorithms of various applications, e.g., generating artificial photo-realistic images/videos, super-resolution, and text to image synthesis.

B. Supervised, Unsupervised and Reinforcement Learning
  1. Supervised Learning: Most of popular deep learning models fall under the category of supervised deep learning, i.e. they need labelled datasets to learn the objective functions. One of big challenges of supervised learning is generalization, i.e. how well a trained model performs on test data. Therefore it is vital that training data truly represents the true distribution of data so it can handle all the unseen data. If the model fits well on training data and fails on test data then it is called overfitting, in deep learning literature there are several techniques that can be used to avoid it, e.g. Dropout[38].

  2. Unsupervised Learning: Unsupervised learning refers to the class of algorithms where the training data do not containlabels.Forinstance,inclassicaldataanalysis,principal component analysis (PCA) [39] can be used to reduce the data dimension followed by a clustering algorithm to group similar data points. In deep learning generative models like autoencoders and variational autoencoders (VAEs) [40] and Generative Adversarial Networks (GANs) [37] are some of popular techniques that can be used for unsupervised learning. Their primary goal is to generate output data from the same distribution as input data. Autoencoders consists of an encoder part which finds compressed latent representation of input and a decoder part which decodes that representation back to the original input. VAEs take autoencoders to the next level by learning the whole distribution instead of just a single representation at the end of the encoder part, which in turn can be used by the decoder to generate the whole distribution of outputs. The trick to learning this distribution is to also learn variance along with mean of latent representation at the encoder-decoder meeting point and add a KL-divergencebased loss term to the standard reconstruction loss function of the autoencoders.

  3. Deep Reinforcement Learning (DeepRL): Reinforcement Learning (RL) tries to mimic the human learning behavior, i.e., taking actions and then adjusting them for the future according to feedback from the environment. For example, young children learn to repeat or not repeat their actions based on the reaction of their parents. The RL model consists of an environment with states, actions to transition between those states, and a reward system for ending up in different states. The objective of the algorithm is to learn the best actions for given states using a feedback reward system. In a classical RL algorithms function, approximators are used to calculate the probability of different actions in different states. DeepRL uses different types of neural networks to create these functions [41][42]. Recently DeepRL received particular attention and popularity due to the success of Google Deep Mind’s AlphaGo [43], which defeated the Go board game world champion. This task was considered impossible by computers just until a few years ago.

C. Relevant Deep Learning Concepts
  1. Automatic Machine Learning (AutoML): Deep networks have many hyperparameters to choose from, for example, number of layers, kernel sizes, type of optimizer, skip connections, and the like. There are billions of possible combinations of these parameters and given high computational cost, time, and energy costs it is hard to find the best performing network even from among a few hundred candidates. In the case of deep learning, the objective of AutoML is mainly to find the most efficient and high performing deep network for a given dataset and task. The first major attempt in this field was by Zoph et al. [44], who used DeepRL to find the optimum CNN for image classification. In their system an RNN creates CNN architectures and, based on their classification results, proposes changes to them. This process continues to loop until the optimum architecture is found. This algorithm was able to find competing networks compared to the state-of-the-art but took over 800 GPUs, which was unrealistic for practical application.Recently,therehavebeenmanynewdevelopments in the AutoML field, which have made it possible to perform such tasks in more intelligent and efficient ways. More details about the field of network architectural search can be found in [45].

  2. Geometric Deep Learning – Graph Neural Networks (GNNs): Exceptforwell-structuredimagedata,thereisalarge amount of unstructured data, e.g., knowledge graphs and social networks, in real life that cannot be directly processed by a deep CNN. Usually, these data are represented in the form of graphs, where each node represents an entity and edges delineate their mutual relations. To learn from unstructured data, geometric deep learning has been attracting an increasing attention, and a most-commonly used architecture is GNN, which is also proven successful in dealing with structured data. Specifically, Using the terminology of graphs, nodes of a graph can be regarded as feature descriptions of entities, and their edges are established by measuring their relations or distances and encoded in an adjacency matrix. Once a graph is constructed, messages can be propagated among each node by simply performing matrix multiplication. Followingly, [46] proposed Graph Convolutional Networks (GCNs) characterized by utilizing graph convolutions, and [45] fasten the process. Moreover recurrent units in RGNNs (Recurrent Graph Neural Network) [47] [48] have also been proven to obtain achievements in learning from graphs.

III. POSSIBLE PITFALLS

To develop tailored deep learning architectures and prepare suitable training datasets for SAR or InSAR tasks, it is important to understand that SAR data is different from optical remote sensing data, not to mention images downloaded from the internet. In this section, we discuss the special characteristics (and possible pitfalls) encountered while applying deep learning to SAR. What makes SAR data and SAR data processing by neural networks unique? SAR data are substantially different from optical imagery in many respects. These are a few points to be considered when transferring CNN experience and expertise from optical to SAR data:
Dynamic Range. Depending on their spatial resolution, the dynamic range of SAR images can be up to 90 dB (TerraSAR-X high resolution spotlight data with a resolution of about 1 m). Moreover, the distribution is extremely asymmetric, with the majority of pixels in the low amplitude range (distributed scatterers) and a long tail representing bright discrete scatterers, in particular in urban areas. Standard CNNs are not able to handle such dynamic ranges and, hence, most approaches feature dynamic compression as a preprocessing step. In [49], the authors first take only amplitude values from 0 to 255 and then subtract mean values of each image. In [11], [50], normalization is performed as a pre-processing step, which compresses the dynamic range significantly.

• Signal Statistics. In order to retrieve features from SAR (amplitude or intensity) images the speckle statistics must be considered. Speckle is a multiplicative, rather than an additive, phenomenon. This has consequences: While the optimum estimator of radar brightness of a homogeneous image patch under speckle is a simple moving averaging operation (i.e., a convolution, like in the additive noise case), other optimum detectors of edges and low-level features under additive Gaussian noise may no longer be optimum in the case of SAR. A popular example is Touzi’s CFAR edge detector [51] for SAR images, which uses the ratio of two spatial averages over adjacent windows. This operation cannot be emulated by the first layer of a standard CNN. Some studies use a logarithmic mapping of the SAR images prior to feeding them into a CNN [52], [9]. This turns speckle into an additive random variable and —as a side effect —reduces dynamic range. But still, a single convolutional layer can only emulate approximations to optimum SAR feature estimators. It could be valuable to supplement the original log-SAR image by a few lowpass filtered and logarithmized versions as input to the CNN. Another approach is to apply some sophisticated speckle reduction filter before entering the CNN, e.g., non-local averaging [53], [54], [55].

Imaging Geometry.
The SAR image coordinates range and azimuth are not arbitrary coordinates like East and North or x and y, but rather reflect the peculiarities of the image generation process. Layover always occurs at near range shadow always at far range of an object. That means, that data augmentation by rotation of SAR images would lead to nonsense imagery that would never be generated by a SAR.
The Complex Nature of SAR Data.
The most valuable information of SAR data lies in its phase. This applies for SAR image formation, which takes place in the complex signal domain, as well as for polarimetric, interferometric (InSAR), and tomographic SAR data processing. This means that the entire CNN must be able to handle complex numbers. For the convolution operation this is trivial. The nonlinear activation function and the loss function, however, require thorough consideration. Depending on whether the activation function acts on the real and imaginary parts of the signal independently, or only on its magnitude, and where bias is added, phase will be distorted to different degrees.

If we use polarimetric SAR data for land cover or target classification, a nonlinear processing of the phase is even desirable, because the phase between different polarimetric channels has physical meaning and, hence, contributes to the classification process. In SAR interferometry and tomography, however, the absolute phase has no meaning, i.e., the CNN must be invariant to an arbitrary phase offset. Assume some interferometric input signal x to a CNN and the output

signal CNN(x) with phase

Deep Learning Meets SAR
This linearity is violated, for example, if the activation function is applied to real and imaginary parts separately, or if a bias is added to the complex numbers. Another point to consider in regression-type InSAR CNN processing (e.g., for noise reduction) is the loss function. If the quantity of interest is not the complex number itself, but its phase, the loss function must be able to handle the cyclic nature of phases. It may also be advantageous that the loss function is independent—at least to a certain degree —of the signal magnitude to relieve the CNN from modelling the magnitude. A loss function that meets these requirements is, for example,
Deep Learning Meets SAR

Simulation-based Training and Validation Data?

The prevailing lack of ground-truth for regression-type tasks, like speckle reduction or inSAR denoising, might tempt us to use simulated SAR data for training and validation of neural networks. However, this bears the risk that our networks will learn models that are far too simplified. Unlike in the optical imaging field, where highly realistic scenes can be simulated, e.g. by PC games, the simulation of SAR data is more a scientific topic without the power of commercial companies and a huge market. SAR simulators focus on specific scenarios, e.g. vegetation (only distributed scatterers considered) or persistent (point) scatterers. The most advanced simulators are probably the ones for computing radar backscatter signatures of single military objects, like vessels. To our knowledge though there is no simulator available that can , e.g., generate realistic interferometric data of rugged terrain with layover, spatially varying coherence, and diverse scattering mechanisms. Often simplified scattering assumptions are made, e.g. that speckle is multiplicative. Even this is not true; pure Gaussian scattering can only be found for quite homogeneous surfaces and low resolution SARs. As soon as the resolution increases chances for a few dominating scatterers in a resolution cell increase as well and the statistic become substantial different from one of fully developed speckle

IV. RECENT ADVANCES IN DEEP LEARNING APPLIED TO SAR

In this section, we provide an in-depth review of deep learning methods applied to SAR data from six perspectives: terrain surface classification, object detection, parameter inversion, despeckling, SAR Interferometry (InSAR), and SARoptical data fusion. For each application, notable developments are stated in the chronological order, and their advantages and disadvantages are reported. Finally, each subsection is concluded with a brief summary.

A. Terrain Surface Classification

As an important direction of SAR applications, terrain surface classification using PolSAR images is rapidly advancing with the help of deep learning. Regarding feature extraction, most conventional methods rely on exploring physical scattering properties [56] and texture information [57] in SAR images. However, these features are mainly human designed based on specific problems and characteristics of data sources. Compared to conventional methods, deep learning is superior in terrain surface classification due to its capability of automatically learning discriminative features. Moreover, deep learning approaches, such as CNNs, can effectively extract not only polarimetric characteristics but also spatial patterns of PolSAR images [6]. Some of the most notable deep learning techniques for PolSAR image classification are reviewed in the following.
Xie et al. [58] first applied deep learning to terrain surface classification using PolSAR images. They employed a stacked auto encoder (SAE) to automatically learn deep features from PolSAR data and then fed them to a softmax classifier. Remarkable improvements in both classification accuracy and visual effect proved that this method can effectively learn a comprehensive feature representation for classification purposes. Instead of simply applying SAE, Geng et al. [61] proposed a deep convolutional autoencoder (DCAE) for automatically extracting features and performing classification. The first layer of DCAE is a hand-crafted convolutional layer, where filters are pre-defined, such as gray-level co-occurrence matrices and Gabor filters. The second layer of DCAE performs a scale transformation, which integrates correlated neighbor pixels to reduce speckle. Following these two hand-crafted layers, a trained SAE, which is similar to [58], is attached for learning more abstract features. Tested on high-resolution single-polarization TerraSAR-X images, the method achieved remarkable classification accuracy. Based on DCAE, Geng et al. [59] proposed a framework, called deep supervised and contractive neural network (DSCNN), for SAR image classification, which introduces histogram of oriented gradient (HOG) descriptors. In addition, a supervised penalty is designed to capture relevant information between features and labels, and a contractive restriction, which can enhance local invariance, is employed in the following trainable autoencoder layers. An example of applying DSCNN on TerraSAR-X data from a small area in Norway is seen in Fig. 2. Compared to other algorithms, the capability of DSCNN to achieve a highly accurate and noise free classification map is observed.

Deep Learning Meets SAR
Fig. 2: Classification maps obtained from a TerraSAR-X image of a small area in Norway [59]. Subfigures (a)-(f) depict the results of classification using SVM (accuracy = 78.42%), sparse representation classifier (SRC) (accuracy = 85.61%), random forest (accuracy = 82.20%) [60], SAE (accuracy = 87.26%) [58], DCAE (accuracy = 94.57%) [61], contractive AE (accuracy = 88.74). Subfigures (g)-(i) show the combination of DSCNN with SVM (accuracy = 96.98%), with SRC (accuracy = 92.51%) [62], and with random forest (accuracy = 96.87%). Subfigures (j) and (k) represent the classification results of DSCNN (accuracy = 97.09%) and DSCNN followed by spatial regularization (accuracy = 97.53%), which achieve higher accuracy than the other methods.

In addition to the aforementioned methods, many studies integrate SAE models with conventional classification algorithms for terrain surface classification. Hou et al. [64] proposed an SAE combined with superpixel for PolSAR image classification. Multiple layers of the SAE are trained on a pixel-by-pixel basis. Superpixels are formed based on Paulidecomposed pseudo-color images. Outputs of the SAE are used as features in the final step of k-nearest neighbor clustering of superpixels. Zhang et al. [65] applied stacked sparse AE to PolSAR image classification by taking into account local spatial information. Qin et al. [66] applied adaptive boosting of RBMs to PolSAR image classification. Zhao et al. [67] proposed a discriminant DBN (DisDBN) for SAR image classification, in which discriminant features are learned by combining ensemble learning with a deep belief network in an unsupervised manner. Moreover, taking into account that most current deep learning methods aim at exploiting features either from polarization information or spatial information of PolSAR images, Gao et al. [63] proposed a dual-branch CNN to learn features from both perspectives for terrain surface classification. This method is built on two feature extraction channels: one to extract polarization features from the 6-channel real matrix, and the other to extract spatial features of a Pauli decomposition. Next the extracted features are combined using two parallel fully connected layers, and finally fed to a softmax layer for classification. The detailed architecture of this network is illustrated in Fig. 3. Different variations of CNNs have been used for terrain surface classification as well. In [68], Zhou et al. first extracted a 6-channel covariance matrix and then fed it to a trainable CNN for PolSAR image classification. Wang et al. [69] proposed a fully convolutional network integrated with sparse and low-rank subspace representations for classifying PolSAR images. Chen et al. [70] improved CNN performances by incorporatingexpertknowledgeoftargetscatteringmechanism interpretationandpolarimetricfeaturemining.Inamorerecent work [71], He et al. proposed the combination of features learned from nonlinear manifold embedding and applying a fully convolutional network (FCN) on input PolSAR images; the final classification was carried out in an ensemble approach by SVM. In [72], the authors focused on the computational efficiency of deep learning methods, proposing the use of lightweight 3D CNNs. They showed that classification accuracy comparable to other CNN methods was achievable while significantly reducing the number of learned parameters and therefore gaining computational efficiency. Apart from these single-image classification schemes using CNN, the use of time series of SAR images for crop classification has been shown in [73], [74]. The authors of both papers experimented with using Recurrent Neural Network (RNN)based architectures to exploit the temporal dependency of multi-temporalSARimagestoimproveclassificationaccuracy. A unique approach for tackling PolSAR classification was recently proposed in [75], where for the first time the authors utilized an AutoML technique to find the optimum CNN architecture for each dataset. The approach takes into account the complex nature of PolSAR images, is cost effective, and achieves high classification accuracy [75].

Deep Learning Meets SAR
Most of the aforementioned methods rely primarily on preprocessing or transforming raw complex-valued data into features in the real domain and then inputting them in a common CNN, which constrains the possibility of directly learning features from raw data. To tackle this problem, Zhang et al. [76] proposed a novel complex-valued CNN (CV-CNN) specifically designed to process complex values in PolSAR data, i.e., the off-diagonal elements of a coherency or covariance matrix. The CV-CNN not only takes complex numbers as input but also employs complex weights and complex operations throughout different layers. A complexvalued backpropagation algorithm is also developed for CVCNN training. Other notable complex-valued deep learning approaches for classification using PolSAR images can be found in [77], [78], [79]. Although not completely related to terrain surface classification, it is also worth mentioning that the combination of SAR and PolSAR images with feed-forward neural networks has been extensively used for sea ice classification. This topic is not treated any further in this section and the interested reader is referred to consult [80], [81], [82], [83], [84] for more information. Similar to the polarimetric signature, InSAR coherence provides information about physical scattering properties. In [85] interferometric volume decorrelation is used as a feature for forest/non-forest mapping together with radar backscatter and incidence angle. The authors used bistatic TanDEM-X data where temporal decorrelation can be neglected. They compared different architectures and concluded that CNNs outperform random forest and U-Net proved best for this segmentation task.

To summarize, it is apparent that deep learning-based SAR and PolSAR classification algorithms have advanced considerably in the past few years. Although at first the focus was based on low-rank representation learning using SAE [58] and its modifications [61], later research focused on a multitude of issues relevant to SAR imagery, such as taking into account speckle [61], [59] preserving spatial structures [63] and their complex nature [76], [77], [78]. It can also be seen that the challenge of the scarcity of labeled data has driven researchers to use semi-supervised learning algorithms [79]. Finally, one of machine learning’s important fields, AutoML, a field that had not been exploited extensively by the remote sensing community, has found its application for PolSAR image classification [75].

B. Object Detection

Although various characteristics distinguish SAR images from optical RGB images,the SAR object detection problem is still analogous to optical image classification and segmentation in the sense that feature extraction from raw data is always the prior and crucial step. Hence, given success in the optical domain, there is no doubt that deep learning is one of the most promising ways to develop the state-of-the-art SAR object detection algorithms. The majority of earlier works on SAR object detection using deep learning consists of taking successful deep learning methods for optical object detection and applying them with minor tweaks to military vehicle detection (MSTAR dataset; see subsection V-C) or ship detection on custom datasets. Even small-sized networks are easily able to achieve more than 90% test accuracy on most of these tasks. The first attempt in military vehicle detection can be found in [7], where Chen et al. used an unsupervised sparse autoencoder to generate convolution kernels from random patches of a given input for a single-layer CNN, which generated features to train a softmax classifier for classifying military targets in the MSTAR dataset [87]. The experiments in [7] showed great potential for applying CNNs to SAR target recognition. With this discovery, Chen et al. [88] proposed A-ConvNets, a simple 5-layer CNN that was able to achieve state-of-the-art accuracy of about 99% on the MSTAR dataset.
Deep Learning Meets SAR
Following this trend, more and more authors applied CNNs to the MSTAR dataset [89], [90], [91]. Morgan [89] successfully applied a modestly sized 3-layered CNN on MSTAR and building upon it Wilmanski et al. [92] investigated the effects of initialization and optimizer selection on final results. Ding et al. [90] investigated the capabilities of a CNN model combined with domain-specific data augmentation techniques (e.g., pose synthesis and speckle adding) in SAR object detection. Furthermore, Du et al. [91] proposed a displacement- and rotation-insensitive CNN, and claimed that data augmentation on training samples is necessary and critical in the preprocessing stage. On the same dataset, instead of treating CNN as an endto-end model, Wagner [93] and similarly Gao [94] integrated CNN and SVM, by first using a CNN to extract features, and then feeding them to an SVM for final prediction. Specifically, Gao et al. [95] added a class of separation information to the cross-entropy cost function as a regularization term, which they show explicitly facilitates intra-class compactness and separtability, in turn improving the quality of extracted features. More recently, Furukawa [96] proposed VersNet, an encoder-decoder style segmentation network, to not only identify but also localize multiple objects in an input SAR image. Moreover, Zhang et al. [86] proposed an approach based on multi-aspect image sequences as a pre-processing step. In the contribution, they are taking into account backscattering signals from different viewing geometries, following feature extraction using Gabor filters, dimensionallity reduction and eventually feeding the results to a Bidirectional LSTM model for joint recognition of targets. The flowchart of this SAR ATR framework is illustrated in Fig. 4. Besides truck detection, ship detection is another tackled SAR object detection task. Early studies on applying deep

learning models to ship detection [97], [98], [99], [100], [101] mainly consist of two stages: first cropping patches from the whole SAR image and then identifying whether cropped patches belong to target objects using a CNN. Because of fixed patch sizes these methods were not robust enough to cater for variations in ship geometry, like size and shape. This problem was overcome by using region-based CNNs [102], [103], with creative use of skip connections and feature fusion techniques in later literature. For example, Li et al. [104] fuses features of the last three convolution layers before feeding them to a region proposal network (RPN). Kang et al. [105] proposed a contextual region based network that fuses features from different levels. Meanwhile, to make the most use of features of different resolution, Jiao et al. [106] densely connected each layertoitssubsequentlayersandfedfeaturesfromalllayersto separate RPN to generat proposals; in the end the best proposal was chosen based on an intersection-over-union score. In more recent works on SAR object detection, scientists have tried to explore many other interesting ideas to complementcurrentworks.Dechesneetal.[107]proposedamultitask network that simultaneously learned to detect, classify, and estimate the length of ships. Mullissa et al. [108] showed that CNNs can be trained directly on Complex-Valued SAR data; Kazemi et al. [109] performed object classification using an RNN based architecture directly on received SAR signal instead of processed SAR images; and Rostami et al. [110] and Huang et al. [111] explored knowledge transfer or transfer learning from other domains to the SAR domain for SAR object detection. Perhaps one of the more interesting recent works in this application area is building detection by Shahzad et al. [112]. They tackle the problem of Very High Resolution (VHR) SAR building detection using a FCN [113] architecture for feature extraction, followed by CRF-RNN [114], which helps give similar weights to neighboring pixels. This architecture produced building segmentation masks with up to 93% accuracy. An example of the detected buildings can be seen in Fig. 5, where the left subfigure is the amplitude of the input TerraSAR-X image of Berlin, and the right subfigure is the predicted building mask. Another major contribution made in that paper addresses the problem of lack of training data by introducing an automatic annotation technique, which annotates the TomoSAR data using Open Street Map (OSM) data. In summary, deep learning faces challenges on two fronts when applied to SAR object detection tasks. The first is the challenge of identifying characteristics of SAR imagery like imaging geometry, size of objects, and speckle noise. The second and bigger challenge is the lack of good quality standardized datasets. As we observed, the most popular dataset, MSTAR, is too easy for deep nets and for ship detection, majority of authors created their own datasets, which makes it very hard to judge the quality of the proposed algorithms and even harder to compare different algorithms.

Deep Learning Meets SAR

C. Parameter Inversion

Parameter inversion from SAR images is a challenging field in SAR applications. As one important branch, ice concentration estimation is now attracting great attention due to its importance to ice monitoring and climate research [115]. Since there are complex interactions between SAR signals and sea ice [116], empirical algorithms face difficulties with interpreting SAR images for accurate ice concentration estimation. Wang et al. [8] resorted to a CNN for generating ice concentration maps from dual polarized SAR images. Their method takes image patches of the intensity-scaled dual band SAR images as inputs, and outputs ice concentration directly. In [117], [118], Wang et al. employed various CNN models to estimate ice concentration from SAR images during the melt season. Labels are produced by ice experts via visual interpretation. The algorithm was tested on dual-pol RadarSat2 data. Since the problem considered is the regression of a continuous value, mean squared error is selected as the loss function. Experimental results demonstrate that CNNs can offer a more accurate result than comparative operational products. In a different application, Song et al. used a deep CNN, including five pairs of convolutional and max pooling layers followed by two fully connected layers for inverting rough surfaceparametersfromSARimages[121].Thetrainingofthe network was based on simulated data solely due to the scarcity of real training data. The method was able to invert the desired parameters with a reasonable accuracy and the authors showed that training a CNN for parameter inversion purposes could be done quite efficiently. Furthermore, Zhao et al. [122] designed a complex-valued CNN to directly learn physical scattering signatures from PolSAR images. The authors have notably proposed a framework to automatically generate labeled data, which led to an unsupervised learning algorithm for the aforementioned parameter inversion. On the whole, deep learning-based parameter estimation for SAR applications has not yet been fully exploited. Unfortunately, most of the focus of the remote sensing community has been devoted to classical problems, which overlap with computer vision tasks such as classification, object detection, segmentation, and denoising. We hope that in the future more studies will be carried out to employ deep learning methods for geophysical and other parameter inversion tasks using SAR data.

D. Despeckling

Speckle, caused by the coherent interaction among scattered signals from sub-resolution objects, often makes processing andinterpretationofSARimagesdifficult.Therefore,despeckling is a crucial procedure before applying SAR images to various tasks. Conventional methods aim at removing speckle either spatially, where local spatial filters, such as the Lee filter [123], Kuan filter [124], and Frost filter [125], are employed, or using wavelet-based methods [126], [127], [128]. For a full overview of these techniques, the reader is referred to [129]. In the past decade, patch-based methods for speckle reduction have gained high popularity due to their ability to preserve spatial features while not sacrificing image resolution [130]. Deledalle et al. [131] proposed one of the first nonlocal patchbased methods applied to speckle reduction by taking into account the statistical properties of speckle combined with the original nonlocal image denoising algorithm introduced in [132]. A vast number of variations of the nonlocal method for SAR despeckling has been proposed, with the most notable ones included in [133], [134]. However, on one hand, manual selection of appropriate parameters for conventional algorithms is not easy and is sensitive to reference images. On the other hand, it is difficult to achieve a balance between preserving distinct image features and removing artifacts with empirical despeckling methods. To solve these limitations, methods based on deep learning have been developed.
Deep Learning Meets SAR

Deep Learning Meets SAR
Inspired by the success of image denoising using a residual learning network architecture in the computer vision community [135], Chierchia et al. [52] first introduced a residual learning CNN for SAR image despeckling by presenting a 17-layered CNN for learning to subtract speckle components from noisy images. Considering that speckle noise is assumed to be multiplicative, the homomorphic approach with coupled log- and exp-transformations is performed before and after feeding images to the network. In this case, multiplicative speckle noise is transformed into an additive form and can be recovered by residual learning, where log-speckle noise is regarded as residual. As shown in Fig. 6, an input log-noisy image is mapped identically to a fusion layer via a shortcut connection, and then added element-wise with the learned residual image to produce a log-clean image. Afterwards, denoised images can be obtained by an exp-transformation. Wang et al. [9] proposed a CNN, called ID-CNN, for image despeckling, which can directly learn denoised images via a component-wise division-residual layer with skip connections. In another words, homomorphic processing is not introduced for transforming multiplicative noise into additive noise and at a final stage the noisy image is divided by the learned noise to yield the clean image.

As a step forward with respect to the two aforementioned residual-based learning methods, Zhang et al. [136] employed a dilated residual network, SAR-DRN,instead of simply stacking convolutional layers. Unlike [52] and similar to [9], SARDRN is trained in an end-to-end fashion using a combination of dilated convolutions and skip connections with a residual learning structure, which indicates that prior knowledge such as a noise description model is not required in the workflow. In [137], Yue et al. proposed a novel deep neural network architecture specifically designed for SAR despeckling. It used a convolutional neural network to extract image features and reconstruct a discrete RCS probability density function (PDF). It is trained by a hybrid loss function which measures the distance between the actual SAR image intensity PDF and the estimated one that is derived from convolution between the reconstructed RCS PDF and prior speckle PDF. Experimental results demonstrated that the proposed despeckling neural network can achieve comparable performance as non-learning state-of-the-art methods. In [49], the problem of despeckling was tackled by a time series of images. Using as tack of images for despeckling is not unique to deep learning-based methods, as has been recently demonstrated in [138] as well. In [49] the authors utilized a multi-layer perceptron with several hidden layers to learn non-linear intensity characteristics of training image patches. This approach has shown promising results and reported comparative performance with the state-of-theart despeckling algorithms. Again using single images instead of time series, in [139] the authors proposed a deep encoder–decoder CNN architecture with focus on feature preservation, which is a weakness of CNNs. They modified U-Net [] in order to accommodate speckle statistical features. Another notable CNN approach was introduced in [120], where the authors used a nonlocal structure, while the weights for pixel-wise similarity measures were assigned using a CNN. The results of this approach, called CNN-NLM, are reported in Fig. 7, where the superiority of the method with respect to both feature preservation and speckle reduction is clearly observed. From the deep learning-based despeckling methods reviewed in this subsection,it can be observed that most methods employ CNN-based architectures with single images of the scene for training; they either output the clean image in an end-to-end fashion or propose residual-based techniques to learn the underlying noise model. With the availability of large archives of time series thanks to the Sentinel-1 mission, an interesting direction is to exploit the temporal correlation of speckle characteristics for despeckling applications. Another problem in supervised deep learning-based despeckling techniques is the lack of ground truth data. In many studies, the training data set is built by corrupting optical images by multiplicative noise. This is far from realistic for despeckling applied to real SAR data. Therefore, despeckling in an unsupervised manner would be highly desirable and worth attention.

E. InSAR

Interferometric SAR (InSAR) is one of the most important SAR techniques, and is widely used in reconstructing the topography of the Earth’s surface, i.e., digital elevation model (DEM) generation [140], [141], [56], and detecting topographical displacements, e.g., monitoring volcanic eruptions [142], [143], [144], earthquakes [145], [146], land subsidence [147], and urban areas using time series methods [148], [149], [150]. The principle of InSAR is to first measure the interferometric phase between signals received by two antennas located at different positions and then extract topographic information from the obtained interferogram by unwrapping and converting the absolute phase to height. However, an actual interferogram often suffers from a large number of singular points, which originate from the interference distortion and noise in radar measurements. These points result in unwrapping errors and consequently low quality DEMs. To tackle this problem, Ichikawa and Hirose [151] applied a complex-valued neural network, CVNN, in the spectral domain to restore singular points. With the help of the Complex Markov Random Field (CMRF) filter [152], they aimed at learning ideal relationships between the spectrum of neighboring pixels and that of center pixels via a one-hidden-layer CVNN. Notably, center pixels of each training sample are supposed to be ideal points, which indicate that singular points are not fed to the network during the training procedure. Similarly, Oyama and Hirose [153] restored singular points with a CVNN in the spectrum domain. Related to topography extraction, Costante et al. [155] proposed a fully CNN Encoder-Decoder architecture for estimating DEM from single-pass image acquisitions. It is demonstrated that this model is capable of extracting highlevel features from input radar images using an encoder section and then reconstructing full resolution DEM via a decoder section. Moreover, the network can potentially solve the layover phenomenon in one single-look SAR image with contextual features. In addition to reconstructing DEMs, Schwegmann et al. [156] presented a CNN-based technique to detect subsidence deformations from interferograms. They employed a 9-layer network to extract salient information in interferograms and displacement maps for discriminating deformation targets from deformation-like targets. Furthermore, Anantrasirichai et al. [10], [157], [158] used a pre-trained CNN to automatically detect volcanic ground deformation from InSAR images. They divided each image into patches, and relabeled them with binary labels, i.e., ”background” and ”volcano”, and finally fed them to the network to predict volcano deformation. They further improved their method to be able to detect slowmoving volcanoes by using a time series of interferograms in [159]. In another study related to automatic volcanic deformation detection, Valade et al. [154] designed and trained a CNN from scratch to learn a decorrelation mask from input wrapped interferograms, which then was used to detect volcanic ground deformation. The flowchart of this approach can be seen in Fig. 8. The training in both of the aforementioned works [159], [154] was based on simulated data. Another geophysically motivated example of using deep learning on InSAR data, which was actually proposed earlier than the above-mentioned CNN-based studies, was seen in [160], [161], [162], where the authors used simple feed-forward shallow neural networks for seismic event characterization and automatic seismic source
parameter inversion by exploiting the power of neural networks in solving non-linear problems.