【YOLT】《You Only Look Twice: Rapid Multi-Scale Object Detection In Satellite Imagery》

本文主要讲解了一篇基于yolo算法进行改进的高效卫星图像目标检测算法，主要针对高分辨率输入和密集小物体进行了优化。

论文地址：https://arxiv.org/pdf/1805.09512.pdf
代码地址：https://github.com/CosmiQ/yolt

1 Motivation

Detection of small objects in large swaths of imagery is one of the primary problems in satellite imagery analytics.

Object detection in ground-based imagery has benefited from research into new deep learning approaches, transitioning such technology to overhead imagery is nontrivial.

1.1 Challenges

sheer number of pixels per image：over 250 million pixels
geographic extent per image：> $64 k m^{2}$
objects of interest are minuscule：about 10 pixels

1.2 Impressive framework

Faster R-CNN typically ingests 1000 × 600 pixel images
SSD：300 × 300 or 512 × 512
YOLO：416 × 416 or 544 × 544

However

None can come remotely close to ingesting the ~ 16,000×16,000 input sizes typical of satellite imagery.

Due to the speed, accuracy, and flexibility of YOLO, 作者的 framework 基于YOLO设计.

2 Author’s algorithm

Excluding implementation details, algorithms must adjust for:

Small spatial extent（目标太小）：small and densely clustered，在卫星图像中，感兴趣的物体相对尺寸都很小而且常常聚拢在一起，与ImageNet数据集中大范围的显著物体大不相同。同时物体的分辨率主要由地面采样距离决定，它定义了每个像素对应的物理长度。通常情况下卫星运行的高度是350km左右，最清晰的商用卫星图像可以达到30cm的GSD（每个像素对应30cm），而普通的数字卫星影响只能达到3-4m的分辨率了。所以对于车辆、船只这样的小物体来说可能只有10多个像素来描述；
Complete rotation invariance（要有旋转不变性）：卫星图像中的物体具有各个方位的朝向，而ImageNet数据集中大多是竖直方向的，需要检测器具有旋转不变性；
Training example frequency（训练样本少）：训练数据的缺乏，对于卫星图像缺乏高质量的训练数据，虽然SpaceNet已经进行了一系列有益的工作，但还需要进一步改进；
Ultra high resolution（图片太大）：极高的图像分辨率，与通常输入的小图片不同，卫星图像动辄上亿像素，简单的将采样方法对于卫星图像处理无法适用。

文章的 contribution 就是 addresses each of these issues separately

Notion:

Ground sample distance (GSD)
卫星图片上一个像素点代表真实世界的尺寸，比如 30cm GSD 就表示，图片上的一个像素点就为真实世界中的30cm

Commercially available imagery varies from 30 cm GSD for the sharpest Digital-Globe imagery, to 3-4 meter GSD for Planet imagery

That’s to say, cars each object will be only ~15 pixels in extent even at the highest resolution.

3 Advantage

The proposed approach can rapidly detect objects of vastly different scales with relatively little training data over multiple sensors.

4 Method

Left: Model applied to a large 4000 × 4000 pixel test image downsampled to a size of 416 × 416;（小目标没有了）none of the 1142 cars in this image are detected.

right：Model applied to a small 416 × 416 pixel cutout; the excessive false negative rate is due to the high density of cars that cannot be differentiated by the 13 × 13 grid.

作者的方法，总结一下就是：
Data augmentation + pre- and post-processing + 改进的YOLOv1

4.1 改进YOLOv1

YOLOv1：
【YOLOv1】《You Only Look Once: Unified, Real-Time Object Detection》

输入416×416
consider the default YOLO network architecture, which downsamples by a factor of 32 and returns a 13 ×13 prediction grid;
导致如果目标像素小于32，就无法检测

作者的方法

缩小了 downsample 的倍数，加多的网络的层数
we implement a network architecture that uses 22 layers and downsamples by a factor of 16 Thus, a 416 × 416 pixel input image yields a 26 × 26 prediction grid.
**函数用的 Leaky ReLUs
加了一个 pass through layer

论文中作者总结对YOLOv1的改进如下

4.1.1 Leaky ReLUs

**函数ReLU、Leaky ReLU、PReLU和RReLU

【YOLT】《You Only Look Twice: Rapid Multi-Scale Object Detection In Satellite Imagery》

4.1.2 passthrough layer

YOLO v2之总结篇（linux+windows）

这个层的作用就是将上一层特征图的相邻像素都切除一部分组成了另外一个通道。例如，将26*26*512的特征图变为13*13*2048的特征图（这里具体的实现过程需要看作者的源码，但是，为了解释这个变化过程，可以做这样的解释，就是将一个26*26的图的像素放到4个13*13的图中，水平每2个像素取1个，垂直也是每2个像素取一个，一共就可以得到2*2=4个，512*4=2048），使得特征图的数目提高了4倍，同时，相比于26*26的特征图，13*13的特征图更有利用小目标物的检测，

网络结构如下图

红线处就是passthrough层

$N_{f} = N_{b o x e s} * （ N_{c l a s s} + 5 ）$

$N_{b o x e s}$ is the number of boxes per grid（default is 5）

4.2 pre-processing and post-processing

pre-processing 就是训练的时候 split 产生许多cutouts，有15%的 overlap
post-processing 就是测试时把cutout 通过 NMS（非极大值抑制）合起来

5 dataset

汽车数据集使用了COWC数据集，基于15cm的GSD尺度。为了与目前商用卫星图像的30cm尺度一致，利用高斯核对图像进性了处理，并在30cmGSD的尺度上为每辆车标注3m的边框，共13303个样本；
建筑平面基于SpaceNet的数据在30cmGSD尺度下标注了221336个样本；
飞机利用八张GigitalGlobe的图片标注了230个样本；
船只利用三张GigitalGlobe的图片标注了556个样本；
机场利用37张图片作为训练样本，其中包含机场跑道，并进行4比例的降采样。

An initial learning rate of $10^{- 3}$ , a weight decay of 0.0005, and a momentum of 0.9.
Training takes 2 ~ 3 days on a single NVIDIA Titan X GPU.