ScratchDet: Exploring to train single-shot object detectors from scratch

  1. the domain gap between source and target datasets
  2. the learning objective bias between classification and detection
  3. the architecture limitations of the classification network for detection


we study the impact of BatchNorm on training detectors from scratch, and find that using BatchNorm on the backbone and detection head subnetworks makes the detector converge well from scratch.


使用预训练模型在目标检测上有严重的限制。第一,fine-tuning can be regarded as a transfer learning problem,whilch is difficult to fill the domain gap perfectly between the source dataset and the target dataset.第二,其次,分类和检测任务对transition有不同程度的敏感性。 分类任务优选于平移不变性,因此需要下采样操作(例如,最大池和步幅2的卷积)以获得更好的性能。 相反,局部纹理信息对于对象检测更为关键,使得transition不变操作(例如,下采样操作)的使用谨慎。也不方便更改模型结构。

ResNet/VGGNet + SSD

BN reparameterizes the optimization problem to make its landscape significantly smoother instead of reducing the internal covariate shift. BN helps the detector converge well without adapting the pretrained model based detector.分析基于RsNetVGGNetSSD下第一个卷积上的采样步长对表现有比较大的影像。引入了root block

2.Related work

Object detectors with pretrained network.

Train-from-scratch object detectors



BactNorm for SSD Trained from sratch

DSOD 使用Densenet,并没有发现BN的重要作用。

BN on the backbone subnetwork

We add BN on each conv layer in the backbonesubnetwork and then train it form scratch.我们可以使用相对较大的学习率,0.01或者0.05来进一步提高表现72.5%-77.8% and 78%,和预训练77.2%,进一步表明添加BNbackbone subnetwork是很重要的一个措施去提高SSD from scratch.

BN on the detection head subnetwork

These results are very useful to explain the phenomenon that using large learning rate to train SSD with the original architecture from scratch or pretrained networks usually leads to gradient explosion, poor stability and weak prediction of gradients.

BN in the whole network

在两个部分上都使用了BN,配一个比较大lr,相比较预训练SSDfrom scratch模型精度提高了。


Backbone Network redesign

Perdormance analysis of ResNet and VGGNet


We argue that this phenomenon is attributed to the downsampling operation in the first convolution layer (i.e.,conv1 x with stride 2) of ResNet-101, which cuts off half of the raw image information. This operation significantly affects the detection accuracy, especially for small objects


In summary, the downsampling operation in the first convolution layer has a bad impact on the detection accuracy, especially for small objects.

Backbone network redesign for object detection


we remove the downsampling operation (i.e., change the stride from 2 to 1) in the first conv layer and replace the 7 × 7 convolution kernel by a stack of several 3 × 3 convolution filters (denoted as the root block). With these improvements, Root-ResNet is able to exploit more local information from the image, so as to extract powerful features for small object detection.

Furthermore, we replace four convolution blocks (added by SSD to extract the feature maps with different scales) with four residual blocks to the end of the Root-ResNet. Each residual block is formed by two branches. One branch is a 1 × 1 convolution layer with stride 2 and the other one consists of a 3×3 convolution layer with stride 2 and a 3×3 convolution layer with stride 1. The number of output channels in each convolution layer is set to 128


Input size越大,map越高

