Yolo V2简要

YOLO V2 简要

YOLO V1详解

一、YOLO V2特点

1、采用同时在coco目标家名册数据集和ImageNet分类数据集上训练的策略,允许模型检测没有分类标签的数据。

2、不再采用全连接层和预测box,而是采用卷积层预测anchor box的方式。

3、采用k-means聚类算法预选box,距离不采用传统欧氏距离,距离采用如下计算方式:

Yolo V2简要
由聚类中心个数和平均的IOU的关系选择最好的tradeoff,即5个中心
Yolo V2简要

4、box中心坐标输出还是和YOLO V1一样采用box中心相对于grid的坐标,但长宽坐标为box宽度坐标与预测的比值,如下图Yolo V2简要

tx,ty,tw,th就是预测localization,tx,ty输出时经过**函数,而tw,ty经过e次方

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-o88GDLgX-1596618669044)(/Users/songfeixiang/Library/Application Support/typora-user-images/截屏2020-08-05 上午10.52.22.png)]

confidence to计算方式也是V1中那样。

5、模型采用不同尺寸图像训练,不同于固定输入图片size,V2中每10个epoch从{320, 352, …, 608}中随机选取一个size,然后调整模型结构进行训练。

6、采用darknet-19,类似于VGG采用很多的3*3卷积,并在pooling后总是double channels。

7、采用global average pooling代替全连接层

8、 use batch normalization to stabilize training, speed up convergence, and regularize the model .

二、训练策略

1、Training for classification: We train the network on the standard ImageNet 1000 class classification dataset for 160 epochs using stochastic gradient descent with a starting learning rate of 0.1, polynomial rate decay with a power of 4, weight decay of 0.0005 and momentum of 0.9 using the Darknet neural network framework. During training we use standard data augmentation tricks including random crops, rotations, and hue, saturation, and exposure shifts.

As discussed above, after our initial training on images at 224 × 224 we fine tune our network at a larger size, 448.For this fine tuning we train with the above parameters but for only 10 epochs and starting at a learning rate of 0.001.

2、Training for detection. We modify this network for de- tection by removing the last convolutional layer and instead adding on three 3 × 3 convolutional layers with 1024 fil- ters each followed by a final 1 × 1 convolutional layer with the number of outputs we need for detection. For VOC we predict 5 boxes with 5 coordinates each and 20 classes per box so 125 filters. We also add a passthrough layer from the final 3 × 3 × 512 layer to the second to last convolutional layer so that our model can use fine grain features.

We train the network for 160 epochs with a starting learning rate of 10−3, dividing it by 10 at 60 and 90 epochs. We use a weight decay of 0.0005 and momentum of 0.9. We use a similar data augmentation to YOLO and SSD with random crops, color shifting, etc. We use the same training strategy on COCO and VOC.

3、During training we mix images from both detection and classification datasets. When our network sees an image labelled for detection we can backpropagate based on the full YOLOv2 loss function. When it sees a classification image we only backpropagate loss from the classification-specific parts of the architecture.

4、同时使用detection和classification数据集进行训练时会遇到,标签不独立的问题,如detection中只有所有狗的标签都只是狗,而classification中狗的标签有哈士奇、德牧等等,这就导致了标签不独立的问题,而softmax是建立在标签独立的基础上的。

如果用多标签的话会丢失一些数据集的信息,如coco数据集本身的标签都是独立的即一条样本不可能出现两个coco数据集中的标签,而使用多标签则忽略了这种数据集中标签的独立性,一条样本可能预测出多个coco数据集中的标签。

最终作者使用WordNet构建WordTree来解决这个问题