论文翻译——YOLO9000: Better, Faster, Stronger

摘要

翻译：

我们介绍YOLO9000，一个最先进的，实时的物体检测系统，可以检测超过9000个物体类别。首先，我们提出了对YOLO检测方法的各种改进，既有新颖的，也有来自之前工作的改进。改进后的YOLOv2在PASCAL VOC和COCO等标准检测任务上是最先进的。使用一种新的，多尺度的训练方法，同样的YOLOv2模型可以运行在不同的大小，提供了一个简单的折衷速度和准确性。在67帧每秒时，YOLOv2在2007 VOC上得到76.8的映射。在40帧/秒的时候，YOLOv2得到了78.6的贴图，超过了最先进的方法，比如ResNet和SSD的更快的RCNN，同时仍然运行得更快。最后提出了一种目标检测和分类的联合训练方法。利用这种方法，我们在COCO检测数据集和ImageNet分类数据集上同时训练YOLO9000。我们的联合训练允许YOLO9000预测没有标记检测数据的对象类的检测。我们在ImageNet检测任务上验证我们的方法。YOLO9000在ImageNet检测验证集上得到19.7映射，尽管只有200个类中的44个类的检测数据。在156个没有在COCO的班级中，YOLO9000得到16.0的地图。但是YOLO只能检测200多个类;它能预测到9000多个不同类别的目标。它仍然是实时运行的。

We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. Using a novel, multi-scale training method the same YOLOv2 model can run at varying sizes, offering an easy tradeoff between speed and accuracy. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP , outperforming state-of-the-art methods like Faster RCNN with ResNet and SSD while still running significantly faster. Finally we propose a method to jointly train on object detection and classification. Using this method we train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. Our joint training allows YOLO9000 to predict detections for object classes that don’t have labelled detection data. We validate our approach on the ImageNet detection task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP. But YOLO can detect more than just 200 classes; it predicts detections for more than 9000 different object categories. And it still runs in real-time.

1 说明

翻译：

通用目标检测应该是快速、准确的，能够识别各种各样的目标。自从神经网络的引入，检测框架变得越来越快和准确。然而，大多数检测方法仍然局限于一小组对象。

与分类和标记等其他任务相比，当前的目标检测数据集是有限的。最常见的检测数据集包含成千上万到几十万的图像，有几十到几百个标签[3][10][2]。分类数据集有数百万幅图像，有几万或几十万个类别。

General purpose object detection should be fast, accurate, and able to recognize a wide variety of objects. Since the introduction of neural networks, detection frameworks have become increasingly fast and accurate. However, most detection methods are still constrained to a small set of objects.

Current object detection datasets are limited compared to datasets for other tasks like classification and tagging. The most common detection datasets contain thousands to hundreds of thousands of images with dozens to hundreds of tags [3] [10] [2]. Classification datasets have millions of images with tens or hundreds of thousands of categories [20] [2].

翻译：

我们希望检测规模到对象分类的水平。但是，用于检测的图像标记要比用于分类或标记的图像标记昂贵得多(标记通常由用户免费提供)。因此，在不久的将来，我们不太可能看到检测数据集与分类数据集具有相同的规模。

We would like detection to scale to level of object classification. However, labelling images for detection is far more expensive than labelling for classification or tagging (tags are often user-supplied for free). Thus we are unlikely to see detection datasets on the same scale as classification datasets in the near future.

翻译：

我们提出了一种新的方法来利用我们已经拥有的大量分类数据，并利用它来扩大现有检测系统的范围。我们的方法使用对象分类的层次视图，允许我们将不同的数据集组合在一起。我们还提出了一种联合训练算法，允许我们在检测数据和分类数据上训练目标检测器。我们的方法利用带标记的检测图像来学习精确定位目标，同时使用分类图像来增加词汇量和鲁棒性。使用这种方法，我们训练YOLO9000，一个实时的目标检测器，可以检测超过9000种不同的目标类别。首先，我们改进基础YOLO检测系统，以生产YOLOv2，一个最先进的，实时检测器。然后我们使用我们的数据集组合方法和联合训练算法来训练一个来自ImageNet的9000多个类的模型和来自COCO的检测数据。我们所有的代码和预先训练过的模型都可以通过http://pjreddie.com/yolo9000/在线获得。

原文: 可修改后右键重新翻译

We propose a new method to harness the large amount of classification data we already have and use it to expand the scope of current detection systems. Our method uses a hierarchical view of object classification that allows us to combine distinct datasets together. We also propose a joint training algorithm that allows us to train object detectors on both detection and classification data. Our method leverages labeled detection images to learn to precisely localize objects while it uses classification images to increase its vocabulary and robustness. Using this method we train YOLO9000, a real-time object detector that can detect over 9000 different object categories. First we improve upon the base YOLO detection system to produce YOLOv2, a state-of-the-art, real-time detector. Then we use our dataset combination method and joint training algorithm to train a model on more than 9000 classes from ImageNet as well as detection data from COCO. All of our code and pre-trained models are available online at http://pjreddie.com/yolo9000/.

2 更好

翻译：

YOLO的缺点相对于最先进的检测系统。YOLO与Fast R-CNN的误差分析表明，YOLO存在大量的定位错误。此外，与基于区域提案的方法相比，YOLO的召回率相对较低。因此，我们主要关注在保持分类精度的同时提高召回率和定位。计算机视觉一般趋向于更大，更深的网络[6][18][17]。更好的性能通常取决于训练更大的网络或将多个模型集成在一起。然而，使用YOLOv2，我们想要一个更快更精确的探测器。我们没有扩大我们的网络，而是简化了网络，然后使表示更容易学习。我们从过去的工作中收集各种想法，并结合我们自己的新概念来提高YOLO的表现。结果的摘要见表2。

YOLO suffers from a variety of shortcomings relative to state-of-the-art detection systems. Error analysis of YOLO compared to Fast R-CNN shows that YOLO makes a significant number of localization errors. Furthermore, YOLO has relatively low recall compared to region proposal-based methods. Thus we focus mainly on improving recall and localization while maintaining classification accuracy. Computer vision generally trends towards larger, deeper networks [6] [18] [17]. Better performance often hinges on training larger networks or ensembling multiple models together. However, with YOLOv2 we want a more accurate detector that is still fast. Instead of scaling up our network, we simplify the network and then make the representation easier to learn. We pool a variety of ideas from past work with our own novel concepts to improve YOLO’s performance. A summary of results can be found in Table 2.

翻译：

批处理规范化。批处理规范化在收敛方面带来了显著的改进，同时消除了对其他形式的正则化[7]的需要。通过在YOLO中对所有卷积层进行批处理归一化，我们在mAP上得到了超过2%的改进。批处理规范化还有助于规范化模型。通过批量归一化，我们可以在不过度拟合的情况下消除模型的偏差。

高分辨率的分类器。所有最先进的检测方法都使用在ImageNet[16]上预先训练好的分类器。从AlexNet开始，大多数分类器操作的输入图像小于256×256[8]。原YOLO在224×224处训练分类器网络，将分辨率提高到448进行检测。这意味着网络必须同时切换到学习对象检测和调整到新的输入分辨率。

对于YOLOv2，我们首先在ImageNet上以448×448分辨率对分类网络进行10个epoch的精细调整。这给了网络时间来调整它的过滤器在高分辨率输入下工作得更好。然后在检测时对得到的网络进行微调。这种高分辨率分类网络使我们的地图增加了近4%。

Batch Normalization. Batch normalization leads to significant improvements in convergence while eliminating the need for other forms of regularization [7]. By adding batch normalization on all of the convolutional layers in YOLO we get more than 2% improvement in mAP . Batch normalization also helps regularize the model. With batch normalization we can remove dropout from the model without overfitting.

High Resolution Classifier. All state-of-the-art detection methods use classifier pre-trained on ImageNet [16]. Starting with AlexNet most classifiers operate on input images smaller than 256 × 256 [8]. The original YOLO trains the classifier network at 224 × 224 and increases the resolution to 448 for detection. This means the network has to simultaneously switch to learning object detection and adjust to the new input resolution.

For YOLOv2 we first fine tune the classification network at the full 448×448 resolution for 10 epochs on ImageNet. This gives the network time to adjust its filterstoworkbetter on higher resolution input. We then fine tune the resulting network on detection. This high resolution classification network gives us an increase of almost 4% mAP.

翻译：

使用锚框卷积。YOLO使用convolutional feature extractor上的全连接层直接预测边界盒的坐标。而不是直接预测坐标更快的R-CNN预测边界盒使用手选先验[15]。在更快的R-CNN中，区域建议网络(RPN)只使用卷积层来预测锚盒的偏移量和信任度。由于预测层是卷积的，RPN在feature map的每个位置预测这些偏移量。预测偏移量而不是坐标简化了问题，使网络更容易学习。

Convolutional With Anchor Boxes. YOLO predicts the coordinates of bounding boxes directly using fully connected layers on top of the convolutional feature extractor. Instead of predicting coordinates directly Faster R-CNN predicts bounding boxes using hand-picked priors [15]. Using only convolutional layers the region proposal network (RPN) in Faster R-CNN predicts offsets and confidences for anchor boxes. Since the prediction layer is convolutional, the RPN predicts these offsets at every location in a feature map. Predicting offsets instead of coordinates simplifies the problem and makes it easier for the network to learn.

翻译：

我们从YOLO中移除完全连接的层，并使用锚盒来预测边界盒。首先，我们消除了一个池化层，以使网络的卷积层的输出具有更高的分辨率。我们还缩小了网络，对416张输入图像进行操作，而不是448×448。我们这样做是因为我们希望在我们的特征图中有奇数个位置，所以只有一个中心单元格。物体，特别是大的物体，往往占据图像的中心，所以最好在中心有一个单独的位置来预测这些物体，而不是四个位置都在附近。YOLO的卷积层将图像向下采样32倍，因此使用416的输入图像，我们得到了一个13×13的输出特征图。

We remove the fully connected layers from YOLO and use anchor boxes to predict bounding boxes. First we eliminate one pooling layer to make the output of the network’s convolutional layers higher resolution. We also shrink the network to operate on 416 input images instead of 448×448. We do this because we want an odd number of locations in our feature map so there is a single center cell. Objects, especially large objects, tend to occupy the center of the image so it’s good to have a single location right at the center to predict these objects instead of four locations that are all nearby. YOLO’s convolutional layers downsample the image by a factor of 32 so by using an input image of 416 we get an output feature map of 13 × 13.

翻译：

当我们转向锚盒时，我们也将类预测机制与空间位置解耦，取而代之的是为每个锚盒预测类和对象。在YOLO之后，客观预测仍然预测了ground truth的欠条和建议的盒子，而类预测则预测了该类在有对象的情况下的条件概率。使用锚框，我们在精确度上有一点降低。YOLO预测每张图片只有98个盒子，但如果使用锚盒，我们的模型预测会超过1000个。在没有锚框的情况下，我们的中间模型得到69.5 mAP和81%的召回率。使用锚盒时，我们的模型得到69.2个地图，召回率为88%。尽管地图减少了，但召回率的增加意味着我们的模型有更多的改进空间。

When we move to anchor boxes we also decouple the class prediction mechanism from the spatial location and instead predict class and objectness for every anchor box. Following YOLO, the objectness prediction still predicts the IOU of the ground truth and the proposed box and the class predictions predict the conditional probability of that class given that there is an object. Using anchor boxes we get a small decrease in accuracy. YOLO only predicts 98 boxes per image but with anchor boxes our model predicts more than a thousand. Without anchor boxes our intermediate model gets 69.5 mAP with a recall of 81%. With anchor boxes our model gets 69.2 mAP with a recall of 88%. Even though the mAP decreases, the increase in recall means that our model has more room to improve.

翻译：

维集群。在YOLO中使用锚框时，我们遇到了两个问题。首先，盒子的尺寸是手工挑选的。网络可以学会适当地调整盒子，但如果我们为网络选择更好的先验，我们可以让网络更容易学会预测良好的探测。

Dimension Clusters. We encounter two issues with anchor boxes when using them with YOLO. The first is that the box dimensions are hand picked. The network can learn to adjust the boxes appropriately but if we pick better priors for the network to start with we can make it easier for the network to learn to predict good detections.

翻译：

我们在训练集边界盒上运行k-means聚类，而不是手动选择先验要找到好的先验。如果我们使用具有欧氏距离的标准k均值，较大的方框比较小的方框产生更多的误差。然而，我们真正想要的是能够获得良好IOU分数的先验，这与盒子的大小无关。因此，对于我们的距离度量我们使用:

论文翻译——YOLO9000: Better, Faster, Stronger

Instead of choosing priors by hand, we run k-means clustering on the training set bounding boxes to automatically find good priors. If we use standard k-means with Euclidean distance larger boxes generate more error than smaller boxes. However, what we really want are priors that lead to good IOU scores, which is independent of the size of the box. Thus for our distance metric we use:

翻译：

我们对不同的k值运行k-means，用最接近的质心绘制平均IOU，见图2。我们选择k = 5作为模型复杂性和高召回率之间的一个很好的权衡。集群中心与手工挑选的锚盒有显著的不同。又矮又宽的盒子越来越少，又高又瘦的盒子越来越多。我们将平均IOU与我们的聚类策略的最近先验以及表1中手工挑选的锚盒进行比较。在只有5个先验时，质心的表现类似于9个锚框，平均欠条为61.0，而不是60.9。如果我们使用9个中心，我们会看到一个高得多的平均欠条。这表明使用k-means生成边界框可以使模型得到更好的表示，并使任务更容易学习。

We run k-means for various values of k and plot the average IOU with closest centroid, see Figure 2. We choose k = 5 as a good tradeoff between model complexity and high recall. The cluster centroids are significantly different than hand-picked anchor boxes. There are fewer short, wide boxes and more tall, thin boxes. We compare the average IOU to closest prior of our clustering strategy and the hand-picked anchor boxes in Table 1. At only 5 priors the centroids perform similarly to 9 anchor boxes with an average IOU of 61.0 compared to 60.9. If we use 9 centroids we see a much higher average IOU. This indicates that using k-means to generate our bounding box starts the model off with a better representation and makes the task easier to learn.

论文翻译——YOLO9000: Better, Faster, Stronger

翻译：

表1:2007年挥发性有机化合物(VOC)最接近前的箱类平均欠条。在使用不同的生成方法之前，VOC 2007上对象的平均欠条(IOU)与它们最近的、未修改的。聚类得到的结果比使用手工选择的先验要好得多。

Table 1: Average IOU of boxes to closest priors on VOC 2007. The average IOU of objects on VOC 2007 to their closest, unmodified prior using different generation methods. Clustering gives much better results than using hand-picked priors.

翻译：

直接定位预测。当对YOLO使用锚框时，我们会遇到第二个问题:模型不稳定，特别是在早期的迭代中。大部分的不稳定性来自于对盒子(x, y)位置的预测。在区域建议网络中，网络预测valuestx和ty， (x, y)中心坐标计算如下:

论文翻译——YOLO9000: Better, Faster, Stronger

Direct location prediction. When using anchor boxes with YOLO we encounter a second issue: model instability, especially during early iterations. Most of the instability comes from predicting the (x, y) locations for the box. In region proposal networks the network predicts valuestxand tyand the (x, y) center coordinates are calculated as:

翻译：

例如，如果预测tx= 1，则会将锚框的宽度向右移动，如果预测tx= - 1，则会将锚框向左移动相同的宽度。这个公式是不受约束的，所以任何锚框都可以在图像的任何点结束，而不管锚框的位置是什么。在随机初始化的情况下，模型需要很长时间才能稳定地预测可感知的偏移量。我们使用YOLO的方法来预测相对于网格单元格位置的位置坐标，而不是预测偏移量。这将ground truth限定在0和1之间。我们使用逻辑**来限制网络的预测落在这个范围内。

For example, a prediction of tx= 1 would shift the box to the right by the width of the anchor box, a prediction of tx= −1 would shift it to the left by the same amount. This formulation is unconstrained so any anchor box can end up at any point in the image, regardless of what location predicted the box. With random initialization the model takes a long time to stabilize to predicting sensible offsets. Instead of predicting offsets we follow the approach of YOLO and predict location coordinates relative to the location of the grid cell. This bounds the ground truth to fall between 0 and 1. We use a logistic activation to constrain the network’s predictions to fall in this range.

翻译：

网络在输出特征图的每个单元预测5个边界盒。该网络为每个边界盒预测了5个坐标，tx, ty, tw, th和to。如果单元格与图像左上角的偏移量为(cx, cy)，且边界盒先验的宽度和高度为pw, ph，则预测值为:

论文翻译——YOLO9000: Better, Faster, Stronger

The network predicts 5 bounding boxes at each cell in the output feature map. The network predicts 5 coordinates for each bounding box, tx, ty, tw, th, and to. If the cell is offset from the top left corner of the image by (cx, cy) and the bounding box prior has width and height pw, ph, then the predictions correspond to:

翻译：

由于对位置预测进行了约束，使得参数化更容易学习，使网络更加稳定。使用维度集群和直接预测边界框中心位置的版本比使用锚框的版本提高了大约5%的YOLO。

Since we constrain the location prediction the parametrization is easier to learn, making the network more stable. Using dimension clusters along with directly predicting the bounding box center location improves YOLO by almost 5% over the version with anchor boxes.

翻译：

细粒度特性。这个改进的YOLO在一个13×13的特征图上预测检测结果。虽然这对于大型对象来说已经足够了，但对于本地化较小的对象来说，更细粒度的特性可能会让它受益。更快的R-CNN和SSD都在网络的各种特征地图上运行他们的提议网络，以获得一系列的分辨率。我们采用了不同的方法，简单地添加了一个透传层，它带来了先前26×26分辨率层的特性。

Fine-Grained Features.This modified YOLO predicts detections on a 13 × 13 feature map. While this is sufficient for large objects, it may benefit from finer grained features for localizing smaller objects. Faster R-CNN and SSD both run their proposal networks at various feature maps in the network to get a range of resolutions. We take a different approach, simply adding a passthrough layer that brings features from an earlier layer at 26 × 26 resolution.

翻译：

通过层将相邻的特征叠加到不同的通道而不是空间位置，将高分辨率特征与低分辨率特征连接起来，类似于ResNet中的身份映射。这将26×26×512的feature map转换为13×13×2048的feature map，可以与原始feature进行拼接。我们的检测器运行在这个扩展的特征图之上，因此它可以访问细粒度的特征。这将使性能略微提高1%

The passthrough layer concatenates the higher resolution features with the low resolution features by stacking adjacent features into different channels instead of spatial locations, similar to the identity mappings in ResNet. This turns the 26 × 26 × 512 feature map into a 13 × 13 × 2048 feature map, which can be concatenated with the original features. Our detector runs on top of this expanded feature map so that it has access to fine grained features. This gives a modest 1% performance increase.

翻译：

多尺度的训练。原YOLO使用的输入分辨率为448×448。随着锚盒的增加，我们将分辨率改为416×416。但是，由于我们的模型只使用卷积和池层，所以可以动态地调整它的大小。我们希望YOLOv2能够在不同大小的图像上运行，所以我们把它训练到模型中。而不是固定输入图像的大小，我们改变网络每几次迭代。我们的网络每10批随机选择一个新的图像尺寸。由于我们的模型将样本降低了32倍，我们从以下32的倍数中提取:{320,352，…，608}。因此最小的选项为320×320，最大的选项为608×608。我们将网络调整到那个维度并继续培训。

原文: 可修改后右键重新翻译

Multi-Scale Training. The original YOLO uses an input resolution of 448 × 448. With the addition of anchor boxes we changed the resolution to 416×416. However, since our model only uses convolutional and pooling layers it can be resized on the fly. We want YOLOv2 to be robust to running on images of different sizes so we train this into the model. Instead of fixing the input image size we change the network every few iterations. Every 10 batches our network randomly chooses a new image dimension size. Since our model downsamples by a factor of 32, we pull from the following multiples of 32: {320,352, ...,608}. Thus the smallest option is 320 × 320 and the largest is 608 × 608. We resize the network to that dimension and continue training.

翻译：

这种模式迫使网络学会跨各种输入维度进行良好预测。这意味着同一个网络可以预测不同分辨率的探测结果。网络在较小的尺寸下运行速度更快，因此YOLOv2在速度和准确性之间提供了一个简单的折衷。在低分辨率下，YOLOv2作为一种便宜、相当精确的探测器运行。在288×288，它运行在超过90帧/秒的地图几乎一样快的R-CNN。这使得它非常适合于较小的gpu、高帧率视频或多视频流。YOLOv2是一款最先进的检测器，具有78.6幅VOC 2007地图，同时仍能以高于实时速度运行。YOLOv2与其他框架在voc 2007上的比较见表3。

This regime forces the network to learn to predict well across a variety of input dimensions. This means the same network can predict detections at different resolutions. The network runs faster at smaller sizes so YOLOv2 offers an easy tradeoff between speed and accuracy. At low resolutions YOLOv2 operates as a cheap, fairly accurate detector. At 288×288 it runs at more than 90 FPS with mAP almost as good as Fast R-CNN. This makes it ideal for smaller GPUs, high framerate video, or multiple video streams. At high resolution YOLOv2 is a state-of-the-art detector with 78.6 mAP on VOC 2007 while still operating above real-time speeds. See Table 3 for a comparison of YOLOv2 with other frameworks on VOC 2007.

论文翻译——YOLO9000: Better, Faster, Stronger

翻译：

表3:PASCAL VOC 2007的检测框架。YOLOv2比之前的检测方法更快、更准确。它还可以在不同的分辨率下运行，以便在速度和精度之间进行轻松的权衡。每个YOLOv2条目实际上都是相同的训练模型，有着相同的权重，只是大小不同而已。所有的时间信息都在Geforce GTX Titan X(原始的，不是Pascal模型)上。

Table 3: Detection frameworks on PASCAL VOC 2007. YOLOv2 is faster and more accurate than prior detection methods. It can also run at different resolutions for an easy tradeoff between speed and accuracy. Each YOLOv2 entry is actually the same trained model with the same weights, just evaluated at a different size. All timing information is on a Geforce GTX Titan X (original, not Pascal model).

翻译：

进一步的实验。我们对YOLOv2进行检测2012 VOC的培训。表4显示了YOLOv2与其他最先进的检测系统的性能比较。YOLOv2实现了73.4 mAP，运行速度远远快于其他竞争方法。我们也对COCO进行了训练，并与其他方法进行了比较，见表5。在VOC度量(IOU = 5) YOLOv2得到44.0 mAP，与SSD和更快的R-CNN相当。

Further Experiments. We train YOLOv2 for detection on VOC 2012. Table 4 shows the comparative performance of YOLOv2 versus other state-of-the-art detection systems. YOLOv2 achieves 73.4 mAP while running far faster than competing methods. We also train on COCO and compare to other methods in Table 5. On the VOC metric (IOU = .5) YOLOv2 gets 44.0 mAP , comparable to SSD and Faster R-CNN.

3 更快

翻译：

我们希望检测是准确的，但我们也希望它是快速的。大多数检测应用程序，如机器人或自动驾驶汽车，依赖于低延迟预测。为了性能最大化我们将YOLOv2设计为从无到有的快速。大多数检测框架依赖于VGG-16作为基本特性提取器[17]。vga -16是一个功能强大、精确的分类网络，但它的复杂程度没有必要。vga -16的卷积层需要306.9亿次浮点运算才能对分辨率为224×224的单幅图像进行单次扫描。

We want detection to be accurate but we also want it to be fast. Most applications for detection, like robotics or selfdriving cars, rely on low latency predictions. In order tomaximize performance we design YOLOv2 to be fast from the ground up. Most detection frameworks rely on VGG-16 as the base feature extractor [17]. VGG-16 is a powerful, accurate classification network but it is needlessly complex. The convolutional layers of VGG-16 require 30.69 billion floating point operations for a single pass over a single image at 224 × 224 resolution.

翻译：

YOLO框架使用一个基于Googlenet架构[19]的自定义网络。这个网络比vga -16更快，一次转发只使用85.2亿次操作。然而，它的准确性比VGG16稍差。对于单次裁剪，准确率最高的5位是224×224,YOLO的自定义模型得到88.0%的ImageNet，而vgr -16得到90.0%。

The YOLO framework uses a custom network based on the Googlenet architecture [19]. This network is faster than VGG-16, only using 8.52 billion operations for a forward pass. However, it’s accuracy is slightly worse than VGG16. For single-crop, top-5 accuracy at 224 × 224, YOLO’s custom model gets 88.0% ImageNet compared to 90.0% for VGG-16.

翻译：

Darknet-19。提出了一种新的分类模型作为YOLOv2的基础。我们的模型建立在网络设计的前期工作以及该领域的常识之上。与VGG模型类似，我们主要使用3×3的过滤器，在每个池步骤[17]后通道数量翻倍。根据网络中的网络(NIN)的工作，我们使用全局平均池来进行预测，并使用1×1滤波器来压缩3×3卷积[9]之间的特征表示。采用批处理归一化的方法来稳定训练，加快收敛速度，并使模型[7]规范化。我们的最后一个模型称为深潭-19，有19个卷积层和5个maxpooling层。完整的描述见表6。处理一幅图像只需要55.8亿次操作，但在ImageNet上却达到了72.9%的top-1和91.2%的top-5精度。

原文: 可修改后右键重新翻译

Darknet-19. We propose a new classification model to be used as the base of YOLOv2. Our model builds off of prior work on network design as well as common knowledge in the field. Similar to the VGG models we use mostly 3 × 3 filters and double the number of channels after every pooling step [17]. Following the work on Network in Network (NIN) we use global average pooling to make predictions as well as 1 × 1 filters to compress the feature representation between 3 × 3 convolutions [9]. We use batch normalization to stabilize training, speed up convergence, and regularize the model [7]. Our final model, called Darknet-19, has 19 convolutional layers and 5 maxpooling layers. For a full description see Table 6. Darknet-19 only requires 5.58 billion operations to process an image yet achieves 72.9% top-1 accuracy and 91.2% top-5 accuracy on ImageNet.

翻译：

培训分类。我们在标准的ImageNet 1000类分类数据集上训练网络160个epochs，使用随机梯度下降，初始学习率为0.1，多项式速率衰减为4，权值衰减为0.0005，动量为0.9，使用暗网神经网络框架[13]。在训练中，我们使用标准的数据增强技巧，包括随机作物、旋转、色调、饱和度和曝光偏移。如上所述，在对224×224的图像进行初始训练后，我们将网络微调到一个更大的尺寸，448。为了这一微调，我们训练与上述参数，但只有10epoch10和开始一个10−3的earningrateof10 - 3。在这个更高的分辨率下，我们的网络达到了76.5%的top-1精度和93.3%的top-5精度。

原文: 可修改后右键重新翻译

Training for classification. We train the network on the standard ImageNet 1000 class classification dataset for 160 epochs using stochastic gradient descent with a starting learning rate of 0.1, polynomial rate decay with a power of 4, weight decay of 0.0005 and momentum of 0.9 using the Darknet neural network framework [13]. During training we use standard data augmentation tricks including random crops, rotations, and hue, saturation, and exposure shifts. As discussed above, after our initial training on images at 224 × 224 we fine tune our network at a larger size, 448. For this fine tuning we train with the above parameters but foronly10epochsandstartingatalearningrateof10−3. At this higher resolution our network achieves a top-1 accuracy of 76.5% and a top-5 accuracy of 93.3%.

翻译：

培训检测。我们对这个网络进行了修改，去掉了最后一个卷积层，增加了3个3×3的卷积层，每个层有1024个过滤器，每个过滤器后面是最后一个1×1的卷积层，包含检测所需的输出数。对于VOC，我们预测5个盒子，每个盒子有5个坐标，每个盒子有20个类，所以有125个过滤器。我们还从最后的3×3×512层添加了一个直通层到倒数第二个卷积层，这样我们的模型就可以使用细粒度特征。我们训练160个epoch的网络以10−3的起始学习率，在60和90个epoch将其除以10。

原文: 可修改后右键重新翻译

Training for detection. We modify this network for detection by removing the last convolutional layer and instead adding on three 3 × 3 convolutional layers with 1024 filters each followed by a final 1 × 1 convolutional layer with the number of outputs we need for detection. For VOC we predict 5 boxes with 5 coordinates each and 20 classes per box so 125 filters. We also add a passthrough layer from the final 3 × 3 × 512 layer to the second to last convolutional layer so that our model can use fine grain features. We train the network for 160 epochs with a starting learning rate of 10−3, dividing it by 10 at 60 and 90 epochs.

翻译：

我们使用重量衰减为0.0005动量为0.9。我们对YOLO和SSD使用了类似的数据增强，包括随机作物、颜色变换等。我们对COCO和VOC使用相同的培训策略。

We use a weight decay of 0.0005 and momentum of 0.9. We use a similar data augmentation to YOLO and SSD with random crops, color shifting, etc. We use the same training strategy on COCO and VOC.

4 更强

翻译：

我们建议建立分类和检测数据联合培训机制。该方法使用标记的图像进行检测，学习与检测相关的信息，如边界盒坐标预测和对象，以及如何对常见目标进行分类。它使用只有类标签的图像来扩大它可以检测的类别的数量。在训练过程中，我们混合了来自检测和分类数据集的图像。当我们的网络看到一个标记的图像进行检测时，我们可以基于YOLOv2丢失函数进行反向传播。当它看到一个分类图像时，我们只反向传播架构中分类特定部分的损失。

We propose a mechanism for jointly training on classification and detection data. Our method uses images labelled for detection to learn detection-specific information like bounding box coordinate prediction and objectness as well as how to classify common objects. It uses images with only class labels to expand the number of categories it can detect. During training we mix images from both detection and classification datasets. When our network sees an image labelled for detection we can backpropagate based on the full YOLOv2 loss function. When it sees a classification image we only backpropagate loss from the classificationspecific parts of the architecture.

翻译：

这种方法带来了一些挑战。检测数据集只有共同的对象和一般的标签，如“狗”或“船”。分类数据集有更广泛和更深范围的标签。ImageNet有100多个品种的狗，包括“诺福克梗”，“约克夏梗”和“贝灵顿梗”。如果我们想训练两个数据集，我们需要一个连贯的方式来合并这些标签。大多数分类方法在所有可能的类别上使用softmax层来计算最终的概率分布。使用softmax假设类是互斥的。这就给合并数据集带来了问题，例如您不希望使用这个模型合并ImageNet和COCO，因为类“Norfolk terrier”和“dog”不是互斥的。我们可以使用一个多标签模型来组合不假定互斥的数据集。这种方法忽略了我们所知道的关于数据的所有结构，例如所有的COCO类都是互斥的。

This approach presents a few challenges. Detection datasets have only common objects and general labels, like “dog” or “boat”. Classification datasets have a much wider and deeper range of labels. ImageNet has more than a hundred breeds of dog, including “Norfolk terrier”, “Yorkshire terrier”, and “Bedlington terrier”. If we want to train on both datasets we need a coherent way to merge these labels. Most approaches to classification use a softmax layer across all the possible categories to compute the final probability distribution. Using a softmax assumes the classes are mutually exclusive. This presents problems for combining datasets, for example you would not want to combine ImageNet and COCO using this model because the classes “Norfolk terrier” and “dog” are not mutually exclusive. We could instead use a multi-label model to combine the datasets which does not assume mutual exclusion. This approach ignores all the structure we do know about the data, for example that all of the COCO classes are mutually exclusive.

翻译：

分级分类。ImageNet标签是从WordNet中提取的，WordNet是一种语言数据库，用于构造概念及其与[12]之间的关系。在WordNet中，"诺福克梗"和" Y奥克郡梗"都是" terrier "的下义字" terrier "是猎犬的一种，是狗的一种，是犬科的一种，等等。大多数分类方法假设标签的平面结构，但是对于组合数据集，结构正是我们所需要的。WordNet的结构是一个有向图，而不是树，因为语言是复杂的。例如，“dog”是一种“canine”和一种“domestic animal”，它们在WordNet中都是同义词集。我们没有使用完整的图结构，而是通过从ImageNet中的概念构建层次树来简化问题。为了构建这个树，我们检查ImageNet中的可视名词，并查看它们通过WordNet图到根节点(在本例中为“物理对象”)的路径。许多synset只有一条路径通过图，因此首先我们将所有这些路径添加到树中。然后，我们迭代地检查我们留下的概念，并添加尽可能少地增长树的路径。所以如果一个概念有两条路径到根结点，一条路径会给树增加三条边，而另一条路径只会增加一条边，我们就选择较短的那条路径。

Hierarchical classification. ImageNet labels are pulled from WordNet, a language database that structures concepts and how they relate [12]. In WordNet, “Norfolk terrier” and “Y orkshire terrier” are both hyponyms of “terrier” which is a type of “hunting dog”, which is a type of “dog”, which is a “canine”, etc. Most approaches to classification assume a flat structure to the labels however for combining datasets, structure is exactly what we need. WordNet is structured as a directed graph, not a tree, because language is complex. For example a “dog” is both a type of “canine” and a type of “domestic animal” which are both synsets in WordNet. Instead of using the full graph structure, we simplify the problem by building a hierarchical tree from the concepts in ImageNet. To build this tree we examine the visual nouns in ImageNet and look at their paths through the WordNet graph to the root node, in this case “physical object”. Many synsets only have one path through the graph so first we add all of those paths to our tree. Then we iteratively examine the concepts we have left and add the paths that grow the tree by as little as possible. So if a concept has two paths to the root and one path would add three edges to our tree and the other would only add one edge, we choose the shorter path.

翻译：

最终的结果是WordTree，可视化概念的层次模型。为了使用WordTree来执行分类，我们预测了给定给定synset的每一个下鞘的概率在每个节点上的条件概率。例如，在“terrier”节点，我们预测:

论文翻译——YOLO9000: Better, Faster, Stronger

The final result is WordTree, a hierarchical model of visual concepts. To perform classification with WordTree we predict conditional probabilities at every node for the probability of each hyponym of that synset given that synset. For example, at the “terrier” node we predict:

翻译：

如果我们想计算一个特定节点的绝对概率，我们只需沿着树到根节点的路径，然后乘以条件概率。所以如果我们想知道一张诺福克梗的照片，我们计算:

论文翻译——YOLO9000: Better, Faster, Stronger

If we want to compute the absolute probability for a particular node we simply follow the path through the tree to the root node and multiply to conditional probabilities. So if we want to know if a picture is of a Norfolk terrier we compute:

翻译：

为了分类的目的，我们假设图像包含一个对象:P r(物理对象)= 1。为了验证这种方法，我们在使用1000类ImageNet构建的WordTree上训练了dark -19模型。为了构建WordTree1k，我们添加了所有中间节点，这些节点将标签空间从1000扩展到1369。在训练过程中，我们传播地面真相给树贴上标签，这样，如果一幅图像被标记为“诺福克梗”，它也会被标记为“狗”和“哺乳动物”等等。为了计算条件概率，我们的模型预测了一个有1369个值的向量，我们计算了具有相同概念的所有sysnset的softmax，见图5。

For classification purposes we assume that the the image contains an object: P r(physical object) = 1. To validate this approach we train the Darknet-19 model on WordTree built using the 1000 class ImageNet. To build WordTree1k we add in all of the intermediate nodes which expands the label space from 1000 to 1369. During training we propagate ground truth labels up the tree so that if an image is labelled as a “Norfolk terrier” it also gets labelled as a “dog” and a “mammal”, etc. To compute the conditional probabilities our model predicts a vector of 1369 values and we compute the softmax over all sysnsets that are hyponyms of the same concept, see Figure 5.

翻译：

使用与之前相同的训练参数，我们的分层算法Darknet-19获得了71.9%的top-1准确率和90.4%的top-5准确率。尽管增加了369个额外的概念，并让我们的网络预测树形结构，但我们的准确性仅略有下降。以这种方式执行分类也有一些好处。在新的或未知的对象类别上，性能会很好地降低。例如，如果网络看到一张狗的照片，但不确定它是什么类型的狗，它仍然会高度自信地预测“狗”，但下垂体之间的信任程度较低。

Using the same training parameters as before, our hierarchical Darknet-19 achieves 71.9% top-1 accuracy and 90.4% top-5 accuracy. Despite adding 369 additional concepts and having our network predict a tree structure our accuracy only drops marginally. Performing classification in this manner also has some benefits. Performance degrades gracefully on new or unknown object categories. For example, if the network sees a picture of a dog but is uncertain what type of dog it is, it will still predict “dog” with high confidence but have lower confidences spread out among the hyponyms.

翻译：

此配方也适用于检测。现在，我们不用假设每张图像都有一个对象，而是使用YOLOv2的客观性预测器来给出P r(物理对象)的值。探测器预测一个包围盒还有概率树。我们向下遍历树，在每一次分割时使用最高置信度路径，直到达到某个阈值并预测该对象类。

This formulation also works for detection. Now, instead of assuming every image has an object, we use YOLOv2’s objectness predictor to give us the value of P r(physical object). The detector predicts a bounding box and the tree of probabilities. We traverse the tree down, taking the highest confidence path at every split until we reach some threshold and we predict that object class.

翻译：

数据集与WordTree的组合。我们可以使用WordTree以一种合理的方式将多个数据集组合在一起。我们只需将数据集中的类别映射到树中的synset。图6显示了使用WordTree组合来自ImageNet和COCO的标签的示例。WordNet非常多样化，因此我们可以将此技术用于大多数数据集。

Dataset combination with WordTree. We can use WordTree to combine multiple datasets together in a sensible fashion. We simply map the categories in the datasets to synsets in the tree. Figure 6 shows an example of using WordTree to combine the labels from ImageNet and COCO. WordNet is extremely diverse so we can use this technique with most datasets.

翻译：

联合分类和检测。既然我们可以使用WordTree组合数据集，我们就可以训练我们的联合模型在分类和检测方面。我们希望训练一个超大规模的检测器，这样我们就可以使用COCO检测数据集和整个ImageNet版本中的前9000个类创建我们的组合数据集。我们还需要评估我们的方法，因此我们添加了来自ImageNet检测挑战的尚未包含的任何类。此数据集对应的WordTree有9418个类。ImageNet是一个大得多的数据集，所以我们通过对COCO进行过采样来平衡数据集，这样ImageNet的数据集就只大了4倍1。使用这个数据集，我们训练YOLO9000。我们使用基本的YOLOv2架构，但是只有3个先验而不是5个先验来限制输出大小。当我们的网络看到一个检测图像，我们反向传播正常。对于classificationloss，我们只在标签的相应级别或以上反向传播损失。例如，如果标签是“狗”，我们就会给树下面的预测——“德国牧羊犬”和“金毛猎犬”——分配错误，因为我们没有那个信息。

Joint classification and detection. Now that we can combine datasets using WordTree we can train our joint model on classification and detection. We want to train an extremely large scale detector so we create our combined dataset using the COCO detection dataset and the top 9000 classes from the full ImageNet release. We also need to evaluate our method so we add in any classes from the ImageNet detection challenge that were not already included. The corresponding WordTree for this dataset has 9418 classes. ImageNet is a much larger dataset so we balance the dataset by oversampling COCO so that ImageNet is only larger by a factor of 4:1. Using this dataset we train YOLO9000. We use the base YOLOv2 architecture but only 3 priors instead of 5 to limit the output size. When our network sees a detection image webackpropagatelossasnormal. Forclassificationloss, we only backpropagate loss at or above the corresponding level of the label. For example, if the label is “dog” we do assign any error to predictions further down in the tree, “German Shepherd” versus “Golden Retriever”, because we do not have that information.

翻译：

当它看到分类图像时，我们只反向传播eclassificationloss。找到预测该类的最高概率的边界框，然后计算它预测的树的损失。我们还假设预测的框与地面真值标签至少重叠3 IOU，并且基于这个假设我们反向传播目标损失。通过这种联合训练，YOLO9000学会使用COCO的检测数据在图像中寻找目标，并学会使用ImageNet的数据对各种各样的目标进行分类。

When it sees a classification image we only backpropagateclassificationloss. Todothiswesimplyfindthebounding box that predicts the highest probability for that class and we compute the loss on just its predicted tree. We also assume that the predicted box overlaps what would be the ground truth label by at least .3 IOU and we backpropagate objectness loss based on this assumption. Using this joint training, YOLO9000 learns to find objects in images using the detection data in COCO and it learns to classify a wide variety of these objects using data from ImageNet.

翻译：

我们在ImageNet检测任务上评估YOLO9000。ImageNet的检测任务与COCO共享44个对象类别，这意味着YOLO9000只看到了大部分测试图像的分类数据，而没有看到检测数据。YOLO9000对156个未见过任何标记检测数据的不相交对象类得到了19.7映射和16.0映射。这个mAP比DPM得到的结果要高，但是YOLO9000是在不同的数据集上训练的，只有部分监督[4]。同时，它还能实时检测9000种其他类别的目标。当我们在ImageNet上分析YOLO9000的表现时，我们发现它能很好地学习新物种，但在学习衣物和设备等类别时却很吃力。

We evaluate YOLO9000 on the ImageNet detection task. The detection task for ImageNet shares on 44 object categories with COCO which means that YOLO9000 has only seen classification data for the majority of the test images, not detection data. YOLO9000 gets 19.7 mAP overall with 16.0 mAP on the disjoint 156 object classes that it has never seen any labelled detection data for. This mAP is higher than results achieved by DPM but YOLO9000 is trained on different datasets with only partial supervision [4]. It also is simultaneously detecting 9000 other object categories, all in real-time. When we analyze YOLO9000’s performance on ImageNet we see it learns new species of animals well but struggles with learning categories like clothing and equipment.

翻译：

新的动物更容易学习，因为对可可动物的客观预测很好地概括了出来。相反，COCO没有任何类型的衣服的边框标签，只有人，所以YOLO9000在“太阳镜”或“泳裤”等模特类别上挣扎。

New animals are easier to learn because the objectness predictions generalize well from the animals in COCO. Conversely, COCO does not have bounding box label for any type of clothing, only for person, so YOLO9000 struggles to model categories like “sunglasses” or “swimming trunks”.

5 结论

翻译：

我们介绍YOLOv2和YOLO9000，实时检测系统。YOLOv2是最先进的，比跨越各种检测数据集的其他检测系统更快。此外，它可以运行在各种图像大小，以提供一个平滑的折衷速度和精度。YOLO9000是通过联合优化检测和分类，实时检测9000多个目标类别的框架。我们使用WordTree来结合来自不同来源的数据，并使用我们的联合优化技术在ImageNet和COCO上同时进行培训。YOLO9000是缩小检测和分类之间数据集大小差距的一个强有力的步骤。我们的许多技术都适用于目标检测之外的领域。ImageNet的WordTree表示为图像分类提供了更丰富、更详细的输出空间。数据集组合使用层次分类将是有用的分类和分割领域。像多尺度训练这样的训练技术可以在各种视觉任务中提供益处。在未来的工作中，我们希望使用类似的技术来进行弱监督图像分割。我们还计划在训练中使用更强大的匹配策略来给分类数据分配弱标签，从而提高检测结果。计算机视觉拥有大量的标记数据。我们将继续寻找将不同来源和数据结构结合在一起的方法，以形成更强大的视觉世界模型。

We introduce YOLOv2 and YOLO9000, real-time detection systems. YOLOv2 is state-of-the-art and faster than other detection systems across a variety of detection datasets. Furthermore, it can be run at a variety of image sizes to provide a smooth tradeoff between speed and accuracy. YOLO9000 is a real-time framework for detection more than 9000 object categories by jointly optimizing detection and classification. We use WordTree to combine data from various sources and our joint optimization technique to train simultaneously on ImageNet and COCO. YOLO9000 is a strong step towards closing the dataset size gap between detection and classification. Many of our techniques generalize outside of object detection. Our WordTree representation of ImageNet offers a richer, more detailed output space for image classification. Dataset combination using hierarchical classification would be useful in the classification and segmentation domains. Training techniques like multi-scale training could provide benefit across a variety of visual tasks. For future work we hope to use similar techniques for weakly supervised image segmentation. We also plan to improve our detection results using more powerful matching strategies for assigning weak labels to classification data during training. Computer vision is blessed with an enormous amount of labelled data. We will continue looking for ways to bring different sources and structures of data together to make stronger models of the visual world.

论文翻译——YOLO9000: Better, Faster, Stronger

摘要

1 说明

2 更好

3 更快

4 更强

5 结论

相关推荐