图像识别算法分类算法

FixEfficientNet is a technique combining two existing techniques: The FixRes from the Facebook AI Team[2] and the EfficientNet [3] first presented from the Google AI Research Team. FixRes is the short form for Fix Resolution and tries to keep a fixed size for either the RoC (Region of Classification) used for train time or the crop used for test time. The EfficientNet is a compound scaling of the dimensions of a CNN which improves both accuracy and efficiency. This article is meant to explain both techniques and why they are state-of-the-art.

FixEfficientNet是一种结合了两种现有技术的技术：来自Facebook AI团队的FixRes [2] 以及由Google AI研究团队首先提出的EfficientNet [3]。 FixRes是Fix Resolution的简写形式，它尝试为火车时间或测试时间的作物保持固定大小。 EfficientNet是CNN尺寸的复合缩放，可提高准确性和效率。本文旨在说明这两种技术及其最新技术。

The FixEfficientNet has been presented first with the corresponding paper on the 20th April 2020 from the Facebook AI Research Team [1]. The technique is used for Image Classification and consecutively a task of the field of Computer Vision. It is currently the state-of-the-art and has the best results on the ImageNet Dataset with 480M params, a top-1 accuracy of 88.5%, and top-5 accuracy of 98,7%.

首先，Facebook AI研究团队于2020年4月20日将FixEfficientNet与相应的论文一起展示[1]。该技术用于图像分类，并且连续地是计算机视觉领域的任务。目前，它是最新技术，在ImageNet数据集上具有480M参数，*1精度为88.5％和*5精度为98.7％的最佳结果。

But let’s dive in a bit deeper to get a better understanding of the combined techniques:

但是，让我们深入一点，以更好地了解组合技术：

了解FixRes (Understanding FixRes)

训练时间 (Training Time)

Until the Facebook AI Research Team proposed the FixRes technique the state-of-the-art was to extract a random square of pixels out of an image. This was used as RoC for the training time. (Be aware that using this technique the amount of data is artificially increased). The image has then been resized to obtain an image of a fixed size (=crop). This was then fed to the Convolutional Neural Network [2].

在Facebook AI研究团队提出FixRes技术之前，最先进的技术是从图像中提取像素的随机正方形。在训练时间用作RoC。 (请注意，使用此技术会人为增加数据量)。然后将图像调整大小以获得固定大小(=裁剪)的图像。然后将其输入到卷积神经网络[2]。

RoC = rectangle/square in input imagecrop = pixels of RoC rescaled with a biliniear interpolation to a certain resolution

RoC =输入图像中的矩形/正方形crop =通过双线性插值将RoC像素重新缩放到特定分辨率

训练时规模扩大 (Train-time scale augmentation)

TTo get a better understanding of what FixRes exactly does let’s take a look at the math. Changing the size of the RoC in the input image affects the distribution of the size of the object given to CNN. The object has a size of r x r in the input image. If the RoC is now scaled it is changed by s and consecutively the size of the object is now rs x rs.

为了更好地了解FixRes的确切功能，让我们看一下数学。更改输入图像中RoC的大小会影响分配给CNN的对象大小的分布。该对象在输入图像中的大小为rxr 。如果现在对RoC进行了缩放，则将s更改，并且对象的大小现在将连续变为rs x rs 。

For the augmentation, the RandomResizedCrop of PyTorch is used. The input image has a size of H x W, from which a RoC is randomly selected. This RoC is then resized to a crop of size

对于增强，使用PyTorch的RandomResizedCrop。输入图像的大小为H x W ，从中随机选择RoC。然后将此RoC调整为裁剪大小

The scaling of the input image (H x W) to the crop that is output can be expressed by the following factor:

输入图像( H x W )对输出作物的缩放比例可以通过以下因素表示：

图像识别算法分类算法_最先进的图像分类算法fixefficientnet l2 — source: image of the author.

测试时间 (Test time)

At test time the RoC is often centered in the image, which results in a so-called center crop. Both crops, the one from train time and test time have the same size, but they origin from a different part of the image. This often leads to a bias in the distribution of CNN [2].

在测试时 ，RoC通常位于图像的中心，这导致所谓的中心裁剪 。两种作物(来自训练时间和测试时间的一种)具有相同的大小，但是它们源自图像的不同部分。这通常会导致CNN的分布存在偏差[2]。

测试时规模扩展。 (Test-time scale augmentation.)

As previously described the test augmentation is not the same as the training time augmentation (keyword center crop). The crop has then the size

如前所述，测试增强与训练时间增强(关键字中心裁切)不同。庄稼就那么大

Regarding the assumption that the input image is a square (H=W) the scaling factor for the test augmentation can be expressed as

关于输入图像是正方形(H = W)的假设，用于测试增强的比例因子可以表示为

What’s the finding of this? Until the FixRes was developed, the preprocessing for test and training time was developed separated from each other leading to bias. Consecutively the Facebook AI Team tried to find a solution that executes the preprocessing simultaneously and somehow synchronized and that is FixRes.

有什么发现？ 在开发FixRes之前，测试和培训时间的预处理是彼此分开的，从而导致偏差。连续地，Facebook AI团队试图找到一个解决方案，该解决方案同时执行预处理并以某种方式同步，即FixRes 。

The Standard Preprocessing as seen above often enlarges the RoC at training and decreases the size at test time.

如上所示，标准预处理通常会在训练时扩大RoC，并在测试时减小大小。

The FixRes technique takes an either-or approach. It either reduces the train-time resolution and keeps the size of the test crop or increases the test-time resolution and keeps the size of the training crop. The aim is to retrieve the object (here the crow) in the same size to reduce the scale invariance in the CNN [2]. This looks like the following:

FixRes技术采用“ 或”方法 。它要么降低训练时间分辨率并保持测试作物的大小，要么增加测试时间分辨率并保持培训作物的大小。目的是检索大小相同的物体(此处为乌鸦)，以减少CNN中的尺度不变性[2]。如下所示：

This results in two effects on how the data is fed to the CNN:

这会对数据如何馈送到CNN产生两个影响：

The size of the object (here the crow) in the image is changed by the FixRes Scaling.
图像中对象(此处为乌鸦)的大小通过FixRes Scaling进行更改。
The use of different crop sizes has an impact on how and when the neurons are activated.
使用不同的农作物大小会影响神经元的**方式和时间。

**统计数据变化问题 (The Problem of Varying Activation Statistics)

Touvron et al. found that larger test crops and foremost the adjustment of the object size lead to better accuracy. However, it’s a trade-off between adjusting the size of the object and varying activation statistics.

Touvron等。发现更大的测试作物以及最重要的是对象尺寸的调整可以带来更好的准确性。但是，这需要在调整对象的大小和更改**统计信息之间进行权衡。

Tests showed that the activation map changes with the resolution of the image. K_test = 224 leads to a map of 7x7, K_test = 64 leads to a map of 2x2 and K_test = 448 leads to a map of 14x14. This shows that the activation distribution varies at test time and the values are out of the classifier range [1].

测试表明，**图随图像的分辨率而变化。 K_test = 224导致映射为7x7，K_test = 64导致映射为2x2，K_test = 448导致映射为14x14。这表明**分布在测试时会发生变化，并且值超出了分类器范围[1]。

To solve the problem of changes in the activation statistics there are two solutions presented:

为了解决**统计信息中的更改问题，提出了两种解决方案：

Parametric Adaptation: A parametric Fréchet distribution is used to fit the average pooling layer. The new distribution is then mapped via scalar transformation to the old distribution and applied as an activation function.

参数调整：参数Fréchet分布用于拟合平均池化层。然后，新的分布通过标量变换映射到旧的分布，并用作**函数。
Fine-tuning: Another way to apply a correction is the fine-tuning of the model. The Fine-Tuning is only applied to the last layers of the CNN.

微调：进行校正的另一种方法是模型的微调。精调仅应用于CNN的最后一层。

During the fine-tuning stage, a label smoothing is used [1].

在微调阶段，使用标签平滑[1]。

EfficientNet架构[3] (EfficientNet Architecture [3])

The authors have pre-trained several models from which the EfficientNet-L2 shows the best results. But what is the EfficientNet?

作者已经对几种模型进行了预训练，从这些模型中EfficientNet-L2可以显示最佳结果。 但是什么是EfficientNet？

As most algorithms in Image Classification, the Efficient Net is based on CNNs. If you don’t have a clue what a CNN is, click here. CNN has three dimensions: width, depth, and resolution. Depth is the number of layers, width is the number of channels (e.g. conventional RGB would have 3 channels) and the resolution is the pixels of an image.

作为图像分类中的大多数算法，高效网络基于CNN。如果您不知道CNN是什么，请单击此处。 CNN具有三个维度：宽度，深度和分辨率。深度是层数，宽度是通道数(例如，常规RGB将具有3个通道)，分辨率是图像的像素。

EfficientNets introduced compound scaling, which makes use of all three dimensions:

EfficientNets引入了复合缩放，它利用了所有三个方面：

Width Scaling — The width can be increased by having images with more channels. The accuracy gain diminishes pretty quickly though.

宽度缩放 -通过使图像具有更多通道可以增加宽度。准确度的增益很快就会下降。

Depth Scaling — Is the conventional and most typical way of scaling. By increasing depth, you increase the number of layers your neural network has. But adding more layers does not always increase the performance of your network. Most often it needs more time, but due to vanishing gradients, the performance can stagnate or even decrease with a higher number of layers.

深度缩放 —是常规且最典型的缩放方式。通过增加深度，可以增加神经网络具有的层数。但是，添加更多的层并不总是可以提高网络的性能。通常，它需要更多的时间，但是由于渐变消失，随着层数的增加，性能可能会停滞甚至下降。

Resolution Scaling — This means increasing the resolution and hence, the number of pixels, e.g. from 200x200 to 600x600. The problem with this kind of scaling is that the accuracy gain disappears with a higher resolution. Until a certain point, your accuracy might increase but the accuracy increments decrease.

分辨率缩放—这意味着提高分辨率，从而提高像素数，例如，从200x200增加到600x600。这种缩放的问题是精度增益随着更高的分辨率而消失。在一定程度上，您的精度可能会增加，但精度增加会减少。

The upscaling of all three dimensions leads to diminishing accuracy increments and a balanced scaling of all three dimensions is necessary to achieve the best accuracy results. Therefore the compound scaling is proposed:

所有这三个维度的按比例缩放导致精度增量的减小，并且为获得最佳精度结果，必须对所有这三个维度进行平衡缩放。因此，提出了复合缩放：

ɸ specifies the available resources, while alpha, beta, and gamma are responsible for the allocation of them.

ɸ指定可用资源，而alpha，beta和gamma负责分配这些资源。

Touvron et al. [1] “use neural architecture search to develop a new baseline network, and scale it up to obtain a family of models, called EfficientNets.” The Neural Architecture Search (NAS) optimizes FLOPS and accuracy.

Touvron等。 [1]“使用神经体系结构搜索来开发新的基准网络，并对其进行扩展以获得称为EfficientNets的一系列模型。” 神经架构搜索(NAS)优化了FLOPS和准确性。

结论 (Conclusion)

The combination of both techniques leads to the currently best algorithm in image classification close ahead of EfficientNet Noisy Student. It is the current leading algorithm in both, efficiency and accuracy. Due to its top-5 accuracy of 98,7% there is still the possibility for improvement, but it’s quite accurate already. So it remains to wait until this is replaced by a new technique.

两种技术的结合导致了EfficientNet Noisy Student之前图像分类的最佳算法。在效率和准确性上，它都是当前领先的算法。由于其前五位的准确性为98.7％，仍有改进的可能性，但它已经相当准确了。因此，有待等到被新技术替代之前。

Since this article does not include any implementation, you can try it by yourself using the official Github of the authors: http://github.com/facebookresearch/FixRes.

由于本文不包含任何实现，因此您可以使用作者的官方Github自己尝试一下： http : //github.com/facebookresearch/FixRes。

The pre-trained networks of the authors [1] can be seen below.

作者[1]的预训练网络可以在下面看到。

I hope you understood and enjoyed it!

希望您理解并喜欢它！

：

[1] Touvron, H., Vedaldi, A., Douze, M., & Jégou, H. (2020b). Fixing the train-test resolution discrepancy: FixEfficientNet. ArXiv:2003.08237 [Cs]. http://arxiv.org/abs/2003.08237

[1] Touvron，H.，Vedaldi，A.，Douze，M.＆Jégou，H.(2020b)。修复火车测试分辨率差异：FixEfficientNet。 ArXiv：2003.08237 [Cs] 。 http://arxiv.org/abs/2003.08237

[2] Touvron, H., Vedaldi, A., Douze, M., & Jégou, H. (2020a). Fixing the train-test resolution discrepancy. ArXiv:1906.06423 [Cs]. http://arxiv.org/abs/1906.06423

[2] Touvron，H.，Vedaldi，A.，Douze，M.＆Jégou，H.(2020a)。修复火车测试分辨率差异。 ArXiv：1906.06423 [Cs] 。 http://arxiv.org/abs/1906.06423

[3] Tan, M., & Le, Q. V. (2020). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ArXiv:1905.11946 [Cs, Stat]. http://arxiv.org/abs/1905.11946

[3] Tan，M.和Le，QV(2020)。 EfficientNet：对卷积神经网络的模型缩放的重新思考。 ArXiv：1905.11946 [Cs，Stat] 。 http://arxiv.org/abs/1905.11946

翻译自: https://towardsdatascience.com/state-of-the-art-image-classification-algorithm-fixefficientnet-l2-98b93deeb04c