摘要Abstract Top-down：

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.

从上到下的视觉注意力机制被广泛的应用于image captioning和VQA领域，以通过细粒度分析甚至多个推理步骤来加深对对象的理解。

Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings.

在我们的方法里，从底向上机制（基于Faster R-CNN）提出图像区域，每个区域又一个相关的特征向量，而从上到下机制确定特征权重。

注释：

Faster R-CNN（主要包括RPN + CNN + ROI或者RPN+CNN+Fast R-CNN），发展的过程大体可以概括为R-CNN——Fast R-CNN——Faster R-CNN——Mask R-CNN,都是将神经网络应用于目标检测的典型代表，首先是R-CNN将CNN应用于目标检测中取得了较大的成效，后面几个网络都是在前面的基础上进行了改进，在速度和准确率方面都有了很大的提升。

主要步骤：

1.对整张图片输进CNN，得到feature map

2.卷积特征输入到RPN，得到候选框的特征信息

3.对候选框中提取出的特征，使用分类器判别是否属于一个特定类

4.对于属于某一类别的候选框，用回归器进一步调整其位置

介绍Introduction Problems：

In this paper we adopt similar terminology and refer to attention mechanisms driven by non-visual or task-specific context as‘top-down’, and purely visual feed-forward attention mechanisms as‘bottom-up’.

在本文中，我们采用了类似的术语，将由非视觉或者特定任务环境驱动的注意力机制成为从上到下，纯粹的视觉前馈注意力机制成为自底向上。

We first present an image captioning model that takes multiple glimpses of salient image regions during caption generation. Empirically, we find that the inclusion of bottom-up attention has a significant positive benefit for image captioning.

我们首先提出了一个图像字幕模型，这个模型在字幕生成期间多次捕捉到显著区域的瞬时信息。通过实验，我们发现从低向上机制对图像字幕带来了显著的好处。

字幕模型Captioning Model：

Within the captioning model, we characterize the first LSTM layer as a top-down visual attention model, and the second LSTM layer as a language model, indicating each layer with superscripts in the equations that follow.

在字幕模型中，我们将第一个LSTM层描述为自顶向下的视觉注意力模型，第二个LSTM层作为一个语言模型，在下边的等式中将用上标表示每个层。

Note that the SCST approach uses ResNet- 101 encoding of full images, similar to our ResNet baseline.

注意SCST方法使用的ResNet-101编码的完整图像，类似于我们的ResNet baseline。

注意：
1.Baseline:表示基线，通俗的讲，一个算法被称为baseline，基本上表示比这个算法性能还差的基本上不能接受，表示还有巨大的改进空间和超越benchmark的潜力，所以baseline有一个自带的含义就是“性能起点”。在算法继续优化和调参数的过程中，你的目标是比这个性能更好，因此需要在这个base line的基础上往上跳。
Benchmark：表示里程碑，通俗的讲，一个算法之所以被称为benchmark，是因为它的性能已经被广泛研究，人们对它性能的表现形式、测量方法都非常熟悉，因此可以作为标准方法来衡量其他方法的好坏。
state-of-the-art（SOTA）：表示最先进的，能够称为SOTA的算法表明其性能在当前属于最佳性能。
2.self-critical sequence training(SCST)使用强化学习来训练image captioning模型，它是一种REINFORCE算法，但它不使用baseline来正规化rewards以减小方差，而是使用新的test-time inference算法来正规化reward。

论文-《Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering 》重点翻译+扩展

摘要Abstract Top-down：

介绍Introduction Problems：

相关工作Related Work：

字幕模型Captioning Model：

论文-《Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering 》重点翻译+扩展

摘要Abstract Top-down：

介绍Introduction Problems：

相关工作Related Work：

字幕模型Captioning Model：

相关推荐