论文笔记：Skeleton Key_Image Captioning by Skeleton-attribute Decomposition

Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition

这篇文章的作者提出，人类认识图的过程，应该是先定位图片的位置和他们的关系，然后才是详尽的说明物体的属性。这篇文章基于此设计了一种coarse-to-fine的方法，首先生成skeleton sentence，然后生成相应的属性短语，最后将这两部分合成完整的caption。整个流程如下图所示。

1. Skeleton-Attribute分解

借助斯坦福的句法分析器，将句子进行剖析，拿出简要结构（skeleton sentence），以及每一部分对应的属性短语(attribute phrase)。如上图，首先取出最底层的NP结构，分别是piggy bank和black bowtle。然后将最后一个词作为skeleton sentence的要素，这个词前面的部分作为修饰部分。最底层除NP结构以外的部分则留在skeleton中。因此得到skeneton sentence为：bank with bowtle。

2. Skel-LSTM

这部分使用的就是soft-attention机制,具体如下：

论文笔记：Skeleton Key_Image Captioning by Skeleton-attribute Decomposition

需要指出的是，论文笔记：Skeleton Key_Image Captioning by Skeleton-attribute Decomposition 需要用在下文的属性短语生成中。

3. Attr-LSTM

这里就是简单的encoder-decoder结构。因为属性的生成，取决于上文的论文笔记：Skeleton Key_Image Captioning by Skeleton-attribute Decomposition (图片信息)，以及skeleton sentence中生成的当前词语以及对应的历史信息。因此encoder的向量为以下向量，并作为第一个词输入；

论文笔记：Skeleton Key_Image Captioning by Skeleton-attribute Decomposition

但是，因为论文笔记：Skeleton Key_Image Captioning by Skeleton-attribute Decomposition 是是在skeleton sentence生成词语之前生成的信息，现在这个词语已经生成，可以对进行修正，使视觉信息更加准确。

调整方式如下：

Skel-LSTM网络下，在T时刻输入论文笔记：Skeleton Key_Image Captioning by Skeleton-attribute Decomposition 生成时，输出为每个词的概率分布 =(p1,p2,…,pQ)，其中Q对应的是词典的大小。由公式（2）可知，其实就是T时刻各个视觉块通过加权得到的，这里将每一个视觉块vij 分别输入到skel-LSTM网络中得到分布论文笔记：Skeleton Key_Image Captioning by Skeleton-attribute Decomposition 。纠正为，公式如下：

论文笔记：Skeleton Key_Image Captioning by Skeleton-attribute Decomposition

参考文献：

Wang Y, Lin Z, Shen X, et al. Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition[C]// Computer Vision and Pattern Recognition. IEEE, 2017:7378-7387.

论文笔记：Skeleton Key_Image Captioning by Skeleton-attribute Decomposition

相关推荐