《Show, Attend and Tell: Neural Image Caption Generation with Visual Attention》阅读笔记

这篇论文提出了Attention机制对于Encoder-Decoder进行改进。在Encoder-Decoder结构中，Encoder将输入序列编码为 $h_{n}$ 。这样做的一个潜在问题是，如果原始序列中包含的许多信息，而 $h_{n}$ 的长度又是一定的，那么 $h_{n}$ 就存不下我们所需的所有信息。
利用Attention机制，Decoder可以在输入序列中选取需要的特征，提高了Encoder-Decoder模型的性能。

首先，让我们先来回顾下LSTM的机制。LSTM的结构图如下图所示：
《Show, Attend and Tell: Neural Image Caption Generation with Visual Attention》阅读笔记
* 红色表示输入
* 蓝色表示输出
* 绿色表示记忆单元
* 虚线表示前一时刻的变量

每个组件的具体表达式如下：

意义	表达式
数据输入	$z = g ([x_{t}, y_{t - 1}])$
输入门	$i = σ [x_{t}, y_{t - 1}, c_{t - 1}]$
遗忘门	$g = σ [x_{t}, y_{t - 1}, c_{t - 1}]$
输出门	$o = σ [x_{t}, y_{t - 1}, c_{t - 1}]$

包含两种非线性**函数：
$σ (u) = \frac{1}{1 + e^{- u}}$
$g (u) = h (u) = t a n h (u) = \frac{e^{z} - e^{- z}}{e^{z} + e^{- z}}$

方括号[ ]表示线性变化，具有一般形式：
《Show, Attend and Tell: Neural Image Caption Generation with Visual Attention》阅读笔记
每个函数具有不同参数W，R，p, b，通过训练获得。

LSTM的一种变体 attention LSTM

LSTM新增了一个和输入同尺度的注意力权重 $α_{t}$ ，由输入和输出/隐状态计算得到：
$α_{t} = s o f t m a x (k (x_{t}, y_{t - 1}))$ , 其中k是计算相关性的网络

用这个权重给原始输入加权
$\hat{x_{t}} = ϕ (a_{t}, x_{t})$
使用加权的输入代替原来的 $x_{t}$ ，那么LSTM的结构如下图所示：
《Show, Attend and Tell: Neural Image Caption Generation with Visual Attention》阅读笔记

让我们再回到论文的内容

结构

从输入到输出依旧经过decoder及encoder两个部分。
《Show, Attend and Tell: Neural Image Caption Generation with Visual Attention》阅读笔记
* 特征（annotation）: { $a_{1} . . . a_{i} . . . a_{L}$ }, 每个 $a_{i}$ 都是一个D维特征，共有L个，描述图像的不同区域。
* 上下文（context）: { $z_{1} . . . z_{t} . . . z_{C}$ }，每个 $z_{i}$ 也是一个D维特征，共有C个，表示每个单词对应的上下文。
* 输出（caption）: { $y_{1} . . . y_{t} . . . y_{C}$ }。 $y_{t}$ 组成一句“说明”（caption）。句子长度C不定。每个单词 $y_{t}$ 是一个K维概率，K是词典的大小。

从输入图像 I 到 a

特征a直接使用现成的VGG网络中conv5_3层的14 * 14 * 512特征。所以，区域数量 L = 14 * 14 = 196，维度 D = 512

从 a 到 z

每个特征向量 $a_{i}$ 对应的权重 $α_{i}$ 是根据聚焦模型 $f_{a t t}$ 计算得到的。
$e_{t i} = f_{a t t} (a_{i}, h_{t - 1})$
$α_{t i} = \frac{e x p (e_{t i})}{\sum_{k = 1}^{L} e x p (e_{t k})}$
计算得到权重之后，我们就可以计算 $\hat{z_{t}} = ϕ ({a_{i}}, {α_{i}})$
$ϕ$ 函数将在下面部分讨论，总共有两种形式 hard 以及 soft.
权重 $α_{i}$ 记录了对每个特征向量 $a_{i}$ 的关注

从 z 到 y

z作为LSTM的输入，y作为LSTM的输出

参考博客：
【图像理解】之Show, attend and tell算法详解

《Show, Attend and Tell: Neural Image Caption Generation with Visual Attention》阅读笔记

LSTM的一种变体 attention LSTM

结构

从输入图像 I 到 a

从 a 到 z

从 z 到 y

相关推荐