Watch,Listen,and Describe:Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

这是NAACL2018的一篇关于video caption（CV与NLP结合）的文章，paper链接https://arxiv.org/abs/1804.05448，一作是加州大学圣塔芭芭拉分校（UCSB）的PHD，作者的homepage http://www.cs.ucsb.edu/~xwang/，code还没有被released出来（作者没有release code的习惯）。
个人瞎扯： 看这篇文章主要有两个原因。

1.他的导师William Wang http://www.cs.ucsb.edu/~william/ 昨天在微博上面说他们组有一篇video caption的文章达到了state-of-the-art。
2.昨天刷arxiv正好看到这篇跨媒体的文章，算是比较新的文章了吧。

文章要做的事情（video caption）：
输入：video（frame+audio）　　　　　输出：sentence
文章给出了一个video caption的示例如下所示。
Watch,Listen,and Describe:Globally and Locally Aligned Cross-Modal Attentions for Video Captioning
与state-of-the-art方法对比结果如下所示。

文章还给出了它的ablation study，如下所示。

method

hierarchically aligned cross-modal attention (HACA) framework如下所示。
Watch,Listen,and Describe:Globally and Locally Aligned Cross-Modal Attentions for Video Captioning
这篇的文章的思路是采用encoder-decoder的方式分别利用visual和audio的global和local的feature去预测sentence。
　　encoder： ResNet Visual Features+VGGish Audio Features
　　decoder： global+local attentive decoder
　　
文章中的几个点：
Attention Mechanism： 对序列中的每一个维度的feature做加权的平均，并学习这样的一个加权平均的系数。
Hierarchical Attentive Encoder： 分别利用high-level和low-level的encoder，low-level运行s次，high-level运行一次（没有看到与 stacked two-layer LSTM的ablation study），得到global和local的feature。
Globally and Locally Aligned Cross-modal Attentive Decoder:

Global decoder: 将 global fusion context与word embedding of the generated word by global fusion feature做concatenation。
Local decoder:将 local fusion context与word embedding of the generated word by local fusion做concatenation。

最后将Global decoder与Local decoder出来的context做concatenation，利用这个concatenation好的feature做sotfmax来predict sentence。

总结：

在encoder和decoder端加attention都很work。
Hierarchical结构比较work。个人感觉hierarchical attentive encoder-decoder(autoencoder)这样的结构应该很适合处理sequence。

Watch,Listen,and Describe:Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

相关推荐