Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning

这是CVPR2018 Oral的一篇关于做Visual Dialog Generation的文章，paper连接https://arxiv.org/abs/1711.07613，作者的homepage http://qi-wu.me/home.html，一作是University of Adelaide Chunhua Shen组的Assistant Professor，code暂时还没有被released出来。
文章要做的事情：
输入：image+question（text）　　　输出：answer（text）
文章中show出来的example如下所示。
Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning
与state-of-the-art比较的实验结果如下所示。

method

文章的framework如下所示。
Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning

用CNN提取图像的特征，LSTM提取问题，答案以及历史答案的信息，其中提取信息的方式采用的co-attention[ https://arxiv.org/abs/1612.05386 ]，然后再讲图像，问题和历史答案特征做concatenation操作，然后用LSTM softmax得到当前问题的答案。
为了使得得到的答案的语法符合人的理解（套路），文章加入了GAN。首先将问题和答案输入到LSTM中得到一个新的特征，然后再将新的特征与图像和历史答案信息做concatenation（表示不能理解为什么不直接把4个feature做concatenation），将concatenation之后的特征输入到GAN中。
为了是的生成的answer更适合visual dialog（其实不管是visual dialog generation还是存dialog generation都是套路），文章加入了reinforcement learning，其中有两个trick在word层面给reward（Intermediate reward），用teacher forcing[ https://arxiv.org/abs/1610.09038 ]的方式更新generator。

总结：感觉文章中的trick很多，但是都不太work（调参很重要）。

Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning

相关推荐