论文阅读：Instance Weighting in Dialogue Systems

总结一下最近读到的三篇instance weighting的paper。

一、Not All Dialogues are Created Equal: Instance Weighting for Neural Conversational Models ——SIGDIAL 18

第一个提出做instance weighting，值得注意的想法是，把这个weighting model看成是一个matching model，重点是，这个model之所以能为另一个network打分，是因为在训练的时候会人为地（heuristically）选出来“high quality” pairs作为正样本喂进去，才能认为打分是“准”的。这一点很关键（at least intuitively）

1、Motivation：（seems reasonable）

常用的dialogue数据类型：Twitter discussions (Ritter et al., 2010) online chat logs (Lowe et al., 2017), movie scripts (DanescuNiculescu-Mizil and Lee, 2011) and movie and TV subtitles (Lison and Tiedemann, 2016).

很多数据本身的特性很影响dialogue：

①很多multiturn dialogue，尤其是subtitles和chat logs没有turn segmentation和speaker identiﬁcation。

②对话数据中的很多specific entities不应该被model学到。

③很多对话质量本身就良莠不齐，很多无意义的答复，很多情景下可能出现的答非所问。

总之在dialogue data中，不同的pair的quality其实是不同的，我们希望model学到intrinsic dialogue pattern rather than just learning to mimic specific scenario。因此希望用一个weighting model来给每个instance打分，乘在loss前面来rectify optimization strategy。

2、模型结构

（1）weighting model：

上下两个RNN share parameters

The selection of high-quality example pairs from a given corpus can be performed through a combination of simple heuristics.

这个的weighting model本质上就是一个matching model，然后用binary-classifier sigmoid cross entropy loss来训练。正样本是true pairs再经过特征工程筛选出来的，而负样本是从training corpus中random sample出来的。

首先这个heuristics应当是dataset-specific的，作者的这个subtitles这种缺陷很明显的dataset比较好用；其次这里的negative example直接就是random sampling了，这对于高质量很好区分的dataset而言无伤大雅。

但是这个设定对于那些本身数据集上没什么缺陷，仅仅是不同pairs之间的quality之间天然会有差别的场景作用不太大吧。

（2） retrieval model

①TF-IDFmodel

将每个sentence用BOW表示成vocab大小的sparse vector，然后每个1都用tf-idf score替代，然后matching score就是cosine similarity。

②Dual Encoder(Lowe et al., 2017)

结构如下，是dual-encoder model的简单改版。

3、数据与训练

使用OpenSubtitles corpus全集来训练retrieval model，用heuristics只选出了0.1%来作为“high quality” examples来训练weighting model，用在training set中skip-gram的word2vec做embedding初始化。

4、discussion

为什么不直接用heuristics选出来的high quality的来做新的训练集呢？

针对检索式，首先，检索式的pos和neg之间的差距过大过小都不好，这是本质问题；其次检索式训练速度快，没必要这样牺牲大量数据。

针对生成式，作者提出的设想是用weighting model来选出来好的样本（by thresholding, maybe）然后concentrate on好样本去训练。感觉也不太行，生成本身就需要大量样本才能训。即

Filter out part of the training data to concentrate the training time on “interesting” examples with a high cohesion between the context and its response.

感觉应当还是要像下面这一篇一样做生成才行。

二、Learning to Converse with Noisy Data: Generation with Calibration ——IJCAI 18

一个calibration network负责打matching score，作为generative model的instance weighting。两个model在同样的training data中训练的，calibration net没有特别提供“high quality”的样本。（直觉上来讲，不work是正常的，但是work了……）

三、Learning Matching Models with Weak Supervision for Response Selection in Retrieval-based Chatbots

论文阅读：Instance Weighting in Dialogue Systems

相关推荐