paper reading--From Word Embeddings To Document Distances

Abstract

词嵌入（word embedding）：根据单词在语句中的局部共存性，学习单词语义层面的表示（semantically meaningful representations for words）。

单词移动距离（Word Mover’s Distance，WMD）：基于词嵌入，衡量文本文档（text documents）间距离的函数。WMD以一个文档的嵌入词移动至另一个文档的嵌入词的最小距离（the minimum amount of distance that the embedded words of one document need to “travel” to reach the embedded words of another document）作为两个文本文档间不相似度（dissimilarity）的度量。

introduction

文档表示的最常用的两种方法：
词袋模型（bag of words，BOW）；
词频逆文档频率（term frequency - inverse document frequency，TF-IDF）。
由于各文档的BOW（TF-IDF）表示通常近似正交性（frequent near-orthogonality），二者并不适于度量文档距离；另外，二者无法表示不同单词间的距离（not capture the distance between individual words）。

文档的低维隐含变量表示（a latent low-dimensional representation of documents）：

隐含语义索引（Latent Semantic Indexing，LSI）：对BOW特征空间进行特征分解；
主体模型（Latent Dirichlet Allocation，LDA）：将相似词按概率分配到不同的主题，并将文档表示这些主题的分布
通常，语义关系体现在词向量的运算上，即嵌入词向量间的距离能够表示语义。本文将文本文档表示为嵌入词的加权点云（a weighted point cloud of embedded words），文本文档A和B间的单词移动距离（Word Mover’s Distance，WMD）定义为：为匹配文档B的点云，文档A中的单词所需移动的最小累积距离（minimum cumulative distance）

word2vec：词嵌入过程，使用（浅层）神经网络语言模型学习单词的向量表示。

skip-gram模型：由输入层、投影层和输出层组成，用于预测相邻单词。通过最大化语料库中相邻单词的对数概率，训练各单词词向量，即给定单词序列 $w_{1},w_{2},\cdots ,w_{T}$ :
$\begin{aligned} \frac{1}{T}\sum_{t=1}^{T}\sum_{j\in nb(t)}log p(w_{j}|w{t}) \end{aligned}$
其中， $nb(t)$ 表示单词 $t$ 的相邻单词集合、 $p(w_{j}|w{t})$ 表示相应词向量, $v_{w_{j}}$ 和 $v_{w_{t}}$ 之间的层级归一化指数（hierarchical softmax）。由于结构简单和层级归一化指数，skip-gram能够使用台式机在数十亿单词上训练，因此能学到复杂的单词关系。

nBOW（n nnBOW representation）
向量 $\mathbf{d}$ 为 $n - 1$ 维单纯形（simplex），包含不同唯一词的两文档（different unique words）位于单纯形不同的区域中，但这两个文档的语义确可能相近。

词映射损失（word travel cost）

本文将单词对间的语义相似度包含进文档距离度量。单词不相似度通常采用在word2vec嵌入空间中的欧氏距离度量。单词 $i$ 和 $j$ 之间的距离为： $c(i, j) = \| \mathbf{x}_{i} - \mathbf{x}_{j}\|_{2}$ ，表示一个单词移动到另一个单词的代价

运输问题：
给定约束，将 $\mathbf{d}$ 移到 $\mathbf{d^{'}}$ 的最小加权累积代价为如下线性规化（linear program）的解：
$\begin{aligned} \mathop {\min }\limits_{\mathbf{T}\geqslant 0} \sum_{i.j=1}^{n}\mathbf{T}_{ij}c(i,j) \\ subject\; to: \; \sum_{j=1}^{n}\mathbf{T}_{ij}\; = \; d_{i} \quad \forall i \in \{1,\cdots, n\} \\ \sum_{j=1}^{n}\mathbf{T}_{ij}\; = \; d_{j}^{'} \quad \forall j \in \{1,\cdots, n\} \\ \end{aligned}$

松弛词移动距离（relaxed word moving distance）
通过放松WMD优化问题（relaxing the WMD optimization problem）并移除一个约束条件（removing one of the two constraints respectively），可以更紧致的下界（much tighter bounds）。

若移除第二个约束条件，优化问题为：
$\begin{aligned} \mathop {\min }\limits_{\mathbf{T}\geqslant 0} \sum_{i.j=1}^{n}\mathbf{T}_{ij}c(i,j) \\ subject\; to: \; \sum_{j=1}^{n}\mathbf{T}_{ij}\; = \; d_{i} \quad \forall i \in \{1,\cdots, n\} \end{aligned}$
由于WMD最优问题的解需要满足两个约束条件，移除一个后，解的可行域变大，因此松弛问题的解必为WMD距离的下界

最优流矩阵 $\mathbf{T}^{\ast}$ 为：
paper reading--From Word Embeddings To Document Distances
令 $\mathbf{T}$ 为松弛问题的任意可行解，对于任意单词i，其最近词为 $j^{\ast}=\argmin\limits_j c(i,j)$ ，则
$\begin{aligned} \sum_{j}\mathbf{T}_{ij}c(i,j)\ge \sum_{j}\mathbf{T}_{ij}c(i,j^{\ast}) = c(i,j^{\ast})\sum_{j}\mathbf{T}_{ij} = c(i,j^{\ast})d_{i} = \sum_{j}\mathbf{T}_{ij}^{\ast}c(i,j) \end{aligned}$
因此， $\mathbf{T^{\ast}}$ 必能生最小损失（a minimum objective value）。计算该解仅需确定 $j^{\ast}=\argmin\limits_j c(i,j)$ ，可在欧式word2vec空间中做最近邻搜索。对文档 $D$ 中的每个词向量 $x_{i}$ ，需要找到文档 $D^{'}$ 中的最相似的词向量 $x_{j}$ .
若移除第一个约束，最近邻搜索过程相反，即对文档 $D^{'}$ 中的每个词向量 $x_{j}$ ，需要找到文档 $D$ 中的最相似的词向量 $x_{i}$

令两个松弛解分别为 $l_{1}(d, d^{'})$ 、 $l_{2}(d, d^{'})$ ，通过取二者中的最大值，可得到更紧致的下界，称为松弛WMD（Relaxed WMD，RWMD）：
$\begin{aligned} l_{r}(d, d^{'})=max(l_{1}(d, d^{'}), l_{2}(d, d^{'})) \end{aligned}$

实验

数据集
paper reading--From Word Embeddings To Document Distances
文档分类

词嵌入（word embeddings）

reference:
Wu L , Yen I E H , Xu K , et al. Word Mover’s Embedding: From Word2Vec to Document Embedding[J]. 2018.

paper reading--From Word Embeddings To Document Distances

Abstract

introduction

实验

相关推荐