NLP-Task4——从one-hot到word2vec

词的表达

给定语料库 $\mathbb{D}=\left\{\mathcal{D}_{1}, \mathcal{D}_{2}, \cdots, \mathcal{D}_{N}\right\}$ ，其中包含N篇文档。
每篇文档 $\mathcal{D_i}$ 包含单词序列 $\left(\operatorname{word}_{I_{1}^{i}}, \operatorname{word}_{I_{2}^{i}}, \cdots, \operatorname{word}_{I_{n_{i}}^{i}}\right)$ ，其中 $I_{j}^{i} \in\{1,2, \cdots, V\}$ 表示单词的编号:
i表示第i篇文档
j表示文档中的第j个单词
$n_i$ 表示第i篇文档中包含 $n_i$ 个单词
$v=I^i_j$ 表示第i篇文档的第j个单词为 $\operatorname{word}_v$
所有单词来自于词汇表 $\mathbb{V}=\left\{\mathrm{word}_{1}, \text { word }_{2}, \cdots, \text { word }_{V}\right\}$ ，其中V表示词汇表的大小。
词的表达任务要解决的问题是：如何表达每个单词的 $\operatorname{word}_v$
最简单的表示方式是one-hot编码，对于词汇表的第v个单词 $\operatorname{word}_v$ 将其表示为: $\operatorname{word}_{v} \rightarrow(0,0, \cdots, 0,1,0, \cdots, 0)^{T}$
即第v位取值为1，剩余位取值为0.
这种表示方法有两个主要缺点：
无法表达单词之间的关系，对于任意一对单词 $\left(\text { word }_{i}, \text { word }_{j}\right)$ ，其向量的距离为 $\sqrt 2$ 。
单词维度过高。对于中文词汇表，其大小可能达到数十万，因此one-hot向量的维度也在数十万维，这对于存储、计算都消耗过大。
Bow:Bag of Words：词在文档中不考虑顺序，这称作词袋模型。

Word2Vec

CBOW模型

CBOW模型(Continuous bag-of-word)：根据上下文来预测一个单词。

一个单词上下文

网络结构

在一个单词上下文的CBOW模型中，输入的是前一个单词，输出是后一个单词；输入为输出的上下文，由于只有一个单词作为输入，因此称作一个单词的上下文。
一个单词上下文的CBOW模型如下：

其中：

N为隐层的大小，即隐向量 $\overrightarrow{\mathbf{h}} \in \mathbb{R}^{N}$ 。
网络输入 $\overrightarrow{\mathbf{x}}=\left(x_{1}, x_{2}, \cdots, x_{V}\right)^{T} \in \mathbb{R}^{V}$ ，它是输入单词（即上下文单词）的one-hot编码，其中只有一位为1，其他位都为0。
网络输出 $\overrightarrow{\mathbf{y}}=\left(y_{1}, y_{2}, \cdots, y_{V}\right)^{T} \in \mathbb{R}^{V}$ ，它是输出单词为词汇表个单词的概率。
相邻层之间为全连接：
- 输入层和隐层之间的权重矩阵为W，其尺寸 $V \times N$ 。
- 隐层和输出层之间的权重矩阵为 $\mathbf{W}^{\prime}$ ，其尺寸为 $N \times V$ 。

假设没有**函数，没有偏置项。给定输入 $\overrightarrow{\mathbf{x}} \in \mathbb{R}^{V}$ ,则其对应的隐向量 $\overrightarrow{\mathbf{h}} \in \mathbb{R}^{N}$ 为： $\overrightarrow{\mathbf{h}}=\mathbf{w}^{T} \overrightarrow{\mathbf{x}}$ 。令 $\mathbf{W}=\left[ \begin{array}{c}{\overrightarrow{\mathbf{w}}_{1}^{T}} \\ {\overrightarrow{\mathbf{w}}_{2}^{T}} \\ {\vdots} \\ {\overrightarrow{\mathbf{w}}_{V}^{T}}\end{array}\right]$
由于 $\overrightarrow{\mathbf{x}}$ 是个one-hot编码，假设它为此表 $\mathbb{V}$ 中第k个单词 $\operatorname{word}_k$ ，即：
$x_{1}=0, x_{2}=0, \cdots, x_{k-1}=0, x_{k}=1, x_{k+1}=0, \cdots, x_{V}=0$
则有： $\overrightarrow{\mathbf{h}}=\overrightarrow{\mathbf{w}}_{k}$ 。
即：W的第k行 $\overrightarrow{\mathbf{w}_k}$ 就是词表 $\mathbb{V}$ 中的第k个单词 $\operatorname{word}_k$ 的表达，称作单词 $\operatorname{word}_k$ 的输入向量。
给定隐向量 $\overrightarrow{\mathbf{h}}$ ，其对应的输出向量 $\overrightarrow{\mathbf{u}} \in \mathbb{R}^{v}$ 为： $\overrightarrow{\mathbf{u}}=\mathbf{W^{\prime}}^{T} \overrightarrow{\mathbf{h}}$
令： $\mathbf{W}^{\prime}=\left[\overrightarrow{\mathbf{w}}_{1}^{\prime}, \overrightarrow{\mathbf{w}}_{2}^{\prime}, \cdots, \overrightarrow{\mathbf{w}}_{V}^{\prime}\right]$
则有： $u_{j}=\overrightarrow{\mathbf{w}}_{ j}^{\prime} \cdot \overrightarrow{\mathbf{h}}$ ，表示词表 $\mathbb{V}$ 中，第j个单词 $\operatorname{word}_j$ 的得分。
$\overrightarrow{\mathbf{w}_j}^{\prime}$ 为矩阵 $\mathbf{W}^{\prime}$ 的第j列，称作单词 $\operatorname{word}_j$ 的输出向量。
$\overrightarrow{\mathbf{u}}$ 之后接入一层softmax层，则有：
$y_{j}=p\left(\text { word }_{j} | \overrightarrow{\mathbf{x}}\right)=\frac{\exp \left(u_{j}\right)}{\sum_{j^{\prime}=1}^{V} \exp \left(u_{j^{\prime}}\right)}, \quad j=1,2, \cdots, V$
即 $y_j$ 表示词汇表 $\mathbb{V}$ 中第j个单词 $\operatorname{word}_j$ 为真实输出单词的概率。
假设给定一个单词 $\operatorname{word}_{I}$ （它称作上下文），观测到它的下一个单词为 $\operatorname{word}_{O}$ 。
假设 $\operatorname{word}_{O}$ 对应的输出编号是 $j^{*}$ ，则网络的优化目标是：
$\begin{array}{c}{\max _{\mathbf{W}, \mathbf{W}^{\prime}} p\left(\operatorname{word}_{O} | \text { word }_{I}\right)=\max _{\mathbf{W}, \mathbf{W}^{\prime}} y_{j^{*}}=\max _{\mathbf{W}, \mathbf{W}^{\prime}} \log \frac{\exp \left(\overrightarrow{\mathbf{w}}_{j^{*}}^{\prime} \cdot \overrightarrow{\mathbf{w}}_{I}\right)}{\sum_{i=1}^{V} \exp \left(\overrightarrow{\mathbf{w}}_{i}^{\prime} \cdot \overrightarrow{\mathbf{w}}_{I}\right)}} \\ {=\max _{\mathbf{W}, \mathbf{W}^{\prime}}\left[\overrightarrow{\mathbf{w}}_{j^{*}}^{\prime} \cdot \overrightarrow{\mathbf{w}}_{I}-\log \sum_{i=1}^{V} \exp \left(\overrightarrow{\mathbf{w}}_{i}^{\prime} \cdot \overrightarrow{\mathbf{w}}_{I}\right)\right]}\end{array}$
其中 $\overrightarrow{\mathbf{w}}_I$ 为输入单词 $\operatorname{word}_I$ 的输入向量。
考虑到 $u_{j}=\overrightarrow{\mathbf{w}}_{j}^{\prime} \cdot \overrightarrow{\mathbf{w}}_{I}$ ，定义：
$\begin{array}{c}{E=-\log p\left(\operatorname{word}_{O} | \operatorname{word}_{I}\right)=-\left[\overrightarrow{\mathbf{w}}_{j^{*}}^{\prime} \cdot \overrightarrow{\mathbf{w}}_{I}-\log \sum_{i=1}^{V} \exp \left(\overrightarrow{\mathbf{w}}_{i}^{\prime} \cdot \overrightarrow{\mathbf{w}}_{I}\right)\right]} \\ {=-\left[u_{j^{*}}-\log \sum_{i=1}^{V} \exp \left(u_{i}\right)\right]}\end{array}$
则优化目标： $\operatorname{min} E$

参数更新

定义 $t_j = \mathbb{I}(j=j^{*})$ ，即第j个输出单元对应于真实的输出单词 $\operatorname{word}_O$ 时，它为1，否则为0.
定义： $e_{j}=\frac{\partial E}{\partial u_{j}}=y_{j}-t_{j}$
它刻画了每个输出单元的预测误差：

当 $j=j^*$ 时： $e_j=y_j-1$ ，它刻画了输出概率（ $y_j$ ）与真实概率1之间的差距
当 $j \neq j^{*}$ 时： $e_j=y_j$ ，它刻画了输出概率与真实概率之间的差距

根据：
$u_{j}=\overrightarrow{\mathbf{w}}_{j}^{\prime} \cdot \overrightarrow{\mathbf{h}} \quad \rightarrow \quad \frac{\partial u_{j}}{\partial \overrightarrow{\mathbf{w}}_{j}^{\prime}}=\overrightarrow{\mathbf{h}}$
则有：
$\frac{\partial E}{\partial \overrightarrow{\mathbf{w}}_{j}^{\prime}}=\frac{\partial E}{\partial u_{j}} \times \frac{\partial u_{j}}{\partial \overrightarrow{\mathbf{w}}_{j}^{\prime}}=e_{j} \overrightarrow{\mathbf{h}}$
则 $\overrightarrow{\mathbf{w}_j}^{\prime}$ 更新规则为：
$\overrightarrow{\mathbf{w}}_{j}^{\prime(n e w)}=\overrightarrow{\mathbf{w}}_{j}^{\prime(o l d)}-\eta e_{j} \overrightarrow{\mathbf{h}}$
其物理意义为：

当估计过量（ $e_{j}>0 \rightarrow y_{j}>t_{j}$ ）时， $\overrightarrow{\mathbf{w}_j}^{\prime}$ 会减去一定比例的 $\overrightarrow{\mathbf{h}}$ 。
这发生在第j个输出单元不对应于真实的输出单元时。
当估计不足( $e_{j}<0 \rightarrow y_{j}<t_{j}$ )时， $\overrightarrow{\mathbf{w}_j}^{\prime}$ 会加上一定比例的 $\overrightarrow{\mathbf{h}}$ 。
这发生在第j个输出单元刚好对应于真实的输出单词时。
当 $y_{j} \simeq t_{j}$ 时，更新的幅度将非常微小。

定义： $\overrightarrow{\mathbf{E H}}=\frac{\partial E}{\partial \overrightarrow{\mathbf{h}}}=\left(\frac{\partial \overrightarrow{\mathbf{u}}}{\partial \overrightarrow{\mathbf{h}}}\right)^{T} \frac{\partial E}{\partial \overrightarrow{\mathbf{u}}}$
根据： $\overrightarrow{\mathbf{u}}=\mathbf{W}^{\prime T} \overrightarrow{\mathbf{h}} \quad \rightarrow \quad\left(\frac{\partial \overrightarrow{\mathbf{u}}}{\partial \overrightarrow{\mathbf{h}}}\right)^{T}=\mathbf{W}^{\prime}$
则有： $\overrightarrow{\mathbf{E H}}=\mathbf{W}^{\prime} \overrightarrow{\mathbf{e}}=\sum_{j=1}^{V} e_{j} \overrightarrow{\mathbf{w}}_{j}^{\prime}$
$\overrightarrow{\mathbf{EH}}$ 的物理意义为：词汇表 $\mathbb{V}$ 中所有单词的输出向量的加权和，其权重为 $e_j$ 。
考虑到 $\overrightarrow{\mathbf{h}}=\mathbf{W}^{T} \overrightarrow{\mathbf{x}}$ ，则有： $\frac{\partial E}{\partial w_{k, i}}=\frac{\partial E}{\partial h_{i}} \times \frac{\partial h_{i}}{\partial w_{k, i}}=E H_{i} \times x_{k}$
由于 $\overrightarrow{\mathbf{x}}$ 是one-hot编码，所以它只有一个分量非零，因此 $\frac{\partial E}{\partial \mathbf{W}}$ 只有一行非零，且该非零行就等于 $\overrightarrow{\mathbf{EH}}$ 。因此更新方程：
$\overrightarrow{\mathbf{w}}_{I}^{(n e w)}=\overrightarrow{\mathbf{w}}_{I}^{(o l d)}-\eta \overrightarrow{\mathbf{E} \mathbf{H}}$
其中 $\overrightarrow{\mathbf{w}}_{I}$ 为非零分量对应的W中的行，而W的其他行在本次更新中都保持不变。
考虑更新行的第k列，则：
$w_{I, k}^{(n e w)}=w_{I, k}^{(o l d)}-\eta \sum_{j=1}^{V} e_{j} w_{j, k}^{\prime}$
当 $y_{j} \simeq t_{j}$ 时，更新的幅度非常微小。
当 $y_j$ 与 $t_j$ 差距越大，则更新的幅度越大。
当给定许多训练样本，每个样本由两个单词组成，上述更新不断进行，更新的效果不断积累。

根据单词的共现效果，输出向量与输入向量相互作用并达到平滑。
- 输出向量 $\overrightarrow{\mathbf{w}}^{\prime}$ 的更新依赖于输入向量 $\overrightarrow{\mathbf{w}}_{I} : \quad \overrightarrow{\mathbf{w}}_{j}^{\prime(n e w)}=\overrightarrow{\mathbf{w}}_{j}^{\prime(o l d)}-\eta e_{j} \overrightarrow{\mathbf{h}}$ 。
  这里隐向量 $\overrightarrow{\mathbf{h}}$ 等于输入向量 $\overrightarrow{\mathbf{w}}_I$ 。
- 输入向量 $\overrightarrow{\mathbf{w}}_I$ 的更新依赖于输出向量 $\overrightarrow{\mathbf{w}}^{\prime} : \quad \overrightarrow{\mathbf{w}}_{I}^{(n e w)}=\overrightarrow{\mathbf{w}}_{I}^{(o l d)}-\eta \overrightarrow{\mathbf{E} \mathbf{H}}$ 。
  这里 $\overrightarrow{\mathbf{E H}}=\sum_{j=1}^{V} e_{j} \overrightarrow{\mathbf{w}}_{j}^{\prime}$ 为词汇表 $\mathbb{V}$ 中所有单词的输出向量的加权和，其权重为 $e_j$ 。
平衡的速度与效果取决于单词的共现分布，以及学习率。

Skip-Gram

CBOW模型是根据前几个单词（即上下文）来预测下一个单词，而Skip-Gram模型是根据一个单词来预测前几个单词（即上下文）。
在CBOW模型中：

同一个单词的表达（即输入向量 $\overrightarrow{\mathbf{w}}_I$ ）是相同的，因为参数 $\mathbf{W}$ 是共享的。
同一个单词的输出向量 $\overrightarrow{\mathbf{w}}_{O}^{\prime}$ 是不同的，因为输入向量随着上下文不同而不同。

在Skip-Gram模型中：

同一个单词的表达（即输入向量 $\overrightarrow{\mathbf{w}}_{O}^{\prime}$ ）是相同的，因为参数 $\mathbf{W}^{\prime}$ 是共享的
同一个单词的输入向量 $\overrightarrow{\mathbf{w}}_{I}$ 是不同的，因为输入向量随着上下文不同而不同。

网络结构

Skip-Gram网络模型如下。其中：

网络输入 $\overrightarrow{\mathbf{x}}=\left(x_{1}, x_{2}, \cdots, x_{V}\right)^{T} \in \mathbb{R}^{V}$ ，它是输入单词的one-hot编码，其中只有一位为1，其他都为0。
网络输出 $\overrightarrow{\mathbf{y}}_{1}, \overrightarrow{\mathbf{y}}_{2}, \cdots, \overrightarrow{\mathbf{y}}_{C}$ ，其中 $\overrightarrow{\mathbf{y}}_{c}=\left(y_{1}^{c}, y_{2}^{c}, \cdots, y_{V}^{c}\right)^{T} \in \mathbb{R}^{V}$ 是第c个输出单词为词汇表各单词的概率。
对于网络中的每个输出 $\overrightarrow{y_c}$ ，其权重矩阵都相同，为 $W^{\prime}$ ，这称作权重共享。
这里的权重共享隐含着：每个单词的输出向量是固定的、唯一的，与其他单词的输出无关。

Skip-Gram网络模型中，设网络第c个输出的第j个分量为 $u_{j}^{c}=\overrightarrow{\mathbf{w}}_{j}^{\prime} \cdot \overrightarrow{\mathbf{h}}$ ，则有：
$y_{j}^{c}=p\left(\operatorname{word}_{j}^{c} | \overrightarrow{\mathbf{x}}\right)=\frac{\exp \left(u_{j}^{c}\right)}{\sum_{k=1}^{V} \exp \left(u_{k}^{c}\right)} ; \quad c=1,2, \cdots, C ; \quad j=1,2, \cdots, V$
$y_{j}^{c}$ 表示第c个输出中，词汇表 $\mathbb{V}$ 中第j个单词 $\operatorname{word}_j$ 为真实输出单词的概率。
因为 $W^{\prime}$ 在多个单元之间共享，所以对于网络每个输出，其得分分布 $\overrightarrow{\mathbf{u}}_{c}=\left(u_{1}^{c}, u_{2}^{c}, \cdots, u_{V}^{c}\right)^{T}$ 是相同的，但是这并不意味着网络的每个输出都是同一个单词。

并不是网络每个输出中，得分最高的为预测单词。因为每个输出中，概率分布都相同，即 $\overrightarrow{\mathbf{y}}_{1}=\overrightarrow{\mathbf{y}}_{2}=\cdots=\overrightarrow{\mathbf{y}}_{C}$ 。
Skip-Grame网络的目标是：网络的多个输出之间的联合概率最大。

假设输入为单词 $\operatorname{word}_I$ ，输出单词序列为 $\operatorname{word}_{O_{1}}, \operatorname{word}_{O_{2}}, \cdots,$ word $o_{C}$ 。定义损失函数为： $E=-\log p\left(\operatorname{word}_{O_{1}}, \operatorname{word}_{O_{3}}, \cdots, \operatorname{word}_{O_{C}} | \text { word }_{I}\right)=-\log \prod_{c=1}^{C} \frac{\exp \left(u_{j_{c}^{*}}^{e}\right)}{\sum_{k=1}^{V} \exp \left(u_{k}^{c}\right)}$
其中 $j_{1}^{*}, j_{2}^{*}, \cdots, j_{C}^{*}$ 为输出单词序列对应于词典 $\mathbb{V}$ 中的下标序列。
由于网络每个输出得分分布都相同，令 $u_{k}=u_{k}^{c}=\overrightarrow{\mathbf{w}}_{k}^{\prime} \cdot \overrightarrow{\mathbf{h}}$ ，则上式化简为
$E=-\sum_{c=1}^{C} u_{j_{c}}^{c}+C \log \sum_{k=1}^{V} \exp \left(u_{k}\right)$

NLP-Task4——从one-hot到word2vec

词的表达

Word2Vec

CBOW模型

一个单词上下文

网络结构

参数更新

Skip-Gram

网络结构

相关推荐