【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

上一篇我们讲解了词嵌入的静态表示和上下文动态表示的区别，即基于上下文动态表示的预训练模型的发展情况，本文是对具体细节的一些细化描述。

1. Representations for a word

我们可以从我们常见的词嵌入的静态表示模型 Word2vec, GloVe, fastText获得一个单词的表示，并应有与下游任务。

Tips for unknown words with word vectors

简单且常见的解决方案

训练时：词汇表 Vocab 为 { words occurring, say, ≥5 times }∪{<UNK>}，将所有罕见的词（数据集中出现次数小于 5）都映射为 <UNK>，为其训练一个词向量。运行时：使用 <UNK>代替词汇表之外的词 OOV

问题

没有办法区分不同UNK话说,无论是身份还是意义

解决方案

使用字符级模型学习期词向量，特别是在 QA 中，match on word identity 是很重要的,即使词向量词汇表以外的单词

Try these tips (from Dhingra, Liu, Salakhutdinov, Cohen 2017)。如果测试时的 <UNK>单词不在你的词汇表中，但是出现在你使用的无监督词嵌入中，测试时直接使用这个向量。此外，你可以将其视为新的单词，并为其分配一个随机向量，将它们添加到你的词汇表。

存在两个大问题

1. 对于一个 word type 总是是用相同的表示，不考虑这个 word token 出现的上下文，比如 star 这个单词，有天文学上的含义以及娱乐圈中的含义。（我们可以进行非常细粒度的词义消歧）

2. 我们对一个词只有一种表示，但是单词有不同的方面，包括语义，句法行为，以及表达 / 含义。（表达：同样的意思可以是用多个单词表示，他们的词义是一样的）

Did we all along have a solution to this problem?

在NLM中，我们立即通过LSTM层获取单词向量(可能只在语料库上进行训练)
那些LSTM层被训练来预测下一个单词
但这些语言模型在每一个位置生成特定于上下文的词表示

2. Pre-ELMo and ELMo

2.1 Peters et al. (2017): TagLM – “Pre-ELMo”

想法：想要获得单词在上下文的意思，但标准的 RNN 学习任务只在 task-labeled 的小数据上（如 NER ）
为什么不通过半监督学习的方式在大型无标签数据集上训练 NLM，而不只是词向量

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

与上文无关的单词嵌入 + RNN model 得到的 hidden states 作为特征输入
Char CNN / RNN + Token Embedding 作为 bi-LSTM 的输入
得到的 hidden states 与 Pre-trained bi-LM（冻结的）的 hidden states 连接起来输入到第二层的 bi-LSTM 中

2.2 Peters et al. (2018): ELMo

word token vectors or contextual word vectors 的爆发版本
使用长上下文而不是上下文窗口学习 word token 向量(这里，整个句子可能更长)
学习深度Bi-NLM，并在预测中使用它的所有层
训练一个双向LM
目标是 performant 但LM不要太大
ELMo学习biLM表示的特定任务组合
这是一个创新，TagLM 中仅仅使用堆叠LSTM的顶层，ELMo 认为BiLSTM所有层都是有用的

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

ELMo: Weighting of layers

这两个biLSTM NLM层有不同的用途/含义
- 低层更适合低级语法，例如
  - 词性标注(part-of-speech tagging)、句法依赖(syntacticdependency)、NER
- 高层更适合更高级别的语义
  - 情绪、Semantic role labeling 语义角色标记、question answering、SNLI
这似乎很有趣，但它是如何通过两层以上的网络来实现的看起来更有趣

3. ULMfit

在大型通用领域的无监督语料库上使用 biLM 训练
在目标任务数据上调整 LM
对特定任务将分类器进行微调

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

ULMfit 重点

使用合理大小的“1 GPU”语言模型，并不是真的很大
在LM调优中要注意很多
- 不同的每层学习速度
- 倾斜三角形学习率(STLR)计划
- 学习分类器时逐步分层解冻和STLR
- 使用[hT,maxpool(h),meanpool(h)]进行分类

使用大型的预训练语言模型是一种提高性能的非常有效的方法
如果使用监督数据进行训练文本分类器，需要大量的数据才能学习好

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

4. Transformer models

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

Transformer 不仅狠强大，而且允许扩展到更大的尺寸

The Motivation for Transformers

我们想要并行化，但是RNNs本质上是顺序的
尽管有GRUs和LSTMs, RNNs仍然需要注意机制来处理长期依赖关系——否则状态之间的 path length 路径长度 会随着序列增长
但如果注意力让我们进入任何一个状态……也许我们可以只用注意力而不需要RNN?

Transformer基本构建块讲解：

Dot-Product Attention (Extending our previous def.)

输入：对于一个输出而言的查询 q 和一组键-值对 k-v
Query, keys, values, and output 都是向量
输出值的加权和
权重的每个值是由查询和相关键的内积计算结果
Query 和 keys 有相同维数 dk，value 的维数为 dv

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

Dot-Product Attention – Matrix notation

当我们有多个查询 q 时，我们将它们叠加在一个矩阵 Q 中

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

Scaled Dot-Product Attention

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

问题：dk 变大时，qTk 的方差增大 → 一些 softmax 中的值的方差将会变大 → softmax 得到的是峰值 →因此梯度变小了
解决方案：通过query/key向量的长度进行缩放

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

Self-attention in the encoder

输入单词向量是queries, keys and values
换句话说：这个词向量自己选择彼此
词向量堆栈= Q = K = V
我们会通过解码器明白为什么我们在定义中将他们分开

Multi-head attention

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

简单self-attention的问题,单词只有一种相互交互的方式
解决方案：多头注意力
首先通过矩阵 W 将 Q, K, V 映射到 h = 8 的许多低维空间
然后应用注意力，然后连接输出，通过线性层

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

Complete transformer block

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

每个 block 都有两个“子层”

多头 attention
两层的前馈神经网络，使用 ReLU

这两个子层都：

残差连接以及层归一化
- LayerNorm(x+Sublayer(x))
- 层归一化将输入转化为均值是 0，方差是 1 ，每一层和每一个训练点（并且添加了两个参数）

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

Complete Encoder

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

实际的词表示是 byte-pair 编码
还添加了一个 positional encoding 位置编码，相同的词语在不同的位置有不同的整体表征

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

encoder 中，每个 block 都是来自前一层的 Q, K, V
Blocks 被重复 6 次（垂直方向）
在每个阶段，你可以通过多头注意力看到句子中的各个地方，累积信息并将其推送到下一层。在任一方向上的序列逐步推送信息来计算感兴趣的值
非常善于学习语言结构

Transformer Decoder

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

decoder 中有两个稍加改变的子层
对之前生成的输出进行 Masked decoder self-attention
Encoder-Decoder Attention，queries 来自于前一个 decoder 层，keys 和 values 来自于 encoder 的输出
Blocks 同样重复 6 次

Tips and tricks of the Transformer

Byte-pair encodings
Checkpoint averaging
Adam 优化器控制学习速率变化
训练时，在每一层添加残差之前进行 Dropout
标签平滑
带有束搜索和长度惩罚的 Auto-regressive decoding
因为 transformer 正在蔓延，但他们很难优化并且不像LSTMs那样开箱即用，他们还不能很好与其他任务的构件共同工作

5. BERT: Devlin, Chang, Lee, Toutanova (2018)

问题：语言模型只使用左上下文或右上下文，但语言理解是双向的
为什么LMs是单向的？
原因1：方向性对于生成格式良好的概率分布是有必要的
原因2：双向编码器中单词可以“看到自己”
解决方案：mask out k % 的输入单词，然后预测 masked words
不再是传统的计算生成句子的概率的语言模型，目标是填空
- 总是使用k = 15%
- Masking 太少：训练太昂贵
- Masking 太多：没有足够的上下文

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

参考：http://web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture14-contextual-representations.pdf

【cs224n-12】Modeling contexts of use: Contextual Representations and Pretraining. ELMo and BERT.

1. Representations for a word

2. Pre-ELMo and ELMo

2.1 Peters et al. (2017): TagLM – “Pre-ELMo”

2.2 Peters et al. (2018): ELMo

3. ULMfit

4. Transformer models

5. BERT: Devlin, Chang, Lee, Toutanova (2018)

相关推荐