Bert：论文阅读-BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

摘要：我们引入了一种名为BERT的语言表示模型，它代表Transformers的双向编码器表示（Bidirectional Encoder Representations）。与最近（recent）的语言表示模型（Peters et al.，2018; Radford et al.，2018）不同，BERT旨在（is designed to）通过联合调节（jointly conditioning）所有层中的左右上下文（left and right context）来预训练深度双向表示（deep bidirectional representations）。因此，只需一个额外的输出层（with just one additional output layer）就可以对预先训练的BERT表示进行微调（fine-tuned），从而为各种任务创建最先进（state-of-the-art）的模型，例如问答（question answering）和语言推理（language inference），而无需基本（substantial）的特定任务架构（task-specific architecture）修改（modifications）。

BERT在概念上（conceptually）简单且经验丰富（empirically powerful）。它在11项自然语言处理任务中获得了最新的（state-of-the-art）成果，包括将GLUE基准（benchmark）推至80.4％（绝对提高7.6％），MultiNLI准确率达到86.7％（绝对改进5.6％）和SQuAD v1.1 问题回答测试F1（Test F1）到93.2（1.5绝对提高），超过人类表现2.0%。

1 Introduction（简介）

语言模型预训练（Language model pre-training）已证明可有效（be effective for）改善许多自然语言处理任务（Dai和Le，2015; Peters等，2017，2018; Radford等，2018; Howard和Ruder，2018）。这些任务包括句子级任务（sentence-level tasks），如自然语言推理（natural language inference）（Bowman et al，2015; Williams et al，2018）和解码（paraphrasing ）（Dolan和Brockett，2005），旨在通过整体（整体的）分析来预测句子之间的关系，以及令牌级任务（token-level tasks），如命名实体识别（named entity recognition）（Tjong Kim Sang和De Meulder，2003）和SQuAD问题回答（Rajpurkar等，2016），其中模型需要在令牌级别（token-level）生成细粒度输出（grained output）。将预训练语言表示（pre-trained language representations）应用于下游任务（downstream tasks）有两种现有策略：基于特征和微调（feature-based and fine-tuning）。基于特征的方法，例如ELMo（Peters等，2018），使用特定任务的体系结构（tasks-specific architectures），其包括预先训练的表示作为附加特征（additional features）。微调方法（The fine-tuning approach），例如Generative Pre-trained Transformer（OpenAI GPT）（Radford et al，2018），引入了最小的任务特定参数（minimal task-specific parameters），并在通过简单地微调预训练参数来完成下游任务（downstream tasks）。在以前的工作中，两种方法在预训练期间共享相同的目标函数，在这些方法中，他们使用单向（unidirectional）语言模型来学习一般语言表示（general language representations）。

我们认为（We argue that）当前的技术严格限制（severely restrict）了预训练表示的能力，特别是对于微调方法（fine-tuning）。主要限制是标准语言模型是单向的（unidirectional），这限制了在预训练期间可以使用的体系结构的选择。例如，在OpenAI GPT中，作者使用从左到右架构，其中每个令牌只能处理（attended to）Transformer的自我关注层中（in the self-attention layers）的先前令牌（previous tokens）（Vaswani等，2017）。这些限制对于句子级别的任务来说是次优的（sub-optimal），并且在将基于微调的方法应用于令牌级任务（token-level）（如SQuAD问答）时可能是毁灭性（devastating ）的（Rajpurkar等，2016），其中从两个方向合并上下文至关重要（ where it is crucial to incorporate context from both directions）。

在本文中，我们通过提出BERT：Bidirectional Encoder Representations from Transformers来改进基于微调的（fine-tuning based）方法。 BERT通过提出新的预训练目标来解决前面提到的单向约束：“蒙面语言（masked language model）”（MLM），受到完形任务（Cloze task）的启发（Taylor，1953）。被掩盖的语言模型（The masked language model）从输入中随机地掩盖一些标记（tokens），并且目标是仅基于其上下文来预测被掩盖的单词的原始词汇id（the objective is to predict the original vocabulary id of the masked word based only on its context.）。与从左到右（left-to-right）的语言模型预训练不同，MLM目标允许“表示”（representation ）融合（fuse）左右上下文，这允许我们预训练深度双向变换器（deep bidirectional Transformer）。除了蒙面语言模型（masked language model），我们还引入了“下一句预测（next sentence prediction）”任务，该任务联合预先训练文本表示（we also introduce a “next sentence prediction” task that jointly pre-trains text-pair representations.）。
我们的论文的贡献如下：

我们证明了（demonstrate ）双向预训练（bidirectional pre-training）对语言表示（language representations）的重要性。与Radford等人不同。（2018），其使用单向语言模型（unidirectional）进行预训练，BERT使用掩模语言模型（masked language）来实现预训练的深度双向表示（pre-trained deep bidirectional representations）。这也与Peters等人形成对比（ in contrast to ）。（2018），其使用由独立训练的左右和右到左（left-to-right）LM的浅层连接（shallow concatenation）。
我们展示了预训练表示（pre-trained representations ）消除了（eliminate ）许多繁杂设计的（heavily engineered）任务特定体系结构的需求。 BERT是第一个基于微调表示模型（ fine-tuning based representation model ），它在大量句子级（a large suite of）和令牌级任务上（ token-level）实现了最先进（state-of-the-art）的性能，优于（outperforming）许多具有特定任务体系结构的系统。
BERT推进了11项NLP任务的最新技术（state-of-the-art）。我们还发现了BERT的广泛消融（extensive ablations），证明了我们模型的双向性质（bidirectional nature）是最重要的新贡献。代码和预先训练的模型将在goo.gl/language/bert.1上提供。

2 Related Work（相关工作）

预训练一般语言表示（pre-training general language representations）有很长的历史，我们将简要回顾本节中最流行的方法。

2.1 Feature-based Approaches（基于特征的方法）

几十年来，学习广泛适用（widely applicable）的词语表示一直是研究的一个活跃领域（active area），包括非神经学（Brown et al，1992; Ando and Zhang，2005; Blitzer et al，2006）和神经（Collobert and Weston，2008; Mikolov等，2013; Pennington等，2014）方法。预训练的单词嵌入（Pretrained word embeddings）被认为是现代NLP系统的一个组成部分（an integral part of），与从头学习的嵌入（ embeddings learned from scratch）相比有着显着改进（Turian等，2010）。

这些方法（approaches）已被推广（generalized ）到更粗糙的粒度（coarser granularities），例如句子嵌入（sentence embeddings）（Kiros等，2015; Logeswaran和Lee，2018）或段嵌入（paragraph embeddings）（Le和Mikolov，2014）。与传统的单词嵌入（these learned representations）一样，这些学习的表示通常（typically）也用作下游模型（downstream model）中的特征。

ELMo（Peters等，2017）将传统的词嵌入研究概括（generalizes）为不同的维度。他们提出从语言模型中提取上下文敏感特征（context sensitive features）。当将上下文词嵌入（ contextual word embeddings）与现有的任务特定体系结构集成（integrating）时，ELMo推进（advances ）了几个主要NLP基准（benchmarks）测试的最新技术（Peters等，2018），包括关于SQUAD的问题回答（Rajpurkar等，2016），情绪分析（sentiment analysis）（Socher et al，2013），以及命名实体识别（ named entity recognition）（Tjong Kim Sang和De Meulder，2003）。

2.2 Fine-tuning Approaches（微调方法）

从语言模型（LMs）迁移学习（transfer learning）的最新趋势是在LM目标上预先训练一些模型架构（some model architecture），然后对监督下游任务（downstream task）的相同模型进行微调（Dai和Le，2015; Howard和Ruder，2018; Radford） et al，2018）。这些方法的优点（advantage）是需要从头开始（from scratch）学习很少的参数。至少部分由于这一优势，OpenAI GPT（Radford等，2018）从GLUE基准（benchmark）（Wang et al。，2018）获得了许多句子级任务的先前最新（state-of-the-art）结果。

2.3 Transfer Learning from Supervised Data（从监督数据进行迁移学习）

虽然无监督预训练（unsupervised pre-training）的优势在于可用的数据量几乎是无限的（nearly unlimited），但也有工作表明从具有大型数据集的监督任务中有效转移（effective transfer from supervised tasks with large datasets），例如自然语言推理（natural language inference）（Conneau等，2017）和机器翻译（Mc-Cann等，2017）。在NLP之外，计算机视觉研究也证明了（demonstrated）从大型预训练模型转移学习的重要性，其中一个有效的方法是微调在ImageNet上预训练的模型（Deng et al。，2009; Yosinski et al，2014）。

Bert：论文阅读-BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
图1：预训练模型架构的差异。 BERT使用双向（bidirectional）Transformer。 OpenAI GPT使用从左到右（left-to-right）的Transformer。 ELMo使用经过独立训练的从左到右和从右到左LSTM的串联（concatenation ）来生成下游任务（downstream tasks）的功能。在三个中，只有BERT表示（BERT representations）在所有层*同依赖于（jointly conditioned）左右上下文。

3 BERT（Bidirectional Encoder Representations from Transformers）

我们在本节介绍BERT及其详细实现（detailed implementation）。我们首先介绍模型体系结构和BERT的输入表示。然后，我们将在3.3节中介绍预培训任务，即本文的核心创新。预培训程序和微调程序分别在第3.4节和第3.5节中详述。最后，第3.6节讨论了BERT和OpenAI GPT之间的差异。