原文：Deconstructing BERT: Distilling 6 Patterns from 100 Million Parameters

《bert 结构解析：在1亿个参数中提取6种模式》

自然语言处理模型：bert 结构原理解析——attention+transformer（翻译自：Deconstructing BERT）

The year 2018 marked a turning point for the field of Natural Language Processing, with a series of deep-learning models achieving state-of-the-art results on NLP tasks ranging from question answering to sentiment classification. Most recently, Google’s BERT algorithm has emerged as a sort of “one model to rule them all,” based on its superior performance over a wide variety of tasks.

2018是自然语言处理领域的转折点，深度学习模型从问题回答道情感分类等任务中达到了出色的效果
最近google的bert 算法横空出世，号称“一个模型搞定所有”，在多项任务中获得了不错的性能

BERT builds on two key ideas that have been responsible for many of the recent advances in NLP: (1) the transformer architecture and (2) unsupervised pre-training. The transformer is a sequence model that forgoes the recurrent structure of RNN’s for a fully attention-based approach, as described in the classic Attention Is All You Need. BERT is also pre-trained; its weights are learned in advance through two unsupervised tasks: masked language modeling (predicting a missing word given the left and right context) and next sentence prediction (predicting whether one sentence follows another). Thus BERT doesn’t need to be trained from scratch for each new task; rather, its weights are fine-tuned. For more details about BERT, check out the The Illustrated Bert.

bert 基于两个与nlp领域近期改进相关的idea：
- transformer 结构：序列模型，放弃了RNN结构，替换成基于attention的方法，详见： Attention Is All You Need
- 无监督预训练：权重通过两个无监督任务：masked language modeling （根据上下文预测一个缺失的单词）和下文预测（一个句子是否跟着另一个之后）提前训练。
因此bert不需要每个任务再从头开始训练，相反，在后续过程中不断调整，详见：The Illustrated Bert

BERT is a (multi-headed) beast（bert是个多头怪兽）

Bert is not like traditional attention models that use a flat attention structure over the hidden states of an RNN. Instead, BERT uses multiple layers of attention (12 or 24 depending on the model), and also incorporates multiple attention “heads” in every layer (12 or 16). Since model weights are not shared between layers, a single BERT model effectively has up to 24 x 16 = 384 different attention mechanisms.

bert 不同于传统的attention 在RNN隐层上使用平面attention结构，而是使用了多层attention（12 或24）
每一层attention引入了多个head
每一层不共享模型权重，单bert模型有着384个不同的attention机制

Visualizing BERT（可视化）

Because of BERT’s complexity, it can be difficult to intuit the meaning of its learned weights. Deep-learning models in general are notoriously opaque, and various visualization tools have been developed to help make sense of them. However, I hadn’t found one that could shed light on the attention patterns that BERT was learning. Fortunately, Tensor2Tensor has an excellent tool for visualizing attention in encoder-decoder transformer models, so I modified this to work with BERT’s architecture, using a PyTorch implementation of BERT. The adapted interface is shown below, and you can run it yourself using the notebooks on Github.

The tool visualizes attention as lines connecting the position being updated (left) with the position being attended to (right). Colors identify the corresponding attention head(s), while line thickness reflects the attention score. At the top of the tool, the user can select the model layer, as well as one or more attention heads (by clicking on the color patches at the top, representing the 12 heads).

因为bert的复杂性，很难直观理解学习到的权重的含义
深度学习模型一般都是黑盒，诞生了不同的可视化工具来解释它们
Tensor2Tensor 有一个不错的工具，可以对编解码transformer模型终得attention 做可视化，所以我用bert的pytorch实现来对其进行修改以对bert做可视化（工程能力强啊），可以用此notebook Github 自己跑
工具将关注度attention 可视化为连接线，将左边已更新的位置与右边待更新的位置连接。
颜色区分了不同的attention的head，线条粗细反映了关注度attention的分值
工具上方可以选择不同的层，还有head

自然语言处理模型：bert 结构原理解析——attention+transformer（翻译自：Deconstructing BERT）

What does BERT actually learn?（bert学了啥，这里对应标题的6种模式）

I used the tool to explore the attention patterns of various layers / heads of the pre-trained BERT model (the BERT-Base, uncased version). I experimented with different input values, but for demonstration purposes, I just use the following inputs:

Sentence A: I went to the store.

Sentence B: At the store, I bought fresh strawberries.

BERT uses WordPiece tokenization and inserts special classifier ([CLS]) and separator ([SEP]) tokens, so the actual input sequence is: [CLS] i went to the store . [SEP] at the store , i bought fresh straw ##berries . [SEP]

I found some fairly distinctive and surprisingly intuitive attention patterns. Below I identify six key patterns and for each one I show visualizations for a particular layer / head that exhibited the pattern.

工具用于观察预训练bert模型不同层或head的关注度模式，为展示，选择以下句子输入：
- I went to the store.
- At the store, I bought fresh strawberries.
bert 使用WordPiece 标记，因此实际输入为：
- [CLS] i went to the store . [SEP] at the store , i bought fresh straw ##berries . [SEP]
我发现了相当不同且非常直观的关注度模式，以下我选了6中关键的模式做可视化

Pattern 1: Attention to next word（对下一个词的关注度）

In this pattern, most of the attention at a particular position is directed to the next token in the sequence. Below we see an example of this for layer 2, head 0. (The selected head is indicated by the highlighted square in the color bar at the top.) The figure on the left shows the attention for all tokens, while the one on the right shows the attention for one selected token (“i”). In this example, virtually all of the attention is directed to “went,” the next token in the sequence.

此模式下，大部分词关注度来源于句子的下一个词
左图显示了所有词的关注度来源，有图显示了i词来自went的关注度
此模式下i的关注度几乎都来自went

自然语言处理模型：bert 结构原理解析——attention+transformer（翻译自：Deconstructing BERT）

On the left, we can see that the [SEP] token disrupts the next-token attention pattern, as most of the attention from [SEP] is directed to [CLS]rather than the next token. Thus this pattern appears to operate primarily within each sentence.

This pattern is related to the backward RNN, where state updates are made sequentially from right to left. Pattern 1 appears over multiple layers of the model, in some sense emulating the recurrent updates of an RNN.

左图看出分隔符破坏了这种关注度模式，大部分分隔符SEP的关注度贡献到了定界符CLS，因此这种模式只在一句子中出现
这种模式与反向RNN相关，序列的每个状态更新是从右到左。
模式1在多层都有出现，某种程度上是对RNN的循环更新的一种仿真

Pattern 2: Attention to previous word（对前一个词的关注度）

In this pattern, much of the attention is directed to the previous token in the sentence. For example, most of the attention for “went” is directed to the previous word “i” in the figure below. The pattern is not as distinct as the last one; some attention is also dispersed to other tokens, especially the [SEP] tokens. Like Pattern 1, this is loosely related to a sequential RNN, in this case the forward RNN.

这种模式，大部分关注度都来自句子的前一个词
went 词主要关注的是i
这种模式并不想模式1那般明显，部分关注点还落在了其他词上，例如分隔符。
这种也可以视作对正向RNN的仿真

自然语言处理模型：bert 结构原理解析——attention+transformer（翻译自：Deconstructing BERT）

Pattern 3: Attention to identical/related words（对相同/相关的词的关注度）

In this pattern, attention is paid to identical or related words, including the source word itself. In the example below, most of the attention for the first occurrence of “store” is directed to itself and to the second occurrence of “store”. This pattern is not as distinct as some of the others, with attention dispersed over many different words.

这种模式下，关注度主要来自相同词或相关的词，包括自己
store 的关注度落在自己和下一个store上
这种模式也不明显，很多关注点落在其他不同词上

自然语言处理模型：bert 结构原理解析——attention+transformer（翻译自：Deconstructing BERT）

Pattern 4: Attention to identical/related words in other sentence（对其他句子中相同/相关词的关注度）

In this pattern, attention is paid to identical or related words in the other sentence. For example, most of attention for “store” in the second sentence is directed to “store” in the first sentence. One can imagine this being particularly helpful for the next sentence prediction task (part of BERT’s pre-training), because it helps identify relationships between sentences.

此模式中，关注点落在其他句子的相同或相关词上
store 词的关注度落在第一个句子的store
可以想象这个对下一句预测的任务将很有帮助，因为标注了句子间的关联（bert可以用到很多任务上）

自然语言处理模型：bert 结构原理解析——attention+transformer（翻译自：Deconstructing BERT）

Pattern 5: Attention to other words predictive of word（对其他对本单词预测有用的词的关注度）

In this pattern, attention seems to be directed to other words that are predictive of the source word, excluding the source word itself. In the example below, most of the attention from “straw” is directed to “##berries”, and most of the attention from “##berries” is focused on “straw”.

This pattern isn’t as distinct as some of the others. For example, much of the attention is directed to a delimiter token ([CLS]), which is the defining characteristic of Pattern 6 discussed next.

此模式下，关注点来自于对当前词有预见性的词，排除自身
下图中straw的关注度贡献到了##berries（straw 后面接着的大概率是berries）
此模式并不明显，大部分关注度也落在了定界符

自然语言处理模型：bert 结构原理解析——attention+transformer（翻译自：Deconstructing BERT）

Pattern 6: Attention to delimiter tokens（对定界符的关注度）

In this pattern, most of the attention is directed to the delimiter tokens, either the [CLS] token or the [SEP] tokens. In the example below, most of the attention is directed to the two [SEP] tokens. As discussed in this paper, this pattern serves as a kind of “no-op”: an attention head focuses on the [SEP] tokens when it can’t find anything meaningful in the input sentence to focus on.

此模式下，大部分关注度落在定界符（CLS或SEP）
此模式一般代表“无操作”，在输入中找不到有意义的关注点。

自然语言处理模型：bert 结构原理解析——attention+transformer（翻译自：Deconstructing BERT）

Notes（注意）

It has been said that data visualizations are a bit like Rorschach tests: our interpretations may be colored by our own beliefs and expectations. While some of the patterns above are quite distinct, others are somewhat subjective, so these interpretations should only be taken as preliminary observations.

Also, the above 6 patterns describe the coarse attentional structure of BERT and do not attempt to describe the linguistic patterns that attention may capture. For example, there are many different types of “relatedness” that could manifest in Patterns 3 and 4, e.g., synonymy, coreference, etc. It would be interesting to see if different attention heads specialize in different types of semantic and syntactic relationships.

数据可视化就像罗夏测试，我们的解释可能受我们的期望影响
以上模式是比较特殊的，其他模式都有一定主观性，所以这里的解释只能作为初步观测
以上6种模式粗略的描述了bert中attention结构，但没有描绘出attention可以抓取的语言学特征，或者不同head可以表现出不同的语义语法关系

Try it out!（试一把）

You can check out the visualization tool on Github. Please play with it and share what you find!

Github 可视化

For further reading（引申阅读）

In Part 2, I extend the visualization tool to show how BERT is able to form its distinctive attention patterns. In my most recent article, I explore OpenAI’s new text generator, GPT-2.

本文 Part 2 ，对bert 如何形成不同的attention模式做了分析

转自：https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77

下篇：自然语言处理模型：bert 结构原理解析——attention+transformer（翻译自：DECONSTRUCTING BERT, PART 2）

自然语言处理模型：bert 结构原理解析——attention+transformer（翻译自：Deconstructing BERT）

BERT is a (multi-headed) beast（bert是个多头怪兽）

Visualizing BERT（可视化）

What does BERT actually learn?（bert学了啥，这里对应标题的6种模式）

Pattern 1: Attention to next word（对下一个词的关注度）

Pattern 2: Attention to previous word（对前一个词的关注度）

Pattern 3: Attention to identical/related words（对相同/相关的词的关注度）

Pattern 4: Attention to identical/related words in other sentence（对其他句子中相同/相关词的关注度）

Pattern 5: Attention to other words predictive of word（对其他对本单词预测有用的词的关注度）

Pattern 6: Attention to delimiter tokens（对定界符的关注度）

Notes（注意）

Try it out!（试一把）

For further reading（引申阅读）

相关推荐