0.总结

本篇文章是结合 Jay Alammar 关于The Illustrated Transformer文章做出的一个小总结。下面的博文二字指的都是这篇文章、下面的引用都是文章的一部分，而我自己的见解和看法会以正文的形式出现。
本文适合初入门（稍有基础）的小白阅读（如果没有基础，可能因为文章火候不足而误导你），但是欢迎各位大佬降维打击&拍砖。若有疑问请留在评论区，我会第一时间回复~
小白心声：我央求那些大佬们能不能多写一点儿nlp相关文章，救救我这种看英语文章成痴呆，却又尚未入门的菜鸟？
本博文中英混杂，不习惯的直接逃吧~

Transformer 是什么？

Jay Alammar 给出的描述是：

In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained.

Google blog 给出的描述是：

Transformer: A Novel Neural Network Architecture based on self-attention mechanism for Language Understanding

为什么提出来？& 什么时候提出来的？

因为RNN的缺点【什么缺点？为什么有？】而提出了Transformer。

The Transformer was proposed in the paper 《Attention is all your Need》

基础结构

从上到下、由里到外的顺序来谈谈 Transformer 的结构问题，

Tansformers is consit of encoding component, a decoding component, and connections between them.

The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.

The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers: self-Attention + Feed Forward Neural Network

那么Self-Attention 主要是什么样的？由 FFNN 可以猜测 self-attention ~~也是一种神经网络~~ ，只不过是结构上存在一些变化而已。我认为：说self-attention 是神经网络是不正确的，里面没有很多的神经元，因此算不上。

self-attention

是什么？

首先应该先谈谈什么是self-attention，再看 self-attention 层的作用。

Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.

长什么样？

这里以 scaled dot-product Attention 为例：
从雏鸟的角度看Transformer

处理过程：

step1. create three vectors from each of the encoder’s input vectors【因为我们拿的是第一层encoder做例子，所以这里的输入vector 就是三个单词的embedding】
setp2. for each word, we create a Query vector, a Key vector, a Value vector.These vectors are created by multiplying the embedding by three matrices that we trained during the training process. 【刚开始学习Attention的时候，看了网上很多博文，根本不知道q,k,v这些vector是怎么来的，~~心里真是mmp，写的什么文章~~现在知道了，我们需要先创建 $W^Q , W^K, W^V$ 三个矩阵，可以随意初始化，后面会给训练好，然后将每个词的 embedding 与其相乘，那么就得到了各个词对应的query/key/value vector~】
需要注意的是：这些向量在维度上都比 embedding vector要小。

就如下图所示：
从雏鸟的角度看Transformer - step3. calculate a score.

We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring.

step4. divide the scores by some value, then pass the result through a softmax operation
这里的some value 大概率上都默认是the square root of the dimension of the key vectors. 原因就是：This leads to having more stable gradients. There could be other possible values here.

This softmax score determines how much each word will be expressed at this position.

step5. multiply each value vector by the softmax score
这么做的目的显而易见：就是为了关注权重高的词，忽略权重小的词。也就是博文中说的： keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words.
step6.sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

实际处理时，因为想加快运行速度，所以采用的是矩阵乘法。下面就来来看看如何在矩阵基础上完成词的计算。

Matrix Calculation of Self-Attention

step1. calculate the Query ,Key, Value matrices
We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained ( $W^Q, W^K, W^V$ ).
对上面的每个值再做一次叙述！

Every row in the $X$ matrix corresponds to a word in the input sentence.

step2. dealing with matrices
将上面各个步骤【q和k的点积 => 除以平方根 => 取softmax => 乘以 v，最后得到z。因为这里讲的是matrix计算，所以将各个vector 合并则得到一个matrix】压缩到一起合成一个步骤，便得到如下的图示：

作用

a layer that helps the encoder look at other words in the input sentence as it encodes a specific word.

也就是文中说的另一句话：

self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

升级 => mulit-head Attention

The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention.

优点：

expands the model’s ability to focus on different positions.
扩展模型聚焦在不同位置的能力，下面谈谈为什么能扩展？
gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.

从雏鸟的角度看Transformer 照单个 attention 的流程来做的话，那么如果有n（这里取的是8）套$$，就会得到n个结果，如下所示：

将这8个z（ $z_0 - z_7$ ）拼接起来，就得到了最后要的输出。

从雏鸟的角度看Transformer 把上面所有的矩阵放到一起，区分一下就是如下这张图：

再升级 => `sum of embedding`

One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.

To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.

从雏鸟的角度看Transformer

The Residuals

One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer-normalization step.

从雏鸟的角度看Transformer
这里的虚线就是所谓的残差网络，其实它不是 transformer 的重点，只能算是一个小trick而已~ 如果想要彻底的可视化这个过程，也就是像下面这样：

The Deocder Side

The Final Linear and softmax Layer

The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.

上面这个方法是通常的tensor 转 word的方法。

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.
Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.
The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

The Loss Function

模型一次产生一个输出，并且这个输出是最高概率的那个。这种方法叫做贪心法。
另外一种方法是：保留两个最高概率的词，然后比较这两个词在模型中的损失大小，取较小的那个。这种方法叫做 beam search。
疑问：模型指的是什么？损失怎么计算？怎么比较损失？

运作流程【实例展示数据处理过程】

这里先总结一下博文中的内容：

step 1.使用embedding 技术将 word 变成了 vector。维度大小都是512维
step 2.这些embedding只会出现在最底层的 encoder 中。但是每个 encoder 都是接收 a list of vectors，且它们的大小都是 512。这个list 的大小是一个超参数，我们可以自己设置，通常情况，就是训练数据中最长句子的长度。【这个原因是：每个单词对应一个embedding，一个 sentence 会有多个word，所以就会有多个 word embedding，所以就会出现每个sentence 的 list不同，这里取最长的为标准。那么不够的是不是就是补0？？？】
这么说输入的数据就会组成一个矩阵了吗？ 比如说下图中的输入是什么？是把这三个向量组合起来成一个矩阵？还是依次输入？
我个人的理解是：这里是数据的同时输入，也就是把Je suis étudiant 中每个词的embedding 组成一个矩阵，然后同时输入到self-attention 中。

优势

1. parallelize

The biggest benefit, however, comes from how The Transformer lends itself to parallelization.

下面就解释为什么这里会有一个并行操作？

… to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer.The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

缺点

应用方面

Jay 的博客中的介绍的方向是翻译（seq2seq），但是问题是，为什么会选择这个方向作为示例讲解？ Transformer就是根据这个提出来的吗？其源点来自何处？

疑问

读到这儿了，也要开始打开自己的脑子，问一些问题。

为什博文中用的是翻译的seq2seq 做引子，你可以看之前的图片+例子，都是用翻译做的例子？
transformer 的encoder layer 和 decoder layer部分都是线性的吗？为什么效果好？
之前说过，线性的网络不如神经网络，但是为什么这里的线性就这么管用？
并行到底是怎么并行的？

Transformer 和 LSTM 的最大区别，就是 LSTM 的训练是迭代的、串行的，必须要等当前字处理完，才可以处理下一个字。而 Transformer 的训练时并行的，即所有字是同时训练的，这样就大大增加了计算效率。Transformer 使用了位置嵌入 (Positional Encoding) 来理解语言的顺序，使用自注意力机制（Self Attention Mechanism）和全连接层进行计算。

为什么叫self-attention？
What are the “query”, “key”, and “value” vectors? & 又有什么用？

参考资料

https://jalammar.github.io/illustrated-transformer/
https://wmathor.com/index.php/archives/1438/ 【该文有很好的动态演示】
https://mp.weixin.qq.com/s/RLxWevVWHXgX-UcoxDS70w