Attention Is All You Need

摘要

提出一种新的简单的网络结构，仅基于注意力机制

背景

1.循环模型在计算隐藏状态ht时，使用了前面的ht-1和位置t，这种顺序性使得模型无法实现并行计算
2.注意力机制允许对依赖项进行建模，忽略输入或者输出项的距离
3.自注意是一种注意力机制，能够联系一个序列中的不同位置来计算序列表示

模型结构

1.encoder将输入的符号表示序列map到一个连续的表示序列z，对于z，decoder一次生成元素的符号表示输出序列；每一步都是自回归的，当生成下一个符号时，使用之前生成的符号作为附加输入
2.Encoder and Decoder Stacks
Attention Is All You Need论文笔记
(1)Encoder:
N=6，由6个相同的层堆叠，每层有2个子层。一个是multi-head self-attention mechanism，另一个是基于位置的全连接前馈网络，每一子层进行正则化后，使用残差连接

(2)Decoder:
N=6，由6个相同的层堆叠，每层有3个子层，往中间查了一个子层，该子层接收来自Encoder的输出和上一子层的输出作为该子层的输入。这三个子层做同样的正则和残差连接操作。
# This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.
使用标记多头注意子层，确保位置i的只依赖于小于i的位置的输出？？？？偏移

3.Attention
一个attention函数，将1个query和一个key-value对集合map到输出，输出计算了value的权重和，每个value的权重是由query和对应的key之间的兼容函数决定的
（1）Scaled Dot-Product Attention （基于缩放的点积注意根下dk）
Attention Is All You Need论文笔记
将一个query的集合合到一个Q矩阵里；在原点积注意函数基础上加上了缩放因子

（2）Multi-Head Attention

将query、key、value进行多次不同的线性投影，可以并行计算；然后将得到的结果进行连接，再进行线性投影，得到最终结果
· 允许模型在不同的位置共同关注来自不同表示子空间的信息
Attention Is All You Need论文笔记

（3）Applications of Attention in Model
· In “encoder-decoder attention” layers，query来自之前的decoder层，key和value来自encoder层的输出
# This allows every position in the decoder to attend over all positions in the input sequence
· encoder中的self-attention层中，query、key、value都来自encoder前一层的输出；
·
4. Position-wise Feed-Forward Networks
Attention Is All You Need论文笔记
5.Embeddings and Softmax
6.Positional Encoding

Attention Is All You Need论文笔记

Attention Is All You Need

摘要

背景

模型结构

相关推荐