射线追踪模型训练

The transformer architecture has produced a revolution in the NLP field and in deep learning. A multitude of applications are benefiting from the capacity of these models to process sequences in parallel while achieving a deeper understanding of their context through the attention mechanisms they implement. And GPT-3 is a hot topic right now in the deep learning community.

变压器架构在NLP领域和深度学习方面掀起了一场革命。这些模型能够并行处理序列，同时通过它们所实现的注意力机制对它们的上下文有更深入的了解，从而使许多应用程序受益。而且， GPT-3成为当今深度学习社区中的热门话题。

Understanding how the transformer processes sequences can be a bit challenging at first. When tackling a complex model, many people like to study how the computations of the model change the shapes of the tensors that travel through it.

首先，了解变压器如何处理序列可能会有些挑战。处理复杂模型时，许多人喜欢研究模型的计算如何改变通过它的张量的形状。

To that purpose, I created the X-Ray Transformer infographic, that allows you to make the journey from the beginning to the end of the transformer’s computations in both the training and inference phases. Its objective is to achieve a quick and deep understanding of the inner computations of a transformer model through the analysis and exploration of a single visual asset.

为此，我创建了X射线变压器图表，该图表可让您在训练和推理阶段进行从变压器计算的开始到结束的整个过程。其目的是通过分析和探索单个视觉资产，快速而深入地了解变压器模型的内部计算。

A link to download a higher resolution version of the full infographic below is available at the end of this article.

本文末尾提供了下载下面的完整信息图的高分辨率版本的链接。

射线追踪模型训练_X射线互感器通过单个视觉深入变压器训练训练计算 — Link to a much higher resolution version of the full infographic is available at the end of this article. Visualization by Javier [email protected]

色标 (Color code)

When looking at this infographic, the first thing to consider, shown at the bottom right of the graphic, is the colors that denote different important stages.

在查看此信息图时，要考虑的第一件事是图形的右下方，它表示不同的重要阶段。

Light Blue denotes the training phase.

浅蓝色表示在训练阶段。
Light Green denotes the inference phase.

浅绿色表示推断阶段。
Purple indicates the Encoding stage (used in both phases), and within the encoder, purple modules belong to the training phase and green ones to the inference phase.

紫色表示编码阶段(在两个阶段中都使用)，在编码器中，紫色模块属于训练阶段，绿色模块属于推理阶段。
Dark Red indicates the Decoding stage (used in both phases). Within the decoder, purple modules indicate encoder data, dark red indicates decoder data, and green modules express as usual the inference stage.

暗红色表示解码阶段(在这两个阶段中使用)。在解码器中，紫色模块指示编码器数据，深红色指示解码器数据，绿色模块照常表示推断阶段。

Once clarified the color codes, the next is to notice the pink circles with numbers inside them. Those help us see the general path of execution, first moving through the encoder and then the decoder.

弄清楚颜色代码后，下一步是注意内部带有数字的粉红色圆圈。这些帮助我们了解了一般的执行路径，首先是通过编码器，然后是解码器。

The two large arrows on both sides are a reminder of some of the key stages of the execution of both the encoder and decoder phases.

两侧的两个大箭头提醒着执行编码器和解码器阶段的一些关键阶段。

该模型 (The model)

To generate this infographic, I used a small transformer model that generates a chatbot. The chatbot is trained with pairs of questions and answers. This specific model is trained on questions and answers related to movies and series, especially science fiction ones. Examples of questions and answers:

为了生成此信息图，我使用了一个小型的变压器模型来生成一个聊天机器人。 聊天机器人受过成对的问题和答案训练。该特定模型针对与电影和电视剧特别是科幻小说有关的问题和答案进行了培训。问题和答案示例：

“What’s your favourite character in The Expanse series?” : “Naomi Nagata definitely!”
“《浩瀚》系列中您最喜欢的角色是什么？” ：“永田直美当然！”
“What’s your favourite character in Battlestar Galactica?” : “Kara Thrace, she is great”
“您在《太空堡垒卡拉狄加》中最喜欢的角色是什么？” ：“卡拉·色雷斯，她很棒”

Below the title of the infographic, we can review the most important parameters to consider when studying the shapes of the computations.

在信息图标题下方，我们可以查看在研究计算形状时要考虑的最重要参数。

This small model is trained with a batch size of 8.

这个小模型训练的批量大小为8 。
The model has 4 heads in its multi-head attention part.

该模型的多头注意力部分有4个头。
There are 3 encoder layers and 3 decoder layers.

有3个编码器层和3个解码器层。
The size of the output vocabulary of the model is 950

该模型的输出词汇的规模是950
The embedding size used across the model is 32.

整个模型使用的嵌入大小为32 。

培训阶段。新一批 (Training phase. A new batch)

We begin the journey on the bottom left of the infographic as we begin to train the model.

当我们开始训练模型时，我们将在信息图的左下方开始旅程。

We obtain a batch from our dataloader. We use a batch size of 8, so each batch contains 8 input sequences and 8 output sequences.

我们从数据加载器中获取了一批。我们使用8的批次大小，因此每个批次包含8个输入序列和8个输出序列。

标记化，数字化，填充和遮罩 (Tokenizing, Numericalizing, Padding and Mask)

The 8 input sequences are padded as necessary (adding padding tokens) so that they all have the same length, in the case of this specific batch, 10 (the length of the longest sequence in that batch). The same is done with the output sequences of the batch.

8个输入序列将根据需要进行填充(添加填充令牌)，以便在此特定批次的情况下，它们都具有相同的长度，即10个(该批次中最长序列的长度)。批次的输出序列也是如此。

These sequences have been tokenized and numericalized to prepare them to be ingested by the model. By the time the training loop extracts a new batch, the sequences are numericalized and structured in a tensor of dimensions 8x10 (BS x SeqLen).

这些序列已经过标记和数字化处理，可以被模型吸收。到训练循环提取新批次时，序列被数字化并构造为张量为8x10(BS x SeqLen)的张量。

编码器中的遮罩 (Masking in the encoder)

Next we need to create a mask that will help us ensure that the additional padding elements in the sequence are not taken into account by the attention mechanisms. So we set to False or 0 those positions in the mask belonging to padding tokens in the input sequences.

接下来，我们需要创建一个遮罩，该遮罩将帮助我们确保注意机制不考虑序列中的其他填充元素。因此，我们将掩码中属于输入序列中填充令牌的那些位置设置为False或0。

嵌入和位置编码 (Embeddings and positional encoding)

Now we have to create our embeddings, so we send the 8x10 tensor to the embed module and get back a 8x10x32 (BS x SeqLen x EmbedSize) tensor because our embedding size is 32 in this small example (512 is a typical embedding size for transformer models).

现在我们必须创建嵌入，因此我们将8x10张量发送到embed模块，并返回8x10x32(BS x SeqLen x EmbedSize)张量，因为在这个小示例中，我们的嵌入大小为32(512是变压器的典型嵌入大小)楷模)。

To that, we add the result of the positional encoding module, which will help the model take into account the differences in positioning across the sequence.

为此，我们添加了位置编码模块的结果，这将有助于模型考虑整个序列中定位的差异。

The first layer of our encoder is ready to ingest this 8x10x32 tensor.

我们的编码器的第一层已准备好摄取8x10x32张量。

编码器 (The encoder)

The first thing the encoder does is to create three copies of the 8x10x32 tensor to produce the Q, K and V elements of the model, that is, the query, keys and values.

编码器要做的第一件事是创建8x10x32张量的三个副本以产生模型的Q，K和V元素，即查询，键和值。

These 3 tensors are passed through 3 linear modules first (one for each of the tensors). In this small example, these linear modules don’t change the dimensionality (but they could if we wish them to).

这3个张量首先通过3个线性模块(每个张量一个)。在这个小例子中，这些线性模块不会改变尺寸(但是如果我们希望它们可以改变)。

After passing these linear modules, we arrive to the point of having to split the computation into our 4 heads (8 is a typical value for the number of heads. In this small example I use 4). Using 4 heads will allow the attention mechanism to interpret the sequences from different perspectives.

通过这些线性模块后，我们到达必须将计算分为4个头的位置(8是头数量的典型值。在这个小示例中，我使用4)。使用4个头将使注意力机制从不同的角度解释序列。

Computationally we can prepare this stage in two simple steps.

通过计算，我们可以通过两个简单步骤来准备此阶段。

First, we can reconfigure the tensor to split the embedding size dimension, 32, into two dimensions, 4 and 8. 4 is the number of heads. 8, in our case, is equal to the Embedding dimensions / number of heads 32 / 4 = 8, and we will call it dimK or dK (this dimK value can be calculated in different ways).

首先，我们可以重新配置张量以将嵌入尺寸尺寸32分为两个尺寸4和8 。头数为4。在我们的例子中，8等于嵌入尺寸/头数32/4 = 8，我们将其称为dimK或dK (此dimK值可以用不同的方式计算)。
And now we do a transpose operation to position the dimension with the number of heads after the batch size one. That produces the new shapes: 8 x 4 x 10 x 8 (BS x Heads x SeqLen x dimK). What this shape tells us is: for each element of the batch, we will have 4 heads. And each of those heads will have a 10(sequence length) x 8(dimK) matrix within.

现在，我们进行转置操作，以将尺寸与批头尺寸后的头数一起定位。这样就产生了新的形状： 8 x 4 x 10 x 8(BS x Heads x SeqLen x dimK) 。这个形状告诉我们的是：对于批次的每个元素，我们将有4个头。每个头中都有一个10(序列长度)×8(dimK)矩阵。

自我注意 (Self-Attention)

Our objective now is to calculate the attention scores. The encoder performs what is called self-attention. Self-attention helps us compare different parts of each input sequence to themselves and the rest of the sequence itself.

现在我们的目标是计算注意力得分。编码器执行所谓的自我注意。自我注意可以帮助我们将每个输入序列的不同部分与它们自己以及其余部分进行比较。

Conceptually, we are exploring how much attention should different parts of our sequence pay to different parts of itself.

从概念上讲，我们正在探索序列的不同部分应该对自身的不同部分给予多少关注。

To find this out, we will multiply the Query and Keys tensors.

为了找出答案，我们将查询和键张量相乘。

To multiply them we need to transpose the second half of the K tensor.

要乘以它们，我们需要转置K张量的后半部分。
Once the K tensor has been transposed, we obtain two shapes that we can multiply: 8x4x10x8 * 8x4x8x10.

一旦转置了K张量，我们将获得两个可以相乘的形状： 8x4x10x8 * 8x4x8x10 。
Notice that what we are really multiplying are the two last dimensions: 10x8 * 8x10.

请注意，我们真正要乘的是最后两个尺寸： 10x8 * 8x10 。
This is going to produce the attention scores tensor which will have the shape: 8x4x10x10 (BS x Heads x SeqLen x SeqLen).

这将产生注意力分数张量，其形状为： 8x4x10x10(BS x Heads x SeqLen x SeqLen) 。
These are our self-attention scores. For each element of the batch, and for each of the 4 heads, we have a 10x10 matrix, which expresses how much attention each of the parts of our sequence should pay to each of the parts of the same sequence.

这些是我们的自我注意分数。对于批处理中的每个元素，对于4个标题中的每个元素，我们都有一个10x10的矩阵，该矩阵表示序列的每个部分应该对同一序列的每个部分给予多少关注。

The next thing we will do is to apply a mask. This is because, remember, we made sure that all the sequences in the batch would have the very same length. And to do that we had to add padding tokens to the sequences that were shorter than the largest one of that batch.

我们下一步要做的就是涂上口罩。记住，这是因为，请确保我们确保批次中的所有序列都具有相同的长度。为此，我们必须将填充令牌添加到比该批次中最大的序列短的序列上。

So we now should mask (make really small or very negative) those parts of the tensor that refer to parts of the sequence that had the padding token. So we apply the mask and eliminate the influence of those parts of the sequences that correspond to those padding tokens.

因此，我们现在应该屏蔽(使真小或非常小的负数)张量的那些引用序列中具有padding令牌的部分。因此，我们应用了掩码并消除了与那些填充令牌相对应的序列部分的影响。

Now we will apply a softmax module to the 10x10 matrix of the tensor, so that all the numbers of each row sum to 1, converting each row into a probability distribution.

现在，我们将一个softmax模块应用于张量的10x10矩阵，以使每行的所有数字之和为1，将每行转换为概率分布。

Those are our soft self attention scores. For each sequence of each batch and within each head, how strong is the connection between each part of that sequence and each part of itself, with the sum of all the influences on each part of the sequence adding to one.

那是我们柔和的自我关注分数。对于每个批次的每个序列和每个头部而言，该序列的每个部分与其本身的每个部分之间的连接强度是多少，而对序列的每个部分的所有影响之和加在一起。

Now that we have the attention scores, we should apply them to the values, to the V tensor. We want to transform the values of the encoder according to the results of the self-attention computations.

现在我们有了注意力得分，我们应该将它们应用于值，即V张量。我们要根据自注意力计算的结果来变换编码器的值。

Our attention scores have the shape of 8x4x10x10 (BS x Heads x SeqLen x SeqLen). And our V tensor has the shape of 8x4x10x8 (BS x Heads x SeqLen x dimK). Remember that we are really multiplying the last 2 dimensions, so we are multiplying 10x10 * 10x8. This produces a new tensor of dimension 8x4x10x8 (BS x Heads x SeqLen x dimK).

我们的注意力得分的形状为8x4x10x10(BS x Heads x SeqLen x SeqLen) 。并且我们的V张量具有8x4x10x8的形状(BS x Heads x SeqLen x dimK) 。请记住，我们实际上是将最后两个维度相乘，因此我们将相乘10x10 * 10x8 。这将产生一个尺寸为8x4x10x8的新张量(BS x Heads x SeqLen x dimK)。

At this point, we have concluded the self-attention stage. We found out the attention scores by multiplying the queries and the keys. And then applied those attention scores to the values to obtain the final attention tensor.

至此，我们已经结束了自我关注阶段。我们通过将查询和关键字相乘得出注意力得分。然后将这些注意力得分应用于这些值以获得最终注意力张量。

It’s the moment to unify the 4 heads into one. To do that, we do the inverse of before, combining transposition and reconfiguration to obtain a new shape of 8x10x32 (BS x Heads x EmbSize).

现在是将4个头合并为一个的时刻。为此，我们进行与之前相反的操作，结合转置和重新配置以获得8x10x32的新形状(BS x Heads x EmbSize) 。

After passing the resulting tensor through a linear module, we arrive at our first skip connection. We will add our current tensor to the original one that entered the encoder layer. And then we will apply layer normalization to keep the data values within a good range.

在将所得张量通过线性模块后，我们到达了第一个跳过连接。我们将当前张量添加到进入编码器层的原始张量中。然后，我们将应用图层归一化以将数据值保持在适当范围内。

Next, we pass our 8x10x32 tensor through a feedforward layer and then apply another skip connection, adding the resulting tensor to the one that entered the feed forward layer (and normalizing the result as before).

接下来，我们将8x10x32张量通过前馈层，然后应用另一个跳过连接，将结果张量添加到进入前馈层的张量中(并像以前一样对结果进行归一化)。

We can optionally apply a dropout module at different stages of the previous computations, for example when performing the skip connection additions or at the end of the attention phases.

我们可以选择在先前计算的不同阶段应用辍学模块，例如，在执行跳过连接添加时或在注意阶段结束时。

重复并上升 (Repeat and rise)

Wonderful! That was one layer of the encoder. The very same computations will be applied x number of times corresponding to the number of layers we have in the encoder.

精彩！那是编码器的一层。对应于我们在编码器中具有的层数，将应用x次相同的计算。

Notice that the tensor that entered the encoder layer and the one that exits the encoder layer have the very same shape: 8x10x32. That’s why we can chain as many encoder layers as we like one after the other.

请注意，进入编码器层的张量与离开编码器层的张量具有相同的形状： 8x10x32。 这就是为什么我们可以一个接一个地链接尽可能多的编码器层的原因。

Once we arrive at the final encoder layer, we obtain our final 8x10x32 tensor. This encoder output tensor will be used later in the encoder-decoder attention mechanism (present at the decoder layers) to provide the keys and values that will interact with the questions tensor of the decoder.

一旦到达最终的编码器层，便获得了最终的8x10x32张量。该编码器输出张量将稍后在编码器-解码器注意机制(在解码器层中提供)中使用，以提供将与解码器的问题张量交互的键和值。

解码器 (The decoder)

But before we go there, let’s move to the next step. The bottom part of the decoder.

但是在我们去那里之前，让我们继续下一步。解码器的底部。

At the bottom of the decoder we have a 8x14 tensor (BS x SeqLen) that contains 8 sequences of reply phrases. As usual, these phrases have been tokenized and numericalized when creating the dataset and data loaders (and they contain padding tokens as needed).

在解码器的底部，我们有一个8x14张量(BS x SeqLen) ，其中包含8个应答短语序列。像往常一样，在创建数据集和数据加载程序时，已对这些短语进行了标记和数字化(并且根据需要包含填充标记)。

上移一位 (Shift by one)

Something important to note is that in the decoder, we shift the sequences to the right by one position. So the first token will be a start-of-sentence token rather than the first word of the sentence. Why do we do this?

需要注意的重要一点是，在解码器中，我们将序列向右移动一个位置。因此，第一个标记将是句子的开始标记，而不是句子的第一个单词。我们为什么要做这个？

We do it because we don’t want our model to just copy and paste the decoder’s input into its output. We want it to predict the next word (or character, but in this example we are predicting words). So if we don’t shift everything to the right by one, the prediction for position N will be the current word at position N in the decoder’s input, which we can access directly. To prevent this from happening, we shift the decoder’s input to the right by one position. In this way, at each stage, the decoder has to predict the position N but can only see up to position N-1 of the existing phrase.

我们这样做是因为我们不希望我们的模型仅将解码器的输入复制并粘贴到其输出中。我们希望它预测下一个单词(或字符，但是在此示例中，我们正在预测单词) 。因此，如果我们不将所有内容向右移一位，则位置N的预测将是解码器输入中位置N处的当前单词，我们可以直接访问该位置。为了防止这种情况的发生，我们将解码器的输入向右移动一个位置。这样，在每个阶段，解码器必须预测位置N，但只能看到现有短语的位置N-1。

解码器中的遮罩 (Masking in the decoder)

We also create the decoder mask, which contains True above the diagonal and False below it. This mask helps prevent the decoder from considering parts of the sentence that it hasn’t yet seen.

我们还创建了解码器掩码，该掩码在对角线上方包含True ，在对角线下方包含False 。此掩码有助于防止解码器考虑句子中尚未看到的部分。

Let’s go deeper into this point because it’s crucial. The decoder’s self attention mask ensures that each self-attention vector doesn’t pay attention to positions that are in the future.

让我们更深入地介绍这一点，因为它至关重要。 解码器的自我注意遮罩可确保每个自我注意向量都不会关注未来的位置。

So If I am calculating the self-attention scores for the word in position 3 of the sequence, I will mask out all positions after that one. This is necessary because when we are building our output phrase we need to perform our calculations based on the words generated so far and we shouldn’t be able to know the future words that will be coming later.

因此，如果我要计算序列中位置3的单词的自注意力得分，我将掩盖该位置之后的所有位置。这是必要的，因为在构建输出短语时，我们需要基于到目前为止生成的单词来执行计算，而我们不应该知道以后会出现的将来的单词。

In a way we are preventing the decoder from cheating during the training process. If we want, for example, to predict the second word of a phrase, we should take into consideration only the first position of the output phrase. If we want to predict the fifth word, we should consider just the first, second, third and fourth positions.

在某种程度上，我们可以防止解码器在训练过程中作弊。例如，如果我们要预测短语的第二个单词，则应仅考虑输出短语的第一个位置。如果我们要预测第五个单词，我们应该只考虑第一，第二，第三和第四位置。

Notice that the decoder’s mask is also masking the padding tokens that may exist in the output sequences. So the decoder’s mask adds the masking of the padding tokens to the masking of future positions in the sequence.

注意，解码器的掩码还掩盖了可能存在于输出序列中的填充令牌。因此，解码器的面具添加填充令牌序列中的位置未来的掩蔽的掩蔽。

As in the encoder, the 8x14 tensor is sent to the embed module which outputs a 8x14x32 (BS x Heads x SeqLen x EmbSize) tensor , because the embedding size is 32. Next, the result of the positional encoding module is added to it.

与编码器一样，由于嵌入大小为32，因此8x14张量被发送到embed模块，该模块输出8x14x32 (BS x Heads x SeqLen x EmbSize)张量。接下来，将位置编码模块的结果添加到其中。

At this point we arrive at the first decoder layer, which will be repeated as many times as the number of decoder layers we wish to have.

在这一点上，我们到达第一解码器层，它将被重复与我们希望拥有的解码器层数一样多的次数。

In the decoder layer we enter into two consecutive attention stages.

在解码器层，我们进入两个连续的注意阶段。

First, we will have a self-attention stage, very similar to the encoder’s one but using the decoder data.

首先，我们将有一个自我注意阶段，非常类似于编码器，但是使用了解码器数据。
And next, we will have an encoder-decoder attention stage, in which the Questions (Q) tensor will come from the decoder, but the Keys (K) and Values (V) tensors will come from the output of the previously executed encoder. You can locate this mixing stage in the infographic as the big arrow that connects the end of the encoder to the Keys and Values of the second attention stage of the decoder layer.

而接下来，我们将有一个编码器，解码器的关注阶段，其中的问题(Q)张量将来自解码器，但键(K)和价值(V)张量将来自于先前执行的编码器的输出。您可以在图表中将此混合阶段定位为大箭头，该大箭头将编码器的末端连接到解码器层的第二个关注阶段的“键和值”。

The first self-attention stage of the decoder is identical to the encoder’s one, except for using the output sequences as the data, and using the decoder’s mask.

解码器的第一个自我注意阶段与编码器的相同，只是将输出序列用作数据，并使用解码器的掩码。

In the second attention stage, the encoder-decoder attention, a similar process happens with some key differences:

在第二个注意阶段，即编码器-解码器注意，发生了类似的过程，但有一些关键区别：

The questions Q vector is formed with the 8x14x32 (BS x SeqLen x EmbSize) decoder vector

问题Q向量由8x14x32(BS x SeqLen x EmbSize)解码器向量形成
The keys and values vectors, K and V are formed with two copies of the same 8x10x32 (BS x SeqLen x EmbSize) tensor that comes from the result of the encoder’s phase.

键和值向量K和V由来自编码器相位结果的相同8x10x32(BS x SeqLen x EmbSize)张量的两个副本形成。

And the mask that we use in the masking stage is the one that was used in the encoder, the input sequence one. This way we make sure to only consider the connections between the output sequence and the parts of the input sequence that don’t have padding tokens. The output sequence itself has already been masked by the first stage of the decoder layer.

并且，我们在掩蔽阶段使用掩模是已在编码器中，输入序列中的一个中使用的一个。这样，我们确保仅考虑输出序列和输入序列中没有填充令牌的部分之间的连接。输出序列本身已经被解码器层的第一阶段掩盖了。

The attention scores are not a square matrix anymore. We obtain a 14x10 matrix within the 8x4x14x10 (BS x Heads x InSeqLen x OutSeqLen) tensor, reflecting that we are obtaining the relationships between the different parts of the output sequences with the different parts of the input sequences.

注意力分数不再是平方矩阵。我们获得8x4x14x10(BS X读头X InSeqLen X OutSeqLen)张量内的14x10矩阵，反射，我们获得与所述输入序列的不同部分的输出序列的不同部分之间的关系。

As usual, after performing the attention computations, we concatenate the results to obtain in this case a 8x14x32 (BS x SeqLen x EmbSize) tensor.

与往常一样，在执行注意力计算之后，我们将结果连接起来以获得8x14x32(BS x SeqLen x EmbSize)张量。

After we perform the self-attention and encoder-decoder attention stages of the decoder layer, we move to a final stage within the same layer, passing firstly the 8x14x32 tensor through a feedforward module, and then, as we did in the encoder, adding the result of that computation to the input of that module (the feedforward module), applying as well a layer normalization module to the result. (the use of dropout in this process as well as in others, as mentioned in the encoder’s section, is a potential optional addition.)

在执行解码器层的自注意力和编码器-解码器注意力阶段之后，我们移到同一层的最后一步，首先将8x14x32张量通过前馈模块，然后像在编码器中所做的那样添加计算结果到该模块(前馈模块)的输入，并将层归一化模块应用于结果。 (如编码器部分所述，在此过程以及其他过程中使用dropout是一种潜在的可选补充。)

This decoder layer process is then repeated over the existing number of decoder layers. As before, the input and outputs of each decoder layer have identical shapes, 8x14x32 (BS x Heads x EmbSize), which makes it easy to chain a few of these layers/processes.

然后在现有数量的解码器层上重复此解码器层过程。和以前一样，每个解码器层的输入和输出具有相同的形状8x14x32(BS x Heads x EmbSize) ，这使得链接其中一些层/过程变得容易。

解码器的输出 (Decoder’s output)

Once we have iterated through all the decoder layers, we obtain a final 8x14x32 tensor, which we then pass through a linear layer whose output has the shape 8x14x950 (BS x SeqLen x OutVocabSize), 950 being the vocabulary size of the outputs of the chatbot.

遍历所有解码器层后，我们将获得最终的8x14x32张量，然后将其通过线性层，该线性层的输出形状为8x14x950(BS x SeqLen x OutVocabSize) ， 950是聊天机器人输出的词汇量。

This 8x14x950 tensor contains our predictions for this iteration. For each sequence of the batch, and for each of the 14 parts of each sequence, we obtain 950 values corresponding to the potential 950 words that are candidates for the next position of the output phrase.

这个8x14x950张量包含我们对该迭代的预测。对于批处理的每个序列以及每个序列的14个部分，我们获得950个值，这些值对应于可能的950个单词，它们是输出短语的下一个位置的候选项。

It’s time to calculate the loss, the difference between our objectives and our current predictions.

现在是时候计算损失，目标与当前预测之间的差异了。

We take that predictions tensor into a cross entropy loss module, which also receives our 8x14 target tensor. The result of that cross entropy module is the loss value of this iteration of the training process.

我们将该预测张量放入交叉熵损失模块中，该模块也接收我们的8x14目标张量。该交叉熵模块的结果是训练过程此迭代的损失值。

That loss value is then back propagated through the network, weights are updated, and the process restarts again with the encoder processing a new batch.

然后，该损耗值将通过网络反向传播，权重被更新，然后编码器处理新批次，该过程再次重新开始。

We continue the training process for as many epochs as we like until we reach our objective (in terms of accuracy, loss value, validation loss, etc). Every x number of iterations or epochs we proceed to save the weights of the network in case we want to restart the training some other time, or to have the latest trained weights ready to perform inference at any time.

我们会根据需要继续训练尽可能多的时间，直到达到目标为止(就准确性，损失值，验证损失等而言)。如果要在其他时间重新开始训练，或者准备让最新的训练后的权重随时准备进行推理，则每x迭代或纪元数都会继续保存网络的权重。

推理 (Inference)

So that was the training process. Let’s now quickly look at the inference process. Once the transformer is trained, how do we execute it and run it?

这就是培训过程。现在让我们快速看一下推理过程。训练好变压器后，我们如何执行和运行它？

For that, we have to focus on the green parts of the infographic.

为此，我们必须专注于信息图表的绿色部分。

At the bottom left of the graphic, we see the green Inference column that begins the inference process.

在图形的左下方，我们看到绿色的“推理”列，该列开始推理过程。

When we run the trained transformer we will enter a single input phrase, for example: What’s your favourite character in the expanse series?

当我们运行训练有素的变压器时，我们将输入一个输入短语，例如：广阔系列中您最喜欢的字符是什么？

That’s why our batch size will be 1. We still need to specify a batch size because the computations require it. The phrase is tokenized and numericalized, obtaining a 1x9 (BS x SeqLen) tensor because this example phrase has 9 tokens ([“what’s”, ‘your’, ‘favourite’, ‘character’, ‘in’, ‘the’, ‘expanse’, ‘series’,’?’]. Notice that we can tokenize phrases in many different ways, and there are many tokenizers available that you can use. This small example uses a simple way of tokenizing the phrases.

这就是为什么我们的批处理大小将为1的原因。我们仍然需要指定批次大小，因为计算需要它。对该短语进行标记和数字化，以获得1x9(BS x SeqLen)张量，因为此示例短语具有9个标记([“ what's”，“ your”，“ favourite”，“ character”，“ in”，“ the”，“注意，我们可以用多种不同的方式标记短语，并且可以使用很多标记器，这个小例子使用了一种简单的标记短语的方式。

We also create our input mask, which at this inference stage will have True in every position.

我们还创建了输入掩码，在此推断阶段，每个位置都将具有True 。

Next, we pass that input tensor to the embed module and add to it the output of the positional encoding module to get a 1x9x32 (BS x Heads x EmbSize) tensor.

接下来，我们将该输入张量传递给embed模块，并将其添加到位置编码模块的输出中，以获得1x9x32(BS x Heads x EmbSize)张量。

编码器阶段，推理 (Encoder phase, Inference)

The first layer of the encoder begins with similar computations to the ones done during the training iterations, but this time using this 1x9x32 tensor. The encoder layers repeat until we arrive at the final one, where we obtain a 1x9x32 tensor which will be used by the decoder to provide the keys and values of the encoder-decoder attention stage.

编码器的第一层以与训练迭代期间相似的计算开始，但这次使用此1x9x32张量。编码器层重复进行，直到我们到达最后一个层，在此我们获得一个1x9x32张量，解码器将使用该张量提供编码器-解码器关注阶段的键和值。

We move to the decoder, where things get a little different.

我们转到解码器，那里的情况有所不同。

解码器阶段，推理 (Decoder phase, Inference)

Our input to the decoder will initially be formed by the start-of-sentence token. (remember that we shift everything one position to the right to prevent the decoder from copying its input into its output).

我们对解码器的输入将首先由句子开始标记构成。 (请记住，我们将所有位置向右移一个位置，以防止解码器将其输入复制到其输出中)。
The decoder will then output as its result the next word that we should add to the reply sentence (initially formed by just the start-of-sentence token).

然后，解码器将作为结果输出我们应添加到回复语句中的下一个单词(最初仅由句子开始标记组成)。
We will take the new predicted word and add it to the input of the decoder, repeating the process and generating the next word of the reply sentence.

我们将采用新的预测词并将其添加到解码器的输入中，重复此过程并生成回复句子的下一个词。
This again gets added to the input of the decoder and we continue like that until the decoder outputs the end-of-sentence token.

这再次被添加到解码器的输入中，我们像这样继续下去，直到解码器输出句子结束标记。

So our input to the decoder will initially have a shape of 1x1. In the next iteration it will become 1x2, then 1x3, etc until it reaches 1xN with N being the number of iterations of the decoder loop until we obtain the end-of-sentence token.

因此，我们对解码器的输入最初将具有1x1的形状。在下一次迭代中，它将变成1x2 ，然后是1x3 ，依此类推，直到达到1xN为止，其中N是解码器循环的迭代次数，直到获得句子结束标记。

At each point in the loop, we create a new mask that adapts to each iteration. Initially has a shape of 1x1x1 (BS x SeqLen x SeqLen). In the next iteration it becomes 1x2x2, then 1x3x3 until it reaches 1xNxN when we reach the end-of-sentence token. As before, this mask helps us prevent the model from paying attention to future positions in the sequence (beyond the current one it is considering) when calculating the attention scores.

在循环的每个点，我们都创建一个新的掩码，以适应每次迭代。最初的形状为1x1x1(BS x SeqLen x SeqLen) 。在下一次迭代中，它变成1x2x2 ，然后是1x3x3，直到到达句子结束令牌时达到1xNxN 。和以前一样，此掩码可帮助我们在计算注意力得分时避免模型关注序列中的未来位置(超出当前正在考虑的位置)。

We then go through a number of decoder layers, with each doing the same computations we saw before:

然后，我们遍历多个解码器层，每个层都执行与之前看到的相同的计算：

A self attention stage

自我关注阶段
An encoder-decoder attention stage

编解码器关注阶段
The feedforward stage.

前馈阶段。

At the end of the decoder layers we will obtain a 1xNx950 (BS x SeqLen x OutVocabSize) tensor, with N being the position in the decoder loop in which we are. The first iteration we obtain a 1x1x950 tensor, the second time a 1x2x950 tensor, etc.

在解码器层的末尾，我们将获得一个1xNx950(BS x SeqLen x OutVocabSize)张量，其中N是我们在解码器循环中的位置。第一次迭代得到一个1x1x950张量，第二次得到1x2x950张量，依此类推。

We pass the resulting tensor through a softmax module to obtain a probability distribution. This distribution gives us the probabilities of obtaining each of the elements of the output vocabulary for each word of the output phrase. We will consider the probabilities of the last part of that tensor, the ones belonging to the next word we want to predict.

我们将结果张量通过softmax模块以获得概率分布。这种分布为我们提供了针对输出短语的每个单词获取输出词汇表中每个元素的概率。我们将考虑该张量的最后一部分的概率，这些概率属于我们要预测的下一个单词。

We can sample in a variety of ways from this probability distribution to obtain a 1x1 tensor that contains the new word that will be added to the end of the current output sentence.

我们可以从这种概率分布中以各种方式进行采样，以获得1x1张量，其中包含将添加到当前输出句子末尾的新单词。

We then continue to loop and add new words to the output sentence until we find the end-of-sentence token.

然后，我们继续循环并在输出句子中添加新词，直到找到句子结束标记。

And that’s it, we have a cool transformer chatbot whose computations have been revealed to us through this x-ray transformer visualization.

就是这样，我们有一个很酷的变压器聊天机器人，该机器人的计算结果已通过此X射线变压器可视化显示给我们。

You may download a larger version (10488 x 14000 pixels) of the x-ray transformer visualization from the dedicated github repo:

您可以从专用的github repo下载更大版本(10488 x 14000像素)的X射线变压器可视化：

翻译自: https://towardsdatascience.com/x-ray-transformer-dive-into-transformers-training-inference-computations-through-a-single-visual-4e8d50667378