Paper:Sequence to Sequence Learning with Neural Networks

论文原文:PDF
论文被引:12780(2020/11/07)
论文年份:NIPS 2014
论文作者:Ilya Sutskever, Oriol Vinyals, Quoc V . Le. (Google)



Abstract

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

深度神经网络(DNN)是功能强大的模型,已在困难的学习任务上取得了出色的表现。尽管只要有大型的标记训练集可用,DNN都能很好地工作,但是DNN不能用于将序列映射到序列。在本文中,我们提出了一种通用的端到端序列学习方法,该方法对序列结构进行了最小假设。我们的方法使用多层长短期记忆(LSTM)将输入序列映射到固定维数的向量,然后使用另一个深度LSTM从向量中解码目标序列。我们的主要结果是,在从WMT-14数据集中进行英语到法语的翻译任务时,LSTM产生的翻译在整个测试集上的BLEU得分达到34.8,而LSTM的BLEU得分则因词汇不足而受到惩罚。此外,LSTM在长句子上没有困难。为了进行比较,基于短语的SMT系统在同一数据集上的BLEU得分达到33.3。当我们使用LSTM对由上述SMT系统产生的1000个假设进行重新排序时,其BLEU得分增加到36.5,与现有技术水平相近。 LSTM还学习了对词序敏感并且对主动和被动语音相对不变的明智的短语和句子表示。最后,我们发现颠倒(reversing)所有源句子(而不是目标句子)中单词的顺序可以显着提高LSTM的性能,因为这样做会在源句子和目标句子之间引入许多短期依赖性,从而使优化问题更加容易


1 Introduction

Deep Neural Networks (DNNs) are extremely powerful machine learning models that achieve excellent performance on difficult problems such as speech recognition [13, 7] and visual object recognition [19, 6, 21, 20]. DNNs are powerful because they can perform arbitrary parallel computation for a modest number of steps. A surprising example of the power of DNNs is their ability to sort N N-bit numbers using only 2 hidden layers of quadratic size [27]. So, while neural networks are related to conventional statistical models, they learn an intricate computation. Furthermore, large DNNs can be trained with supervised backpropagation whenever the labeled training set has enough information to specify the network’s parameters. Thus, if there exists a parameter setting of a large DNN that achieves good results (for example, because humans can solve the task very rapidly), supervised backpropagation will find these parameters and solve the problem.

深度神经网络(DNN)是非常强大的机器学习模型,可在诸如语音识别[13,7]和视觉对象识别[19,6,21,20]等难题上实现出色的性能。 DNN功能强大,因为它们可以执行适度的步数进行任意并行计算。 DNN强大功能的一个令人惊讶的例子是它们仅使用2个平方大小的隐藏层对N个N位数字进行排序的能力[27]。因此,尽管神经网络与常规统计模型有关,但它们学习的是复杂的计算。此外,只要标记的训练集具有足够的信息来指定网络参数,就可以使用监督的反向传播训练大型DNN。因此,如果存在一个大型DNN的参数设置,可以达到良好的效果(例如,因为人类可以非常快速地解决任务),那么有监督的反向传播将找到这些参数并解决问题。

Despite their flexibility and power, DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality. It is a significant limitation, since many important problems are best expressed with sequences whose lengths are not known a-priori. For example, speech recognition and machine translation are sequential problems. Likewise, question answering can also be seen as mapping a sequence of words representing the question to a sequence of words representing the answer. It is therefore clear that a domain-independent method that learns to map sequences to sequences would be useful.

尽管具有灵活性和强大功能,但DNN只能应用于其输入和目标可以使用固定维数的向量进行合理编码的问题。这是一个重大的局限性,因为许多重要的问题最好用其长度未知的序列来表达。例如,语音识别和机器翻译是序列问题。同样,问题回答也可以看作是将代表问题的单词序列映射到代表答案的单词序列。因此,很明显,学习将序列映射到序列的域独立方法是有用的。

Sequences pose a challenge for DNNs because they require that the dimensionality of the inputs and outputs is known and fixed. In this paper, we show that a straightforward application of the Long Short-Term Memory (LSTM) architecture [16] can solve general sequence to sequence problems. The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixeddimensional vector representation, and then to use another LSTM to extract the output sequence from that vector (fig. 1). The second LSTM is essentially a recurrent neural network language model [28, 23, 30] except that it is conditioned on the input sequence. The LSTM’s ability to successfully learn on data with long range temporal dependencies makes it a natural choice for this application due to the considerable time lag between the inputs and their corresponding outputs (fig. 1).

序列对DNN构成挑战,因为序列要求知道和固定输入和输出的维数。在本文中,我们证明了直接使用长短时记忆(LSTM)架构[16]可以解决一般序列到序列的问题。这个想法是使用一个LSTM一次读取一个时间步来读取输入序列,以获得较大的固定维向量表示,然后使用另一个LSTM从该向量中提取输出序列(图1)。第二个LSTM本质上是递归神经网络语言模型[28、23、30],除了它以输入序列为条件。由于输入和它们对应的输出之间存在相当长的时滞(time lag),因此LSTM具有成功学习具有长期时间依赖性的数据的能力使其成为该应用程序的自然选择(图1)

There have been a number of related attempts to address the general sequence to sequence learning problem with neural networks. Our approach is closely related to Kalchbrenner and Blunsom [18] who were the first to map the entire input sentence to vector, and is very similar to Cho et al. [5]. Graves [10] introduced a novel differentiable attention mechanism that allows neural networks to focus on different parts of their input, and an elegant variant of this idea was successfully applied to machine translation by Bahdanau et al. [2]. The Connectionist Sequence Classification is another popular technique for mapping sequences to sequences with neural networks, although it assumes a monotonic alignment between the inputs and the outputs [11].

已经进行了许多相关尝试,以解决一般序列与神经网络的序列学习问题。我们的方法与最早将整个输入句子映射到向量的Kalchbrenner和Blunsom [18]密切相关,并且与Cho等人[5]非常相似。Graves[10]引入了一种新颖的可微分注意机制,该机制使神经网络能够专注于其输入的不同部分,并且Bahdanau等人[2]成功地将此思想的一种优雅变体应用于机器翻译。连接序列分类方法是另一种流行的技术,可以将序列映射到具有神经网络的序列,尽管它假设输入和输出之间是单调对齐的(monotonic alignment)[11]

Paper:Sequence to Sequence Learning with Neural Networks

Figure 1: Our model reads an input sentence “ABC” and produces “WXYZ” as the output sentence. The model stops making predictions after outputting the end-of-sentence token. Note that the LSTM reads the input sentence in reverse, because doing so introduces many short term dependencies in the data that make the optimization problem much easier.

图1:我们的模型读取输入句子“ABC”并产生“WXYZ”作为输出句子。在输出句子结束标记后,模型停止进行预测。请注意,LSTM反向读取输入语句,因为这样做会在数据中引入许多短期依赖关系,从而使优化问题更加容易

The main result of this work is the following. On the WMT’14 English to French translation task, we obtained a BLEU score of 34.81 by directly extracting translations from an ensemble of 5 deep LSTMs (with 380M parameters each) using a simple left-to-right beam-search decoder. This is by far the best result achieved by direct translation with large neural networks. For comparison, the BLEU score of a SMT baseline on this dataset is 33.30 [29]. The 34.81 BLEU score was achieved by an LSTM with a vocabulary of 80k words, so the score was penalized whenever the reference translation contained a word not covered by these 80k. This result shows that a relatively unoptimized neural network architecture which has much room for improvement outperforms a mature phrase-based SMT system.

这项工作的主要结果如下。在WMT’14的英语到法语翻译任务中,通过使用简单的从左到右波束搜索解码器直接从5个深度LSTM(每个具有380M参数)的集成(ensemble)中提取翻译,我们获得了34.81的BLEU分数。到目前为止,这是通过大型神经网络直接翻译获得的最佳结果。为了进行比较,此数据集上SMT基线的BLEU得分为33.30 [29]。 34.81 BLEU分数是由LSTM的词汇量为80k的单词获得的,因此,只要参考译文中包含这80k所不包含的单词,该分数就会受到惩罚。该结果表明,相对未优化的神经网络架构有很大的改进空间,其性能优于成熟的基于短语的SMT系统。

Finally, we used the LSTM to rescore the publicly available 1000-best lists of the SMT baseline on the same task [29]. By doing so, we obtained a BLEU score of 36.5, which improves the baseline by 3.2 BLEU points and is close to the previous state-of-the-art (which is 37.0 [9]).

最后,我们使用LSTM对同一任务上SMT基准的公开可用的1000条最佳列表进行重新评分[29]。通过这样做,我们获得了36.5的BLEU得分,该基线将基线提高了3.2 BLEU点,并且接近之前的最新水平(37.0 [9])。

Surprisingly, the LSTM did not suffer on very long sentences, despite the recent experience of other researchers with related architectures [26]. We were able to do well on long sentences because we reversed the order of words in the source sentence but not the target sentences in the training and test set. By doing so, we introduced many short term dependencies that made the optimization problem much simpler (see sec. 2 and 3.3). As a result, SGD could learn LSTMs that had no trouble with long sentences. The simple trick of reversing the words in the source sentence is one of the key technical contributions of this work.

令人惊讶的是,尽管其他具有相关架构的研究人员最近有经验,但LSTM人在很长的句子上并没有受到影响[26]。我们能够在长句子上做得很好,因为我们在训练和测试集中颠倒了源句子中的单词顺序,而不是目标句子中的单词顺序。通过这样做,我们引入了许多短期依赖性,从而使优化问题变得更加简单(请参见第2节和第3.3节)。结果,SGD可以学习长句没有问题的LSTM。颠倒源句中单词的简单技巧是这项工作的关键技术贡献之一

A useful property of the LSTM is that it learns to map an input sentence of variable length into a fixed-dimensional vector representation. Given that translations tend to be paraphrases of the source sentences, the translation objective encourages the LSTM to find sentence representations that capture their meaning, as sentences with similar meanings are close to each other while different sentences meanings will be far. A qualitative evaluation supports this claim, showing that our model is aware of word order and is fairly invariant to the active and passive voice.

LSTM的一个有用的特性是它学会将可变长度的输入句子映射到固定维度的向量表示中。鉴于翻译往往是对源语句的解释,翻译目标鼓励LSTM寻找捕捉其意义的语句表示,因为具有相似意义的语句彼此接近,而不同的语句意义会很远。定性评估支持这一说法,表明我们的模型知道词序,对主动和被动语态相当不变。


2 The model

The Recurrent Neural Network (RNN) [31, 28] is a natural generalization of feedforward neural networks to sequences. Given a sequence of inputs ( x 1 , . . . , x T ) (x_1, . . . , x_T) (x1,...,xT), a standard RNN computes a sequence of outputs ( y 1 , . . . , y T ) (y_1, . . . , y_T) (y1,...,yT) by iterating the following equation:

递归神经网络(RNN)[31,28]是前馈神经网络对序列的自然概括。给定一系列输入 ( x 1 , . . . , x T ) (x_1, . . . , x_T) (x1,...,xT),标准RNN通过迭代以下公式来计算一系列输出 ( y 1 , . . . , y T ) (y_1, . . . , y_T) (y1,...,yT)

Paper:Sequence to Sequence Learning with Neural Networks

The RNN can easily map sequences to sequences whenever the alignment between the inputs the outputs is known ahead of time. However, it is not clear how to apply an RNN to problems whose input and the output sequences have different lengths with complicated and non-monotonic relationships.

只要提前知道输入与输出之间的对齐(alignment),RNN即可轻松将序列映射到序列。但是,尚不清楚如何将RNN应用于输入和输出序列的长度不同且具有复杂和非单调关系的问题

A simple strategy for general sequence learning is to map the input sequence to a fixed-sized vector using one RNN, and then to map the vector to the target sequence with another RNN (this approach has also been taken by Cho et al. [5]). While it could work in principle since the RNN is provided with all the relevant information, it would be difficult to train the RNNs due to the resulting long term dependencies [14, 4] (figure 1) [16, 15]. However, the Long Short-Term Memory (LSTM) [16] is known to learn problems with long range temporal dependencies, so an LSTM may succeed in this setting.

通用序列学习(general sequence learning)的一种简单策略是使用一个RNN将输入序列映射到固定大小的向量,然后使用另一个RNN将向量映射到目标序列(Cho等人也采用了这种方法[5 ])。尽管由于RNN提供了所有相关信息而在原则上可以工作,但由于产生了长期依赖关系,因此很难训练RNN [14,4](图1)[16,15]。但是,众所周知,长短时记忆(LSTM)[16]可以学习到具有长期时间依赖性的问题,因此LSTM在这种情况下可能会成功

The goal of the LSTM is to estimate the conditional probability p ( y 1 , . . . , y T ′ ∣ x 1 , . . . , x T ) p(y_1, . . . , y_{T′}|x_1, . . . , x_T) p(y1,...,yTx1,...,xT) where ( x 1 , . . . , x T ) (x_1, . . . , x_T) (x1,...,xT) is an input sequence and y 1 , . . . , y T ′ y_1, . . . , y_{T′} y1,...,yT is its corresponding output sequence whose length T ′ T′ T may differ from T T T. The LSTM computes this conditional probability by first obtaining the fixeddimensional representation v v v of the input sequence ( x 1 , . . . , x T ) (x_1, . . . , x_T) (x1,...,xT) given by the last hidden state of the LSTM, and then computing the probability of y 1 , . . . , y T ′ y_1, . . . , y_{T′} y1,...,yT with a standard LSTM-LM formulation whose initial hidden state is set to the representation v v v of x 1 , . . . , x T x_1, . . . , x_T x1,...,xT:

LSTM的目标是估计条件概率 p ( y 1 , . . . , y T ′ ∣ x 1 , . . . , x T ) p(y_1, . . . , y_{T′}|x_1, . . . , x_T) p(y1,...,yTx1,...,xT),其中 ( x 1 , . . . , x T ) (x_1, . . . , x_T) (x1,...,xT) 是输入序列,而 y 1 , . . . , y T ′ y_1, . . . , y_{T′} y1,...,yT 是输入序列,其长度 T ′ T' T 可能不同于 T T T。LSTM通过首先获得由输入序列的最后一个隐藏状态给出的输入序列 ( x 1 , . . . , x T ) (x_1, . . . , x_T) (x1,...,xT) 的固定维表示 v v v 来计算此条件概率。然后使用标准的LSTM-LM公式计算 y 1 , . . . , y T ′ y_1, . . . , y_{T′} y1,...,yT 的概率,其初始隐藏状态设置为 x 1 , . . . , x T x_1, . . . , x_T x1,...,xT

Paper:Sequence to Sequence Learning with Neural Networks
In this equation, each p ( y t ∣ v , y 1 , . . . , y t − 1 ) p(y_t|v, y_1, . . . , y_{t−1}) p(ytv,y1,...,yt1) distribution is represented with a softmax over all the words in the vocabulary. We use the LSTM formulation from Graves[10]. Note that we require that each sentence ends with a special end-of-sentence symbol “”, which enables the model to define a distribution over sequences of all possible lengths. The overall scheme is outlined in figure 1, where the shown LSTM computes the representation of “A”, “B”, “C”, “” and then uses this representation to compute the probability of “W”, “X”, “Y”, “Z”, “”.

在这个等式中,每个 p ( y t ∣ v , y 1 , . . . , y t − 1 ) p(y_t|v, y_1, . . . , y_{t−1}) p(ytv,y1,...,yt1) 分布在词汇表中的所有单词上都表示为softmax分数。我们使用Graves [10]的LSTM公式。请注意,我们要求每个句子都以一个特殊的句子结尾符号“”结尾,这使模型可以定义所有可能长度的序列上的分布。总体方案如图1所示,其中所示的LSTM计算“A”,“B”,“C”,“”的表示形式,然后使用该表示形式来计算“W”,“X”,“Y”,“Z”,“”的概率。

Our actual models differ from the above description in three important ways. First, we used two different LSTMs: one for the input sequence and another for the output sequence, because doing so increases the number model parameters at negligible computational cost and makes it natural to train the LSTM on multiple language pairs simultaneously [18]. Second, we found that deep LSTMs significantly outperformed shallow LSTMs, so we chose an LSTM with four layers. Third, we found it extremely valuable to reverse the order of the words of the input sentence. So for example, instead of mapping the sentence a , b , c a, b, c a,b,c to the sentence α , β , γ α, β, γ α,β,γ, the LSTM is asked to map c , b , a c, b, a c,b,a to α , β , γ α, β, γ α,β,γ, where α , β , γ α, β, γ α,β,γ is the translation of a , b , c a, b, c a,b,c. This way, a a a is in close proximity to α α α, b b b is fairly close to β β β, and so on, a fact that makes it easy for SGD to “establish communication” between the input and the output. We found this simple data transformation to greatly boost the performance of the LSTM.

我们的实际模型在三个重要方面与上述描述有所不同。首先,我们使用了两种不同的LSTM:一种用于输入序列,另一种用于输出序列,因为这样做可以以可忽略的计算成本增加数量模型参数,并且自然而然地同时在多种语言对上训练LSTM [18]。其次,我们发现深LSTM明显优于浅LSTM,因此我们选择了四层的LSTM。第三,我们发现颠倒输入句子中单词的顺序非常有价值。因此,例如,不是将句子 a , b , c a, b, c a,b,c 映射到句子 α , β , γ α, β, γ α,β,γ,而是要求LSTM将 c , b , a c, b, a c,b,a 映射到 α , β , γ α, β, γ α,β,γ,其中 α , β , γ α, β, γ α,β,γ a , b , c a, b, c a,b,c 的翻译, a a a 非常接近 α α α b b b 非常接近 β β β,依此类推,这使得SGD易于在输入和输出之间“建立通信”。我们发现这种简单的数据转换极大地提高了LSTM的性能


3 Experiments

We applied our method to the WMT’14 English to French MT task in two ways. We used it to directly translate the input sentence without using a reference SMT system and we it to rescore the n-best lists of an SMT baseline. We report the accuracy of these translation methods, present sample translations, and visualize the resulting sentence representation.

我们以两种方式将我们的方法应用于WMT 14英语到法语机器翻译任务。我们使用它来直接翻译输入句子,而不使用参考SMT系统,我们使用它来重新搜索SMT基线的n个最佳列表。我们报告这些翻译方法的准确性,提供翻译样本,并可视化最终的句子表达。

3.1 Dataset details

We used the WMT’14 English to French dataset. We trained our models on a subset of 12M sentences consisting of 348M French words and 304M English words, which is a clean “selected” subset from [29]. We chose this translation task and this specific training set subset because of the public availability of a tokenized training and test set together with 1000-best lists from the baseline SMT [29].

我们使用了WMT’14英语到法语的数据集。我们在由348M法语单词和304M英语单词组成的1200万个句子的子集中训练了模型,这是从[29]中干净的“选定”子集。我们选择此翻译任务和此特定的训练集子集,是因为公开提供了标记化训练和测试集以及来自基准SMT的1000条最佳列表[29]。

As typical neural language models rely on a vector representation for each word, we used a fixed vocabulary for both languages. We used 160,000 of the most frequent words for the source language and 80,000 of the most frequent words for the target language. Every out-of-vocabulary word was replaced with a special “UNK” token.

由于典型的神经语言模型依赖于每个单词的矢量表示,因此我们为这两种语言使用了固定的词汇表。我们将160,000个最常用词用于源语言,将80,000个最常用词用于目标语言。词汇中的每个单词都被替换为特殊的“ UNK”令牌。

3.2 Decoding and Rescoring

The core of our experiments involved training a large deep LSTM on many sentence pairs. We trained it by maximizing the log probability of a correct translation T T T given the source sentence S \mathcal{S} S, so the training objective is

我们实验的核心涉及在许多句子对上训练大型的LSTM。在给定源句子 S \mathcal{S} S 的情况下,我们通过最大化正确翻译 T T T 的对数概率来训练它,因此训练目标是
Paper:Sequence to Sequence Learning with Neural Networks
where S \mathcal{S} S is the training set. Once training is complete, we produce translations by finding the most likely translation according to the LSTM:
Paper:Sequence to Sequence Learning with Neural Networks
We search for the most likely translation using a simple left-to-right beam search decoder which maintains a small number B of partial hypotheses, where a partial hypothesis is a prefix of some translation. At each timestep we extend each partial hypothesis in the beam with every possible word in the vocabulary. This greatly increases the number of the hypotheses so we discard all but the B most likely hypotheses according to the model’s log probability. As soon as the “” symbol is appended to a hypothesis, it is removed from the beam and is added to the set of complete hypotheses. While this decoder is approximate, it is simple to implement. Interestingly, our system performs well even with a beam size of 1, and a beam of size 2 provides most of the benefits of beam search (Table 1).

我们使用简单的从左到右的波束搜索(beam search)解码器搜索最可能的翻译,该解码器维持少量的B部分假设,其中部分假设是某种翻译的前缀。在每个时间步上,我们用词汇表中的每个可能单词扩展波束中的每个部分假设。这大大增加了假设的数量,因此根据模型的对数概率,我们抛弃了除B以外的所有最可能假设。一旦将“”符号附加到假设上,便将其从波束中删除,并添加到一组完整的假设中。尽管该解码器是近似的,但实现起来很简单。有趣的是,即使光束大小为1,我们的系统也能表现良好,而光束大小为2的光束则提供了光束搜索的大部分好处(表1)。

3.3 Reversing the Source Sentences

While the LSTM is capable of solving problems with long term dependencies, we discovered that the LSTM learns much better when the source sentences are reversed (the target sentences are not reversed). By doing so, the LSTM’s test perplexity dropped from 5.8 to 4.7, and the test BLEU scores of its decoded translations increased from 25.9 to 30.6.

虽然LSTM能够解决长期依赖的问题,但我们发现,当源句被颠倒时(目标句不被颠倒),LSTM学习得更好。通过这样做,LSTM的测试困惑度(test perplexity)从5.8下降到4.7,其解码翻译的BLEU测试分数从25.9上升到30.6。

While we do not have a complete explanation to this phenomenon, we believe that it is caused by the introduction of many short term dependencies to the dataset. Normally, when we concatenate a source sentence with a target sentence, each word in the source sentence is far from its corresponding word in the target sentence. As a result, the problem has a large “minimal time lag” [17]. By reversing the words in the source sentence, the average distance between corresponding words in the source and target language is unchanged. However, the first few words in the source language are now very close to the first few words in the target language, so the problem’s minimal time lag is greatly reduced. Thus, backpropagation has an easier time “establishing communication” between the source sentence and the target sentence, which in turn results in substantially improved overall performance.

虽然我们对这一现象没有完整的解释,但我们认为这是由于引入了许多对数据集的短期依赖造成的。通常,当我们将一个源句子和一个目标句子连接起来时,源句子中的每个单词都与目标句子中的相应单词相距甚远。因此,该问题具有很大的“最小时滞”[17]。通过颠倒源语句中的单词,源语言和目标语言中对应单词之间的平均距离不变。然而,源语言中的前几个单词现在与目标语言中的前几个单词非常接近,因此问题的最小时滞大大减少。因此,反向传播在源语句和目标语句之间“建立通信”更容易,这反过来导致整体性能的显著提高

Initially, we believed that reversing the input sentences would only lead to more confident predictions in the early parts of the target sentence and to less confident predictions in the later parts. However, LSTMs trained on reversed source sentences did much better on long sentences than LSTMs trained on the raw source sentences (see sec. 3.7), which suggests that reversing the input sentences results in LSTMs with better memory utilization.

最初,我们认为颠倒输入句子只会在目标句子的早期产生更自信的预测,而在后期产生更不自信的预测。然而,在反向源句上训练的LSTM比在原始源句上训练的LSTM在长句上表现更好(请参阅第3.7节),这表明颠倒输入句子会导致LSTM具有更好的记忆利用率。

3.4 Training details

We found that the LSTM models are fairly easy to train. We used deep LSTMs with 4 layers, with 1000 cells at each layer and 1000 dimensional word embeddings, with an input vocabulary of 160,000 and an output vocabulary of 80,000. We found deep LSTMs to significantly outperform shallow LSTMs, where each additional layer reduced perplexity by nearly 10%, possibly due to their much larger hidden state. We used a naive softmax over 80,000 words at each output. The resulting LSTM has 380M parameters of which 64M are pure recurrent connections (32M for the “encoder” LSTM and 32M for the “decoder” LSTM). The complete training details are given below:

  • We initialized all of the LSTM’s parameters with the uniform distribution between -0.08 and 0.08.
  • We used stochastic gradient descent without momentum, with a fixed learning rate of 0.7. After 5 epochs, we begun halving the learning rate every half epoch. We trained our models for a total of 7.5 epochs.
  • We used batches of 128 sequences for the gradient and divided it the size of the batch (namely, 128).
  • Although LSTMs tend to not suffer from the vanishing gradient problem, they can have exploding gradients. Thus we enforced a hard constraint on the norm of the gradient [10, 25] by scaling it when its norm exceeded a threshold. For each training batch, we compute s = ∣ ∣ g ∣ ∣ 2 s = ||g||_2 s=g2, where g g g is the gradient divided by 128 128 128. If s > 5 s > 5 s>5, we set g = 5 g s g = \frac{5g}{s} g=s5g.
  • Different sentences have different lengths. Most sentences are short (e.g., length 20-30) but some sentences are long (e.g., length > 100), so a minibatch of 128 randomly chosen training sentences will have many short sentences and few long sentences, and as a result, much of the computation in the minibatch is wasted. To address this problem, we made sure that all sentences within a minibatch were roughly of the same length, which a 2x speedup.

我们发现LSTM模型非常容易训练。我们使用了4层深度LSTM,每层有1000个单元和1000维词嵌入,输入词汇量为160,000,输出词汇量为80,000。我们发现,深层LSTM的性能明显优于浅层LSTM,这可能是由于深层LSTM的更大隐藏状态使其每层减少了将近10%的困惑。我们在每个输出中使用了超过80,000个单词的softmax。生成的LSTM具有380M参数,其中64M是纯循环连接(“编码器” LSTM为32M,“解码器” LSTM为32M)。完整的训练详细信息如下:

  • 使用-0.08到0.08之间的均匀分布初始化所有LSTM的参数
  • 使用无动量的随机梯度下降法,固定学习率为0.7。 5轮训练后,开始将学习率每半轮减半。模型训练了7.5轮。
  • 将128个序列的批次用于梯度并将其除以批次的大小(即128)
  • 尽管LSTM往往不会受梯度消失的困扰,但它们可能会出现爆炸梯度。因此,当梯度[10,25]的范数超过阈值时,我们对其进行了缩放,从而对其施加了严格的约束。对于每个训练批次,我们计算 s = ∣ ∣ g ∣ ∣ 2 s = ||g||_2 s=g2,其中 g g g 是梯度除以 128 128 128。如果 s > 5 s > 5 s>5,则设置 g = 5 g s g = \frac{5g}{s} g=s5g
  • 不同的句子有不同的长度。大多数句子很短(例如,长度为20-30),但有些句子很长(例如,长度>100),因此一小部分包含128个随机选择的训练句子的句子将包含许多短句子和少量长句子,因此,很多小批量中的计算量浪费了。为了解决这个问题,我们确保小批量中的所有句子的长度都大致相同,即加速了2倍

3.5 Parallelization

A C++ implementation of deep LSTM with the configuration from the previous section on a single GPU processes a speed of approximately 1,700 words per second. This was too slow for our purposes, so we parallelized our model using an 8-GPU machine. Each layer of the LSTM was executed on a different GPU and communicated its activations to the next GPU (or layer) as soon as they were computed. Our models have 4 layers of LSTMs, each of which resides on a separate GPU. The remaining 4 GPUs were used to parallelize the softmax, so each GPU was responsible for multiplying by a 1000 × 20000 matrix. The resulting implementation achieved a speed of 6,300 (both English and French) words per second with a minibatch size of 128. Training took about a ten days with this implementation.

深度LSTM的C++实现以及上一节中的配置在单个GPU上的处理速度约为每秒1,700字。这对于我们的目的而言太慢了,因此我们使用8-GPU机器并行化了模型。 LSTM的每一层都在不同的GPU上执行,并在计算出它们后立即将其**信息传达给下一个GPU(或层)。我们的模型有4个LSTM层,每个LSTM都驻留在单独的GPU上。其余4个GPU用于并行化softmax,因此每个GPU负责乘以1000×20000矩阵。最终的实施速度达到了每秒6300个单词(英语和法语),最小批量为128个。此实施的训练时间约为10天。

3.6 Experimental Results

We used the cased BLEU score [24] to evaluate the quality of our translations. We computed our BLEU scores using multi-bleu.pl1on the tokenized predictions and ground truth. This way of evaluating the BELU score is consistent with [5] and [2], and reproduces the 33.3 score of [29]. However, if we evaluate the state of the art system of [9] (whose predictions can be downloaded from statmt.org\matrix) in this manner, we get 37.0, which is greater than the 35.8 reported by statmt.org\matrix.

我们使用了例举的BLEU评分(cased BLEU score)[24]来评估我们的翻译质量。 使用multi-bleu.pl在标记化的预测和真实标签上计算了BLEU分数。 这种评估BELU得分的方法与[5]和[2]一致,并且重现了33.3得分[29]。 但是,如果我们以这种方式评估[9]的最新系统(可以从statmt.org\matrix下载其预测),则得到37.0,这比statmt.org\matrix报告的35.8大。

The results are presented in tables 1 and 2. Our best results are obtained with an ensemble of LSTMs that differ in their random initializations and in the random order of minibatches. While the decoded translations of the LSTM ensemble do not beat the state of the art, it is the first time that a pure neural translation system outperforms a phrase-based SMT baseline on a large MT task by a sizeable margin, despite its inability to handle out-of-vocabulary words. The LSTM is within 0.5 BLEU points of the previous state of the art by rescoring the 1000-best list of the baseline system.

结果显示在表1和2中。我们的最佳结果是通过一组LSTM获得的,这些LSTM的随机初始化和小批处理的随机顺序不同。尽管LSTM集成体的解码翻译没有超越最新技术水平,但这是纯神经翻译系统第一次在大型MT任务上胜过基于短语的SMT基线,尽管它无能为力,但它还是有相当大的优势。通过记录基线系统的1000条最佳列表,LSTM在现有技术水平的0.5个BLEU点之内。
Paper:Sequence to Sequence Learning with Neural Networks
Paper:Sequence to Sequence Learning with Neural Networks

3.7 Performance on long sentences

We were surprised to discover that the LSTM did well on long sentences, which is shown quantitatively in figure 3. Table 3 presents several examples of long sentences and their translations.

我们惊讶地发现,LSTM在长句上做得很好,如图3定量所示。表3给出了几个长句及其翻译的例子。
Paper:Sequence to Sequence Learning with Neural Networks

3.8 Model Analysis

One of the attractive features of our model is its ability to turn a sequence of words into a vector of fixed dimensionality. Figure 2 visualizes some of the learned representations. The figure clearly shows that the representations are sensitive to the order of words, while being fairly insensitive to the replacement of an active voice with a passive voice. The two-dimensional projections are obtained using PCA.

我们的模型的吸引人的特征之一是它能够将单词序列变成固定维数的向量。图2可视化了一些学习的表示。该图清楚地表明,表示对单词的顺序很敏感,而对用被动语态代替主动语态则相当不敏感。使用PCA获得二维投影。
Paper:Sequence to Sequence Learning with Neural Networks
图2:该图显示了LSTM隐藏状态的二维PCA投影,这些投影是在处理图中的短语之后获得的。短语通过含义进行聚类,在这些示例中,含义主要是单词顺序的函数,而单词袋模型很难捕获这些短语。请注意,两个群集的内部结构相似。


4 Related work

There is a large body of work on applications of neural networks to machine translation. So far, the simplest and most effective way of applying an RNN-Language Model (RNNLM) [23] or a Feedforward Neural Network Language Model (NNLM) [3] to an MT task is by rescoring the nbest lists of a strong MT baseline [22], which reliably improves translation quality.

在将神经网络应用于机器翻译方面,有大量工作要做。到目前为止,将RNN语言模型(RNNLM)[23]或前馈神经网络语言模型(NNLM)[3]应用于MT任务的最简单,最有效的方法是对强大的MT基线的nbest列表进行记录[22],可靠地提高了翻译质量。

More recently, researchers have begun to look into ways of including information about the source language into the NNLM. Examples of this work include Auli et al. [1], who combine an NNLM with a topic model of the input sentence, which improves rescoring performance. Devlin et al. [8] followed a similar approach, but they incorporated their NNLM into the decoder of an MT system and used the decoder’s alignment information to provide the NNLM with the most useful words in the input sentence. Their approach was highly successful and it achieved large improvements over their baseline.

最近,研究人员开始研究将有关源语言的信息包含到NNLM中的方法。这项工作的例子包括Auli等人[1],他们将NNLM与输入句子的主题模型结合在一起,从而提高了评分性能。 Devlin等[8]采用了类似的方法,但是他们将NNLM合并到MT系统的解码器中,并使用解码器的对齐信息为NNLM提供输入句子中最有用的单词。他们的方法非常成功,并且在基线之上取得了很大的进步。

Our work is closely related to Kalchbrenner and Blunsom [18], who were the first to map the input sentence into a vector and then back to a sentence, although they map sentences to vectors using convolutional neural networks, which lose the ordering of the words. Similarly to this work, Cho et al. [5] used an LSTM-like RNN architecture to map sentences into vectors and back, although their primary focus was on integrating their neural network into an SMT system. Bahdanau et al. [2] also attempted direct translations with a neural network that used an attention mechanism to overcome the poor performance on long sentences experienced by Cho et al. [5] and achieved encouraging results. Likewise, Pouget-Abadie et al. [26] attempted to address the memory problem of Cho et al. [5] by translating pieces of the source sentence in way that produces smooth translations, which is similar to a phrase-based approach. We suspect that they could achieve similar improvements by simply training their networks on reversed source sentences.

我们的工作与Kalchbrenner和Blunsom [18]密切相关,他们首先将输入的句子映射到向量,然后又返回到句子,尽管他们使用卷积神经网络将句子映射到向量,但这些词却失去了单词的顺序。 与这项工作类似,Cho等[5]使用类似于LSTM的RNN架构将句子映射到向量中并返回,尽管他们的主要重点是将神经网络集成到SMT系统中。 Bahdanau等[2]也尝试使用神经网络进行直接翻译,该神经网络使用注意力机制来克服Cho等人在长句子上的不良表现。 [5]并取得令人鼓舞的成果。同样,Pouget-Abadie等[26]试图解决Cho等人[5]的记忆问题,通过翻译源句子的各个部分以产生平滑的翻译,这类似于基于短语的方法。我们怀疑,通过简单地对他们的反义句子进行网络训练,他们可以实现类似的改进。

End-to-end training is also the focus of Hermann et al. [12], whose model represents the inputs and outputs by feedforward networks, and map them to similar points in space. However, their approach cannot generate translations directly: to get a translation, they need to do a look up for closest vector in the pre-computed database of sentences, or to rescore a sentence.

端到端训练也是Hermann等人[12]的重点,其模型表示前馈网络的输入和输出,并将它们映射到空间中的相似点。但是,他们的方法不能直接生成翻译:要获得翻译,他们需要在预先计算的句子数据库中查找最接近的向量,或对句子重新评分。


5 Conclusion

In this work, we showed that a large deep LSTM with a limited vocabulary can outperform a standard SMT-based system whose vocabulary is unlimited on a large-scale MT task. The success of our simple LSTM-based approach on MT suggests that it should do well on many other sequence learning problems, provided they have enough training data.

在这项工作中,我们证明了词汇量有限的大型深度LSTM可以胜过标准的基于SMT的系统,该系统的词汇量在大型MT任务中是不受限制的。我们基于MT的基于LSTM的简单方法的成功表明,只要有足够的训练数据,它就可以很好地解决许多其他序列学习问题。

We were surprised by the extent of the improvement obtained by reversing the words in the source sentences. We conclude that it is important to find a problem encoding that has the greatest number of short term dependencies, as they make the learning problem much simpler. In particular, while we were unable to train a standard RNN on the non-reversed translation problem (shown in fig. 1), we believe that a standard RNN should be easily trainable when the source sentences are reversed (although we did not verify it experimentally).

通过对源句子中的单词进行反转获得的改进程度,我们感到惊讶。我们得出结论,重要的是找到一种具有最大数量短期依赖项的问题编码,因为它们使学习问题变得更加简单。特别是,尽管我们无法针对非逆向翻译问题训练标准RNN(如图1所示),但我们认为当源句被反向时,标准RNN应该很容易训练(尽管我们没有对其进行验证)实验性地)

We were also surprised by the ability of the LSTM to correctly translate very long sentences. We were initially convinced that the LSTM would fail on long sentences due to its limited memory, and other researchers reported poor performance on long sentences with a model similar to ours [5, 2, 26]. And yet, LSTMs trained on the reversed dataset had little difficulty translating long sentences.

LSTM能够正确翻译很长的句子的能力也令我们感到惊讶。我们最初确信LSTM会由于内存有限而无法在长句子上失败,并且其他研究人员报告说,在使用与我们相似的模型的长句子上,LSTM的性能较差[5,2,26]。但是,在反向数据集上训练的LSTM在翻译长句子时几乎没有困难

Most importantly, we demonstrated that a simple, straightforward and a relatively unoptimized approach can outperform a mature SMT system, so further work will likely lead to even greater translation accuracies. These results suggest that our approach will likely do well on other challenging sequence to sequence problems.

最重要的是,我们证明了一种简单,直接且相对未优化的方法可以胜过成熟的SMT系统,因此进一步的工作可能会带来更大的翻译准确性。这些结果表明,我们的方法可能会很好地解决其他具有挑战性的序列问题