论文原文：PDF
论文被引：11398（2020/11/08）
论文年份：2014
论文作者：Kyunghyun Cho et.al.

文章目录

Abstract
1 Introduction
2 RNN Encoder–Decoder
3 Statistical Machine Translation
- 3.1 Scoring Phrase Pairs with RNN Encoder–Decoder
- 3.2 Related Approaches: Neural Networks in Machine Translation
4 Experiments
5 Conclusion

Abstract

In this paper, we propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixedlength vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

在本文中，我们提出了一种新颖的神经网络模型，称为RNN编码器-解码器，该模型由两个循环神经网络（RNN）组成。一个RNN将符号序列编码为固定长度的矢量表示形式，另一个RNN将表示形式解码为另一符号序列。共同训练提出模型的编码器和解码器，以在给定源序列的情况下最大化目标序列的条件概率。通过经验发现，通过使用RNN编码器-解码器计算的短语对的条件概率作为现有对数线性模型的附加功能，可以提高统计机器翻译系统的性能。定性地，我们表明所提出的模型学习了语言短语的语义和句法上有意义的表示。

1 Introduction

Deep neural networks have shown great success in various applications such as objection recognition (see, e.g., (Krizhevsky et al., 2012)) and speech recognition (see, e.g., (Dahl et al., 2012)). Furthermore, many recent works showed that neural networks can be successfully used in a number of tasks in natural language processing (NLP). These include, but are not limited to, language modeling (Bengio et al., 2003), paraphrase detection (Socher et al., 2011) and word embedding extraction (Mikolov et al., 2013). In the field of statistical machine translation (SMT), deep neural networks have begun to show promising results. (Schwenk, 2012) summarizes a successful usage of feedforward neural networks in the framework of phrase-based SMT system.

深度神经网络已经在各种应用中显示出了巨大的成功，例如对象识别（Krizhevsky等人，2012）和语音识别（Dahl等人，2012）。此外，最近的许多工作表明，神经网络可以成功地用于自然语言处理（NLP）的许多任务中。这些包括但不限于语言建模（Bengio等，2003），复述检测（Socher等，2011）和词嵌入提取（Mikolov等，2013）。在统计机器翻译（SMT）领域，深度神经网络已开始显示出令人鼓舞的结果。（Schwenk，2012）总结了在基于短语的SMT系统框架中前馈神经网络的成功使用。

Along this line of research on using neural networks for SMT, this paper focuses on a novel neural network architecture that can be used as a part of the conventional phrase-based SMT system. The proposed neural network architecture, which we will refer to as anRNN Encoder–Decoder, consists of two recurrent neural networks (RNN) that act as an encoder and a decoder pair. The encoder maps a variable-length source sequence to a fixed-length vector, and the decoder maps the vector representation back to a variable-length target sequence. The two networks are trained jointly to maximize the conditional probability of the target sequence given a source sequence. Additionally, we propose to use a rather sophisticated hidden unit in order to improve both the memory capacity and the ease of training.

沿着将神经网络用于SMT的研究路线，本文重点研究了一种新型的神经网络架构，其可用作常规的基于短语的SMT系统的一部分。所提出的神经网络架构（我们将其称为RNN编码器-解码器）由两个递归神经网络（RNN）分别充当编码器和解码器对。编码器将可变长度的源序列映射到固定长度的向量，而解码器将向量表示映射回可变长度的目标序列。共同训练两个网络，以在给定源序列的情况下最大化目标序列的条件概率。另外，我们建议使用相当复杂的隐藏单元，以提高记忆容量和训练的简易性。

The proposed RNN Encoder–Decoder with a novel hidden unit is empirically evaluated on the task of translating from English to French. We train the model to learn the translation probability of an English phrase to a corresponding French phrase. The model is then used as a part of a standard phrase-based SMT system by scoring each phrase pair in the phrase table. The empirical evaluation reveals that this approach of scoring phrase pairs with an RNN Encoder–Decoder improves the translation performance.

拟议的具有新型隐藏单元的RNN编码器-解码器在从英语到法语的翻译任务上进行了经验评估。我们训练模型以学习英语短语到对应法语短语的翻译概率。然后，通过对词组表中的每个词组对进行评分，将该模型用作基于标准词组的SMT系统的一部分。实证评估表明，使用RNN编码器-解码器对短语对进行评分的方法可提高翻译性能。

We qualitatively analyze the trained RNN Encoder–Decoder by comparing its phrase scores with those given by the existing translation model. The qualitative analysis shows that the RNN Encoder–Decoder is better at capturing the linguistic regularities in the phrase table, indirectly explaining the quantitative improvements in the overall translation performance. The further analysis of the model reveals that the RNN Encoder– Decoder learns a continuous space representation of a phrase that preserves both the semantic and syntactic structure of the phrase.

我们通过将训练有素的RNN编码器/解码器的短语得分与现有翻译模型给出的短语得分进行比较，来定性分析。定性分析表明，RNN编码器-解码器更擅长捕获短语表中的语言规律性，从而间接解释了整体翻译性能的定量改进。对模型的进一步分析表明，RNN编码器-解码器学习了短语的连续空间表示，该短语保留了短语的语义和句法架构。

2 RNN Encoder–Decoder

2.1 Preliminary: Recurrent Neural Networks

A recurrent neural network (RNN) is a neural network that consists of a hidden state h and an optional output y which operates on a variablelength sequence x = ( x 1 , . . . , x T ) x = (x_1, . . . , x_T) x=(x1,...,xT). At each time step t, the hidden state h t h_t ht of the RNN is updated by

递归神经网络（RNN）是由隐藏状态h和可选输出y组成的神经网络，该输出以可变长度序列 x = ( x 1 , . . . , x T ) x = (x_1, . . . , x_T) x=(x1,...,xT) 进行操作。在每个时间步 t t t，RNN的隐藏状态 h t h_t ht 通过下式更新：

Paper：Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
where f is a non-linear activation function. f may be as simple as an elementwise logistic sigmoid function and as complex as a long short-term memory (LSTM) unit (Hochreiter and Schmidhuber, 1997).

其中 f f f 是非线性**函数，其可能像sigmoid函数一样简单，也可能像长短时记忆（LSTM）单元一样复杂（Hochreiter和Schmidhuber，1997）。

An RNN can learn a probability distribution over a sequence by being trained to predict the next symbol in a sequence. In that case, the output at each timestep t is the conditional distribution p ( x t ∣ x t − 1 , . . . , x 1 ) p(x_t| x_{t−1}, . . . , x_1) p(xt∣xt−1,...,x1). For example, a multinomial distribution (1-of-K coding) can be output using a softmax activation function

RNN可以通过训练以预测序列中的下一个符号来学习序列中的概率分布。在那种情况下，每个时间步 t t t 的输出是条件分布 p ( x t ∣ x t − 1 , . . . , x 1 ) p(x_t| x_{t−1}, . . . , x_1) p(xt∣xt−1,...,x1)。例如，可以使用softmax**函数输出多项式分布（K编码中的1）

Paper：Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
for all possible symbols j = 1 , . . . , K j = 1, . . . , K j=1,...,K, where w j w_j wj are the rows of a weight matrix W W W. By combining these probabilities, we can compute the probability of the sequence x using

对于所有可能的符号 j = 1 , . . . , K j = 1, . . . , K j=1,...,K，其中 w j w_j wj 是权重矩阵 W W W 的行。通过组合这些概率，我们可以使用以下公式计算序列 x x x 的概率

Paper：Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
From this learned distribution, it is straightforward to sample a new sequence by iteratively sampling a symbol at each time step.

从这种学习的分布中，可以通过在每个时间步迭代对符号进行采样来直接对新序列进行采样。

2.2 RNN Encoder–Decoder

In this paper, we propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence,e.g. p ( y 1 , . . . , y T ′ ∣ x 1 , . . . , x T ) p(y_1, . . . , y_{T'} | x_1, . . . , x_T) p(y1,...,yT′∣x1,...,xT), where one should note that the input and output sequence lengths T and T0may differ.

在本文中，我们提出了一种新颖的神经网络架构，该架构学习将可变长度序列编码为固定长度向量表示，然后将给定的固定长度向量表示解码为可变长度序列。从概率的角度来看，这个新模型是一种通用方法，用于学习以另一个可变长度序列为条件的可变长度序列的条件分布，例如 p ( y 1 , . . . , y T ′ ∣ x 1 , . . . , x T ) p(y_1, . . . , y_{T'} | x_1, . . . , x_T) p(y1,...,yT′∣x1,...,xT)，其中应注意输入和输出序列长度 T T T 和 T ′ T' T′ 可能不同。

The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes according to Eq. (1). After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary c of the whole input sequence.

编码器是一个RNN，它顺序读取输入序列 x x x 的每个符号。当它读取每个符号时，RNN的隐藏状态根据等式(1)改变。读取序列的末尾（由序列末尾符号标记）后，RNN的隐藏状态为整个输入序列的摘要 c \boldsymbol{c} c。

The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol ytgiven the hidden state hhti. However, unlike the RNN described in Sec. 2.1, both ytand hhtiare also conditioned onyt−1and on the summarycof the input sequence. Hence, the hidden state of the decoder at time t is computed by,

所提出的模型的解码器是另一个RNN，该RNN经过训练可以通过预测下一个符号（隐藏状态hhti）来生成输出序列。但是，与第2.1节中描述的RNN不同，yt和hhtiare都还条件onyt-1并在summarycof输入序列上。因此，解码器在时间t的隐藏状态由下式计算：

所提出的模型的解码器是另一个RNN，经过训练可以通过预测给定隐藏状态 h t h_t ht 的下一个符号 y t y_t yt 来生成输出序列。但是，与第2.1节中描述的RNN不同， y t y_t yt 和 h t h_t ht 都以 y t − 1 y_{t-1} yt−1 和输入序列的摘要 c \boldsymbol{c} c 为条件。因此，解码器在时间 t t t 的隐藏状态由下式计算：：

Paper：Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
and similarly, the conditional distribution of the next symbol is

Paper：Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
for given activation functions f and g (the latter must produce valid probabilities, e.g. with a softmax).

对于给定的**函数 f f f 和 g g g（后者必须产生有效概率，例如softmax）。

See Fig. 1 for a graphical depiction of the proposed model architecture.

The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood

拟议的RNN编码器-解码器的两个组件经过联合训练，以最大限度地提高条件对数似然：
Paper：Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
where θ is the set of the model parameters and each (xn,yn) is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, we can use a gradient-based algorithm to estimate the model parameters.

其中 θ θ θ 是模型参数的集合，每个 ( x n ， y n ) (x_n，y_n) (xn，yn) 是训练集中的（输入序列，输出序列）对。在我们的情况下，由于解码器的输出从输入开始是可微分的，因此我们可以使用基于梯度的算法来估计模型参数。

Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences, where the score is simply a probability p θ ( y ∣ x ) p_θ(y | x) pθ(y∣x) from Eqs. (3) and (4).

一旦训练了RNN编码器-解码器，就可以以两种方式使用该模型。一种方法是使用模型在给定输入序列的情况下生成目标序列。另一方面，该模型可用于对给定的一对输入和输出序列进行评分，其中评分是来自等式（3）和（4）的概率 p θ ( y ∣ x ) p_θ(y | x) pθ(y∣x)。

2.3 Hidden Unit that Adaptively Remembers and Forgets

In addition to a novel model architecture, we also propose a new type of hidden unit (f in Eq. (1)) that has been motivated by the LSTM unit but is much simpler to compute and implement.1Fig. 2 shows the graphical depiction of the proposed hidden unit.

除了新颖的模型架构外，我们还提出了一种新型的隐藏单元（等式（1）中的 f f f），该单元是LSTM单元驱动的，但是它的计算和实现要简单得多。图2示出了所提出的隐藏单元的图形描述。

Let us describe how the activation of the j-th hidden unit is computed. First, the reset gate rjis computed by

描述如何计算第 j j j 个隐藏单元的**。首先，重置门 r j r_{j} rj 由下式计算：

Paper：Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
where σ is the logistic sigmoid function, and [.]j denotes the j-th element of a vector. x and ht−1 are the input and the previous hidden state, respectively. Wrand Urare weight matrices which are learned.

其中 σ σ σ 是对数S型函数， [ . ] j [.] j [.]j 表示向量的第 j j j 个元素。 x x x 和 h t − 1 h_{t-1} ht−1 分别是输入状态和先前的隐藏状态。 W r W_r Wr 和 U r U_r Ur 是学习的权重矩阵。

Similarly, the update gate z j z_j zj computed by

Paper：Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
The actual activation of the proposed unit h j h_j hj is then computed by

where

In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation.

在此公式中，当重置门接近于0时，隐藏状态*忽略先前的隐藏状态，仅使用当前输入进行重置。这有效地允许隐藏状态丢弃将来以后发现不相关的任何信息，从而允许更紧凑的表示。

On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit (Bengio et al., 2013).

另一方面，更新门控制从前一个隐藏状态将有多少信息转移到当前隐藏状态。这类似于LSTM网络中的记忆单元，并有助于RNN记住长期信息。此外，这可以被认为是leaky-integration单元的自适应变体（Bengio等，2013）。

As each hidden unit has separate reset and update gates, each hidden unit will learn to capture dependencies over different time scales. Those units that learn to capture short-term dependencies will tend to have reset gates that are frequently active, but those that capture longer-term dependencies will have update gates that are mostly active.

由于每个隐藏单元都有单独的重置和更新门，因此每个隐藏单元将学会捕获不同时间范围内的依赖关系。那些学习捕获短期依赖关系的单元将倾向于具有经常**的重置门，而那些捕获长期依赖关系的单元将具有大多数**的更新门。

In our preliminary experiments, we found that it is crucial to use this new unit with gating units. We were not able to get meaningful result with an oft-used tanh unit without any gating.

在我们的初步实验中，我们发现将此新单元与门控单元一起使用至关重要。没有任何门控，经常使用的tanh单元无法获得有意义的结果。

3 Statistical Machine Translation

In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically)istofindatranslationf given a source sentence e, which maximizes

在常用的统计机器翻译系统（SMT）中，该系统（具体来说是解码器）的目标是在给定源句子 e \boldsymbol{e} e 的情况下找到翻译 f f f：
Paper：Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model l o g p ( f ∣ e ) logp(f | e) logp(f∣e) as a loglinear model with additional features and corresponding weights:

右边第一个术语称为翻译模型，后一个语言模型（例如，Koehn，2005）。但是，实际上，大多数SMT系统将 l o g p ( f ∣ e ) logp(f | e) logp(f∣e) 建模为具有附加特征和相应权重的对数线性模型：

Paper：Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
where fn and wn are the n-th feature and weight, respectively. Z(e) is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.

其中 f n f_n fn 和 w n w_n wn 分别是第 n n n 个特征和权重。 Z ( e ) Z(e) Z(e) 是不依赖权重的归一化常数。权重通常经过优化，以最大化开发集的BLEU得分。

In the phrase-based SMT framework introduced in (Koehn et al., 2003) and (Marcu and Wong, 2002), the translation model l o g p ( e ∣ f ) logp(e | f) logp(e∣f) is factorized into the translation probabilities of matching phrases in the source and target sentences.2 These probabilities are once again considered additional features in the log-linear model (see Eq. (9)) and are weighted accordingly to maximize the BLEU score.

在（Koehn等人，2003）和（Marcu and Wong，2002）中引入的基于短语的SMT框架中，翻译模型 l o g p ( e ∣ f ) logp(e | f) logp(e∣f) 被分解为源句和目标句中匹配短语的翻译概率，这些概率再次被认为是对数线性模型中的附加特征（参见等式（9）），并据此加权以最大化BLEU得分。

Since the neural net language model was proposed in (Bengio et al., 2003), neural networks have been used widely in SMT systems. In many cases, neural networks have been used to rescore translation hypotheses (n-best lists) (see, e.g., (Schwenk et al., 2006)). Recently, however, there has been interest in training neural networks to score the translated sentence (or phrase pairs) using a representation of the source sentence as an additional input. See, e.g., (Schwenk, 2012), (Son et al., 2012) and (Zou et al., 2013).

自从（Bengio等人，2003）提出了神经网络语言模型以来，神经网络已广泛用于SMT系统中。在许多情况下，神经网络已被用来重述翻译假设（n个最佳列表）（例如，Schwenk等，2006）。然而，近来，对训练神经网络以使用源句子的表示作为附加输入来对翻译后的句子（或短语对）进行评分感兴趣。例如 Schwenk，2012；Son等，2012；Zou等，2013。

3.1 Scoring Phrase Pairs with RNN Encoder–Decoder

Here we propose to train the RNN Encoder– Decoder (see Sec. 2.2) on a table of phrase pairs and use its scores as additional features in the loglinear model in Eq. (9) when tuning the SMT decoder.

在这里，我们建议在短语对表上训练RNN编码器-解码器（请参阅第2.2节），并在调优SMT解码器时将其得分用作等式（9）中对数线性模型的附加特征。

When we train the RNN Encoder–Decoder, we ignore the (normalized) frequencies of each phrase pair in the original corpora. This measure was taken in order (1) to reduce the computational expense of randomly selecting phrase pairs from a large phrase table according to the normalized frequencies and (2) to ensure that the RNN Encoder– Decoder does not simply learn to rank the phrase pairs according to their numbers of occurrences. One underlying reason for this choice was that the existing translation probability in the phrase table already reflects the frequencies of the phrase pairs in the original corpus. With a fixed capacity of the RNN Encoder–Decoder, we try to ensure that most of the capacity of the model is focused toward learning linguistic regularities, i.e., distinguishing between plausible and implausible translations, orlearningthe“manifold”(regionofprobability concentration) of plausible translations.

当我们训练RNN编码器-解码器时，我们将忽略原始语料库中每个短语对的（规范化）频率。采取此措施是为了（1）减少根据标准化频率从大短语表中随机选择短语对的计算费用，以及（2）确保RNN编码器-解码器不会简单地学习对短语对根据出现次数进行排名。这种选择的一个潜在原因是短语表中现有的翻译概率已经反映了原始语料库中短语对的频率。在RNN编码器-解码器具有固定容量的情况下，我们尝试确保模型的大部分容量集中于学习语言规律性，即区分合理翻译和不合理翻译，或学习合理翻译的 “流形（manifold）”（概率集中区域）。

Once the RNN Encoder–Decoder is trained, we add a new score for each phrase pair to the existing phrase table. This allows the new scores to enter into the existing tuning algorithm with minimal additional overhead in computation.

训练完RNN编码器-解码器后，我们将每个短语对的新分数添加到现有短语表中。这允许新分数以最小的计算额外开销进入现有的调整算法。

As Schwenk pointed out in (Schwenk, 2012), it is possible to completely replace the existing phrase table with the proposed RNN Encoder– Decoder. In that case, for a given source phrase, the RNN Encoder–Decoder will need to generate a list of (good) target phrases. This requires, however, an expensive sampling procedure to be performed repeatedly. In this paper, thus, we only consider rescoring the phrase pairs in the phrase table.

正如（Schwenk，2012）中指出的，用提议的RNN编码器-解码器完全替换现有的短语表。在这种情况下，对于给定的源短语，RNN编码器-解码器将需要生成（良好）目标短语的列表。但是，这需要重复执行昂贵的采样过程。因此，在本文中，我们仅考虑对短语表中的短语对进行评分。

3.2 Related Approaches: Neural Networks in Machine Translation

Before presenting the empirical results, we discuss a number of recent works that have proposed to use neural networks in the context of SMT.

在介绍实证结果之前，我们讨论了许多最近的工作，这些工作已经提出在SMT的背景下使用神经网络。

Schwenk in (Schwenk, 2012) proposed a similar approach of scoring phrase pairs. Instead of the RNN-based neural network, he used a feedforward neural network that has fixed-size inputs (7 words in his case, with zero-padding for shorter phrases) and fixed-size outputs (7 words in the target language). When it is used specifically for scoring phrases for the SMT system, the maximum phrase length is often chosen to be small. However, as the length of phrases increases or as we apply neural networks to other variable-length sequence data, it is important that the neural network can handle variable-length input and output. The proposed RNN Encoder–Decoder is well-suited for these applications.

（Schwenk，2012）提出了一种类似的评分短语对的方法。他使用基于前馈神经网络的前馈神经网络，而不是基于RNN的神经网络，前者具有固定大小的输入（在他的情况下为7个单词，较短的短语为零填充）和固定大小的输出（目标语言为7个单词）。当专门用于对SMT系统的短语评分时，通常将最大短语长度选择得较小。但是，随着短语长度的增加或将神经网络应用于其他可变长度序列数据，神经网络能够处理可变长度的输入和输出非常重要。建议的RNN编码器-解码器非常适合这些应用。

Similar to (Schwenk, 2012), Devlin et al. (Devlin et al., 2014) proposed to use a feedforward neural network to model a translation model, however, by predicting one word in a target phrase at a time. They reported an impressive improvement, but their approach still requires the maximum length of the input phrase (or context words) to be fixed a priori.

与（Schwenk，2012）相似，（Devlin等人，2014）提出使用前馈神经网络为翻译模型建模，但是，是通过一次预测目标短语中的一个单词来实现的。他们报告了令人印象深刻的改进，但是他们的方法仍然需要事先确定输入短语（或上下文词）的最大长度。

Although it is not exactly a neural network they train, the authors of (Zou et al., 2013) proposed to learn a bilingual embedding of words/phrases. They use the learned embedding to compute the distance between a pair of phrases which is used as an additional score of the phrase pair in an SMT system.

尽管不是他们训练的神经网络，但（Zou等人，2013）的作者建议学习单词/短语的双语嵌入。他们使用学习的嵌入来计算一对短语之间的距离，该距离在SMT系统中用作短语对的附加分数。

In (Chandar et al., 2014), a feedforward neural network was trained to learn a mapping from a bag-of-words representation of an input phrase to an output phrase. This is closely related to both the proposed RNN Encoder–Decoder and the model proposed in (Schwenk, 2012), except that their input representation of a phrase is a bag-of-words. A similar approach of using bag-of-words representations was proposed in (Gao et al., 2013) as well. Earlier, a similar encoder–decoder model using two recursive neural networks was proposed in (Socher et al., 2011), but their model was restricted to a monolingual setting, i.e. the model reconstructs an input sentence. More recently, another encoder–decoder model using an RNN was proposed in (Auli et al., 2013), where the decoder is conditioned on a representation of either a source sentence or a source context.

在（Chandar等人，2014）中，对前馈神经网络进行了训练，以学习从输入短语的词袋表示到输出短语的映射。这与拟议的RNN编码器和解码器以及（Schwenk，2012）中提出的模型都密切相关，不同之处在于它们对短语的输入表示是一个词袋。（Gao等人，2013）也提出了一种使用词袋表示的类似方法。早些时候，在（Socher等人，2011）中提出了使用两个递归神经网络的类似编码器-解码器模型，但它们的模型仅限于单语设置，即该模型可重构输入句子。最近，在（Auli等人，2013）中提出了另一个使用RNN的编码器-解码器模型，其中解码器以源句子或源上下文的表示为条件。

One important difference between the proposed RNN Encoder–Decoder and the approaches in (Zou et al., 2013) and (Chandar et al., 2014) is that the order of the words in source and target phrases is taken into account. The RNN Encoder–Decoder naturally distinguishes between sequences that have the same words but in a different order, whereas the aforementioned approaches effectively ignore order information.

拟议的RNN编码器和解码器与（Zou等，2013）和（Chandar等，2014）中的方法之间的一个重要区别是，要考虑到源短语和目标短语中单词的顺序。RNN编码器-解码器自然地区分具有相同单词但顺序不同的序列，而上述方法有效地忽略了顺序信息。

The closest approach related to the proposed RNN Encoder–Decoder is the Recurrent Continuous Translation Model (Model 2) proposed in (Kalchbrenner and Blunsom, 2013). In their paper, they proposed a similar model that consists of an encoder and decoder. The difference with our model is that they used a convolutionaln-gram model (CGM) for the encoder and the hybrid of an inverse CGM and a recurrent neural network for the decoder. They, however, evaluated their model on rescoring the n-best list proposed by the conventional SMT system and computing the perplexity of the gold standard translations.

与拟议的RNN编码器-解码器最接近的方法是（Kalchbrenner and Blunsom，2013）中提出的递归连续翻译模型（模型2）。他们在论文中提出了一个类似的模型，该模型由编码器和解码器组成。与我们的模型的不同之处在于，他们将卷积神经元模型（CGM）用于编码器，并将逆CGM和递归神经网络的混合体用于解码器。但是，他们评估了他们的模型，以记录传统SMT系统提出的n个最佳列表，并计算了金标准翻译的困惑度。

4 Experiments

4.1 Data and Baseline System

4.1.1 RNN Encoder–Decoder

Paper：Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

4.1.2 Neural Language Model

Paper：Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

4.2 Quantitative Analysis

We tried the following combinations:

Baseline configuration
Baseline + RNN
Baseline + CSLM + RNN
Baseline + CSLM + RNN + Word penalty

Paper：Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

4.3 Qualitative Analysis

Paper：Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

4.4 Word and Phrase Representations

Since the proposed RNN Encoder–Decoder is not specifically designed only for the task of machine translation, here we briefly look at the properties of the trained model.

由于提出的RNN编码器-解码器不是专门为机器翻译任务设计的，这里我们简单地看一下训练模型的属性。

It has been known for some time that continuous space language models using neural networks are able to learn semantically meaningful embeddings (See, e.g., (Bengio et al., 2003; Mikolov et al., 2013)). Since the proposed RNN Encoder–Decoder also projects to and maps back from a sequence of words into a continuous space vector, we expect to see a similar property with the proposed model as well.

一段时间以来已经知道，使用神经网络的连续空间语言模型能够学习语义上有意义的嵌入(例如，Bengio等人，2003；Mikolov等人，2013)。由于所提出的RNN编码器-解码器还将一系列单词投影并映射回一个连续的空间向量，我们希望所提出的模型也具有类似的特性。

The left plot in Fig. 4 shows the 2–D embedding of the words using the word embedding matrix learned by the RNN Encoder–Decoder. The projection was done by the recently proposed BarnesHut-SNE (van der Maaten, 2013). We can clearly see that semantically similar words are clustered with each other (see the zoomed-in plots in Fig. 4).

图4的左图显示了使用RNN编码器-解码器学习的单词嵌入矩阵对单词进行二维嵌入。该预测是由最近提出的BarnesHut-SNE（van der Maaten，2013）完成的。我们可以清楚地看到，语义相似的单词彼此聚集（请参见图4中的放大图）。

The proposed RNN Encoder–Decoder naturally generates a continuous-space representation of a phrase. The representation (c in Fig. 1) in this case is a 1000-dimensional vector. Similarly to the word representations, we visualize the representations of the phrases that consists of four or more words using the Barnes-Hut-SNE in Fig. 5.

提出的RNN编码器-解码器自然会生成短语的连续空间表示。在这种情况下，表示形式（图1中的c）是1000维向量。与单词表示类似，我们使用图5中的Barnes-Hut-SNE可视化由四个或更多单词组成的短语的表示。

From the visualization, it is clear that the RNN Encoder–Decoder capturesboth semantic and syntactic structures of the phrases. For instance, in the bottom-left plot, most of the phrases are about the duration of time, while those phrases that are syntactically similar are clustered together. The bottom-right plot shows the cluster of phrases that are semantically similar (countries or regions). On the other hand, the top-right plot shows the phrases that are syntactically similar.

从可视化中可以清楚地看出，RNN编码器-解码器可以捕获短语的语义和句法结构。例如，在左下图中，大多数短语大约是持续时间，而语法上相似的短语则聚在一起。右下图显示了在语义上相似的词组（国家或地区）。另一方面，右上图显示了语法上相似的短语。

5 Conclusion

In this paper, we proposed a new neural network architecture, called an RNN Encoder–Decoder that is able to learn the mapping from a sequence of an arbitrary length to another sequence, possibly from a different set, of an arbitrary length. The proposed RNN Encoder–Decoder is able to either score a pair of sequences (in terms of a conditional probability) or generate a target sequence given a source sequence. Along with the new architecture, we proposed a novel hidden unit that includes a reset gate and an update gate that adaptively control how much each hidden unit remembers or forgets while reading/generating a sequence.

在本文中，我们提出了一种新的神经网络架构，称为RNN编码器-解码器，它能够学习从任意长度的序列到另一个序列（可能是从不同集合的任意长度）的映射。提出的RNN编码器-解码器能够对一对序列评分（根据条件概率），或者在给定源序列的情况下生成目标序列。与新架构一起，我们提出了一种新颖的隐藏单元，其中包括一个重置门和一个更新门，它们可以自适应地控制每个隐藏单元在读取/生成序列时记住或忘记了多少。

We evaluated the proposed model with the task of statistical machine translation, where we used the RNN Encoder–Decoder to score each phrase pair in the phrase table. Qualitatively, we were able to show that the new model is able to capture linguistic regularities in the phrase pairs well and also that the RNN Encoder–Decoder is able to propose well-formed target phrases.

我们用统计机器翻译的任务评估了提出的模型，在该模型中，我们使用RNN编码器-解码器对短语表中的每个短语对进行评分。定性地，我们能够证明新模型能够很好地捕获短语对中的语言规律性，而且RNN编码器－解码器能够提出格式正确的目标短语。

The scores by the RNN Encoder–Decoder were found to improve the overall translation performance in terms of BLEU scores. Also, we found that the contribution by the RNN Encoder– Decoder is rather orthogonal to the existing approach of using neural networks in the SMT system, so that we can improve further the performance by using, for instance, the RNN Encoder– Decoder and the neural net language model together.

发现RNN编码器-解码器的分数可以提高BLEU分数的整体翻译性能。此外，我们发现RNN编码器-解码器的贡献与SMT系统中使用神经网络的现有方法相当正交，因此我们可以通过使用RNN编码器-解码器和神经网络语言模型。

Our qualitative analysis of the trained model shows that it indeed captures the linguistic regularities in multiple levels i.e. at the word level as well as phrase level. This suggests that there may be more natural language related applications that may benefit from the proposed RNN Encoder– Decoder.

我们对经过训练的模型的定性分析表明，它确实捕获了多个级别（即单词级别和短语级别）的语言规律性。这表明可能会有更多与自然语言相关的应用程序可以从建议的RNN编码器-解码器中受益。

The proposed architecture has large potential for further improvement and analysis. One approach that was not investigated here is to replace the whole, or a part of the phrase table by letting the RNN Encoder–Decoder propose target phrases. Also, noting that the proposed model is not limited to being used with written language, it will be an important future research to apply the proposed architecture to other applications such as speech transcription.

所提出的架构具有进一步改进和分析的巨大潜力。此处未研究的一种方法是通过让RNN编码器-解码器建议目标短语来替换整个或部分短语表。另外，注意到提出的模型不限于与书面语言一起使用，将提出的架构应用于诸如语音转录的其他应用将是未来的重要研究。

Paper：Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

文章目录

Abstract

1 Introduction

2 RNN Encoder–Decoder

2.1 Preliminary: Recurrent Neural Networks

2.2 RNN Encoder–Decoder

2.3 Hidden Unit that Adaptively Remembers and Forgets

3 Statistical Machine Translation

3.1 Scoring Phrase Pairs with RNN Encoder–Decoder

3.2 Related Approaches: Neural Networks in Machine Translation

4 Experiments

4.1 Data and Baseline System

4.1.1 RNN Encoder–Decoder

4.1.2 Neural Language Model

4.2 Quantitative Analysis

4.3 Qualitative Analysis

4.4 Word and Phrase Representations

5 Conclusion

相关推荐