Models like BERT (Devlin et. al.) or GPT (Radford et. al.) have achieved the state of the art in language understanding. However, these models are pre-trained only on one language. Recently, efforts have been made towards mitigating monolingual representations and building universal cross-lingual models that would be capable of encoding any sentence into a shared embedding space.

像BERT( Devlin等人 )或GPT( Radford等人 )这样的模型已经在语言理解方面达到了最先进的水平。但是，这些模型仅在一种语言上进行了预训练。近来，已经做出努力以减轻单语言表示并建立通用的跨语言模型，该模型能够将任何句子编码到共享的嵌入空间中。

In this article, we will be discussing the paper, Cross-lingual Language Model Pretraining, proposed by Facebook AI. The authors propose 2 approaches for cross-lingual language modeling:

在本文中，我们将讨论Facebook AI提出的论文《跨语言模型预训练》。作者提出了两种用于跨语言建模的方法：

Unsupervised, relies on monolingual data
无监督，依靠单语数据
Supervised, relies on parallel data.
受监督，依赖于并行数据。

跨语言语言模型(XLM) (Cross-lingual Language Model (XLM))

In this section, we will discuss the approaches proposed for training the XLM.

在本节中，我们将讨论为训练XLM提出的方法。

共享子词词汇 (Shared Sub-Word Vocabulary)

The model uses the same shared vocabulary for all the languages. This helps in establishing a common embedding space for tokens from all languages. Hence, it is evident that languages that have the same script (alphabets), or similar words map better to this common embedding space.

该模型对所有语言使用相同的共享词汇 。这有助于为来自所有语言的令牌建立通用的嵌入空间。因此，很明显，具有相同脚本(字母)或类似单词的语言可以更好地映射到此公共嵌入空间。

For tokenizing the corpora, Byte-Pair Encoding (BPE) is used.

为了标记语料库，使用了字节对编码(BPE)。

因果语言建模(CLM) (Causal Language Modeling (CLM))

This is the regular Language Modeling objective where we maximize the probability of a token x_t to appear at the ‘t’th position in a given sequence given all the tokens x_<t (all the tokens preceding the ‘t’th token) in that sequence. i.e.

这是常规语言建模的目标，在该目标中，在给定所有标记x_ <t (所有在' t '标记之前的所有标记)的情况下，我们最大化标记x_t在给定序列中出现在第' t '位置的概率序列。即

OpenAI’s GPT and GPT-2 are trained on this objective. You can refer to my articles on GPT and GPT-2 if you’re interested in the details of this objective.

OpenAI的GPT和GPT-2就此目标进行了培训。如果您对此目标的细节感兴趣，可以参考我在GPT和GPT-2上的文章。

屏蔽语言建模(MLM) (Masked Language Modeling (MLM))

This is a type of the Denoising Autoencoding objective, also known as the Cloze task. Here, we maximize the probability of a given masked token x_t to appear at the ‘t’th position in a given sequence given all the tokens in that sequence, x_hat. i.e.

这是“降噪自动编码”目标的一种，也称为“结束任务”。在这里，我们给定被屏蔽令牌x_t出现在给定序列x_hat中所有令牌的给定序列中第t个位置的概率。即

BERT and RoBERTa are trained on this objective. You can refer to my articles on BERT and RoBERTa if you’re interested in the details of this objective.

BERT和RoBERTa就此目标进行了培训。如果您对该目标的细节感兴趣，可以参考我在BERT和RoBERTa上的文章。

Note that the only difference between BERT’s and XLM’s approach is that BERT uses pairs of sentences whereas XLM uses streams of an arbitrary number of sentences and truncate once the length is 256.

请注意，BERT和XLM的方法之间的唯一区别是BERT使用成对的句子，而XLM使用任意数量的句子的流并在长度为256时截断。

翻译语言建模(TLM) (Translation Language Modeling (TLM))

The CLM and MLM tasks work well on monolingual corpora, however, they do not take advantage of the available parallel translation data. Hence, the authors propose a Translation Language Modeling objective wherein we take a sequence of parallel sentences from the translation data and randomly mask tokens from the source as well as from the target sentence. For example, in the figure above, we have masked words from English as well as from the French sentence. All the words in the sequence contribute to the prediction of a given masked word, hence establishing a cross-lingual mapping among the tokens.

CLM和MLM任务在单语语料库上可以很好地工作，但是它们没有利用可用的并行翻译数据。因此，作者提出了一种翻译语言建模目标，其中我们从翻译数据中提取了一系列平行句子，并从源和目标句子中随机屏蔽了标记 。例如，在上图中，我们屏蔽了英语和法语句子中的单词。 序列中的所有单词都有助于预测给定的屏蔽单词 ，从而在标记之间建立跨语言映射。

XLM (XLM)

In this work, we consider cross-lingual language model pretraining with either CLM, MLM, or MLM used in combination with TLM.

在这项工作中，我们考虑使用CLM，MLM或与TLM结合使用的MLM进行跨语言语言模型预训练。

— XLM Paper

— XLM纸

XLM预培训 (XLM Pre-training)

In this section, we’ll discuss how XLM Pre-training is leveraged for downstream tasks like:

在本节中，我们将讨论如何将XLM预培训用于下游任务，例如：

Zero-shot cross-lingual classification
零镜头跨语言分类
Supervised and unsupervised neural machine translation
有监督和无监督的神经机器翻译
Language models for low-resource languages
资源匮乏的语言的语言模型
Unsupervised cross-lingual word embeddings
无监督的跨语言词嵌入

零镜头跨语言分类 (Zero-shot Cross-lingual Classification)

Just like in any other Transformer-based monolingual model, XLM too, is fine-tuned on the XNLI dataset for obtaining the cross-lingual classification.

就像在任何其他基于Transformer的单语言模型中一样，XLM也在XNLI数据集上进行了微调，以获取跨语言分类。

A classification layer is added on top of XLM and it is trained on the English NLI training dataset. Then the model is evaluated on 15 XNLI languages.

在XLM之上添加了一个分类层，并在英语NLI训练数据集中对其进行了训练。然后，使用15种XNLI语言对模型进行评估。

Since the model hasn’t been tuned to classify sentences from any of these languages, it is a zero-shot learning example.

由于尚未对模型进行调整以对来自这些语言中的任何一种的句子进行分类 ，因此这是零学习示例。

无人监督NMT (Unsupervised NMT)

For this task, the authors propose pre-training a complete encoder-decoder architecture with a cross-lingual language modeling objective. The model is evaluated on several translation benchmarks including WMT’14 English-French, WMT’16 English-German, and WMT’16 English-Romanian.

为此，作者提出了使用跨语言建模目标对完整的编解码器架构进行预训练的建议。该模型在多个翻译基准上进行了评估，包括WMT'14英语-法语，WMT'16英语-德语和WMT'16英语-罗马尼亚语。

受监管的NMT (Supervised NMT)

Here, the encoder and decoder are loaded with pre-trained weights from XLM and then fine-tuned over the supervised translation dataset. This essentially achieves multi-lingual language translation.

在这里， 编码器和解码器会加载来自XLM的预训练权重 ，然后在监督的翻译数据集中进行微调。这实质上实现了多语言翻译。

For more on multi-lingual NMT, refer to this blog.

有关多语言NMT的更多信息，请参阅此博客。

低资源语言建模 (Low-resource language modeling)

Here’s where “languages with the same script or similar words provide better mapping” comes into the picture. For example, there are 100k sentences written in Nepali on Wikipedia and about 6 times more in Hindi. Moreover, these languages have 80% of tokens in common.

这就是“具有相同脚本或相似单词的语言提供更好的映射”的地方。例如，在Wikipedia上用尼泊尔语写的句子有10万个，在印地语中写的句子大约是6万个。而且，这些语言有80％的共同标记。

Hence, a cross-lingual language model will be evidently beneficial for a language model in Nepali as it is trained on relatively more data of similar correspondence.

因此，跨语言模型对于尼泊尔语中的语言模型显然是有利的，因为它是在相对较多的相似对应数据上进行训练的。

无监督的跨语言词嵌入 (Unsupervised Cross-lingual Word Embeddings)

Finally, since we have a shared vocabulary, the lookup table (or embedding matrix) of the XLM model gives us the cross-lingual word embeddings.

最后，由于我们有一个共享的词汇表，因此XLM模型的查找表 (或嵌入矩阵)为我们提供了跨语言的词嵌入。

结论 (Conclusion)

In this article, we discussed how a cross-lingual language model is beneficial not only for obtaining better results in generic downstream tasks but also for the fact that it improves the quality of the model for low-resource languages by training on similar high-resource languages, hence getting exposure to more relevant data.

在本文中，我们讨论了跨语言模型如何不仅有益于在通用下游任务中获得更好的结果，而且还因为它通过在相似的高资源资源上进行培训而提高了低资源语言模型的质量。语言，因此可以接触更多相关数据。

Here is a link to the original XLM GitHub repository.

这是原始XLM GitHub存储库的链接。

Here is a link to huggingface’s XLM architecture implementation and pre-trained weights.

这是拥抱面部的XLM体系结构实现和预训练权重的链接。

翻译自: https://towardsdatascience.com/xlm-cross-lingual-language-model-33c1fd1adf82

xlm跨语言模型