A Brief Survey of Multilingual Neural Machine Translation: Equal Contribution

A Brief Survey of Multilingual Neural Machine Translation
Raj Dabre∗ Chenhui Chu∗ Anoop Kunchukuttan∗

NICT Institute for Datability Science Microsoft AI & Research,
Kyoto, Japan Osaka University Hyderabad, India.
[email protected] Osaka, Japan. [email protected]
[email protected]
Abstract ral network (RNN) (Bahdanau et al., 2015), con-

volutional neural network (CNN) (Gehring et al.,
We present a survey on multilingual neu- 2017) and self-attention (Vaswani et al., 2017) ar-
ral machine translation (MNMT), which has
chitectures are popular approaches based on this
arXiv:1905.05395v3 [cs.CL] 4 Jan 2020
gained a lot of traction in the recent years.

MNMT has been useful in improving transla- paradigm. For a more detailed exposition of NMT,
tion quality as a result of knowledge transfer. we refer readers to some prominent tutorials (Neu-
MNMT is more promising and interesting than big, 2017; Koehn, 2017).
its statistical machine translation counterpart While initial research on NMT started with
because end-to-end modeling and distributed building translation systems between two lan-
representations open new avenues. Many ap- guages, researchers discovered that the NMT
proaches have been proposed in order to ex-
framework can naturally incorporate multiple lan-
ploit multilingual parallel corpora for improv-
ing translation quality. However, the lack of a
guages. Hence, there has been a massive increase
comprehensive survey makes it difficult to de- in work on MT systems that involve more than two
termine which approaches are promising and languages (Dong et al., 2015; Firat et al., 2016a;
hence deserve further exploration. In this pa- Zoph and Knight, 2016; Cheng et al., 2017; John-
per, we present an in-depth survey of exist- son et al., 2017; Chen et al., 2017, 2018b; Neu-
ing literature on MNMT. We categorize vari- big and Hu, 2018) etc. We refer to NMT sys-
ous approaches based on the resource scenar- tems handling translation between more than one
ios as well as underlying modeling principles.
language pair as multilingual NMT (MNMT) sys-
We hope this paper will serve as a starting
point for researchers and engineers interested tems. The ultimate goal MNMT research is to de-
in MNMT. velop one model for translation between all possi-
ble languages by effective use of available linguis-
1 Introduction tic resources.
MNMT systems are desirable because training
Neural machine translation (NMT) (Cho et al., models with data from many language pairs might
2014; Sutskever et al., 2014; Bahdanau et al., help acquire knowledge from multiple sources
2015) has become the dominant paradigm for MT (Zoph and Knight, 2016). Moreover, MNMT
in academic research as well as commercial use systems tend to generalize better due to expo-
(Wu et al., 2016). NMT has shown state-of-the-art sure to diverse languages, leading to improved
performance for many language pairs (Bojar et al., translation quality. This particular phenomenon
2017, 2018). Its success can be mainly attributed is known as knowledge transfer (Pan and Yang,
to the use of distributed representations of lan- 2010). Knowledge transfer has been strongly ob-
guage, enabling end-to-end training of an MT sys- served for translation between low-resource lan-
tem. Unlike statistical machine translation (SMT) guages, which have scarce parallel corpora or
systems (Koehn et al., 2007), separate lossy com- other linguistic resources but have benefited from
ponents like word aligners, translation rule extrac- data in other languages (Zoph et al., 2016). In ad-
tors and other feature extractors are not required. dition, MNMT systems will be compact, because
The dominant NMT approach is the Embed - En- a single model handles translations for multiple
code - Attend - Decode paradigm. Recurrent neu- languages (Johnson et al., 2017). This can reduce
∗
equal contribution the deployment footprint, which is crucial for con-
MNMT
Multiway Low or Zero-Resource Multi-source
Prototypical Controlling Training Transfer 1. Multi-Source

Pivot Zero-shot
Approaches Sharing Protocols learning Available
2. Missing
1. Complete 1. Dynamic 1. Joint 1. Training 1. Run-time 1. Training Source
Sharing Parameter Training 2. Language 2. Pre-training 2. Corpus Sentences
2. Minimal Generation 2. Knowledge Relatedness Size 3. Post-Editing
Sharing 2. Universal Distillation 3. Lexical 3. Language
Encoder Transfer Control
Representation 4. Syntactic
3. Multiple Transfer
Target
Languages
Figure 1: MNMT research categorized according to resource scenarios and underlying modeling principles.
strained environment like mobile phones or IoT into another language. In this scenario, existing
devices. It can also simplify the large-scale de- multilingual redundancy in the source side can be
ployment of MT systems. Most importantly, we exploited for multi-source translation (Zoph and
believe that the biggest benefit of doing MNMT Knight, 2016).
research is getting better insights into and answers Given these benefits, scenarios and the tremen-
to an important question in natural language pro- dous increase in the work on MNMT in recent
cessing: how do we build distributed representa- years, we undertake this survey paper on MNMT
tions such that similar text across languages have to systematically organize the work in this area. To
similar representations? the best of our knowledge, no such comprehensive
There are multiple MNMT scenarios based on survey on MNMT exists. Our goal is to shed light
available resources and studies have been con- on various MNMT scenarios, fundamental ques-
ducted for the following scenarios (Figure 11 ): tions in MNMT, basic principles, architectures,
Multiway Translation. The goal is constructing a and datasets of MNMT systems. The remainder
single NMT system for one-to-many (Dong et al., of this paper is structured as follows: We present a
2015), many-to-one (Lee et al., 2017) or many-to- systematic categorization of different approaches
many (Firat et al., 2016a) translation using parallel to MNMT in each of the above mentioned scenar-
corpora for more than one language pair. ios to help understand the array of design choices
Low or Zero-Resource Translation. For most of available while building MNMT systems (Sec-
the language pairs in the world, there are small or tions 2, 3, and 4). We put the work in MNMT into
no parallel corpora, and three main directions have a historical perspective with respect to multilin-
been studied for this scenario. Transfer learn- gual MT in older MT paradigms (Section 5). We
ing: Transferring translation knowledge from a also describe popular multilingual datasets and the
high-resource language pair to improve the trans- shared tasks that focus on multilingualism (Sec-
lation of a low-resource language pair (Zoph et al., tion 6). In addition, we compare MNMT with do-
2016). Pivot translation: Using a high-resource main adaptation for NMT, which tackles the prob-
language (usually English) as a pivot to translate lem of improving low-resource in-domain transla-
between a language pair (Firat et al., 2016a). Zero- tion (Section 7). Finally, we share our opinions on
shot translation: Translating between language future research directions in MNMT (Section 8)
pairs without parallel corpora (Johnson et al., and conclude this paper (Section 9).
2017).
Multi-Source Translation. Documents that have 2 Multiway NMT
been translated into more than one language The goal is learning a single model for l language
might, in the future, be required to be translated pairs (si , ti ) ∈ L (i = 1 to l), where L ⊂ S × T ,
1
Please see the supplementary material for papers related and S, T are sets of source and target languages
to each category. respectively. S and T need not be mutually ex-
clusive. Parallel corpora are available for these l embeddings, encoders and decoders for each lan-
language pairs. One-many, many-one and many- guage. By sharing attention across languages,
many NMT models have been explored in this they show improvements over bilingual models.
framework. Multiway translation systems follow However, this model has a large number of pa-
the standard paradigm in popular NMT systems. rameters. Nevertheless, the number of parameters
However, this architecture is adapted to support only grows linearly with the number of languages,
multiple languages. The wide ranges of possible while it grows quadratically for bilingual systems
architectural choices is exemplified by two highly spanning all the language pairs in the multiway
contrasting prototypical approaches. system.
2.1 Prototypical Approaches 2.2 Controlling Parameter Sharing
Complete Sharing. Johnson et al. (2017) pro- In between the extremities of parameter sharing
posed a highly compact model where all languages exemplified by the above mentioned models, lies
share the same embeddings, encoder, decoder, and an array of choices. The degree of parameter
attention mechanism. A common vocabulary, typ- sharing depends on the divergence between the
ically subword-level like byte pair encoding (BPE) languages involved (Sachan and Neubig, 2018)
(Sennrich et al., 2016b), is defined across all lan- and can be controlled at various layers of the
guages. The input sequence includes a special to- MNMT system. Sharing encoders among mul-
ken (called the language tag) to indicate the tar- tiple languages is very effective and is widely
get language. This enables the decoder to cor- used (Lee et al., 2017; Sachan and Neubig, 2018).
rectly generate the target language, though all tar- Blackwood et al. (2018) explored target language,
get languages share the same decoder parameters. source language and pair specific attention pa-
The model has minimal parameter size as all lan- rameters. They showed that target language spe-
guages share the same parameters; and achieves cific attention performs better than other attention
comparable/better results w.r.t. bilingual systems. sharing configurations. For self-attention based
But, a massively multilingual system can run into NMT models, Sachan and Neubig (2018) ex-
capacity bottlenecks (Aharoni et al., 2019). This plored various parameter sharing strategies. They
is a black-box model, which can use an off-the- showed that sharing the decoder self-attention and
shelf NMT system to train a multilingual system. encoder-decoder inter-attention parameters is use-
Ha et al. (2016) proposed a similar model, but ful for linguistically dissimilar languages. Zare-
they maintained different vocabularies for each moodi et al. (2018) further proposed a routing
language. network to dynamically control parameter sharing
This architecture is particularly useful for re- learned from the data. Designing the right shar-
lated languages, because they have high degree of ing strategy is important to maintaining a balance
lexical and syntactic similarity (Sachan and Neu- between model compactness and translation accu-
big, 2018). Lexical similarity can be further uti- racy.
lized by (a) representing all languages in a com- Dynamic Parameter or Representation Gener-
mon script using script conversion (Dabre et al., ation. Instead of defining the parameter sharing
2018; Lee et al., 2017) or transliteration (Nakov protocol a priori, Platanios et al. (2018) learned the
and Ng (2009) for multilingual SMT), (b) using a degree of parameter sharing from the data. This is
common subword-vocabulary across all languages achieved by defining the language specific model
e.g. character (Lee et al., 2017) and BPE (Nguyen parameters as a function of global parameters and
and Chiang, 2017), (c) representing words by both language embeddings. This approach also reduces
character encoding and a latent embedding space the number of language specific parameters (only
shared by all languages (Wang et al., 2019). language embeddings), while still allowing each
Pinnis et al. (2018) and Lakew et al. (2018a) language to have its own unique parameters for
have compared RNN, CNN and the self-attention different network layers. In fact, the number of
based architectures for MNMT. They show that parameters is only a small multiple of the compact
self-attention based architectures outperform the model (the multiplication factor accounts for the
other architectures in many cases. language embedding size) (Johnson et al., 2017),
Minimal Sharing. On the other hand, Firat et al. but the language embeddings can directly impact
(2016a) proposed a model comprised of separate the model parameters instead of the weak influ-
ence that language tags have. the output distributions of the student and teacher
Universal Encoder Representation. Ideally, models. The distillation loss is applied for a lan-
multiway systems should generate encoder repre- guage pair only if the teacher model shows better
sentations that are language agnostic. However, translation accuracy than the student model on the
the attention mechanism sees a variable number validation set. This approach shows better results
of encoder representations depending on the sen- than joint training of a black-box model, but train-
tence length (this could vary for translations of ing time increases significantly because bilingual
the same sentence). To overcome this, an atten- models also have to be trained.
tion bridge network generates a fixed number of
contextual representations that are input to the at- 3 Low or Zero-Resource MNMT
tention network (Lu et al., 2018; Vázquez et al., An important motivation for MNMT is to im-
2018). Murthy et al. (2018) pointed out that the prove or support translation for language pairs
contextualized embeddings are word order depen- with scarce or no parallel corpora, by utilizing
dent, hence not language agnostic. training data from high-resource language pairs.
Multiple Target Languages. This is a challeng- In this section, we will discuss the MNMT ap-
ing scenario because parameter sharing has to be proaches that specifically address the low or zero-
balanced with the capability to generate sentences resource scenario.
in each target language. Blackwood et al. (2018)
3.1 Transfer Learning
added the language tag to the beginning as well as
end of sequence to avoid its attenuation in a left- Transfer learning (Pan and Yang, 2010) has been
to-right encoder. Wang et al. (2018) explored mul- widely explored to address low-resource trans-
tiple methods for supporting target languages: (a) lation, where knowledge learned from a high-
target language tag at beginning of the decoder, (b) resource language pair is used to improve the
target language dependent positional embeddings, NMT performance on a low-resource pair.
and (c) divide hidden units of each decoder layer Training. Most studies have explored the follow-
into shared and language-dependent ones. Each of ing setting: the high-resource and low-resource
these methods provide gains over Johnson et al. language pairs share the same target language.
(2017), and combining all gave the best results. Zoph et al. (2016) first showed that transfer learn-
ing can benefit low-resource language pairs. First,
2.3 Training Protocols they trained a parent model on a high-resource lan-
Joint Training. All the available languages pairs guage pair. The child model is initialized with the
are trained jointly to minimize the mean nega- parent’s parameters wherever possible and trained
tive log-likelihood for each language pair. As on the small parallel corpus for the low-resource
some language pairs would have more data than pair. This process is known as fine-tuning. They
other languages, the model may be biased. To also studied the effect of fine-tuning only a subset
avoid this, sentence pairs from different language of the child model’s parameters (source and tar-
pairs are sampled to maintain a healthy balance. get embeddings, RNN layers and attention). The
Mini-batches can be comprised of a mix of sam- initialization has a strong regularization effect in
ples from different language pairs (Johnson et al., training the child model. Gu et al. (2018b) used
2017) or the training schedule can cycle through the model agnostic meta learning (MAML) frame-
mini-batches consisting of a language pair only work (Finn et al., 2017) to learn appropriate pa-
(Firat et al., 2016a). For architectures with lan- rameter initialization from the parent pair(s) by
guage specific layers, the latter approach is conve- taking the child pair into consideration. Instead of
nient to implement. fine-tuning, both language pairs can also be jointly
Knowledge Distillation. In this approach sug- trained (Gu et al., 2018a).
gested by Tan et al. (2019), bilingual models are Language Relatedness. Zoph et al. (2016) and
first trained for all language pairs involved. These Dabre et al. (2017b) have empirically shown that
bilingual models are used as teacher models to language relatedness between the parent and child
train a single student model for all language pairs. source languages has a big impact on the pos-
The student model is trained using a linear inter- sible gains from transfer learning. Kocmi and
polation of the standard likelihood loss as well as Bojar (2018) showed that transfer learning im-
distillation loss that captures the distance between proves low-resource language translation, even
when neither the source nor the target languages Spanish can be used for fine tuning and can have
are shared between the resource-rich and poor lan- the same effect as a pseduo-parallel corpus which
guage pairs. Further investigation is needed to un- is two orders of magnitude larger. Pivoting mod-
derstand the gains in translation quality in this sce- els can be improved if they are jointly trained as
nario. Neubig and Hu (2018) used language relat- shown by Cheng et al. (2017). Joint training was
edness to prevent overfitting when rapidly adapt- achieved by either forcing the pivot language’s
ing pre-trained MNMT model for low-resource embeddings to be similar or maximizing the like-
scenarios. Chaudhary et al. (2019) used this ap- lihood of the cascaded model on a small source-
proach to translate 1,095 languages to English. target parallel corpus. Chen et al. (2017) proposed
Lexical Transfer. Zoph et al. (2016) randomly teacher-student learning for pivoting where they
initialized the word embeddings of the child first trained a pivot-target NMT model and used
source language, because those could not be trans- it as a teacher to guide the behaviour of a source-
ferred from the parent. Gu et al. (2018a) im- target NMT model.
proved on this simple initialization by mapping
3.3 Zero-Shot
pre-trained monolingual embeddings of the parent
and child sources to a common vector space. On The approaches proposed so far involve pivoting
the other hand, Nguyen and Chiang (2017) utilized or synthetic corpus generation, which is a slow
the lexical similarity between related source lan- process due to its two-step nature. It is more inter-
guages using a small subword vocabulary. Lakew esting, and challenging, to enable translation be-
et al. (2018b) dynamically updated the vocabu- tween a zero-resource pair without explicitly in-
lary of the parent model with the low-resource lan- volving a pivot language during decoding or for
guage pair before transferring parameters. generating pseudo-parallel corpora. This scenario
Syntactic Transfer. Gu et al. (2018a) proposed to is known as zero-shot NMT. Zero-shot NMT also
encourage better transfer of contextual represen- requires a pivot language but it is only used dur-
tations from parents using a mixture of language ing training without the need to generate pseudo-
experts network. Murthy et al. (2018) showed parallel corpora.
that reducing the word order divergence between Training. Zero-shot NMT was first demonstrated
source languages via pre-ordering is beneficial in by Johnson et al. (2017). However, this zero-shot
extremely low-resource scenarios. translation method is inferior to pivoting. They
showed that the context vectors (from attention)
3.2 Pivoting for unseen language pairs differ from the seen lan-
Zero-resource NMT was first explored by Firat guage pairs, possibly explaining the degradation
et al. (2016a), where a multiway NMT model was in translation quality. Lakew et al. (2017) tried to
used to translate from Spanish to French using En- overcome this limitation by augmenting the train-
glish as a pivot language. This pivoting was done ing data with the pseudo-parallel unseen pairs gen-
either at run time or during pre-training. erated by iterative application of the same zero-
Run-Time Pivoting. Firat et al. (2016a) involved shot translation. Arivazhagan et al. (2018) in-
a pipeline through paths in the multiway model, cluded explicit language invariance losses in the
which first translates from French to English and optimization function to encourage parallel sen-
then from English to Spanish. They also experi- tences to have the same representation. Reinforce-
mented with using the intermediate English trans- ment learning for zero-shot learning was explored
lation as an additional source for the second stage. by Sestorain et al. (2018) where the dual learning
Pivoting during Pre-Training. Firat et al. framework was combined with rewards from lan-
(2016b) used the MNMT model to first translate guage models.
the Spanish side of the training corpus to English Corpus Size. Work on translation for Indian lan-
which in turn is translated into French. This gives guages showed that zero-shot works well only
a pseudo-parallel French-Spanish corpus where when the training corpora are extremely large
the source is synthetic and the target is original. (Mattoni et al., 2017). As the corpora for most
The MNMT model is fine tuned on this synthetic Indian languages contain fewer than 100k sen-
data and this enables direct French to Spanish tences, the zero-shot approach is rather infeasible
translation. Firat et al. (2016b) also showed that despite linguistic similarity. Lakew et al. (2017)
a small clean parallel corpus between French and confirmed this in the case of European languages
where small training corpora were used. Mattoni instead of a dummy token for the missing source
et al. (2017) also showed that zero-shot transla- languages.
tion works well only when the training corpora are Post-Editing. Instead of having a translator trans-
large, while Aharoni et al. (2019) show that mas- late from scratch, multi-source NMT can be used
sively multilingual models are beneficial for ze- to generate high quality translations. The trans-
roshot translation. lations can then be post-edited, a process that is
Language Control. Zero-shot NMT tends to less labor intensive and cheaper compared to trans-
translate into the wrong language at times and Ha lating from scratch. Multi-source NMT has been
et al. (2017) proposed to filter the output of the used for post-editing where the translated sentence
softmax so as to force the model to translate into is used as an additional source, leading to im-
the desired language. provements (Chatterjee et al., 2017).
4 Multi-Source NMT 5 Multilingualism in Older Paradigms
If the same source sentence is available in mul- One of the long term goals of the MT community
tiple languages then these sentences can be used is the development of architectures that can handle
together to improve the translation into the tar- more than two languages.
get language. This technique is known as multi-
RBMT. To this end, rule-based systems
source MT (Och and Ney, 2001). Approaches
(RBMT) using an interlingua were explored
for multi-source NMT can be extremely useful
widely in the past. The interlingua is a symbolic
for creating N-lingual (N > 3) corpora such as
semantic, language-independent representation
Europarl (Koehn, 2005) and UN (Ziemski et al.,
for natural language text (Sgall and Panevová,
2016b). The underlying principle is to leverage
1987). Two popular interlinguas are UNL
redundancy in terms of source side linguistic phe-
(Uchida, 1996) and AMR (Banarescu et al., 2013)
nomena expressed in multiple languages.
Different interlinguas have been proposed in
Multi-Source Available. Most studies assume various systems like KANT (E. H. Nyberg and
that the same sentence is available in multiple lan- Carbonell, 1997), UNL, UNITRAN (Dorr, 1987)
guages. Zoph and Knight (2016) showed that and DLT (Witkam, 2006). Language specific
a multi-source NMT model using separate en- analyzers converted language input to interlingua,
coders and attention networks for each source lan- while language specific decoders converted the
guage outperforms single source models. A sim- interlingua into another language. To achieve
pler approach concatenated multiple source sen- an unambiguous semantic representation, a lot
tences and fed them to a standard NMT model of linguistic analysis had to be performed and
Dabre et al. (2017a), with performance compara- many linguistic resources were required. Hence,
ble to (Zoph and Knight, 2016). Interestingly, this in practice, most interlingua systems were limited
model could automatically identify the boundaries to research systems or translation in specific
between different source languages and simplify domains and could not scale to many languages.
the training process for multi-source NMT. Dabre Over time most MT research focused on building
et al. (2017a) also showed that it is better to use bilingual systems.
linguistically similar source languages, especially
in low-resource scenarios. Ensembling of individ- SMT. Phrase-based SMT (PBSMT) systems
ual source-target models is another beneficial ap- (Koehn et al., 2003), a very successful MT
proach, for which Garmash and Monz (2016) pro- paradigm, were also bilingual for the most part.
posed several methods with different degrees of Compared to RBMT, PBSMT requires less lin-
parameterization. guistic resources and instead requires parallel cor-
Missing Source Sentences. There can be miss- pora. However, like RBMT, they work with sym-
ing source sentences in multi-source corpora. bolic, discrete representations making multilin-
Nishimura et al. (2018b) extended (Zoph and gual representation difficult. Moreover, the central
Knight, 2016) by representing each “missing” unit in PBSMT is the phrase, an ordered sequence
source language with a dummy token. Choi et al. of words (not in the linguistic sense). Given its ar-
(2018) and Nishimura et al. (2018a) further pro- bitrary structure, it is not clear how to build a com-
posed to use MT generated synthetic sentences, mon symbolic representation for phrases across
languages. Nevertheless, some shallow forms of 6 Datasets and Resources
multilingualism have been explored in the context
of: (a) pivot-based SMT, (b) multi-source PBSMT,
and (c) SMT involving related languages. MNMT requires parallel corpora in similar do-
mains across multiple languages.
Pivoting. Popular solutions are: chaining
source-pivot and pivot-target systems at decoding Multiway. Commonly used publicly available
(Utiyama and Isahara, 2007), training a source- multilingual parallel corpora are the TED cor-
target system using synthetic data generated using pus (Mauro et al., 2012), UN Corpus (Ziemski
target-pivot and pivot-source systems (Gispert and et al., 2016a) and those from the European Union
Marino, 2006), and phrase-table triangulation piv- like Europarl, JRC-Aquis, DGT-Aquis, DGT-TM,
oting source-pivot and pivot-target phrase tables ECDC-TM, EAC-TM (Steinberger et al., 2014).
(Utiyama and Isahara, 2007; Wu and Wang, 2007). While these sources are primarily comprised of
European languages, parallel corpora for some
Multi-source. Typical approaches are: re-ranking Asian languages is accessible through the WAT
outputs from independent source-target systems shared task (Nakazawa et al., 2018). Only small
(Och and Ney, 2001), composing a new output amount of parallel corpora are available for many
from independent source-target outputs (Matusov languages, primarily from movie subtitles and
et al., 2006), and translating a combined input software localization strings (Tiedemann, 2012b).
representation of multiple sources using lattice
networks over multiple phrase tables (Schroeder Low or Zero-Resource. For low or zero-resource
et al., 2009). NMT translation tasks, good test sets are re-
quired for evaluating translation quality. The
Related languages. For multilingual translation above mentioned multilingual parallel corpora
with multiple related source languages, the typical can be a source for such test sets. In addi-
approaches involved script unification by mapping tion, there are other small parallel datasets like
to a common script such as Devanagari (Baner- the FLORES dataset for English-{Nepali,Sinhala}
jee et al., 2018) or transliteration (Nakov and (Guzmán et al., 2019), the XNLI test set spanning
Ng, 2009). Lexical similarity was utilized us- 15 languages (Conneau et al., 2018b) and the In-
ing subword-level translation models (Vilar et al., dic parallel corpus (Birch et al., 2011). The WMT
2007; Tiedemann, 2012a; Kunchukuttan and Bhat- shared tasks (Bojar et al., 2018) also provide test
tacharyya, 2016, 2017). Combining subword-level sets for some low-resource language pairs.
representation and pivoting for translation among
related languages has been explored (Henrı́quez Multi-Source. The corpora for multi-source NMT
et al., 2011; Tiedemann, 2012a; Kunchukuttan have to be aligned across languages. Multi-source
et al., 2017). Most of the above mentioned multi- corpora can be extracted from some of the above
lingual systems involved either decoding-time op- mentioned sources. The following are widely used
erations, chaining black-box systems or compos- for evaluation in the literature: Europarl (Koehn,
ing new phrase-tables from existing ones. 2005), TED (Tiedemann, 2012b), UN (Ziemski
et al., 2016b). The Indian Language Corpora Ini-
tiative (ILCI) corpus (Jha, 2010) is a 11-way paral-
lel corpus of Indian languages along with English.
Comparison with MNMT. While symbolic
The Asian Language Treebank (Thu et al., 2016)
representations constrain a unified multilingual
is a 9-way parallel corpus of South-East Asian lan-
representation, distributed universal language rep-
guages along with English, Japanese and Bengali.
resentation using real-valued vector spaces makes
The MMCR4NLP project (Dabre and Kurohashi,
multilingualism easier to implement in NMT. As
2017) compiles language family grouped multi-
no language specific feature engineering is re-
source corpora and provides standard splits.
quired for NMT, making it possible to scale to
multiple languages. Neural networks provide flex- Shared Tasks. Recently, shared tasks with a focus
ibility in experimenting with a wide variety of ar- on multilingual translation have been conducted
chitectures, while advances in optimization tech- at IWSLT (Cettolo et al., 2017), WAT (Nakazawa
niques and availability of deep learning toolkits et al., 2018) and WMT (Bojar et al., 2018); so
make prototyping faster. common benchmarks are available.
7 Connections with Domain Adaptation how do we build encoder and decoder representa-
tions that are language agnostic? Particularly, the
High quality parallel corpora are limited to spe- questions of word-order divergence between the
cific domains. Both, vanilla SMT and NMT per- source languages and variable length encoder rep-
form poorly for domain specific translation in resentations have received little attention.
low-resource scenarios (Duh et al., 2013; Koehn Multiple Target Language MNMT. Most current
and Knowles, 2017). Leveraging out-of-domain efforts address multiple source languages. Multi-
parallel corpora and in-domain monolingual cor- way systems for multiple low-resource target lan-
pora for in-domain translation is known as domain guages need more attention. The right balance be-
adaptation for MT (Chu and Wang, 2018). tween sharing representations vs. maintaining the
As we can treat each domain as a language, distinctiveness of the target language for genera-
there are many similarities and common ap- tion needs exploring.
proaches between MNMT and domain adapta- Explore Pre-training Models. Pre-training em-
tion for NMT. Therefore, similar to MNMT, when beddings, encoders and decoders have been shown
using out-of-domain parallel corpora for domain to be useful for NMT (Ramachandran et al., 2017).
adaptation, multi-domain NMT and transfer learn- How pre-training can be incorporated into dif-
ing based approaches (Chu et al., 2017) have ferent MNMT architectures, is an important as
been proposed for domain adaptation. When well. Recent advances in cross-lingual word (Kle-
using in-domain monolingual corpora, a typical mentiev et al., 2012; Mikolov et al., 2013; Chan-
way of doing domain adaptation is generating a dar et al., 2014; Artetxe et al., 2016; Conneau
pseduo-parallel corpus by back-translating target et al., 2018a; Jawanpuria et al., 2019) and sentence
in-domain monolingual corpora (Sennrich et al., embeddings (Conneau et al., 2018b; Chen et al.,
2016a), which is similar to the pseduo-parallel 2018a; Artetxe and Schwenk, 2018) could provide
corpus generation in MNMT (Firat et al., 2016b). directions for this line of investigation.
There are also many differences between Related Languages, Language Registers and
MNMT and domain adaptation for NMT. While Dialects. Translation involving related languages,
pivoting is a popular approach for MNMT (Cheng language registers and dialects can be further ex-
et al., 2017), it is unsuitable for domain adapta- plored given the importance of this use case.
tion. As there are always vocabulary overlaps be- Code-Mixed Language. Addressing intra-
tween different domains, there are no zero-shot sentence multilingualism i.e. code mixed input
translation (Johnson et al., 2017) settings in do- and output, creoles and pidgins is an interesting re-
main adaptation. In addition, it not uncommon to search direction. The compact MNMT models can
write domain specific sentences in different styles handle code-mixed input, but code-mixed output
and so multi-source approaches (Zoph and Knight, remains an open problem (Johnson et al., 2017).
2016) are not applicable either. On the other hand, Multilingual and Multi-Domain NMT. Jointly
data selection approaches in domain adaptation tackling multilingual and multi-domain translation
that select out-of-domain sentences which are sim- is an interesting direction with many practical use
ilar to in-domain sentences (2017a) have not been cases. When extending an NMT system to a new
applied to MNMT. In addition, instance weight- language, the parallel corpus in the domain of in-
ing approaches (Wang et al., 2017b) that interpo- terest may not be available. Transfer learning in
late in-domain and out-of-domain models have not this case has to span languages and domains.
been studied for MNMT. However, with the de-
velopment of cross-lingual sentence embeddings, 9 Conclusion
data selection and instance weighting approaches MNMT has made rapid progress in the recent
might be applicable for MNMT in the near future. past. In this survey, we have covered literature
pertaining to the major scenarios we identified
8 Future Research Directions
for multilingual NMT: multiway, low or zero-
While exciting advances have been made in resource (transfer learning, pivoting, and zero-
MNMT in recent years, there are still many inter- shot approaches) and multi-source translation. We
esting directions for exploration. have systematically compiled the principal design
Language Agnostic Representation Learning. approaches and their variants, central MNMT is-
A core question that needs further investigation is: sues and their proposed solutions along with their
strengths and weaknesses. We have put MNMT in Ondřej Bojar, Christian Federmann, Mark Fishel,
a historical perspective w.r.t work on multilingual Yvette Graham, Barry Haddow, Philipp Koehn, and
Christof Monz. 2018. Findings of the 2018 confer-
RBMT and SMT systems. We suggest promising
ence on machine translation (WMT18). In Proceed-
and important directions for future work. We hope ings of the Third Conference on Machine Transla-
that this survey paper could significantly promote tion: Shared Task Papers, pages 272–303. Associa-
and accelerate MNMT research. tion for Computational Linguistics.
Ondřej Bojar, Rajen Chatterjee, Christian Federmann,

Yvette Graham, Barry Haddow, Shujian Huang,
References Matthias Huck, Philipp Koehn, Qun Liu, Varvara
Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Logacheva, Christof Monz, Matteo Negri, Matt
Massively Multilingual Neural Machine Transla- Post, Raphael Rubino, Lucia Specia, and Marco
tion. In NAACL (to appear). Turchi. 2017. Findings of the 2017 conference
on machine translation (WMT17). In Proceedings
Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Roee of the Second Conference on Machine Translation,
Aharoni, Melvin Johnson, and Wolfgang Macherey. pages 169–214, Copenhagen, Denmark. Association
2018. The missing ingredient in zero-shot neural for Computational Linguistics.
machine translation.
Mauro Cettolo, Marcello Federico, Luisa Bentivogli,
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. Jan Niehues, Sebastian Stker, Katsuhito Sudoh,
Learning principled bilingual mappings of word Koichiro Yoshino, and Christian Federmann. 2017.
embeddings while preserving monolingual invari- Overview of the IWSLT 2017 evaluation campaign.
ance. In Proceedings of the Conference on Empiri- In IWSLT.
cal Methods in Natural Language Processing, pages
2289–2294. Sarath Chandar, Stanislas Lauly, Hugo Larochelle,
Mitesh Khapra, Balaraman Ravindran, Vikas C
Mikel Artetxe and Holger Schwenk. 2018. Mas- Raykar, and Amrita Saha. 2014. An autoencoder
sively multilingual sentence embeddings for zero- approach to learning bilingual word representations.
shot cross-lingual transfer and beyond. CoRR, In Proceedings of the Advances in Neural Informa-
abs/1812.10464. tion Processing Systems, pages 1853–1861.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Rajen Chatterjee, M. Amin Farajian, Matteo Negri,
gio. 2015. Neural machine translation by jointly Marco Turchi, Ankit Srivastava, and Santanu Pal.
learning to align and translate. In In Proceedings of 2017. Multi-source neural automatic post-editing:
the 3rd International Conference on Learning Rep- Fbk’s participation in the WMT 2017 ape shared
resentations (ICLR 2015), San Diego, USA. Interna- task. In Proceedings of the Second Conference on
tional Conference on Learning Representations. Machine Translation, pages 630–638. Association
Laura Banarescu, Claire Bonial, Shu Cai, Madalina for Computational Linguistics.
Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin
Knight, Philipp Koehn, Martha Palmer, and Nathan Aditi Chaudhary, Siddharth Dalmia, Junjie Hu, Xinjian
Schneider. 2013. Abstract meaning representation Li, Austin Matthews, Aldrian Obaja Muis, Naoki
for sembanking. In Proceedings of the 7th Linguis- Otani, Shruti Rijhwani, Zaid Sheikh, Nidhi Vyas,
tic Annotation Workshop and Interoperability with Xinyi Wang, Jiateng Xie, Ruochen Xu, Chunting
Discourse, pages 178–186, Sofia, Bulgaria. Associ- Zhou, Peter J. Jansen, Yiming Yang, Lori Levin, Flo-
ation for Computational Linguistics. rian Metze, Teruko Mitamura, David R. Mortensen,
Graham Neubig, Eduard Hovy, Alan W Black,
Tamali Banerjee, Anoop Kunchukuttan, and Pushpak Jaime Carbonell, Graham V. Horwood, Shabnam
Bhattacharyya. 2018. Multilingual Indian Language Tafreshi, Mona Diab, Efsun S. Kayi, Noura Farra,
Translation System at WAT 2018: Many-to-one and Kathleen McKeown. 2019. The ARIEL-CMU
Phrase-based SMT. In 5th Workshop on Asian Lan- Systems for LoReHLT18. CoRR, abs/1902.08899.
guage Translation.
Xilun Chen, Ahmed Hassan Awadallah, Hany Hassan,
Lexi Birch, Chris Callison-Burch, Miles Osborne, and Wei Wang, and Claire Cardie. 2018a. Zero-resource
Matt Post. 2011. The indic multi-parallel cor- multilingual model transfer: Learning what to share.
pus. http://homepages.inf.ed.ac.uk/ CoRR, abs/1810.03552.
miles/babel.html.
Yun Chen, Yang Liu, Yong Cheng, and Victor O.K.
Graeme Blackwood, Miguel Ballesteros, and Todd Li. 2017. A teacher-student framework for zero-
Ward. 2018. Multilingual neural machine transla- resource neural machine translation. In Proceed-
tion with task-specific attention. In Proceedings of ings of the 55th Annual Meeting of the Association
the 27th International Conference on Computational for Computational Linguistics (Volume 1: Long Pa-
Linguistics, pages 3112–3122. Association for Com- pers), pages 1925–1935. Association for Computa-
putational Linguistics. tional Linguistics.
Yun Chen, Yang Liu, and Victor O. K. Li. 2018b. Zero- Raj Dabre and Sadao Kurohashi. 2017. Mmcr4nlp:
resource neural machine translation with multi- Multilingual multiway corpora repository for
agent communication game. In AAAI, pages 5086– natural language processing. arXiv preprint
5093. AAAI Press. arXiv:1710.01025.
Yong Cheng, Qian Yang, Yang Liu, Maosong Sun, and Raj Dabre, Tetsuji Nakagawa, and Hideto Kazawa.
Wei Xu. 2017. Joint training for pivot-based neural 2017b. An empirical study of language relatedness
machine translation. In Proceedings of the Twenty- for transfer learning in neural machine translation.
Sixth International Joint Conference on Artificial In- In Proceedings of the 31st Pacific Asia Conference
telligence, IJCAI-17, pages 3974–3980. on Language, Information and Computation, pages
282–286. The National University (Phillippines).
KyungHyun Cho, Bart van Merrienboer, Dzmitry Bah-
danau, and Yoshua Bengio. 2014. On the properties Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and
of neural machine translation: Encoder-decoder ap- Haifeng Wang. 2015. Multi-task learning for mul-
proaches. In Eighth Workshop on Syntax, Semantics tiple language translation. In Proceedings of the
and Structure in Statistical Translation. 53rd Annual Meeting of the Association for Compu-
tational Linguistics and the 7th International Joint
Gyu Hyeon Choi, Jong Hun Shin, and Young Kil Kim. Conference on Natural Language Processing (Vol-
2018. Improving a multi-source neural machine ume 1: Long Papers), pages 1723–1732. Associa-
translation model with corpus extension for low- tion for Computational Linguistics.
resource languages. In Proceedings of the Eleventh
International Conference on Language Resources Bonnie J. Dorr. 1987. UNITRAN: An Interlingua Ap-
and Evaluation (LREC-2018). European Language proach to Machine Translation. In Proceedings of
Resource Association. the 6th Conference of the American Association of
Artificial Intelligence.
Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 2017.
An empirical comparison of domain adaptation Kevin Duh, Graham Neubig, Katsuhito Sudoh, and Ha-
methods for neural machine translation. In Pro- jime Tsukada. 2013. Adaptation data selection us-
ceedings of the 55th Annual Meeting of the Associa- ing neural language models: Experiments in ma-
tion for Computational Linguistics (Volume 2: Short chine translation. In Proceedings of the 51st Annual
Papers), pages 385–391. Association for Computa- Meeting of the Association for Computational Lin-
tional Linguistics. guistics (Volume 2: Short Papers), pages 678–683,
Sofia, Bulgaria.
Chenhui Chu and Rui Wang. 2018. A survey of do-
main adaptation for neural machine translation. In T. Mitamura E. H. Nyberg and J. Carbonell. 1997. The
Proceedings of the 27th International Conference on KANT Machine Translation System: From R&D to
Computational Linguistics, pages 1304–1319. Asso- Initial Deployment. In Proceedings of LISA (The Li-
ciation for Computational Linguistics. brary and Information Services in Astronomy) Work-
shop on Integrating Advanced Translation Technol-
Alexis Conneau, Guillaume Lample, Marc’Aurelio ogy.
Ranzato, Ludovic Denoyer, and Hervé Jégou.
2018a. Word translation without parallel data. Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017.
In Proceedings of the International Conference Model-agnostic meta-learning for fast adaptation of
on Learning Representations. URL: https:// deep networks. In Proceedings of the 34th In-
github.com/facebookresearch/MUSE. ternational Conference on Machine Learning, vol-
ume 70 of Proceedings of Machine Learning Re-
Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- search, pages 1126–1135, International Convention
ina Williams, Samuel R. Bowman, Holger Schwenk, Centre, Sydney, Australia. PMLR.
and Veselin Stoyanov. 2018b. XNLI: Evaluating
Cross-lingual Sentence Representations. In Pro- Orhan Firat, Kyunghyun Cho, and Yoshua Bengio.
ceedings of the 2018 Conference on Empirical Meth- 2016a. Multi-way, multilingual neural machine
ods in Natural Language Processing. Association translation with a shared attention mechanism. In
for Computational Linguistics. Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computa-
Raj Dabre, Fabien Cromieres, and Sadao Kurohashi. tional Linguistics: Human Language Technologies,
2017a. Enabling multi-source neural machine trans- pages 866–875. Association for Computational Lin-
lation by concatenating source sentences in multi- guistics.
ple languages. In Proceedings of MT Summit XVI,
vol.1: Research Track, pages 96–106. Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan,
Fatos T. Yarman Vural, and Kyunghyun Cho. 2016b.
Raj Dabre, Anoop Kunchukuttan, Atsushi Fujita, and Zero-resource translation with multi-lingual neural
Eiichiro Sumita. 2018. NICT’s participation in WAT machine translation. In Proceedings of the 2016
2018: Approaches using multilingualism and recur- Conference on Empirical Methods in Natural Lan-
rently stacked layers. In 5th Workshop on Asian guage Processing, pages 268–277. Association for
Language Translation. Computational Linguistics.
Ekaterina Garmash and Christof Monz. 2016. Ensem- Girish Nath Jha. 2010. The TDIL Program and the In-
ble learning for multi-source neural machine transla- dian Langauge Corpora Intitiative (ILCI). In LREC.
tion. In Proceedings of COLING 2016, the 26th In-
ternational Conference on Computational Linguis- Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim
tics: Technical Papers, pages 1409–1418. The COL- Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,
ING 2016 Organizing Committee. Fernanda Viégas, Martin Wattenberg, Greg Corrado,
Macduff Hughes, and Jeffrey Dean. 2017. Google’s
Jonas Gehring, Michael Auli, David Grangier, Denis multilingual neural machine translation system: En-
Yarats, and Yann N. Dauphin. 2017. Convolutional abling zero-shot translation. Transactions of the As-
sequence to sequence learning. In Proceedings sociation for Computational Linguistics, 5:339–351.
of the 34th International Conference on Machine
Learning, volume 70 of Proceedings of Machine Alexandre Klementiev, Ivan Titov, and Binod Bhat-
Learning Research, pages 1243–1252, International tarai. 2012. Inducing crosslingual distributed rep-
Convention Centre, Sydney, Australia. PMLR. resentations of words. In Proceedings of the Inter-
national Conference on Computational Linguistics:
Adri‘a De Gispert and Jose B Marino. 2006. Catalan- Technical Papers, pages 1459–1474.
English statistical machine translation without par-
allel corpus: bridging through Spanish. In In Proc.
Tom Kocmi and Ondřej Bojar. 2018. Trivial trans-
of 5th International Conference on Language Re-
fer learning for low-resource neural machine trans-
sources and Evaluation (LREC).
lation. In Proceedings of the Third Conference on
Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor O.K. Machine Translation, Volume 1: Research Papers,
Li. 2018a. Universal neural machine translation pages 244–252, Belgium, Brussels. Association for
for extremely low resource languages. In Proceed- Computational Linguistics.
ings of the 2018 Conference of the North American
Chapter of the Association for Computational Lin- Philipp Koehn. 2005. Europarl: A Parallel Corpus for
guistics: Human Language Technologies, Volume Statistical Machine Translation. In Conference Pro-
1 (Long Papers), pages 344–354. Association for ceedings: the tenth Machine Translation Summit,
Computational Linguistics. pages 79–86, Phuket, Thailand. AAMT, AAMT.
Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li, Philipp Koehn. 2017. Neural machine translation.
and Kyunghyun Cho. 2018b. Meta-learning for low- CoRR, abs/1709.07809.
resource neural machine translation. In Proceed-
ings of the 2018 Conference on Empirical Methods Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
in Natural Language Processing, pages 3622–3631. Callison-Burch, Marcello Federico, Nicola Bertoldi,
Association for Computational Linguistics. Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra
Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan Constantin, and Evan Herbst. 2007. Moses: Open
Pino, Guillaume Lample, Philipp Koehn, Vishrav source toolkit for statistical machine translation. In
Chaudhary, and Marc’Aurelio Ranzato. 2019. Two Proceedings of the 45th Annual Meeting of the As-
New Evaluation Datasets for Low-Resource Ma- sociation for Computational Linguistics Companion
chine Translation: Nepali-English and Sinhala- Volume Proceedings of the Demo and Poster Ses-
English. arXiv preprint arXiv:1902.01382. sions, pages 177–180, Prague, Czech Republic. As-
sociation for Computational Linguistics.
Thanh-Le Ha, Jan Niehues, and Alexander H. Waibel.
2016. Toward multilingual neural machine transla- Philipp Koehn and Rebecca Knowles. 2017. Six chal-
tion with universal encoder and decoder. In Pro- lenges for neural machine translation. In Pro-
ceedings of the 13th International Workshop on Spo- ceedings of the First Workshop on Neural Machine
ken Language Translation. Translation, pages 28–39, Vancouver. Association
Thanh-Le Ha, Jan Niehues, and Alexander H. Waibel. for Computational Linguistics.
2017. Effective strategies in zero-shot neural ma-
chine translation. In IWSLT. Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical phrase-based translation. In
C. Henrı́quez, M. R. Costa-jussá, R. E. Banchs, Proceedings of the 2003 Conference of the North
L. Formiga, and J. B. Mari no. 2011. Pivot Strate- American Chapter of the Association for Computa-
gies as an Alternative for Statistical Machine Trans- tional Linguistics on Human Language Technology-
lation Tasks Involving Iberian Languages. In Work- Volume 1, pages 48–54. Association for Computa-
shop on ICL NLP Tasks. tional Linguistics.
Pratik Jawanpuria, Arjun Balgovind, Anoop Anoop Kunchukuttan and Pushpak Bhattacharyya.
Kunchukuttan, and Bamdev Mishra. 2019. Learn- 2016. Orthographic Syllable as basic unit for SMT
ing multilingual word embeddings in latent metric between Related Languages. In Proceedings of the
space: a geometric approach. Transaction of the Conference on Empirical Methods in Natural Lan-
Association for Computational Linguistics (TACL). guage Processing.
Anoop Kunchukuttan and Pushpak Bhattacharyya. Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013.
2017. Learning variable length units for SMT be- Exploiting similarities among languages for ma-
tween related languages via Byte Pair Encoding. chine translation. Technical report, arXiv preprint
In First Workshop on Subword and Character level arXiv:1309.4168.
models in NLP.
V. Rudra Murthy, Anoop Kunchukuttan, and Push-
Anoop Kunchukuttan, Maulik Shah, Pradyot Prakash, pak Bhattacharyya. 2018. Addressing word-order
and Pushpak Bhattacharyya. 2017. Utilizing lexical divergence in multilingual neural machine transla-
similarity between related, low-resource languages tion for extremely low resource languages. CoRR,
for pivot-based smt. In Proceedings of the Eighth In- abs/1811.00383.
ternational Joint Conference on Natural Language
Processing (Volume 2: Short Papers), pages 283– Toshiaki Nakazawa, Shohei Higashiyama, Chenchen
289. Asian Federation of Natural Language Process- Ding, Raj Dabre, Anoop Kunchukuttan, Win Pa
ing. Pa, Isao Goto, Hideya Mino, Katsuhito Sudoh, and
Sadao Kurohashi. 2018. Overview of the 5th work-
Surafel Melaku Lakew, Mauro Cettolo, and Marcello shop on asian translation. In Proceedings of the 5th
Federico. 2018a. A comparison of transformer and Workshop on Asian Translation (WAT2018).
recurrent neural networks on multilingual neural
machine translation. In Proceedings of the 27th In- Preslav Nakov and Hwee Tou Ng. 2009. Improved
ternational Conference on Computational Linguis- statistical machine translation for resource-poor lan-
tics, pages 641–652. Association for Computational guages using related resource-rich languages. In
Linguistics. Proceedings of the 2009 Conference on Empirical
Methods in Natural Language Processing.
Surafel Melaku Lakew, Aliia Erofeeva, Matteo Ne-
gri, Marcello Federico, and Marco Turchi. 2018b. Graham Neubig. 2017. Neural machine translation and
Transfer learning in multilingual neural machine sequence-to-sequence models: A tutorial. CoRR,
translation with dynamic vocabulary. In IWSLT. abs/1703.01619.
Surafel Melaku Lakew, Quintino F. Lotito, Matteo Ne- Graham Neubig and Junjie Hu. 2018. Rapid adapta-
gri, Marco Turchi, and Marcello Federico. 2017. tion of neural machine translation to new languages.
Improving zero-shot translation of low-resource lan- In Proceedings of the 2018 Conference on Empiri-
guages. In IWSLT. cal Methods in Natural Language Processing, pages
875–880. Association for Computational Linguis-
Jason Lee, Kyunghyun Cho, and Thomas Hofmann. tics.
2017. Fully character-level neural machine trans-
lation without explicit segmentation. Transactions Toan Q. Nguyen and David Chiang. 2017. Trans-
of the Association for Computational Linguistics, fer learning across low-resource, related languages
5:365–378. for neural machine translation. In Proceedings of
the Eighth International Joint Conference on Natu-
Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas Bhard- ral Language Processing (Volume 2: Short Papers),
waj, Shaonan Zhang, and Jason Sun. 2018. A neu- pages 296–301. Asian Federation of Natural Lan-
ral interlingua for multilingual machine translation. guage Processing.
In Proceedings of the Third Conference on Machine
Translation: Research Papers, pages 84–92. Asso- Yuta Nishimura, Katsuhito Sudoh, Graham Neubig,
ciation for Computational Linguistics. and Satoshi Nakamura. 2018a. Multi-source neural
machine translation with data augmentation. In 15th
Giulia Mattoni, Pat Nagle, Carlos Collantes, and Dim- International Workshop on Spoken Language Trans-
itar Shterionov. 2017. Zero-shot translation for in- lation (IWSLT), Brussels, Belgium.
dian languages with sparse data. In Proceedings
of MT Summit XVI, Vol.2: Users and Translators Yuta Nishimura, Katsuhito Sudoh, Graham Neubig,
Track, pages 1–10. and Satoshi Nakamura. 2018b. Multi-source neural
machine translation with missing data. In Proceed-
Evgeny Matusov, Nicola Ueffing, and Hermann Ney. ings of the 2nd Workshop on Neural Machine Trans-
2006. Computing consensus translation for multiple lation and Generation, pages 92–99. Association for
machine translation systems using enhanced hypoth- Computational Linguistics.
esis alignment. In 11th Conference of the European
Chapter of the Association for Computational Lin- Franz Josef Och and Hermann Ney. 2001. Statisti-
guistics. cal multi-source translation. In Proceedings of MT
Summit, volume 8, pages 253–258.
Cettolo Mauro, Girardi Christian, and Federico Mar-
cello. 2012. Wit3: Web inventory of transcribed and Sinno Jialin Pan and Qiang Yang. 2010. A survey on
translated talks. In Conference of European Associ- transfer learning. IEEE Trans. on Knowl. and Data
ation for Machine Translation, pages 261–268. Eng., 22(10):1345–1359.
Mārcis Pinnis, Matı̄ss Rikters, and Rihards Krišlauks. overview of the European Unions highly multilin-
2018. Training and Adapting Multilingual NMT gual parallel corpora. Language Resources and
for Less-resourced and Morphologically Rich Lan- Evaluation, 48(4):679–707.
guages. In Proceedings of the Eleventh Interna-
tional Conference on Language Resources and Eval- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
uation (LREC 2018), Miyazaki, Japan. European Sequence to sequence learning with neural net-
Language Resources Association (ELRA). works. In Proceedings of the 27th International
Conference on Neural Information Processing Sys-
Emmanouil Antonios Platanios, Mrinmaya Sachan, tems, NIPS’14, pages 3104–3112, Cambridge, MA,
Graham Neubig, and Tom Mitchell. 2018. Contex- USA. MIT Press.
tual parameter generation for universal neural ma-
chine translation. In Proceedings of the 2018 Con- Xu Tan, Yi Ren, Di He, Tao Qin, and Tie-Yan Liu.
ference on Empirical Methods in Natural Language 2019. Multilingual neural machine translation with
Processing, pages 425–435. Association for Com- knowledge distillation. In International Conference
putational Linguistics. on Learning Representations.
Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew M
Prajit Ramachandran, Peter J. Liu, and Quoc V. Le. Finch, and Eiichiro Sumita. 2016. Introducing the
2017. Unsupervised pretraining for sequence to se- asian language treebank (ALT). In LREC.
quence learning. In EMNLP.
Jörg Tiedemann. 2012a. Character-based pivot trans-
Devendra Sachan and Graham Neubig. 2018. Parame- lation for under-resourced languages and domains.
ter sharing methods for multilingual self-attentional In Proceedings of the 13th Conference of the Euro-
translation models. In Proceedings of the Third pean Chapter of the Association for Computational
Conference on Machine Translation: Research Pa- Linguistics.
pers, pages 261–271. Association for Computational
Linguistics. Jörg Tiedemann. 2012b. Parallel data, tools and inter-
faces in opus. In Proceedings of the Eight Interna-
Josh Schroeder, Trevor Cohn, and Philipp Koehn. tional Conference on Language Resources and Eval-
2009. Word lattices for multi-source translation. In uation (LREC’12), Istanbul, Turkey. European Lan-
Proceedings of the 12th Conference of the European guage Resources Association (ELRA).
Chapter of the Association for Computational Lin-
guistics, pages 719–727. Association for Computa- H. Uchida. 1996. UNL: Universal Networking Lan-
tional Linguistics. guage An Electronic Language for Communi-
cation, Understanding, and Collaboration. In
Rico Sennrich, Barry Haddow, and Alexandra Birch. UNU/IAS/UNL Center.
2016a. Improving neural machine translation mod-
els with monolingual data. In Proceedings of the Masao Utiyama and Hitoshi Isahara. 2007. A Compar-
54th Annual Meeting of the Association for Compu- ison of Pivot Methods for Phrase-Based Statistical
tational Linguistics (Volume 1: Long Papers), pages Machine Translation. In Conference of the North
86–96, Berlin, Germany. Association for Computa- Americal Chapter of the Association for Computa-
tional Linguistics. tional Linguistics, pages 484–491.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Rico Sennrich, Barry Haddow, and Alexandra Birch.
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
2016b. Neural machine translation of rare words
Kaiser, and Illia Polosukhin. 2017. Attention is all
with subword units. In Proceedings of the 54th An-
you need. In I. Guyon, U. V. Luxburg, S. Bengio,
nual Meeting of the Association for Computational
H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
Linguistics (Volume 1: Long Papers), pages 1715–
nett, editors, Advances in Neural Information Pro-
1725, Berlin, Germany. Association for Computa-
cessing Systems 30, pages 5998–6008. Curran As-
tional Linguistics.
sociates, Inc.
Lierni Sestorain, Massimiliano Ciaramita, Christian Raúl Vázquez, Alessandro Raganato, Jörg Tiedemann,
Buck, and Thomas Hofmann. 2018. Zero-shot dual and Mathias Creutz. 2018. Multilingual NMT with
machine translation. CoRR, abs/1805.10338. a language-independent attention bridge. CoRR,
abs/1811.00498.
Petr Sgall and Jarmila Panevová. 1987. Machine trans-
lation, linguistics, and interlingua. In Proceedings David Vilar, Jan-T Peter, and Hermann Ney. 2007. Can
of the Third Conference on European Chapter of the we translate letters? In Proceedings of the Second
Association for Computational Linguistics, EACL Workshop on Statistical Machine Translation.
’87, pages 99–103, Stroudsburg, PA, USA. Associa-
tion for Computational Linguistics. Rui Wang, Andrew Finch, Masao Utiyama, and Ei-
ichiro Sumita. 2017a. Sentence embedding for neu-
Ralf Steinberger, Mohamed Ebrahim, Alexandros ral machine translation domain adaptation. In Pro-
Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, ceedings of the 55th Annual Meeting of the Associa-
Marek Przybyszewski, and Signe Gilbro. 2014. An tion for Computational Linguistics (Volume 2: Short
Papers), pages 560–566, Vancouver, Canada. Asso- Barret Zoph and Kevin Knight. 2016. Multi-source
ciation for Computational Linguistics. neural translation. In Proceedings of the 2016 Con-
ference of the North American Chapter of the Asso-
Rui Wang, Masao Utiyama, Lemao Liu, Kehai Chen, ciation for Computational Linguistics: Human Lan-
and Eiichiro Sumita. 2017b. Instance weighting guage Technologies, pages 30–34. Association for
for neural machine translation domain adaptation. Computational Linguistics.
In Proceedings of the 2017 Conference on Empiri-
cal Methods in Natural Language Processing, pages Barret Zoph, Deniz Yuret, Jonathan May, and Kevin
1482–1488, Copenhagen, Denmark. Knight. 2016. Transfer learning for low-resource
neural machine translation. In Proceedings of the
2016 Conference on Empirical Methods in Natural
Xinyi Wang, Hieu Pham, Philip Arthur, and Graham Language Processing, EMNLP 2016, Austin, Texas,
Neubig. 2019. Multilingual neural machine transla- USA, November 1-4, 2016, pages 1568–1575.
tion with soft decoupled encoding. In International
Conference on Learning Representations.
Yining Wang, Jiajun Zhang, Feifei Zhai, Jingfang Xu,

and Chengqing Zong. 2018. Three strategies to im-
prove one-to-many multilingual translation. In Pro-
ceedings of the 2018 Conference on Empirical Meth-
ods in Natural Language Processing, pages 2955–
2960. Association for Computational Linguistics.
T. Witkam. 2006. History and Heritage of the DLT

(Distributed Language Translation) project. In
Utrecht, The Netherlands: private publication.
Hua Wu and Haifeng Wang. 2007. Pivot language ap-

proach for phrase-based statistical machine transla-
tion. Machine Translation, 21(3):165–181.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.

Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, Jeff Klingner, Apurva Shah, Melvin
Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
Gouws, Yoshikiyo Kato, Taku Kudo, Hideto
Kazawa, Keith Stevens, George Kurian, Nishant
Patil, Wei Wang, Cliff Young, Jason Smith, Jason
Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,
Macduff Hughes, and Jeffrey Dean. 2016. Google’s
neural machine translation system: Bridging the gap
between human and machine translation. CoRR,
abs/1609.08144.
Poorya Zaremoodi, Wray Buntine, and Gholamreza

Haffari. 2018. Adaptive knowledge sharing in
multi-task learning: Improving low-resource neural
machine translation. In Proceedings of the 56th An-
nual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), pages 656–
661. Association for Computational Linguistics.
Michal Ziemski, Marcin Junczys-Dowmunt, and Bruno

Pouliquen. 2016a. The united nations parallel cor-
pus v1. 0. In LREC.
Micha Ziemski, Marcin Junczys-Dowmunt, and Bruno

Pouliquen. 2016b. The United Nations Parallel Cor-
pus v1.0. In Proceedings of the Tenth International
Conference on Language Resources and Evaluation
(LREC 2016), Paris, France. European Language
Resources Association (ELRA).

A Brief Survey of Multilingual Neural Machine Translation: Equal Contribution

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Brief Survey of Multilingual Neural Machine Translation: Equal Contribution

Uploaded by

Copyright:

Available Formats

A Brief Survey of Multilingual Neural Machine Translation

Raj Dabre∗ Chenhui Chu∗ Anoop Kunchukuttan∗

Abstract ral network (RNN) (Bahdanau et al., 2015), con-

gained a lot of traction in the recent years.

Multiway Low or Zero-Resource Multi-source

Prototypical Controlling Training Transfer 1. Multi-Source

4 Multi-Source NMT 5 Multilingualism in Older Paradigms

Ondřej Bojar, Rajen Chatterjee, Christian Federmann,

Yining Wang, Jiajun Zhang, Feifei Zhai, Jingfang Xu,

T. Witkam. 2006. History and Heritage of the DLT

Hua Wu and Haifeng Wang. 2007. Pivot language ap-

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.

Poorya Zaremoodi, Wray Buntine, and Gholamreza

Michal Ziemski, Marcin Junczys-Dowmunt, and Bruno

Micha Ziemski, Marcin Junczys-Dowmunt, and Bruno

You might also like