Toward Multilingual Neural Machine Translation With Universal Encoder and Decoder
Toward Multilingual Neural Machine Translation With Universal Encoder and Decoder
Abstract
In this paper, we present our first attempts in building a multilingual Neural Machine Translation
framework under a unified approach. We are then able to employ attention-based NMT for many-
arXiv:1611.04798v1 [cs.CL] 15 Nov 2016
to-many multilingual translation tasks. Our approach does not require any special treatment on
the network architecture and it allows us to learn minimal number of free parameters in a standard
way of training. Our approach has shown its effectiveness in an under-resourced translation
scenario with considerable improvements up to 2.6 BLEU points. In addition, the approach has
achieved interesting and promising results when applied in the translation task that there is no
direct parallel corpus between source and target languages.
1 Introduction
Neural Machine Translation (NMT) has shown its effectiveness in translation tasks when NMT systems
perform best in recent machine translation campaigns (Cettolo et al., 2015; Bojar et al., 2016). Com-
pared to phrase-based Statistical Machine Translation (SMT) which is basically an ensemble of different
features trained and tuned separately, NMT directly modeling the translation relationship between source
and target sentences. Unlike SMT, NMT does not require much linguistic information and large mono-
lingual data to achieve good performances.
An NMT consists of an encoder which recursively reads and represents the whole source sentence into
a context vector and a recurrent decoder which takes the context vector and its previous state to predict
the next target word. It is then trained in an end-to-end fashion to learn parameters which maximizes
the likelihood between the outputs and the references. Recently, attention-based NMT has been featured
in most state-of-the-art systems. First introduced by (Bahdanau et al., 2014), attention mechanism is
integrated in decoder side as feedforward layers. It allows the NMT to decide which source words
should take part in the predicting process of the next target words. It helps to improve NMTs significantly.
Nevertheless, since the attention mechanism is specific to a particular source sentence and the considering
target word, it is also specific to particular language pairs.
Some recent work has focused on extending the NMT framework to multilingual scenarios. By train-
ing such network using parallel corpora in number of different languages, NMT could benefit from
additional information embedded in a common semantic space across languages. Basically, the pro-
posed NMT are required to employ multiple encoders or multiple decoders to deal with multilinguality.
Furthermore, in order to avoid the tight dependency of the attention mechanism to specific language
pairs, they also need to modify their architecture to combine either the encoders or the attention layers.
These modifications are specific to the purpose of the tasks as well. Thus, those multilingual NMTs are
more complicated, much more free parameters to learn and more difficult to perform standard trainings
compared to the original NMT.
In this paper, we introduce a unified approach to seamlessly extend the original NMT to multilingual
settings. Our approach allows us to integrate any language in any side of the encoder-decoder architec-
ture with only one encoder and one decoder for all the languages involved. Moreover, it is not necessary
to do any network modification to enable attention mechanism in our NMT systems. We then apply
our proprosed framework in two demanding scenarios: under-resourced translation and zero-resourced
translation. The results show that bringing multilinguality to NMT helps to improve individual trans-
lations. With some insightful analyses of the results, we set our goal toward a fully multilingual NMT
framework.
The paper starts with a detailed introduction to attention-based NMT. In Section 3.1, related work
about multi-task NMT is reviewed. Section 3.2 describes our proposed approach and thorough compar-
isons to the related work. It is followed by a section of evaluating our systems in two aforementioned
scenarios, in which different strategies have been employed under a unified approach (Section 4). Finally,
the paper ends with conclusion and future work.
zj = g(zj−1 , tj−1 , cj )
tj−1 = Et • yj−1
Again, g is the recurrent activation function of the decoder and Et is the shared word embedding
matrix of the target sentences. The context vector cj is calculated based on the annotation vectors from
the encoder. Before feeding the annotation vectors into the decoder, an attention mechanism is set up in
between, in order to choose which annotation vectors should contribute to the predicting decision of the
next target word. Intuitively, a relevance between the previous target word and the annotation vectors can
be used to form some attention scenario. There exists several ways to calculate the relevance as shown
in (Luong et al., 2015a), but what we describe here follows the proposed method of (Bahdanau et al.,
2014)
rel sc(zj−1 , hi )) = va • tanh(Wa • zj−1 + Ua • hi )
exp(rel sc(zj−1 , hi )) X
αij = P , cj = αij hi
i0 exp(rel sc(zj−1 , hi0 )) i
This work is licenced under a Creative Commons Attribution 4.0 International License. License details: http://
creativecommons.org/licenses/by/4.0/
In (Bahdanau et al., 2014), this attention mechanism, originally called alignment model, has been
employed as a simple feedforward network with the first layer is a learnable layer via va ,Wa and Ua .
The relevance scores rel sc are then normalized into attention weights αij and the context vector cj is
calculated as the weighted sum of all annotation vectors hi . Depending on how much attention the target
word at time j put on the source states hi , a soft alignment is learned. By being employed this way, word
alignment is not a latent variable but a parametrized function, making the alignment model differentiable.
Thus, it could be trained together with the whole architecture using backpropagation.
One of the most severe problems of NMT is handling of the rare words, which are not in the short
lists of the vocabularies, i.e. out-of-vocabulary (OOV) words, or do not appear in the training set at
all. In (Luong et al., 2015b), the rare target words are copied from their aligned source words after
the translation. This heuristic works well with OOV words and named entities but unable to translate
unseen words. In (Sennrich et al., 2016b), their proposed NMT models have been shown to not only
be effective on reducing vocabulary sizes but also have the ability to generate unseen words. This is
achieved by segmenting the rare words into subword units and translating them. The state-of-the-art
translation systems essentially employ subword NMT (Sennrich et al., 2016b).
Language-specific Coding. When the encoder of a NMT system considers words across languages as
different words, with a well-chosen architecture, it is expected to be able to learn a good representation of
the source words in an embedding space in which words carrying similar meaning would have a closer
distance to each others than those are semantically different. This should hold true when the words
have the same or similar surface form, such as (@de@Obama; @en@Obama) or (@de@Projektion;
@en@projection)3 . This should also hold true when the words have the same or similar meaning across
languages, such as (@en@car; @en@automobile) or (@de@Flussufer; @en@bank). Our encoder then
acts similarly to the one of multi-source approach(Zoph and Knight, 2016), collecting additional infor-
mation from other sources for better translations, but with a much simpler embedding function. Unlike
them, we need only one encoder, so we could reduce the number of parameters to learn. Furthermore,
we neither need to change the network architecture nor depend on which recurrent unit (GRU, LSTM or
simple RNN) is currently using in the encoder.
We could apply the same trick to the target sentences and thus enable many-to-many translation capa-
bility of our NMT system. Similar to the multi-target translation(Dong et al., 2015), we exploit further
the correlation in semantics of those target sentences across different languages. The main difference
between our approach and the work of (Dong et al., 2015) is that we need only one decoder for all target
languages. Given one encoder for multiple source languages and one decoder for multiple target lan-
guages, it is trivial to incorporate the attention mechanism as in the case of a regular NMT for single
language translation. In training, the attention layers were directed to learn relevant alignments between
words in specific language pair and forward the produced context vector to the decoder. Now we rely
totally on the network to learn good alignments between source and target sides. In fact, giving more
information, our system are able to form nice alignments.
In comparison to other research that could perform complete multi-task learning, e.g. the work from
(Luong et al., 2016) or the approach proposed by (Firat et al., 2016), our method is able to accommodate
the attention layers seemlessly and easily. It also draws a clear distinction from those works in term of
the complexity of the whole network: considerably less parameters to learn, thus reduces overfitting,
with a conventional attention mechanism and a standard training procedure.
2
An example taken from the paper is when we want to translate the English word bank into French, it might be easier if we
have an additional German sentence containing the word Flussufer (river bank).
3
@lang code@a word is a simple way that transforms the word a word into a different surface form associated with its
language lang code. For example, @de@Projektion is referred to the word Projektion appearing in a German (de) sentence.
Target Forcing. While language-specific coding allows us to implement a multilingual attention-
based NMT, there are two issues we have to consider before training the network. The first is that the
number of rare words would increase in proportion with the number of languages involved. This might
be solved by applying a rare word treatment method with appropriate awareness of the vocabularies’
size. The second one is more problematic: Ambiguity level in the translation process definitely increases
due to the additional introduction of words having the same or similar meaning across languages at both
source and target sides. We deal with the problem by explicitly forcing the attention and translation to
the direction that we prefer, expecting the information would limit the ambiguity to the scope of one
language instead of all target languages. We realize this idea by adding at the beginning and at the end
of every source sentences a special symbol indicating the language they would be translated into. For
example, in a multilingual NMT, when a source sentence is German and the target language is English,
the original sentence (already language-specific coded) is:
@de@darum @de@geht @de@es @de@in @de@meinem @de@Vortrag
Now when we force it to be translated into English, the target-forced sentence becomes:
<E> @de@darum @de@geht @de@es @de@in @de@meinem @de@Vortrag <E>
Due to the nature of recurrent units used in the encoder and decoder, in training, those starting sym-
bols4 encourage the network learning the translation of following target words in a particular language
pair. In testing time, information of the target language we provided help to limit the translated candi-
dates, hence forming the translation in the desired language.
Figure 1 illustrates the essence of our approach. With two steps in the preprocessing phase, namely
language-specific coding and target forcing, we are able to employ multilingual attention-based NMT
without any special treatment in training such a standard architecture. Our encoder and attention-enable
decoder can be seen as a shared encoder and decoder across languages, or an universal encoder and
decoder. The flexibitily of our approach allow us to integrate any language into source or target side.
As we will see in Section 4, it has proven to be extremely helpful not only in low-resourced scenarios
but also in translation of well-resourced language pairs as it provides a novel way to make use of large
monolingual corpora in NMT.
4 Evaluation
In this section, we describe the evaluation of our proposed approach in comparisons with the strong base-
lines using NMT in two scenarios: the translation of an under-resource language pair and the translation
of a language pair that does not exist any paralled data at all.
and the web-crawled parallel data (CommonCrawl). While the number of sentences in popular TED
corpora varies from 13 thousands to 17 thousands, the total number of sentences in those larger corpus
is approximately 3 million sentences.
Neural Machine Translation Setup. All experiments have been conducted using NMT framework
Nematus6 , Following the work of Sennrich et al. (2016b), subword segmentation is handled in the
prepocessing phase using Byte-Pair Encoding (BPE). Excepts stated clearly in some experiments, we set
the number of BPE merging operations at 39500 on the joint of source and target data. When training
all NMT systems, we take out the sentence pairs exceeding 50-word length and shuffle them inside
every minibatch. Our short-list vocabularies contain 40,000 most frequent words while the others are
considered as rare words and applied the subword translation. We use an 1024-cell GRU layer and
1000-dimensional embeddings with dropout at every layer with the probability of 0.2 in the embedding
and hidden layers and 0.1 in the input and ourput layers. We trained our systems using gradient descent
optimization with Adadelta (Zeiler, 2012) on minibatches of size 80 and the gradient is rescaled whenever
its norm exceed 1.0. All the trainings last approximately seven days if the early-stopping condition could
not be reached. At a certain time, an external evaluation script on BLEU (Papineni et al., 2002) is
conducted on a development set to decide the early-stopping condition. This evaluation script has also
being used to choose the model archiving the best BLEU on the development set instead of the maximal
loglikelihood between the translations and target sentences while training. In translation, the framework
produces n-best candidates and we then use a beam search with the beam size of 12 to get the best
translation.
2.1 BLEU points on and tst2014 . Adding French data to the source side and their corresponding Ger-
man data to the target side in our mix-multi-source system also help to gain 2.2 and 1.6 BLEU points
more on tst2013 tst2014, respectively. We observe a better improvement from our mix-source system
compared to our mix-multi-source system. We speculate the reason that the mix-source encoder utilize
the same information shared in two languages while the mix-multi-source receives and processes similar
information in the other language but not necessarily the same. We might validate this hypothesis by
comparing two systems trained on a common English-German-French corpus of TED. We put it in our
future work’s plan.
As we expected Figure 3 shows how different words in different languages can be close in the shared
space after being learned to translate into a common language. We extract the word embeddings from
the encoder of the mix-multi-source (En,Fr→De,De) after training, remove the language-specific codes
(@en@ and @fr@)and project the word vectors to the 2D space using t-SNE8 (Maaten and Hinton, 2008).
sciences
science
scientific
scientifique science
cognition scientifiques
20
experimental
recognition
discovery
realization
biology
philosophy exploration scientist
chercheurs apprenez
laboratoire imagination chercheur
researcher
raisonnement
reasoning
studyexplore
examine
investigation
progression
intuitions development
0 investigate
exploring studied
mission explore
explorer
observations lesson
cliniques
clinical improvement recherches
research
recherche
clinique investment
career
education
educational
interested formation
training universitaire essais
trial
quest universitaires
academic
−20 explorer
challenges
challenge
task
formation enseignement
labor
exam jobs advanced scolaires
survey scientifically school
preparation interventions enseignements
discussions
discussion screening ecology publications
−40 wedding tour panel documentary
debate stage literature
investigation
coverage
caps
assemblage
publication soutenu forestation leges
publication
conference
grad
travail
work oeuvres
travaux
−60
−20 −10 0 10 20 30 40 50
Figure 3: The multilingual word embeddings from the shared representation space of the source.
tst2013 tst2014
System
BLEU ∆BLEU BLEU ∆BLEU
Baseline (En→De) 24.35 – 20.62 –
Mix-source big (En,De→De,De) 25.87 +1.52 21.68 +1.06
the forcing strength might not be enough to guide the decision of the next words. Once the very first word
is translated into a word in wrong language, the following words tend to be translated into that wrong
language again. Table 4 shows some statistics of the translated words and sentences in wrong language.
Table 4: Percentages of language identificcation mistakes when applying our translation strategies.
Balancing of the training corpus. Although it is not severe as in the case of mix-source system for
large monolingual data, the limited number of sentences in target language can affect the training. The
difference of 1.07 BLEU points between bridge and universal might explain this assumption as we added
more target data (French) in universal strategy, thus reducing the unbalance in training.
Those issues would be addressed in our following future work toward the multilingual attention-based
NMT.
References
[Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Transla-
tion by Jointly Learning to Align and Translate. CoRR, abs/1409.0473.
[Bojar et al.2016] Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes
Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, et al. 2016. Findings of the 2016
Conference on Machine Translation (WMT16). In Proceedings of the First Conference on Machine Translation
(WMT16), pages 12–58, Berlin, Germany. Association for Computational Linguistics.
[Cettolo et al.2012] Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. Wit3 : Web inventory of
transcribed and translated talks. In Proceedings of the 16th Conference of the European Association for Machine
Translation (EAMT), pages 261–268, Trento, Italy, May.
[Cettolo et al.2015] M Cettolo, J Niehues, S Stüker, L Bentivogli, R Cattoni, and M Federico. 2015. The IWSLT
2015 Evaluation Campaign. In Proceedings of the 12th International Workshop on Spoken Language Transla-
tion (IWSLT 2015), Danang, Vietnam.
[Cho et al.2014] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and
Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine
Translation. In Proceedings of Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation
(SSST-8, Baltimore, ML, USA, Jule. Association for Computational Linguistics.
[Dong et al.2015] Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-Task Learning
for Multiple Language Translation. In Proceedings of ACL-IJNLP 2015, pages 1723–1732, Beijing, China,
July. Association for Computational Linguistics.
[Firat et al.2016] Orhan Firat, KyungHyun Cho, and Yoshua Bengio. 2016. Multi-Way, Multilingual Neural Ma-
chine Translation with a Shared Attention Mechanism. CoRR, abs/1601.01073.
[Gülçehre et al.2015] Çaglar Gülçehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loı̈c Barrault, Huei-Chi Lin,
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2015. On Using Monolingual Corpora in Neural Ma-
chine Translation. CoRR, abs/1503.03535.
[Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.
Neural Comput., 9(8):1735–1780, November.
[Luong et al.2015a] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015a. Effective approaches
to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods
in Natural Language Processing (EMNLP 15, pages 1412–1421, Lisbon, Portugal, September. Association for
Computational Linguistics.
[Luong et al.2015b] Minh-Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. 2015b.
Addressing the rare word problem in neural machine translation. In Proceedings of ACL-IJNLP 2015, pages
11–19, Beijing, China, July. Association for Computational Linguistics.
[Luong et al.2016] Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016.
Multi-task sequence to sequence learning. In International Conference on Learning Representations (ICLR),
San Juan, Puerto Rico, May.
[Maaten and Hinton2008] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Jour-
nal of Machine Learning Research, 9(Nov):2579–2605.
[Papineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for
automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for
Computational Linguistics (ACL 2002), pages 311–318. Association for Computational Linguistics.
[Sennrich et al.2016a] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Improving Neural Machine
Translation Models with Monolingual Data. In Association for Computational Linguistics (ACL 2016), Berlin,
Germany, August.
[Sennrich et al.2016b] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural Machine Translation of
Rare Words with Subword Units. In Association for Computational Linguistics (ACL 2016), Berlin, Germany,
August.
[Zeiler2012] Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method. CoRR, abs/1212.5701.
[Zoph and Knight2016] Barret Zoph and Kevin Knight. 2016. Multi-Source Neural Translation. In The North
American Chapter of the Association for Computational Linguistics (NAACL 2016), San Diego, CA, USA,
June.