Professional Documents
Culture Documents
Understanding Back-Translation at Scale
Understanding Back-Translation at Scale
489
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500
Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics
ing only on public WMT bitext as well as 226M A similar strategy can be applied to NMT (He
monolingual sentences. This outperforms the sys- et al., 2016b). Besides improving accuracy during
tem of DeepL by 1.7 BLEU who train on large decoding, neural LM and NMT can benefit from
amounts of high quality non-benchmark data. On deeper integration, e.g. by combining the hid-
WMT’14 English-French we achieve 45.6 BLEU. den states of both models (Gulcehre et al., 2017).
Neural architecture also allows multi-task learning
2 Related work and parameter sharing between MT and target-side
LM (Domhan and Hieber, 2017).
This section describes prior work in machine
translation with neural networks as well as semi- Back-translation (BT) is an alternative to lever-
supervised machine translation. age monolingual data. BT is simple and easy to
apply as it does not require modification to the MT
2.1 Neural machine translation training algorithms. It requires training a target-
We build upon recent work on neural machine to-source system in order to generate additional
translation which is typically a neural network synthetic parallel data from the monolingual tar-
with an encoder/decoder architecture. The en- get data. This data complements human bitext to
coder infers a continuous space representation of train the desired source-to-target system. BT has
the source sentence, while the decoder is a neural been applied earlier to phrase-base systems (Bo-
language model conditioned on the encoder out- jar and Tamchyna, 2011). For these systems, BT
put. The parameters of both models are learned has also been successful in leveraging monolin-
jointly to maximize the likelihood of the target gual data for domain adaptation (Bertoldi and Fed-
sentences given the corresponding source sen- erico, 2009; Lambert et al., 2011). Recently, BT
tences from a parallel corpus (Sutskever et al., has been shown beneficial for NMT (Sennrich
2014; Cho et al., 2014). At inference, a target sen- et al., 2016a; Poncelas et al., 2018). It has been
tence is generated by left-to-right decoding. found to be particularly useful when parallel data
is scarce (Karakanta et al., 2017).
Different neural architectures have been pro-
posed with the goal of improving efficiency Currey et al. (2017) show that low resource
and/or effectiveness. This includes recurrent net- language pairs can also be improved with syn-
works (Sutskever et al., 2014; Bahdanau et al., thetic data where the source is simply a copy of
2015; Luong et al., 2015), convolutional net- the monolingual target data. Concurrently to our
works (Kalchbrenner et al., 2016; Gehring et al., work, Imamura et al. (2018) show that sampling
2017; Kaiser et al., 2017) and transformer net- synthetic sources is more effective than beam
works (Vaswani et al., 2017). Recent work re- search. Specifically, they sample multiple sources
lies on attention mechanisms where the encoder for each target whereas we draw only a single sam-
produces a sequence of vectors and, for each ple, opting to train on a larger number of target
target token, the decoder attends to the most sentences instead. Hoang et al. (2018) and Cot-
relevant part of the source through a context- terell and Kreutzer (2018) suggest an iterative pro-
dependent weighted-sum of the encoder vec- cedure which continuously improves the quality of
tors (Bahdanau et al., 2015; Luong et al., 2015). the back-translation and final systems. Niu et al.
Attention has been refined with multi-hop atten- (2018) experiment with a multilingual model that
tion (Gehring et al., 2017), self-attention (Vaswani does both the forward and backward translation
et al., 2017; Paulus et al., 2018) and multi-head which is continuously trained with new synthetic
attention (Vaswani et al., 2017). We use a trans- data.
former architecture (Vaswani et al., 2017). There has also been work using source-side
monolingual data (Zhang and Zong, 2016). Fur-
2.2 Semi-supervised NMT thermore, Cheng et al. (2016); He et al. (2016a);
Monolingual target data has been used to improve Xia et al. (2017) show how monolingual text from
the fluency of machine translations since the early both languages can be leveraged by extending
IBM models (Brown et al., 1990). In phrase-based back-translation to dual learning: when training
systems, language models (LM) in the target lan- both source-to-target and target-to-source models
guage increase the score of fluent outputs during jointly, one can use back-translation in both direc-
decoding (Koehn et al., 2003; Brants et al., 2007). tions and perform multiple rounds of BT. A simi-
490
lar idea is applied in unsupervised NMT (Lample cial for the autoencoder setups of (Lample et al.,
et al., 2018a,b). Besides monolingual data, var- 2018a; Hill et al., 2016) which is inspired by de-
ious approaches have been introduced to benefit noising autoencoders (Vincent et al., 2008). In
from parallel data in other language pairs (Johnson particular, we transform source sentences with
et al., 2017; Firat et al., 2016a,b; Ha et al., 2016; three types of noise: deleting words with proba-
Gu et al., 2018). bility 0.1, replacing words by a filler token with
Data augmentation is an established technique probability 0.1, and swapping words which is im-
in computer vision where a labeled dataset is sup- plemented as a random permutation over the to-
plemented with cropped or rotated input images. kens, drawn from the uniform distribution but re-
Recently, generative adversarial networks (GANs) stricted to swapping words no further than three
have been successfully used to the same end (An- positions apart.
toniou et al., 2017; Perez and Wang, 2017) as well
as models that learn distributions over image trans- 4 Experimental setup
formations (Hauberg et al., 2016). 4.1 Datasets
3 Generating synthetic sources The majority of our experiments are based on data
from the WMT’18 English-German news transla-
Back-translation typically uses beam search (Sen- tion task. We train on all available bitext exclud-
nrich et al., 2016a) or just greedy search (Lample ing the ParaCrawl corpus and remove sentences
et al., 2018a,b) to generate synthetic source sen- longer than 250 words as well as sentence-pairs
tences. Both are approximate algorithms to iden- with a source/target length ratio exceeding 1.5.
tify the maximum a-posteriori (MAP) output, i.e. This results in 5.18M sentence pairs. For the back-
the sentence with the largest estimated probabil- translation experiments we use the German mono-
ity given an input. Beam is generally successful in lingual newscrawl data distributed with WMT’18
finding high probability outputs (Ott et al., 2018a). comprising 226M sentences after removing dupli-
However, MAP prediction can lead to less rich cates. We tokenize all data with the Moses tok-
translations (Ott et al., 2018a) since it always fa- enizer (Koehn et al., 2007) and learn a joint source
vors the most likely alternative in case of ambigu- and target Byte-Pair-Encoding (BPE; Sennrich et
ity. This is particularly problematic in tasks where al., 2016) with 35K types. We develop on new-
there is a high level of uncertainty such as dia- stest2012 and report final results on newstest2013-
log (Serban et al., 2016) and story generation (Fan 2017; additionally we consider a held-out set from
et al., 2018). We argue that this is also problem- the training data of 52K sentence-pairs.
atic for a data augmentation scheme such as back- We also experiment on the larger WMT’14
translation. Beam and greedy focus on the head of English-French task which we filter in the same
the model distribution which results in very regu- way as WMT’18 English-German. This results in
lar synthetic source sentences that do not properly 35.7M sentence-pairs for training and we learn a
cover the true data distribution. joint BPE vocabulary of 44K types. As monolin-
As alternative, we consider sampling from the gual data we use newscrawl2010-2014, compris-
model distribution as well as adding noise to beam ing 31M sentences after language identification
search outputs. First, we explore unrestricted sam- (Lui and Baldwin, 2012). We use newstest2012
pling which generates outputs that are very di- as development set and report final results on
verse but sometimes highly unlikely. Second, newstest2013-2015.
we investigate sampling restricted to the most The majority of results in this paper are in terms
likely words (Graves, 2013; Ott et al., 2018a; Fan of case-sensitive tokenized BLEU (Papineni et al.,
et al., 2018). At each time step, we select the k 2002) but we also report test accuracy with de-
most likely tokens from the output distribution, re- tokenized BLEU using sacreBLEU (Post, 2018).
normalize and then sample from this restricted set.
4.2 Model and hyperparameters
This is a middle ground between MAP and unre-
stricted sampling. We re-implemented the Transformer model in py-
As a third alternative, we apply noising Lam- torch using the fairseq toolkit.1 All experiments
ple et al. (2018a) to beam search outputs. Adding 1
Code available at https://github.com/
noise to input sentences has been very benefi- pytorch/fairseq
491
are based on the Big Transformer architecture with 25.5
6 blocks in the encoder and decoder. We use the
BLEU (newstest2012)
same hyper-parameters for all experiments, i.e., 25
word representations of size 1024, feed-forward
layers with inner dimension 4096. Dropout is set
24.5
to 0.3 for En-De and 0.1 for En-Fr, we use 16 at-
tention heads, and we average the checkpoints of
24
the last ten epochs. Models are optimized with
greedy beam
Adam (Kingma and Ba, 2015) using β1 = 0.9, top10 sampling
β2 = 0.98, and = 1e − 8 and we use the same 23.5
beam+noise
learning rate schedule as Vaswani et al. (2017). All 5M 8M 11M 17M 29M
models use label smoothing with a uniform prior Total training data
distribution over the vocabulary = 0.1 (Szegedy
et al., 2015; Pereyra et al., 2017). We run exper- Figure 1: Accuracy of models trained on dif-
iments on DGX-1 machines with 8 Nvidia V100 ferent amounts of back-translated data obtained
GPUs and machines are interconnected by Infini- with greedy search, beam search (k = 5), ran-
band. Experiments are run on 16 machines and domly sampling from the model distribution, re-
we perform 30K synchronous updates. We also stricting sampling over the ten most likely words
use the NCCL2 library and the torch distributed (top10), and by adding noise to the beam outputs
package for inter-GPU communication. We train (beam+noise). Results based on newstest2012 of
models with 16-bit floating point operations, fol- WMT English-German translation.
lowing Ott et al. (2018b). For final evaluation,
we generate translations with a beam of size 5 and
with no length penalty. beam search and unrestricted sampling, it is less
likely to pick very low scoring outputs but still
5 Results
preserves some randomness. Preliminary experi-
Our evaluation first compares the accuracy of ments with top5, top20, top50 gave similar results
back-translation generation methods (§5.1) and to top10.
analyzes the results (§5.2). Next, we simulate a We also vary the amount of synthetic data and
low-resource setup to experiment further with dif- perform 30K updates during training for the bi-
ferent generation methods (§5.3). We also com- text only, 50K updates when adding 3M synthetic
pare synthetic bitext to genuine parallel data and sentences, 75K updates for 6M and 12M sen-
examine domain effects arising in back-translation tences and 100K updates for 24M sentences. For
(§5.4). We also measure the effect of upsampling each setting, this corresponds to enough updates to
bitext during training (§5.5). Finally, we scale to a reach convergence in terms of held-out loss. In our
very large setup of up to 226M monolingual sen- 128 GPU setup, training of the final models takes
tences and compare to previous research (§5.6). 3h 20min for the bitext only model, 7h 30min for
6M and 12M synthetic sentences, and 10h 15min
5.1 Synthetic data generation methods for 24M sentences. During training we also sam-
We first investigate different methods to gener- ple the bitext more frequently than the synthetic
ate synthetic source translations given a back- data and we analyze the effect of this in more de-
translation model, i.e., a model trained in the tail in §5.5.
reverse language direction (Section 5.1). We Figure 1 shows that sampling and beam+noise
consider two types of MAP prediction: greedy outperform the MAP methods (pure beam search
search (greedy) and beam search with beam size 5 and greedy) by 0.8-1.1 BLEU. Sampling and
(beam). Non-MAP methods include unrestricted beam+noise improve over bitext-only (5M) by be-
sampling from the model distribution (sampling), tween 1.7-2 BLEU in the largest data setting.
restricting sampling to the k highest scoring out- Restricted sampling (top10) performs better than
puts at every time step with k = 10 (top10) as well beam and greedy but is not as effective as unre-
as adding noise to the beam outputs (beam+noise). stricted sampling (sampling) or beam+noise.
Restricted sampling is a middle-ground between Table 1 shows results on a wider range of
492
news2013 news2014 news2015 news2016 news2017 Average
bitext 27.84 30.88 31.82 34.98 29.46 31.00
+ beam 27.82 32.33 32.20 35.43 31.11 31.78
+ greedy 27.67 32.55 32.57 35.74 31.25 31.96
+ top10 28.25 33.94 34.00 36.45 32.08 32.94
+ sampling 28.81 34.46 34.87 37.08 32.35 33.51
+ beam+noise 29.28 33.53 33.79 37.89 32.66 33.43
Table 1: Tokenized BLEU on various test sets of WMT English-German when adding 24M synthetic
sentence pairs obtained by various generation methods to a 5.2M sentence-pair bitext (cf. Figure 1).
6 Perplexity
human data 75.34
5 beam 72.42
greedy beam
Training perplexity
1 20 40 60 80 100
epoch
fore, sampling is also more likely to provide a
Figure 2: Training perplexity (PPL) per epoch for richer training signal than argmax sequences.
different synthetic data. We separately report PPL To get a better sense of the training signal pro-
on the synthetic data and the bitext. Bitext PPL is vided by each method, we compare the loss on
averaged over all generation methods. the training data for each method. We report the
cross entropy loss averaged over all tokens and
separate the loss over the synthetic data and the
test sets (newstest2013-2017). Sampling and real bitext data. Specifically, we choose the setup
beam+noise perform roughly equal and we adopt with 24M synthetic sentences. At the end of each
sampling for the remaining experiments. epoch we measure the loss over 500K sentence
pairs sub-sampled from the synthetic data as well
5.2 Analysis of generation methods as an equally sized subset of the bitext. For each
The previous experiment showed that synthetic generation method we choose the same sentences
source sentences generated via sampling and beam except for the bitext which is disjoint from the syn-
with noise perform significantly better than those thetic data. This means that losses over the syn-
obtained by pure MAP methods. Why is this? thetic data are measured over the same target to-
Beam search focuses on very likely outputs kens because the generation methods only differ
which reduces the diversity and richness of the in the source sentences. We found it helpful to up-
generated source translations. Adding noise to sample the frequency with which we observe the
beam outputs and sampling do not have this prob- bitext compared to the synthetic data (§5.5) but we
lem: Noisy source sentences make it harder to pre- do not upsample for this experiment to keep condi-
dict the target translations which may help learn- tions as similar as possible. We assume that when
ing, similar to denoising autoencoders (Vincent the training loss is low, then the model can easily
et al., 2008). Sampling is known to better approx- fit the training data without extracting much learn-
imate the data distribution which is richer than the ing signal compared to data which is harder to fit.
argmax model outputs (Ott et al., 2018a). There- Figure 2 shows that synthetic data based on
493
source Diese gegenstzlichen Auffassungen von Fairness liegen nicht nur der politischen Debatte
zugrunde.
reference These competing principles of fairness underlie not only the political debate.
beam These conflicting interpretations of fairness are not solely based on the political debate.
sample Mr President, these contradictory interpretations of fairness are not based solely on the
political debate.
top10 Those conflicting interpretations of fairness are not solely at the heart of the political
debate.
beam+noise conflicting BLANK interpretations BLANK are of not BLANK based on the political
debate.
Table 3: Example where sampling produces inadequate outputs. ”Mr President,” is not in the source.
BLANK means that a word has been replaced by a filler token.
greedy or beam is much easier to fit compared to 3. On the remaining 450K sentences, we apply
data from sampling, top10, beam+noise and the the back-translation system using beam, sam-
bitext. In fact, the perplexity on beam data falls pling and top10 generation.
below 2 after only 5 epochs. Except for sampling, For the last set, we have genuine source sen-
we find that the perplexity on the training data is tences as well as synthetic sources from different
somewhat correlated to the end-model accuracy generation techniques. We report the perplexity of
(cf. Figure 1) and that all methods except sam- our language model on all versions of the source
pling have a lower loss than real bitext. data in Table 2. The results show that beam out-
These results suggest that synthetic data ob- puts receive higher probability by the language
tained with argmax inference does not provide model compared to sampling, beam+noise and
as rich a training signal as sampling or adding real source sentences. This indicates that beam
noise. We conjecture that the regularity of syn- search outputs are not as rich as sampling outputs
thetic data obtained with argmax inference is not or beam+noise. This lack of variability probably
optimal. Sampling and noised argmax both expose explains in part why back-translations from pure
the model to a wider range of source sentences beam search provide a weaker training signal than
which makes the model more robust to reorder- alternatives.
ing and substitutions that happen naturally, even if Closer inspection of the synthetic sources (Ta-
the model of reordering and substitution through ble 3) reveals that sampled and noised beam out-
noising is not very realistic. puts are sometimes not very adequate, much more
Next we analyze the richness of synthetic out- so than MAP outputs, e.g., sampling often in-
puts and train a language model on real human text troduces target words which have no counterpart
and score synthetic source sentences generated by in the source. This happens because sampling
beam search, sampling, top10 and beam+noise. sometimes picks highly unlikely outputs which are
We hypothesize that data that is very regular harder to fit (cf. Figure 2).
should be more predictable by the language model
and therefore receive low perplexity. We elimi- 5.3 Low resource vs. high resource setup
nate a possible domain mismatch effect between The experiments so far are based on a setup with a
the language model training data and the synthetic large bilingual corpus. However, in resource poor
data by splitting the parallel corpus into three non- settings the back-translation model is of much
overlapping parts: lower quality. Are non-MAP methods still more
effective in such a setup? To answer this ques-
1. On 640K sentences pairs, we train a back- tion, we simulate such setups by sub-sampling
translation model, the training data to either 80K sentence-pairs or
640K sentence-pairs and then add synthetic data
2. On 4.1M sentence pairs, we take the source from sampling and beam search. We compare
side and train a 5-gram Kneser-Ney language these smaller setups to our original 5.2M sen-
model (Heafield et al., 2013), tence bitext configuration. The accuracy of the
494
26 The back-translated data is generated via sam-
pling. This setup allows us to compare synthetic
24
data to genuine data since BT-bitext and bitext
BLEU (newstest2012)
0K
0K
2M
6M
5M
11M
M
M
M
80
17
29
16
32
64
1.
2.
495
33
23
32
22 31
BLEU
BLEU
30
21
bitext bitext
29 BT-bitext
BT-bitext
20 BT-news BT-news
28
640K 1.28M 2.56M 5.19M 640K 1.28M 2.56M 5.19M
Amount of data Amount of data
(a) newstest2012 (b) valid-mixed
Figure 4: Accuracy on (a) newstest2012 and (b) a mixed domain valid set when growing a 640K bitext
corpus with (i) real parallel data (bitext), (ii) a back-translated version of the target side of the bitext
(BT-bitext), (iii) or back-translated newscrawl data (BT-news).
496
news13 news14 news15 En–De En–Fr
bitext 36.97 42.90 39.92 a. Gehring et al. (2017) 25.2 40.5
+sampling 37.85 45.60 43.95 b. Vaswani et al. (2017) 28.4 41.0
c. Ahmed et al. (2017) 28.9 41.4
Table 4: Tokenized BLEU on various test sets for d. Shaw et al. (2018) 29.2 41.5
WMT English-French translation. DeepL 33.3 45.9
Our result 35.0 45.6
news13 news14 news15 detok. sacreBLEU3 33.8 43.8
497
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- Thanh-Le Ha, Jan Niehues, and Alexander H. Waibel.
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger 2016. Toward multilingual neural machine trans-
Schwenk, and Yoshua Bengio. 2014. Learning lation with universal encoder and decoder. arXiv,
phrase representations using rnn encoder-decoder 1611.04798.
for statistical machine translation. In Conference on
Empirical Methods in Natural Language Processing Hany Hassan, Anthony Aue, Chang Chen, Vishal
(EMNLP). Chowdhary, Jonathan Clark, Christian Feder-
mann, Xuedong Huang, Marcin Junczys-Dowmunt,
Ryan Cotterell and Julia Kreutzer. 2018. Explain- William Lewis, Mu Li, et al. 2018. Achieving hu-
ing and generalizing back-translation through wake- man parity on automatic chinese to english news
sleep. arXiv preprint arXiv:1806.04402. translation. arXiv, 1803.05567.
Anna Currey, Antonio Valerio Miceli Barone, and Ken- Soren Hauberg, Oren Freifeld, Anders Boesen Lindbo
neth Heafield. 2017. Copied Monolingual Data Im- Larsen, John W. Fisher, and Lars Kai Hansen. 2016.
proves Low-Resource Neural Machine Translation. Dreaming more data: Class-dependent distributions
In Proc. of WMT. over diffeomorphisms for learned data augmenta-
tion. In AISTATS.
Tobias Domhan and Felix Hieber. 2017. Using target-
side monolingual data for neural machine transla- Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu,
tion through multi-task learning. In Conference on Tieyan Liu, and Wei-Ying Ma. 2016a. Dual learning
Empirical Methods in Natural Language Processing for machine translation. In Conference on Advances
(EMNLP). in Neural Information Processing Systems (NIPS).
Angela Fan, Yann Dauphin, and Mike Lewis. 2018. Wei He, Zhongjun He, Hua Wu, and Haifeng Wang.
Hierarchical neural story generation. In Confer- 2016b. Improved neural machine translation with
ence of the Association for Computational Linguis- smt features. In Conference of the Association for
tics (ACL). the Advancement of Artificial Intelligence (AAAI),
pages 151–157.
Orhan Firat, Kyunghyun Cho, and Yoshua Ben- Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H.
gio. 2016a. Multi-way, multilingual neural ma- Clark, and Philipp Koehn. 2013. Scalable Modified
chine translation with a shared attention mecha- Kneser-Ney Language Model Estimation. In Con-
nism. In Conference of the North American Chap- ference of the Association for Computational Lin-
ter of the Association for Computational Linguistics guistics (ACL).
(NAACL).
Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016.
Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan, Learning distributed representations of sentences
Fatos T. Yarman-Vural, and Kyunghyun Cho. 2016b. from unlabelled data. In Conference of the North
Zero-resource translation with multi-lingual neu- American Chapter of the Association for Computa-
ral machine translation. In Conference on Em- tional Linguistics (NAACL).
pirical Methods in Natural Language Processing
(EMNLP). Vu Cong Duy Hoang, Philipp Koehn, Gholamreza
Haffari, and Trevor Cohn. 2018. Iterative back-
Jonas Gehring, Michael Auli, David Grangier, Denis translation for neural machine translation. In Pro-
Yarats, and Yann N Dauphin. 2017. Convolutional ceedings of the 2nd Workshop on Neural Machine
sequence to sequence learning. In International Translation and Generation, pages 18–24.
Conference of Machine Learning (ICML).
Kenji Imamura, Atsushi Fujita, and Eiichiro Sumita.
Alex Graves. 2013. Generating sequences with recur- 2018. Enhancement of encoder and attention using
rent neural networks. arXiv, 1308.0850. target monolingual corpora in neural machine trans-
lation. In Proceedings of the 2nd Workshop on Neu-
Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor ral Machine Translation and Generation, pages 55–
O. K. Li. 2018. Universal neural machine transla- 63.
tion for extremely low resource languages. arXiv,
1802.05368. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho-
Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun rat, Fernanda B. Viégas, Martin Wattenberg, Gre-
Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, gory S. Corrado, Macduff Hughes, and Jeffrey Dean.
Holger Schwenk, and Yoshua Bengio. 2015. On us- 2017. Google’s multilingual neural machine transla-
ing monolingual corpora in neural machine transla- tion system: Enabling zero-shot translation. Trans-
tion. arXiv, 1503.03535. actions of the Association for Computational Lin-
guistics (TACL), 5:339–351.
Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun
Cho, and Yoshua Bengio. 2017. On integrating Lukasz Kaiser, Aidan N. Gomez, and François Chollet.
a language model into neural machine translation. 2017. Depthwise separable convolutions for neural
Computer Speech & Language, 45:137–148. machine translation. CoRR, abs/1706.03059.
498
Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Myle Ott, Michael Auli, David Grangier, and
Aäron van den Oord, Alex Graves, and Koray Marc’Aurelio Ranzato. 2018a. Analyzing uncer-
Kavukcuoglu. 2016. Neural machine translation in tainty in neural machine translation. In Proceed-
linear time. CoRR, abs/1610.10099. ings of the 35th International Conference on Ma-
chine Learning, volume 80, pages 3956–3965.
Alina Karakanta, Jon Dehdari, and Josef van Genabith.
2017. Neural machine translation for low-resource Myle Ott, Sergey Edunov, David Grangier, and
languages without parallel corpora. Machine Trans- Michael Auli. 2018b. Scaling neural machine trans-
lation, pages 1–23. lation.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
A Method for Stochastic Optimization. In Inter- Jing Zhu. 2002. BLEU: a method for automatic
national Conference on Learning Representations evaluation of machine translation. In Conference
(ICLR). of the Association for Computational Linguistics
(ACL).
Philipp Koehn. 2010. Statistical machine translation.
Cambridge University Press. Romain Paulus, Caiming Xiong, and Richard Socher.
2018. A deep reinforced model for abstractive sum-
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris marization. In International Conference on Learn-
Callison-Burch, Marcello Federico, Nicola Bertoldi, ing Representations (ICLR).
Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Gabriel Pereyra, George Tucker, Jan Chorowski,
Constantin, and Evan Herbst. 2007. Moses: Open Lukasz Kaiser, and Geoffrey E. Hinton. 2017. Reg-
source toolkit for statistical machine translation. In ularizing neural networks by penalizing confident
ACL Demo Session. output distributions. In International Conference on
Learning Representations (ICLR) Workshop.
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical phrase-based translation. In Con- Luis Perez and Jason Wang. 2017. The effectiveness of
ference of the North American Chapter of the Asso- data augmentation in image classification using deep
ciation for Computational Linguistics (NAACL). learning. arxiv, 1712.04621.
Patrik Lambert, Holger Schwenk, Christophe Ser- Alberto Poncelas, Dimitar Sht. Shterionov, Andy Way,
van, and Sadaf Abdul-Rauf. 2011. Investigations Gideon Maillette de Buy Wenniger, and Peyman
on translation model adaptation using monolingual Passban. 2018. Investigating backtranslation in neu-
data. In Workshop on Statistical Machine Transla- ral machine translation. arXiv, 1804.06189.
tion (WMT).
Matt Post. 2018. A call for clarity in reporting bleu
Guillaume Lample, Alexis Conneau, Ludovic Denoyer, scores. arXiv, 1804.08771.
and Marc’Aurelio Ranzato. 2018a. Unsupervised
machine translation using monolingual corpora only. Rico Sennrich, Barry Haddow, and Alexandra Birch.
In International Conference on Learning Represen- 2016a. Improving neural machine translation mod-
tations (ICLR). els with monolingual data. Conference of the Asso-
ciation for Computational Linguistics (ACL).
Guillaume Lample, Myle Ott, Alexis Conneau, Lu-
dovic Denoyer, and Marc’Aurelio Ranzato. 2018b. Rico Sennrich, Barry Haddow, and Alexandra Birch.
Phrase-based & neural unsupervised machine trans- 2016b. Neural machine translation of rare words
lation. arXiv, 1803.05567. with subword units. In Conference of the Associa-
tion for Computational Linguistics (ACL).
Marco Lui and Timothy Baldwin. 2012. langid. py: An
off-the-shelf language identification tool. In Pro- Iulian Serban, Alessandro Sordoni, Yoshua Bengio,
ceedings of the ACL 2012 system demonstrations, Aaron C. Courville, and Joelle Pineau. 2016. Build-
pages 25–30. Association for Computational Lin- ing end-to-end dialogue systems using generative hi-
guistics. erarchical neural network models. In Conference of
the Association for the Advancement of Artificial In-
Minh-Thang Luong, Hieu Pham, and Christopher D telligence (AAAI).
Manning. 2015. Effective approaches to attention-
based neural machine translation. In Conference on Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.
Empirical Methods in Natural Language Processing 2018. Self-attention with relative position represen-
(EMNLP). tations. In Proc. of NAACL.
Xing Niu, Michael Denkowski, and Marine Carpuat. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
2018. Bi-directional neural machine transla- Sequence to sequence learning with neural net-
tion with synthetic parallel data. arXiv preprint works. In Conference on Advances in Neural In-
arXiv:1805.11213. formation Processing Systems (NIPS).
499
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
Jonathon Shlens, and Zbigniew Wojna. 2015. Re-
thinking the Inception Architecture for Computer
Vision. arXiv preprint arXiv:1512.00567.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Conference on Advances in Neural In-
formation Processing Systems (NIPS).
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, ,
and Pierre-Antoine Manzagol. 2008. Extracting and
composing robust features with denoising autoen-
coders. In International Conference on Machine
Learning (ICML).
Yingce Xia, Tao Qin, Wei Chen, Jiang Bian, Nenghai
Yu, and Tie-Yan Liu. 2017. Dual supervised learn-
ing. In International Conference on Machine Learn-
ing (ICML).
Jiajun Zhang and Chengqing Zong. 2016. Exploit-
ing source-side monolingual data in neural machine
translation. In Conference on Empirical Methods in
Natural Language Processing (EMNLP).
500