Professional Documents
Culture Documents
Understanding Back-Translation at Scale
Understanding Back-Translation at Scale
BLEU (newstest2012)
torch using the fairseq toolkit.1 All experiments 25
are based on the Big Transformer architecture with
6 blocks in the encoder and decoder. We use the 24.5
same hyper-parameters for all experiments, i.e.,
word representations of size 1024, feed-forward 24
layers with inner dimension 4096. Dropout is greedy beam
set to 0.3 for En-De and 0.1 for En-Fr, we use 23.5
top10 sampling
16 attention heads, and we average the check- beam+noise
points of the last ten epochs. Models are opti- 5M 8M 11M 17M 29M
mized with Adam (Kingma and Ba, 2015) using Total training data
β1 = 0.9, β2 = 0.98, and ǫ = 1e − 8 and we use
the same learning rate schedule as Vaswani et al. Figure 1: Accuracy of models trained on dif-
(2017). All models use label smoothing with a ferent amounts of back-translated data obtained
uniform prior distribution over the vocabulary ǫ = with greedy search, beam search (k = 5), ran-
0.1 (Szegedy et al., 2015; Pereyra et al., 2017). domly sampling from the model distribution, re-
We run experiments on DGX-1 machines with 8 stricting sampling over the ten most likely words
Nvidia V100 GPUs and machines are intercon- (top10), and by adding noise to the beam outputs
nected by Infiniband. Experiments are run on 16 (beam+noise). Results based on newstest2012 of
machines and we perform 30K synchronous up- WMT English-German translation.
dates. We also use the NCCL2 library and the
torch distributed package for inter-GPU communi-
cation. We train models with 16-bit floating point sampling from the model distribution (sampling),
operations, following Ott et al. (2018b). For final restricting sampling to the k highest scoring out-
evaluation, we generate translations with a beam puts at every time step with k = 10 (top10) as well
of size 5 and with no length penalty. as adding noise to the beam outputs (beam+noise).
Restricted sampling is a middle-ground between
5 Results beam search and unrestricted sampling, it is less
Our evaluation first compares the accuracy of likely to pick very low scoring outputs but still
back-translation generation methods (§5.1) and preserves some randomness. Preliminary experi-
analyzes the results (§5.2). Next, we simulate a ments with top5, top20, top50 gave similar results
low-resource setup to experiment further with dif- to top10.
ferent generation methods (§5.3). We also com- We also vary the amount of synthetic data and
pare synthetic bitext to genuine parallel data and perform 30K updates during training for the bi-
examine domain effects arising in back-translation text only, 50K updates when adding 3M synthetic
(§5.4). We also measure the effect of upsampling sentences, 75K updates for 6M and 12M sen-
bitext during training (§5.5). Finally, we scale to a tences and 100K updates for 24M sentences. For
very large setup of up to 226M monolingual sen- each setting, this corresponds to enough updates to
tences and compare to previous research (§5.6). reach convergence in terms of held-out loss. In our
128 GPU setup, training of the final models takes
5.1 Synthetic data generation methods 3h 20min for the bitext only model, 7h 30min for
We first investigate different methods to gener- 6M and 12M synthetic sentences, and 10h 15min
ate synthetic source translations given a back- for 24M sentences. During training we also sam-
translation model, i.e., a model trained in the re- ple the bitext more frequently than the synthetic
verse language direction (Section 3). We con- data and we analyze the effect of this in more de-
sider two types of MAP prediction: greedy search tail in §5.5.
(greedy) and beam search with beam size 5 Figure 1 shows that sampling and beam+noise
(beam). Non-MAP methods include unrestricted outperform the MAP methods (pure beam search
1
Code available at and greedy) by 0.8-1.1 BLEU. Sampling and
https://github.com/pytorch/fairseq beam+noise improve over bitext-only (5M) by be-
news2013 news2014 news2015 news2016 news2017 Average
bitext 27.84 30.88 31.82 34.98 29.46 31.00
+ beam 27.82 32.33 32.20 35.43 31.11 31.78
+ greedy 27.67 32.55 32.57 35.74 31.25 31.96
+ top10 28.25 33.94 34.00 36.45 32.08 32.94
+ sampling 28.81 34.46 34.87 37.08 32.35 33.51
+ beam+noise 29.28 33.53 33.79 37.89 32.66 33.43
Table 1: Tokenized BLEU on various test sets of WMT English-German when adding 24M synthetic
sentence pairs obtained by various generation methods to a 5.2M sentence-pair bitext (cf. Figure 1).
6 Perplexity
human data 75.34
5 beam 72.42
greedy beam
Training perplexity
1 20 40 60 80 100
epoch
to predict the target translations which may
Figure 2: Training perplexity (PPL) per epoch for help learning, similar to denoising autoencoders
different synthetic data. We separately report PPL (Vincent et al., 2008). Sampling is known to better
on the synthetic data and the bitext. Bitext PPL is approximate the data distribution which is richer
averaged over all generation methods. than the argmax model outputs (Ott et al., 2018a).
Therefore, sampling is also more likely to provide
a richer training signal than argmax sequences.
tween 1.7-2 BLEU in the largest data setting. To get a better sense of the training signal pro-
Restricted sampling (top10) performs better than vided by each method, we compare the loss on
beam and greedy but is not as effective as unre- the training data for each method. We report the
stricted sampling (sampling) or beam+noise. cross entropy loss averaged over all tokens and
Table 1 shows results on a wider range of separate the loss over the synthetic data and the
test sets (newstest2013-2017). Sampling and real bitext data. Specifically, we choose the setup
beam+noise perform roughly equal and we adopt with 24M synthetic sentences. At the end of each
sampling for the remaining experiments. epoch we measure the loss over 500K sentence
pairs sub-sampled from the synthetic data as well
5.2 Analysis of generation methods as an equally sized subset of the bitext. For each
The previous experiment showed that synthetic generation method we choose the same sentences
source sentences generated via sampling and beam except for the bitext which is disjoint from the syn-
with noise perform significantly better than those thetic data. This means that losses over the syn-
obtained by pure MAP methods. Why is this? thetic data are measured over the same target to-
Beam search focuses on very likely outputs kens because the generation methods only differ
which reduces the diversity and richness of the in the source sentences. We found it helpful to up-
generated source translations. Adding noise to sample the frequency with which we observe the
beam outputs and sampling do not have this bitext compared to the synthetic data (§5.5) but we
problem: Noisy source sentences make it harder do not upsample for this experiment to keep condi-
source Diese gegenstzlichen Auffassungen von Fairness liegen nicht nur der politischen Debatte
zugrunde.
reference These competing principles of fairness underlie not only the political debate.
beam These conflicting interpretations of fairness are not solely based on the political debate.
sample Mr President, these contradictory interpretations of fairness are not based solely on the
political debate.
top10 Those conflicting interpretations of fairness are not solely at the heart of the political
debate.
beam+noise conflicting BLANK interpretations BLANK are of not BLANK based on the political
debate.
Table 3: Example where sampling produces inadequate outputs. ”Mr President,” is not in the source.
BLANK means that a word has been replaced by a filler token.
22
• the back-translated target side of the remain-
20 ing parallel data (BT-bitext),
beam 80K
18 sampling 80K
beam 640K • back-translated newscrawl data (BT-news).
16 sampling 640K
beam 5M The back-translated data is generated via sam-
14 sampling 5M pling. This setup allows us to compare synthetic
5M data to genuine data since BT-bitext and bitext
11M
2M
6M
K
M
M
M
0K
0K
0K
17
29
16
32
64
1.
2.
22 31
BLEU
BLEU
30
21
bitext bitext
29 BT-bitext
BT-bitext
20 BT-news BT-news
28
640K 1.28M 2.56M 5.19M 640K 1.28M 2.56M 5.19M
Amount of data Amount of data
(a) newstest2012 (b) valid-mixed
Figure 4: Accuracy on (a) newstest2012 and (b) a mixed domain valid set when growing a 640K bitext
corpus with (i) real parallel data (bitext), (ii) a back-translated version of the target side of the bitext
(BT-bitext), (iii) or back-translated newscrawl data (BT-news).
4
with SET ∈ {wmt13, wmt14/full, wmt15} https://www.deepl.com/press.html
3 5
sacreBLEU signatures: BLEU+case.mixed+lang.en- BLEU+case.lc+lang.en-
LANG+numrefs.1+smooth.exp+test.wmt14/full+ de+numrefs.1+smooth.exp+test.SET+tok.13a+version.1.2.11
tok.13a+version.1.2.7 with LANG ∈ {de,fr} with SET ∈ {wmt17, wmt18}
7 Conclusions and future work Peter F. Brown, John Cocke, Stephen Della Pietra, Vin-
cent J. Della Pietra, Frederick Jelinek, John D. Laf-
Back-translation is a very effective data augmen- ferty, Robert L. Mercer, and Paul S. Roossin. 1990.
tation technique for neural machine translation. A statistical approach to machine translation. Com-
Generating synthetic sources by sampling or by putational Linguistics, 16:79–85.
adding noise to beam outputs leads to higher ac- Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua
curacy than argmax inference which is typically Wu, Maosong Sun, and Yang Liu. 2016. Semi-
used. In particular, sampling and noised beam supervised learning for neural machine translation.
In Conference of the Association for Computational
outperforms pure beam by 1.7 BLEU on average Linguistics (ACL).
on newstest2013-2017 for WMT English-German
translation. Both methods provide a richer train- Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
ing signal for all but resource poor setups. We
Schwenk, and Yoshua Bengio. 2014. Learning
also find that synthetic data can achieve up to 83% phrase representations using rnn encoder-decoder
of the performance attainable with real bitext. Fi- for statistical machine translation. In Conference on
nally, we achieve a new state of the art result of 35 Empirical Methods in Natural Language Processing
BLEU on the WMT’14 English-German test set (EMNLP).
by using publicly available benchmark data only. Ryan Cotterell and Julia Kreutzer. 2018. Explain-
In future work, we would like to investigate ing and generalizing back-translation through wake-
an end-to-end approach where the back-translation sleep. arXiv preprint arXiv:1806.04402.
model is optimized to output synthetic sources that Anna Currey, Antonio Valerio Miceli Barone, and Ken-
are most helpful to the final forward model. neth Heafield. 2017. Copied Monolingual Data Im-
proves Low-Resource Neural Machine Translation.
In Proc. of WMT.
References Tobias Domhan and Felix Hieber. 2017. Using target-
Karim Ahmed, Nitish Shirish Keskar, and Richard side monolingual data for neural machine transla-
Socher. 2017. Weighted transformer network for tion through multi-task learning. In Conference on
machine translation. arxiv, 1711.02132. Empirical Methods in Natural Language Processing
(EMNLP).
Antreas Antoniou, Amos J. Storkey, and Harrison Ed-
wards. 2017. Data augmentation generative adver- Angela Fan, Yann Dauphin, and Mike Lewis. 2018.
sarial networks. arXiv, abs/1711.04340. Hierarchical neural story generation. In Confer-
ence of the Association for Computational Linguis-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- tics (ACL).
gio. 2015. Neural machine translation by jointly
Orhan Firat, Kyunghyun Cho, and Yoshua Ben-
learning to align and translate. In International Con-
gio. 2016a. Multi-way, multilingual neural ma-
ference on Learning Representations (ICLR).
chine translation with a shared attention mecha-
nism. In Conference of the North American Chap-
Nicola Bertoldi and Marcello Federico. 2009. Domain
ter of the Association for Computational Linguistics
adaptation for statistical machine translation with
(NAACL).
monolingual resources. In Workshop on Statistical
Machine Translation (WMT). Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan,
Fatos T. Yarman-Vural, and Kyunghyun Cho. 2016b.
Ondrej Bojar and Ales Tamchyna. 2011. Improving Zero-resource translation with multi-lingual neu-
translation model by monolingual data. In Workshop ral machine translation. In Conference on Em-
on Statistical Machine Translation (WMT). pirical Methods in Natural Language Processing
(EMNLP).
Ondřej Bojar, Christian Federmann, Mark Fishel,
Yvette Graham, Barry Haddow, Matthias Huck, Jonas Gehring, Michael Auli, David Grangier, Denis
Philipp Koehn, and Christof Monz. 2018. Find- Yarats, and Yann N Dauphin. 2017. Convolutional
ings of the 2018 conference on machine translation sequence to sequence learning. In International
(WMT18). In Proceedings of the Third Conference Conference of Machine Learning (ICML).
on Machine Translation, Volume 2: Shared Task Pa-
pers, Brussels, Belgium. Association for Computa- Alex Graves. 2013. Generating sequences with recur-
tional Linguistics. rent neural networks. arXiv, 1308.0850.
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz Josef Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor
Och, and Jeffrey Dean. 2007. Large language mod- O. K. Li. 2018. Universal neural machine transla-
els in machine translation. In Conference on Natural tion for extremely low resource languages. arXiv,
Language Learning (CoNLL). 1802.05368.
Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim
Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho-
Holger Schwenk, and Yoshua Bengio. 2015. On us- rat, Fernanda B. Viégas, Martin Wattenberg, Gre-
ing monolingual corpora in neural machine transla- gory S. Corrado, Macduff Hughes, and Jeffrey Dean.
tion. arXiv, 1503.03535. 2017. Google’s multilingual neural machine transla-
tion system: Enabling zero-shot translation. Trans-
Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun actions of the Association for Computational Lin-
Cho, and Yoshua Bengio. 2017. On integrating guistics (TACL), 5:339–351.
a language model into neural machine translation.
Computer Speech & Language, 45:137–148. Lukasz Kaiser, Aidan N. Gomez, and François Chollet.
2017. Depthwise separable convolutions for neural
Thanh-Le Ha, Jan Niehues, and Alexander H. Waibel. machine translation. CoRR, abs/1706.03059.
2016. Toward multilingual neural machine trans-
lation with universal encoder and decoder. arXiv, Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,
1611.04798. Aäron van den Oord, Alex Graves, and Koray
Kavukcuoglu. 2016. Neural machine translation in
Hany Hassan, Anthony Aue, Chang Chen, Vishal linear time. CoRR, abs/1610.10099.
Chowdhary, Jonathan Clark, Christian Feder-
Alina Karakanta, Jon Dehdari, and Josef van Genabith.
mann, Xuedong Huang, Marcin Junczys-Dowmunt, 2017. Neural machine translation for low-resource
William Lewis, Mu Li, et al. 2018. Achieving hu-
languages without parallel corpora. Machine Trans-
man parity on automatic chinese to english news
lation, pages 1–23.
translation. arXiv, 1803.05567.
Diederik P. Kingma and Jimmy Ba. 2015. Adam:
Soren Hauberg, Oren Freifeld, Anders Boesen Lindbo A Method for Stochastic Optimization. In Inter-
Larsen, John W. Fisher, and Lars Kai Hansen. 2016. national Conference on Learning Representations
Dreaming more data: Class-dependent distributions (ICLR).
over diffeomorphisms for learned data augmenta-
tion. In AISTATS. Philipp Koehn. 2010. Statistical machine translation.
Cambridge University Press.
Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu,
Tieyan Liu, and Wei-Ying Ma. 2016a. Dual learning Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
for machine translation. In Conference on Advances Callison-Burch, Marcello Federico, Nicola Bertoldi,
in Neural Information Processing Systems (NIPS). Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra
Wei He, Zhongjun He, Hua Wu, and Haifeng Wang. Constantin, and Evan Herbst. 2007. Moses: Open
2016b. Improved neural machine translation with source toolkit for statistical machine translation. In
smt features. In Conference of the Association for ACL Demo Session.
the Advancement of Artificial Intelligence (AAAI),
pages 151–157. Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical phrase-based translation. In Con-
Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. ference of the North American Chapter of the Asso-
Clark, and Philipp Koehn. 2013. Scalable Modified ciation for Computational Linguistics (NAACL).
Kneser-Ney Language Model Estimation. In Con-
Patrik Lambert, Holger Schwenk, Christophe Ser-
ference of the Association for Computational Lin-
van, and Sadaf Abdul-Rauf. 2011. Investigations
guistics (ACL).
on translation model adaptation using monolingual
Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. data. In Workshop on Statistical Machine Transla-
Learning distributed representations of sentences tion (WMT).
from unlabelled data. In Conference of the North Guillaume Lample, Alexis Conneau, Ludovic Denoyer,
American Chapter of the Association for Computa- and Marc’Aurelio Ranzato. 2018a. Unsupervised
tional Linguistics (NAACL). machine translation using monolingual corpora only.
In International Conference on Learning Represen-
Vu Cong Duy Hoang, Philipp Koehn, Gholamreza tations (ICLR).
Haffari, and Trevor Cohn. 2018. Iterative back-
translation for neural machine translation. In Pro- Guillaume Lample, Myle Ott, Alexis Conneau, Lu-
ceedings of the 2nd Workshop on Neural Machine dovic Denoyer, and Marc’Aurelio Ranzato. 2018b.
Translation and Generation, pages 18–24. Phrase-based & neural unsupervised machine trans-
lation. arXiv, 1803.05567.
Kenji Imamura, Atsushi Fujita, and Eiichiro Sumita.
2018. Enhancement of encoder and attention using Marco Lui and Timothy Baldwin. 2012. langid. py: An
target monolingual corpora in neural machine trans- off-the-shelf language identification tool. In Pro-
lation. In Proceedings of the 2nd Workshop on Neu- ceedings of the ACL 2012 system demonstrations,
ral Machine Translation and Generation, pages 55– pages 25–30. Association for Computational Lin-
63. guistics.
Minh-Thang Luong, Hieu Pham, and Christopher D Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.
Manning. 2015. Effective approaches to attention- 2018. Self-attention with relative position represen-
based neural machine translation. In Conference on tations. In Proc. of NAACL.
Empirical Methods in Natural Language Processing
(EMNLP). Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural net-
Xing Niu, Michael Denkowski, and Marine Carpuat. works. In Conference on Advances in Neural In-
2018. Bi-directional neural machine transla- formation Processing Systems (NIPS).
tion with synthetic parallel data. arXiv preprint
arXiv:1805.11213. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
Jonathon Shlens, and Zbigniew Wojna. 2015. Re-
Myle Ott, Michael Auli, David Grangier, and thinking the Inception Architecture for Computer
Marc’Aurelio Ranzato. 2018a. Analyzing uncer- Vision. arXiv preprint arXiv:1512.00567.
tainty in neural machine translation. In Proceed-
ings of the 35th International Conference on Ma- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
chine Learning, volume 80, pages 3956–3965. Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
Myle Ott, Sergey Edunov, David Grangier, and you need. In Conference on Advances in Neural In-
Michael Auli. 2018b. Scaling neural machine trans- formation Processing Systems (NIPS).
lation. In Proceedings of the Third Conference on
Machine Translation: Research Papers. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, ,
and Pierre-Antoine Manzagol. 2008. Extracting and
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- composing robust features with denoising autoen-
Jing Zhu. 2002. BLEU: a method for automatic coders. In International Conference on Machine
evaluation of machine translation. In Conference Learning (ICML).
of the Association for Computational Linguistics
(ACL). Yingce Xia, Tao Qin, Wei Chen, Jiang Bian, Nenghai
Yu, and Tie-Yan Liu. 2017. Dual supervised learn-
Romain Paulus, Caiming Xiong, and Richard Socher. ing. In International Conference on Machine Learn-
2018. A deep reinforced model for abstractive sum- ing (ICML).
marization. In International Conference on Learn-
ing Representations (ICLR). Jiajun Zhang and Chengqing Zong. 2016. Exploit-
ing source-side monolingual data in neural machine
Gabriel Pereyra, George Tucker, Jan Chorowski,
translation. In Conference on Empirical Methods in
Lukasz Kaiser, and Geoffrey E. Hinton. 2017. Reg-
Natural Language Processing (EMNLP).
ularizing neural networks by penalizing confident
output distributions. In International Conference on
Learning Representations (ICLR) Workshop.
Luis Perez and Jason Wang. 2017. The effectiveness of
data augmentation in image classification using deep
learning. arxiv, 1712.04621.
Alberto Poncelas, Dimitar Sht. Shterionov, Andy Way,
Gideon Maillette de Buy Wenniger, and Peyman
Passban. 2018. Investigating backtranslation in neu-
ral machine translation. arXiv, 1804.06189.
Matt Post. 2018. A call for clarity in reporting bleu
scores. arXiv, 1804.08771.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016a. Improving neural machine translation mod-
els with monolingual data. Conference of the Asso-
ciation for Computational Linguistics (ACL).
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016b. Neural machine translation of rare words
with subword units. In Conference of the Associa-
tion for Computational Linguistics (ACL).
Iulian Serban, Alessandro Sordoni, Yoshua Bengio,
Aaron C. Courville, and Joelle Pineau. 2016. Build-
ing end-to-end dialogue systems using generative hi-
erarchical neural network models. In Conference of
the Association for the Advancement of Artificial In-
telligence (AAAI).