Understanding Back-Translation at Scale

Understanding Back-Translation at Scale
Sergey Edunov△ Myle Ott△ Michael Auli△ David Grangier ▽∗

△
Facebook AI Research, Menlo Park, CA & New York, NY.
▽
Google Brain, Mountain View, CA.
Abstract We focus on back-translation (BT) which

operates in a semi-supervised setup where both
An effective method to improve neural ma-
arXiv:1808.09381v2 [cs.CL] 3 Oct 2018
bilingual and monolingual data in the target lan-

chine translation with monolingual data is
to augment the parallel training corpus with guage are available. Back-translation first trains
back-translations of target language sentences. an intermediate system on the parallel data which
This work broadens the understanding of is used to translate the target monolingual data into
back-translation and investigates a number the source language. The result is a parallel corpus
of methods to generate synthetic source sen- where the source side is synthetic machine transla-
tences. We find that in all but resource poor tion output while the target is genuine text written
settings back-translations obtained via sam- by humans. The synthetic parallel corpus is then
pling or noised beam outputs are most effec-
simply added to the real bitext in order to train
tive. Our analysis shows that sampling or
noisy synthetic data gives a much stronger a final system that will translate from the source
training signal than data generated by beam or to the target language. Although simple, this
greedy search. We also compare how synthetic method has been shown to be helpful for phrase-
data compares to genuine bitext and study var- based translation (Bojar and Tamchyna, 2011),
ious domain effects. Finally, we scale to hun- NMT (Sennrich et al., 2016a; Poncelas et al.,
dreds of millions of monolingual sentences 2018) as well as unsupervised MT (Lample et al.,
and achieve a new state of the art of 35 BLEU
2018a).
on the WMT’14 English-German test set.
In this paper, we investigate back-translation
1 Introduction for neural machine translation at a large scale
by adding hundreds of millions of back-translated
Machine translation relies on the statistics of large sentences to the bitext. Our experiments are based
parallel corpora, i.e. datasets of paired sentences on strong baseline models trained on the public bi-
in both the source and target language. However, text of the WMT competition. We extend previous
bitext is limited and there is a much larger amount analysis (Sennrich et al., 2016a; Poncelas et al.,
of monolingual data available. Monolingual data 2018) of back-translation in several ways. We pro-
has been traditionally used to train language mod- vide a comprehensive analysis of different meth-
els which improved the fluency of statistical ma- ods to generate synthetic source sentences and we
chine translation (Koehn, 2010). show that this choice matters: sampling from the
In the context of neural machine translation model distribution or noising beam outputs out-
(NMT; Bahdanau et al. 2015; Gehring et al. 2017; performs pure beam search, which is typically
Vaswani et al. 2017), there has been extensive used, by 1.7 BLEU on average across several test
work to improve models with monolingual data, sets. Our analysis shows that synthetic data based
including language model fusion (Gulcehre et al., on sampling and noised beam search provides a
2015, 2017), back-translation (Sennrich et al., stronger training signal than synthetic data based
2016a) and dual learning (Cheng et al., 2016; on argmax inference. We also study how adding
He et al., 2016a). These methods have different synthetic data compares to adding real bitext in
advantages and can be combined to reach high ac- a controlled setup with the surprising finding that
curacy (Hassan et al., 2018). synthetic data can sometimes match the accuracy
*Work done while at Facebook AI Research. of real bitext. Our best setup achieves 35 BLEU
on the WMT’14 English-German test set by rely- get language increase the score of fluent outputs
ing only on public WMT bitext as well as 226M during decoding (Koehn et al., 2003; Brants et al.,
monolingual sentences. This outperforms the sys- 2007). A similar strategy can be applied to
tem of DeepL by 1.7 BLEU who train on large NMT (He et al., 2016b). Besides improving ac-
amounts of high quality non-benchmark data. On curacy during decoding, neural LM and NMT can
WMT’14 English-French we achieve 45.6 BLEU. benefit from deeper integration, e.g. by combining
the hidden states of both models (Gulcehre et al.,
2 Related work 2017). Neural architecture also allows multi-task
learning and parameter sharing between MT and
This section describes prior work in machine
target-side LM (Domhan and Hieber, 2017).
translation with neural networks as well as semi-
supervised machine translation. Back-translation (BT) is an alternative to lever-
age monolingual data. BT is simple and easy to
2.1 Neural machine translation apply as it does not require modification to the
We build upon recent work on neural machine MT training algorithms. It requires training a
translation which is typically a neural network target-to-source system in order to generate ad-
with an encoder/decoder architecture. The en- ditional synthetic parallel data from the mono-
coder infers a continuous space representation lingual target data. This data complements hu-
of the source sentence, while the decoder is a man bitext to train the desired source-to-target
neural language model conditioned on the en- system. BT has been applied earlier to phrase-
coder output. The parameters of both models are base systems (Bojar and Tamchyna, 2011). For
learned jointly to maximize the likelihood of the these systems, BT has also been successful in
target sentences given the corresponding source leveraging monolingual data for domain adapta-
sentences from a parallel corpus (Sutskever et al., tion (Bertoldi and Federico, 2009; Lambert et al.,
2014; Cho et al., 2014). At inference, a target sen- 2011). Recently, BT has been shown beneficial
tence is generated by left-to-right decoding. for NMT (Sennrich et al., 2016a; Poncelas et al.,
2018). It has been found to be particularly use-
Different neural architectures have been
ful when parallel data is scarce (Karakanta et al.,
proposed with the goal of improving effi-
2017).
ciency and/or effectiveness. This includes
recurrent networks (Sutskever et al., 2014; Currey et al. (2017) show that low resource lan-
Bahdanau et al., 2015; Luong et al., 2015), con- guage pairs can also be improved with synthetic
volutional networks (Kalchbrenner et al., 2016; data where the source is simply a copy of the
Gehring et al., 2017; Kaiser et al., 2017) and monolingual target data. Concurrently to our
transformer networks (Vaswani et al., 2017). work, Imamura et al. (2018) show that sampling
Recent work relies on attention mechanisms synthetic sources is more effective than beam
where the encoder produces a sequence of search. Specifically, they sample multiple sources
vectors and, for each target token, the decoder for each target whereas we draw only a sin-
attends to the most relevant part of the source gle sample, opting to train on a larger number
through a context-dependent weighted-sum of of target sentences instead. Hoang et al. (2018)
the encoder vectors (Bahdanau et al., 2015; and Cotterell and Kreutzer (2018) suggest an iter-
Luong et al., 2015). Attention has been refined ative procedure which continuously improves the
with multi-hop attention (Gehring et al., 2017), quality of the back-translation and final systems.
self-attention (Vaswani et al., 2017; Paulus et al., Niu et al. (2018) experiment with a multilingual
2018) and multi-head attention (Vaswani et al., model that does both the forward and backward
2017). We use a transformer architecture translation which is continuously trained with new
(Vaswani et al., 2017). synthetic data.
There has also been work using source-side
2.2 Semi-supervised NMT monolingual data (Zhang and Zong, 2016). Fur-
Monolingual target data has been used to im- thermore, Cheng et al. (2016); He et al. (2016a);
prove the fluency of machine translations since the Xia et al. (2017) show how monolingual text
early IBM models (Brown et al., 1990). In phrase- from both languages can be leveraged by ex-
based systems, language models (LM) in the tar- tending back-translation to dual learning: when
training both source-to-target and target-to-source stricted sampling.
models jointly, one can use back-translation in As a third alternative, we apply noising
both directions and perform multiple rounds of Lample et al. (2018a) to beam search out-
BT. A similar idea is applied in unsupervised puts. Adding noise to input sentences has
NMT (Lample et al., 2018a,b). Besides mono- been very beneficial for the autoencoder se-
lingual data, various approaches have been in- tups of (Lample et al., 2018a; Hill et al., 2016)
troduced to benefit from parallel data in other which is inspired by denoising autoencoders
language pairs (Johnson et al., 2017; Firat et al., (Vincent et al., 2008). In particular, we transform
2016a,b; Ha et al., 2016; Gu et al., 2018). source sentences with three types of noise:
Data augmentation is an established technique deleting words with probability 0.1, replacing
in computer vision where a labeled dataset is sup- words by a filler token with probability 0.1,
plemented with cropped or rotated input images. and swapping words which is implemented as a
Recently, generative adversarial networks (GANs) random permutation over the tokens, drawn from
have been successfully used to the same end the uniform distribution but restricted to swapping
(Antoniou et al., 2017; Perez and Wang, 2017) as words no further than three positions apart.
well as models that learn distributions over image
transformations (Hauberg et al., 2016).
4 Experimental setup
3 Generating synthetic sources
4.1 Datasets
Back-translation typically uses beam search
(Sennrich et al., 2016a) or just greedy search The majority of our experiments are based on data
(Lample et al., 2018a,b) to generate synthetic from the WMT’18 English-German news transla-
source sentences. Both are approximate al- tion task. We train on all available bitext exclud-
gorithms to identify the maximum a-posteriori ing the ParaCrawl corpus and remove sentences
(MAP) output, i.e. the sentence with the largest longer than 250 words as well as sentence-pairs
estimated probability given an input. Beam is gen- with a source/target length ratio exceeding 1.5.
erally successful in finding high probability out- This results in 5.18M sentence pairs. For the back-
puts (Ott et al., 2018a). translation experiments we use the German mono-
However, MAP prediction can lead to less rich lingual newscrawl data distributed with WMT’18
translations (Ott et al., 2018a) since it always fa- comprising 226M sentences after removing dupli-
vors the most likely alternative in case of ambi- cates. We tokenize all data with the Moses tok-
guity. This is particularly problematic in tasks enizer (Koehn et al., 2007) and learn a joint source
where there is a high level of uncertainty such and target Byte-Pair-Encoding (BPE; Sennrich et
as dialog (Serban et al., 2016) and story genera- al., 2016) with 35K types. We develop on new-
tion (Fan et al., 2018). We argue that this is also stest2012 and report final results on newstest2013-
problematic for a data augmentation scheme such 2017; additionally we consider a held-out set from
as back-translation. Beam and greedy focus on the training data of 52K sentence-pairs.
the head of the model distribution which results We also experiment on the larger WMT’14
in very regular synthetic source sentences that do English-French task which we filter in the same
not properly cover the true data distribution. way as WMT’18 English-German. This results in
As alternative, we consider sampling from the 35.7M sentence-pairs for training and we learn a
model distribution as well as adding noise to beam joint BPE vocabulary of 44K types. As monolin-
search outputs. First, we explore unrestricted sam- gual data we use newscrawl2010-2014, compris-
pling which generates outputs that are very di- ing 31M sentences after language identification
verse but sometimes highly unlikely. Second, we (Lui and Baldwin, 2012). We use newstest2012
investigate sampling restricted to the most likely as development set and report final results on
words (Graves, 2013; Ott et al., 2018a; Fan et al., newstest2013-2015.
2018). At each time step, we select the k most The majority of results in this paper are in terms
likely tokens from the output distribution, re- of case-sensitive tokenized BLEU (Papineni et al.,
normalize and then sample from this restricted set. 2002) but we also report test accuracy with de-
This is a middle ground between MAP and unre- tokenized BLEU using sacreBLEU (Post, 2018).
4.2 Model and hyperparameters 25.5
We re-implemented the Transformer model in py-
BLEU (newstest2012)
torch using the fairseq toolkit.1 All experiments 25
are based on the Big Transformer architecture with
6 blocks in the encoder and decoder. We use the 24.5
same hyper-parameters for all experiments, i.e.,
word representations of size 1024, feed-forward 24
layers with inner dimension 4096. Dropout is greedy beam
set to 0.3 for En-De and 0.1 for En-Fr, we use 23.5
top10 sampling
16 attention heads, and we average the check- beam+noise
points of the last ten epochs. Models are opti- 5M 8M 11M 17M 29M
mized with Adam (Kingma and Ba, 2015) using Total training data
β1 = 0.9, β2 = 0.98, and ǫ = 1e − 8 and we use
the same learning rate schedule as Vaswani et al. Figure 1: Accuracy of models trained on dif-
(2017). All models use label smoothing with a ferent amounts of back-translated data obtained
uniform prior distribution over the vocabulary ǫ = with greedy search, beam search (k = 5), ran-
0.1 (Szegedy et al., 2015; Pereyra et al., 2017). domly sampling from the model distribution, re-
We run experiments on DGX-1 machines with 8 stricting sampling over the ten most likely words
Nvidia V100 GPUs and machines are intercon- (top10), and by adding noise to the beam outputs
nected by Infiniband. Experiments are run on 16 (beam+noise). Results based on newstest2012 of
machines and we perform 30K synchronous up- WMT English-German translation.
dates. We also use the NCCL2 library and the
torch distributed package for inter-GPU communi-
cation. We train models with 16-bit floating point sampling from the model distribution (sampling),
operations, following Ott et al. (2018b). For final restricting sampling to the k highest scoring out-
evaluation, we generate translations with a beam puts at every time step with k = 10 (top10) as well
of size 5 and with no length penalty. as adding noise to the beam outputs (beam+noise).
Restricted sampling is a middle-ground between
5 Results beam search and unrestricted sampling, it is less
Our evaluation first compares the accuracy of likely to pick very low scoring outputs but still
back-translation generation methods (§5.1) and preserves some randomness. Preliminary experi-
analyzes the results (§5.2). Next, we simulate a ments with top5, top20, top50 gave similar results
low-resource setup to experiment further with dif- to top10.
ferent generation methods (§5.3). We also com- We also vary the amount of synthetic data and
pare synthetic bitext to genuine parallel data and perform 30K updates during training for the bi-
examine domain effects arising in back-translation text only, 50K updates when adding 3M synthetic
(§5.4). We also measure the effect of upsampling sentences, 75K updates for 6M and 12M sen-
bitext during training (§5.5). Finally, we scale to a tences and 100K updates for 24M sentences. For
very large setup of up to 226M monolingual sen- each setting, this corresponds to enough updates to
tences and compare to previous research (§5.6). reach convergence in terms of held-out loss. In our
128 GPU setup, training of the final models takes
5.1 Synthetic data generation methods 3h 20min for the bitext only model, 7h 30min for
We first investigate different methods to gener- 6M and 12M synthetic sentences, and 10h 15min
ate synthetic source translations given a back- for 24M sentences. During training we also sam-
translation model, i.e., a model trained in the re- ple the bitext more frequently than the synthetic
verse language direction (Section 3). We con- data and we analyze the effect of this in more de-
sider two types of MAP prediction: greedy search tail in §5.5.
(greedy) and beam search with beam size 5 Figure 1 shows that sampling and beam+noise
(beam). Non-MAP methods include unrestricted outperform the MAP methods (pure beam search
1
Code available at and greedy) by 0.8-1.1 BLEU. Sampling and
https://github.com/pytorch/fairseq beam+noise improve over bitext-only (5M) by be-
news2013 news2014 news2015 news2016 news2017 Average
bitext 27.84 30.88 31.82 34.98 29.46 31.00
+ beam 27.82 32.33 32.20 35.43 31.11 31.78
+ greedy 27.67 32.55 32.57 35.74 31.25 31.96
+ top10 28.25 33.94 34.00 36.45 32.08 32.94
+ sampling 28.81 34.46 34.87 37.08 32.35 33.51
+ beam+noise 29.28 33.53 33.79 37.89 32.66 33.43
Table 1: Tokenized BLEU on various test sets of WMT English-German when adding 24M synthetic
sentence pairs obtained by various generation methods to a 5.2M sentence-pair bitext (cf. Figure 1).
6 Perplexity
human data 75.34
5 beam 72.42
greedy beam
Training perplexity
top10 sampling sampling 500.17

beam+noise bitext top10 87.15
4
beam+noise 2823.73
3
Table 2: Perplexity of source data as assigned by a
language model (5-gram Kneser–Ney). Data gen-
2 erated by beam search is most predictable.
1 20 40 60 80 100
epoch
to predict the target translations which may
Figure 2: Training perplexity (PPL) per epoch for help learning, similar to denoising autoencoders
different synthetic data. We separately report PPL (Vincent et al., 2008). Sampling is known to better
on the synthetic data and the bitext. Bitext PPL is approximate the data distribution which is richer
averaged over all generation methods. than the argmax model outputs (Ott et al., 2018a).
Therefore, sampling is also more likely to provide
a richer training signal than argmax sequences.
tween 1.7-2 BLEU in the largest data setting. To get a better sense of the training signal pro-
Restricted sampling (top10) performs better than vided by each method, we compare the loss on
beam and greedy but is not as effective as unre- the training data for each method. We report the
stricted sampling (sampling) or beam+noise. cross entropy loss averaged over all tokens and
Table 1 shows results on a wider range of separate the loss over the synthetic data and the
test sets (newstest2013-2017). Sampling and real bitext data. Specifically, we choose the setup
beam+noise perform roughly equal and we adopt with 24M synthetic sentences. At the end of each
sampling for the remaining experiments. epoch we measure the loss over 500K sentence
pairs sub-sampled from the synthetic data as well
5.2 Analysis of generation methods as an equally sized subset of the bitext. For each
The previous experiment showed that synthetic generation method we choose the same sentences
source sentences generated via sampling and beam except for the bitext which is disjoint from the syn-
with noise perform significantly better than those thetic data. This means that losses over the syn-
obtained by pure MAP methods. Why is this? thetic data are measured over the same target to-
Beam search focuses on very likely outputs kens because the generation methods only differ
which reduces the diversity and richness of the in the source sentences. We found it helpful to up-
generated source translations. Adding noise to sample the frequency with which we observe the
beam outputs and sampling do not have this bitext compared to the synthetic data (§5.5) but we
problem: Noisy source sentences make it harder do not upsample for this experiment to keep condi-
source Diese gegenstzlichen Auffassungen von Fairness liegen nicht nur der politischen Debatte
zugrunde.
reference These competing principles of fairness underlie not only the political debate.
beam These conflicting interpretations of fairness are not solely based on the political debate.
sample Mr President, these contradictory interpretations of fairness are not based solely on the
political debate.
top10 Those conflicting interpretations of fairness are not solely at the heart of the political
debate.
beam+noise conflicting BLANK interpretations BLANK are of not BLANK based on the political
debate.
Table 3: Example where sampling produces inadequate outputs. ”Mr President,” is not in the source.
BLANK means that a word has been replaced by a filler token.
tions as similar as possible. We assume that when translation model,

the training loss is low, then the model can easily
fit the training data without extracting much learn- 2. On 4.1M sentence pairs, we take the source
ing signal compared to data which is harder to fit. side and train a 5-gram Kneser-Ney language
Figure 2 shows that synthetic data based on model (Heafield et al., 2013),
greedy or beam is much easier to fit compared to
data from sampling, top10, beam+noise and the 3. On the remaining 450K sentences, we apply
bitext. In fact, the perplexity on beam data falls the back-translation system using beam, sam-
below 2 after only 5 epochs. Except for sampling, pling and top10 generation.
we find that the perplexity on the training data is
somewhat correlated to the end-model accuracy For the last set, we have genuine source sen-
(cf. Figure 1) and that all methods except sam- tences as well as synthetic sources from different
pling have a lower loss than real bitext. generation techniques. We report the perplexity of
These results suggest that synthetic data ob- our language model on all versions of the source
tained with argmax inference does not provide data in Table 2. The results show that beam out-
as rich a training signal as sampling or adding puts receive higher probability by the language
noise. We conjecture that the regularity of syn- model compared to sampling, beam+noise and
thetic data obtained with argmax inference is not real source sentences. This indicates that beam
optimal. Sampling and noised argmax both expose search outputs are not as rich as sampling outputs
the model to a wider range of source sentences or beam+noise. This lack of variability probably
which makes the model more robust to reorder- explains in part why back-translations from pure
ing and substitutions that happen naturally, even if beam search provide a weaker training signal than
the model of reordering and substitution through alternatives.
noising is not very realistic. Closer inspection of the synthetic sources (Ta-
Next we analyze the richness of synthetic out- ble 3) reveals that sampled and noised beam out-
puts and train a language model on real human text puts are sometimes not very adequate, much more
and score synthetic source sentences generated by so than MAP outputs, e.g., sampling often in-
beam search, sampling, top10 and beam+noise. troduces target words which have no counterpart
We hypothesize that data that is very regular in the source. This happens because sampling
should be more predictable by the language model sometimes picks highly unlikely outputs which are
and therefore receive low perplexity. We elimi- harder to fit (cf. Figure 2).
nate a possible domain mismatch effect between
5.3 Low resource vs. high resource setup
the language model training data and the synthetic
data by splitting the parallel corpus into three non- The experiments so far are based on a setup with a
overlapping parts: large bilingual corpus. However, in resource poor
settings the back-translation model is of much
1. On 640K sentences pairs, we train a back- lower quality. Are non-MAP methods still more
26 to add to this 640K training set. We either add:
24
• the remaining parallel data (bitext),
BLEU (newstest2012)
22
• the back-translated target side of the remain-
20 ing parallel data (BT-bitext),
beam 80K
18 sampling 80K
beam 640K • back-translated newscrawl data (BT-news).
16 sampling 640K
beam 5M The back-translated data is generated via sam-
14 sampling 5M pling. This setup allows us to compare synthetic
5M data to genuine data since BT-bitext and bitext
11M
2M
6M
K
M
M
M
0K
0K
0K
share the same target side. It also allows us to

80
17
29
16
32
64
1.
2.
estimate the value of BT data for domain adap-

Total training data
tation since the newscrawl corpus (BT-news) is
Figure 3: BLEU when adding synthetic data from pure news whereas the bitext is a mixture of eu-
beam and sampling to bitext systems with 80K, roparl and commoncrawl with only a small news-
640K and 5M sentence pairs. commentary portion. To assess domain adaptation
effects, we measure accuracy on two held-out sets:
• newstest2012, i.e. pure newswire data.

effective in such a setup? To answer this ques-
tion, we simulate such setups by sub-sampling
• a held-out set of the WMT training data
the training data to either 80K sentence-pairs or
(valid-mixed), which is a mixture of eu-
640K sentence-pairs and then add synthetic data
roparl, commoncrawl and the small news-
from sampling and beam search. We compare
commentary portion.
these smaller setups to our original 5.2M sen-
tence bitext configuration. The accuracy of the
Figure 4 shows the results on both validation
German-English back-translation systems steadily
sets. Most strikingly, BT-news performs almost
increases with more training data: On new-
as well as bitext on newstest2012 (Figure 4a) and
stest2012 we measure 13.5 BLEU for 80K bitext,
improves the baseline (640K) by 2.6 BLEU. BT-
24.3 BLEU for 640K and 28.3 BLEU for 5M.
bitext improves by 2.2 BLEU, achieving 83% of
Figure 3 shows that sampling is more effective the improvement with real bitext. This shows that
than beam for larger setups (640K and 5.2M bi- synthetic data can be nearly as effective as real hu-
texts) while the opposite is true for resource poor man translated data when the domains match.
settings (80K bitext). This is likely because the
Figure 4b shows the accuracy on valid-mixed,
back-translations in the 80K setup are of very poor
the mixed domain valid set. The accuracy of BT-
quality and the noise of sampling and beam+noise
news is not as good as before since the domain of
is too detrimental for this brittle low-resource set-
the BT data and the test set do not match. How-
ting. When the setup is very small the very regu-
ever, BT-news still improves the baseline by up to
lar MAP outputs still provide useful training signal
1.2 BLEU. On the other hand, BT-bitext matches
while the noise from sampling becomes harmful.
the domain of valid-mixed and improves by 2.7
BLEU. This trails the real bitext by only 1.3 BLEU
5.4 Domain of synthetic data
and corresponds to 67% of the gain achieved with
Next, we turn to two different questions: How real human bitext.
does real human bitext compare to synthetic data In summary, synthetic data performs remark-
in terms of final model accuracy? And how does ably well, coming close to the improvements
the domain of the monolingual data affect results? achieved with real bitext for newswire test data,
To answer these questions, we subsample 640K or trailing real bitext by only 1.3 BLEU for valid-
sentence-pairs of the bitext and train a back- mixed. In absence of a large parallel corpus for
translation system on this set. To train a forward news, back-translation therefore offers a simple,
model, we consider three alternative types of data yet very effective domain adaptation technique.
33
23
32
22 31
BLEU
BLEU
30
21
bitext bitext
29 BT-bitext
BT-bitext
20 BT-news BT-news
28
640K 1.28M 2.56M 5.19M 640K 1.28M 2.56M 5.19M
Amount of data Amount of data
(a) newstest2012 (b) valid-mixed
Figure 4: Accuracy on (a) newstest2012 and (b) a mixed domain valid set when growing a 640K bitext
corpus with (i) real parallel data (bitext), (ii) a back-translated version of the target side of the bitext
(BT-bitext), (iii) or back-translated newscrawl data (BT-news).
5.5 Upsampling the bitext

We found it beneficial to adjust the ratio of bitext 25
BLEU (newstest2012)
to synthetic data observed during training. In par-

ticular, we tuned the rate at which we sample data
from the bitext compared to synthetic data. For 24
example, in a setup of 5M bitext sentences and
greedy
10M synthetic sentences, an upsampling rate of 2 23 beam
means that we double the frequency at which we top10
visit bitext, i.e. training batches contain on aver- sampling
age an equal amount of bitext and synthetic data 22 beam+noise
as opposed to 1/3 bitext and 2/3 synthetic data.
1 2 4 8
Figure 5 shows the accuracy of various upsam-
pling rates for different generation methods in a bitext upsample rate
setup with 5M bitext sentences and 24M synthetic
Figure 5: Accuracy when changing the rate at
sentences. Beam and greedy benefit a lot from
which the bitext is upsampled during training.
higher rates which results in training more on the
Rates larger than one mean that the bitext is ob-
bitext data. This is likely because synthetic beam
served more often than actually present in the
and greedy data does not provide as much training
combined bitext and synthetic training corpus.
signal as the bitext which has more variation and
is harder to fit. On the other hand, sampling and
beam+noise require no upsampling of the bitext,
which is likely because the synthetic data is al- train this system we perform 300K training up-
ready hard enough to fit and thus provides a strong dates in 27h 40min on 128 GPUs; we do not up-
training signal (§5.2). sample the bitext for this experiment. Table 4
shows tokenized BLEU and Table 5 shows deto-
5.6 Large scale results kenized BLEU.2 To our knowledge, our baseline
is the best reported result in the literature for new-
To confirm our findings we experiment on
stest2014, and back-translation further improves
WMT’14 English-French translation where we
upon this by 2.6 BLEU (tokenized).
show results on newstest2013-2015. We augment
the large bitext of 35.7M sentence pairs by 31M 2
sacreBLEU signatures: BLEU+case.mixed+lang.en-
newscrawl sentences generated by sampling. To fr+numrefs.1+smooth.exp+test.SET+tok.13a+version.1.2.7
news13 news14 news15 En–De En–Fr
bitext 36.97 42.90 39.92 a. Gehring et al. (2017) 25.2 40.5
+sampling 37.85 45.60 43.95 b. Vaswani et al. (2017) 28.4 41.0
c. Ahmed et al. (2017) 28.9 41.4
Table 4: Tokenized BLEU on various test sets for d. Shaw et al. (2018) 29.2 41.5
WMT English-French translation. DeepL 33.3 45.9
Our result 35.0 45.6
news13 news14 news15 detok. sacreBLEU3 33.8 43.8
bitext 35.30 41.03 38.31
Table 6: BLEU on newstest2014 for WMT
+sampling 36.13 43.84 40.91
English-German (En–De) and English-French
(En–Fr). The first four results use only WMT
Table 5: De-tokenized BLEU (sacreBLEU) on var-
bitext (WMT’14, except for b, c, d in En–De
ious test sets for WMT English-French.
which train on WMT’16). DeepL uses propri-
etary high-quality bitext and our result relies on
Finally, for WMT English-German we train back-translation with 226M newscrawl sentences
on all 226M available monolingual training sen- for En–De and 31M for En–Fr. We also show deto-
tences and perform 250K updates in 22.5 hours kenized BLEU (SacreBLEU).
on 128 GPUs. We upsample the bitext with a
rate of 16 so that we observe every bitext sentence news17 news18
16 times more often than each monolingual sen-
tence. This results in a new state of the art of baseline 29.36 42.38
35 BLEU on newstest2014 by using only WMT +BT 32.66 44.94
benchmark data. For comparison, DeepL, a com- +ensemble 33.31 46.39
mercial translation engine relying on high qual- +filter copies 33.35 46.53
ity bilingual training data, achieves 33.3 tokenized % of source copies 0.56% 0.53%
BLEU .4 Table 6 summarizes our results and com-
pares to other work in the literature. This shows Table 7: De-tokenized case-insensitive sacreBLEU
that back-translation with sampling can result in on WMT English-German newstest17 and new-
high-quality translation models based on bench- stest18.
mark data only.
6 Submission to WMT’18 described in §4.

This section describes our entry to the WMT’18 Ott et al. (2018a) showed that beam search
English-German news translation task which was sometimes outputs source copies rather than tar-
ranked #1 in the human evaluation (Bojar et al., get language translations. We replaced source
2018). Our entry is based on the WMT English- copies by the output of a model trained only on the
German models described in the previous section news-commentary portion of the WMT’18 task
(§5.6). In particular, we ensembled six back- (nc model). This model produced far fewer copies
translation models trained on all available bitext since this dataset is less noisy. Outputs are deemed
plus 226M newscrawl sentences or 5.8B German to be a source copy if the Jaccard similarity be-
tokens. Four models used bitext upsample ratio tween the source and the target unigrams exceeds
16, one model upsample ratio 32, and another one 0.5. About 0.5% of outputs are identified as source
upsample ratio 8. Upsample ratios differed be- copies. We used newstest17 as a development set
cause we reused models previously trained to tune to fine tune ensemble size and model parameters.
the upsample ratio. We did not use checkpoint av- Table 7 summarizes the effect of back-translation
eraging. More details of our setup and data are data, ensembling and source copy filtering.5
4
with SET ∈ {wmt13, wmt14/full, wmt15} https://www.deepl.com/press.html
3 5
sacreBLEU signatures: BLEU+case.mixed+lang.en- BLEU+case.lc+lang.en-
LANG+numrefs.1+smooth.exp+test.wmt14/full+ de+numrefs.1+smooth.exp+test.SET+tok.13a+version.1.2.11
tok.13a+version.1.2.7 with LANG ∈ {de,fr} with SET ∈ {wmt17, wmt18}
7 Conclusions and future work Peter F. Brown, John Cocke, Stephen Della Pietra, Vin-
cent J. Della Pietra, Frederick Jelinek, John D. Laf-
Back-translation is a very effective data augmen- ferty, Robert L. Mercer, and Paul S. Roossin. 1990.
tation technique for neural machine translation. A statistical approach to machine translation. Com-
Generating synthetic sources by sampling or by putational Linguistics, 16:79–85.
adding noise to beam outputs leads to higher ac- Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua
curacy than argmax inference which is typically Wu, Maosong Sun, and Yang Liu. 2016. Semi-
used. In particular, sampling and noised beam supervised learning for neural machine translation.
In Conference of the Association for Computational
outperforms pure beam by 1.7 BLEU on average Linguistics (ACL).
on newstest2013-2017 for WMT English-German
translation. Both methods provide a richer train- Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
ing signal for all but resource poor setups. We
Schwenk, and Yoshua Bengio. 2014. Learning
also find that synthetic data can achieve up to 83% phrase representations using rnn encoder-decoder
of the performance attainable with real bitext. Fi- for statistical machine translation. In Conference on
nally, we achieve a new state of the art result of 35 Empirical Methods in Natural Language Processing
BLEU on the WMT’14 English-German test set (EMNLP).
by using publicly available benchmark data only. Ryan Cotterell and Julia Kreutzer. 2018. Explain-
In future work, we would like to investigate ing and generalizing back-translation through wake-
an end-to-end approach where the back-translation sleep. arXiv preprint arXiv:1806.04402.
model is optimized to output synthetic sources that Anna Currey, Antonio Valerio Miceli Barone, and Ken-
are most helpful to the final forward model. neth Heafield. 2017. Copied Monolingual Data Im-
proves Low-Resource Neural Machine Translation.
In Proc. of WMT.
References Tobias Domhan and Felix Hieber. 2017. Using target-
Karim Ahmed, Nitish Shirish Keskar, and Richard side monolingual data for neural machine transla-
Socher. 2017. Weighted transformer network for tion through multi-task learning. In Conference on
machine translation. arxiv, 1711.02132. Empirical Methods in Natural Language Processing
(EMNLP).
Antreas Antoniou, Amos J. Storkey, and Harrison Ed-
wards. 2017. Data augmentation generative adver- Angela Fan, Yann Dauphin, and Mike Lewis. 2018.
sarial networks. arXiv, abs/1711.04340. Hierarchical neural story generation. In Confer-
ence of the Association for Computational Linguis-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- tics (ACL).
gio. 2015. Neural machine translation by jointly
Orhan Firat, Kyunghyun Cho, and Yoshua Ben-
learning to align and translate. In International Con-
gio. 2016a. Multi-way, multilingual neural ma-
ference on Learning Representations (ICLR).
chine translation with a shared attention mecha-
nism. In Conference of the North American Chap-
Nicola Bertoldi and Marcello Federico. 2009. Domain
ter of the Association for Computational Linguistics
adaptation for statistical machine translation with
(NAACL).
monolingual resources. In Workshop on Statistical
Machine Translation (WMT). Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan,
Fatos T. Yarman-Vural, and Kyunghyun Cho. 2016b.
Ondrej Bojar and Ales Tamchyna. 2011. Improving Zero-resource translation with multi-lingual neu-
translation model by monolingual data. In Workshop ral machine translation. In Conference on Em-
on Statistical Machine Translation (WMT). pirical Methods in Natural Language Processing
(EMNLP).
Ondřej Bojar, Christian Federmann, Mark Fishel,
Yvette Graham, Barry Haddow, Matthias Huck, Jonas Gehring, Michael Auli, David Grangier, Denis
Philipp Koehn, and Christof Monz. 2018. Find- Yarats, and Yann N Dauphin. 2017. Convolutional
ings of the 2018 conference on machine translation sequence to sequence learning. In International
(WMT18). In Proceedings of the Third Conference Conference of Machine Learning (ICML).
on Machine Translation, Volume 2: Shared Task Pa-
pers, Brussels, Belgium. Association for Computa- Alex Graves. 2013. Generating sequences with recur-
tional Linguistics. rent neural networks. arXiv, 1308.0850.
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz Josef Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor
Och, and Jeffrey Dean. 2007. Large language mod- O. K. Li. 2018. Universal neural machine transla-
els in machine translation. In Conference on Natural tion for extremely low resource languages. arXiv,
Language Learning (CoNLL). 1802.05368.
Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim
Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho-
Holger Schwenk, and Yoshua Bengio. 2015. On us- rat, Fernanda B. Viégas, Martin Wattenberg, Gre-
ing monolingual corpora in neural machine transla- gory S. Corrado, Macduff Hughes, and Jeffrey Dean.
tion. arXiv, 1503.03535. 2017. Google’s multilingual neural machine transla-
tion system: Enabling zero-shot translation. Trans-
Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun actions of the Association for Computational Lin-
Cho, and Yoshua Bengio. 2017. On integrating guistics (TACL), 5:339–351.
a language model into neural machine translation.
Computer Speech & Language, 45:137–148. Lukasz Kaiser, Aidan N. Gomez, and François Chollet.
2017. Depthwise separable convolutions for neural
Thanh-Le Ha, Jan Niehues, and Alexander H. Waibel. machine translation. CoRR, abs/1706.03059.
2016. Toward multilingual neural machine trans-
lation with universal encoder and decoder. arXiv, Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,
1611.04798. Aäron van den Oord, Alex Graves, and Koray
Kavukcuoglu. 2016. Neural machine translation in
Hany Hassan, Anthony Aue, Chang Chen, Vishal linear time. CoRR, abs/1610.10099.
Chowdhary, Jonathan Clark, Christian Feder-
Alina Karakanta, Jon Dehdari, and Josef van Genabith.
mann, Xuedong Huang, Marcin Junczys-Dowmunt, 2017. Neural machine translation for low-resource
William Lewis, Mu Li, et al. 2018. Achieving hu-
languages without parallel corpora. Machine Trans-
man parity on automatic chinese to english news
lation, pages 1–23.
translation. arXiv, 1803.05567.
Diederik P. Kingma and Jimmy Ba. 2015. Adam:
Soren Hauberg, Oren Freifeld, Anders Boesen Lindbo A Method for Stochastic Optimization. In Inter-
Larsen, John W. Fisher, and Lars Kai Hansen. 2016. national Conference on Learning Representations
Dreaming more data: Class-dependent distributions (ICLR).
over diffeomorphisms for learned data augmenta-
tion. In AISTATS. Philipp Koehn. 2010. Statistical machine translation.
Cambridge University Press.
Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu,
Tieyan Liu, and Wei-Ying Ma. 2016a. Dual learning Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
for machine translation. In Conference on Advances Callison-Burch, Marcello Federico, Nicola Bertoldi,
in Neural Information Processing Systems (NIPS). Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra
Wei He, Zhongjun He, Hua Wu, and Haifeng Wang. Constantin, and Evan Herbst. 2007. Moses: Open
2016b. Improved neural machine translation with source toolkit for statistical machine translation. In
smt features. In Conference of the Association for ACL Demo Session.
the Advancement of Artificial Intelligence (AAAI),
pages 151–157. Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical phrase-based translation. In Con-
Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. ference of the North American Chapter of the Asso-
Clark, and Philipp Koehn. 2013. Scalable Modified ciation for Computational Linguistics (NAACL).
Kneser-Ney Language Model Estimation. In Con-
Patrik Lambert, Holger Schwenk, Christophe Ser-
ference of the Association for Computational Lin-
van, and Sadaf Abdul-Rauf. 2011. Investigations
guistics (ACL).
on translation model adaptation using monolingual
Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. data. In Workshop on Statistical Machine Transla-
Learning distributed representations of sentences tion (WMT).
from unlabelled data. In Conference of the North Guillaume Lample, Alexis Conneau, Ludovic Denoyer,
American Chapter of the Association for Computa- and Marc’Aurelio Ranzato. 2018a. Unsupervised
tional Linguistics (NAACL). machine translation using monolingual corpora only.
In International Conference on Learning Represen-
Vu Cong Duy Hoang, Philipp Koehn, Gholamreza tations (ICLR).
Haffari, and Trevor Cohn. 2018. Iterative back-
translation for neural machine translation. In Pro- Guillaume Lample, Myle Ott, Alexis Conneau, Lu-
ceedings of the 2nd Workshop on Neural Machine dovic Denoyer, and Marc’Aurelio Ranzato. 2018b.
Translation and Generation, pages 18–24. Phrase-based & neural unsupervised machine trans-
lation. arXiv, 1803.05567.
Kenji Imamura, Atsushi Fujita, and Eiichiro Sumita.
2018. Enhancement of encoder and attention using Marco Lui and Timothy Baldwin. 2012. langid. py: An
target monolingual corpora in neural machine trans- off-the-shelf language identification tool. In Pro-
lation. In Proceedings of the 2nd Workshop on Neu- ceedings of the ACL 2012 system demonstrations,
ral Machine Translation and Generation, pages 55– pages 25–30. Association for Computational Lin-
63. guistics.
Minh-Thang Luong, Hieu Pham, and Christopher D Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.
Manning. 2015. Effective approaches to attention- 2018. Self-attention with relative position represen-
based neural machine translation. In Conference on tations. In Proc. of NAACL.
Empirical Methods in Natural Language Processing
(EMNLP). Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural net-
Xing Niu, Michael Denkowski, and Marine Carpuat. works. In Conference on Advances in Neural In-
2018. Bi-directional neural machine transla- formation Processing Systems (NIPS).
tion with synthetic parallel data. arXiv preprint
arXiv:1805.11213. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
Jonathon Shlens, and Zbigniew Wojna. 2015. Re-
Myle Ott, Michael Auli, David Grangier, and thinking the Inception Architecture for Computer
Marc’Aurelio Ranzato. 2018a. Analyzing uncer- Vision. arXiv preprint arXiv:1512.00567.
tainty in neural machine translation. In Proceed-
ings of the 35th International Conference on Ma- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
chine Learning, volume 80, pages 3956–3965. Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
Myle Ott, Sergey Edunov, David Grangier, and you need. In Conference on Advances in Neural In-
Michael Auli. 2018b. Scaling neural machine trans- formation Processing Systems (NIPS).
lation. In Proceedings of the Third Conference on
Machine Translation: Research Papers. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, ,
and Pierre-Antoine Manzagol. 2008. Extracting and
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- composing robust features with denoising autoen-
Jing Zhu. 2002. BLEU: a method for automatic coders. In International Conference on Machine
evaluation of machine translation. In Conference Learning (ICML).
of the Association for Computational Linguistics
(ACL). Yingce Xia, Tao Qin, Wei Chen, Jiang Bian, Nenghai
Yu, and Tie-Yan Liu. 2017. Dual supervised learn-
Romain Paulus, Caiming Xiong, and Richard Socher. ing. In International Conference on Machine Learn-
2018. A deep reinforced model for abstractive sum- ing (ICML).
marization. In International Conference on Learn-
ing Representations (ICLR). Jiajun Zhang and Chengqing Zong. 2016. Exploit-
ing source-side monolingual data in neural machine
Gabriel Pereyra, George Tucker, Jan Chorowski,
translation. In Conference on Empirical Methods in
Lukasz Kaiser, and Geoffrey E. Hinton. 2017. Reg-
Natural Language Processing (EMNLP).
ularizing neural networks by penalizing confident
output distributions. In International Conference on
Learning Representations (ICLR) Workshop.
Luis Perez and Jason Wang. 2017. The effectiveness of
data augmentation in image classification using deep
learning. arxiv, 1712.04621.
Alberto Poncelas, Dimitar Sht. Shterionov, Andy Way,
Gideon Maillette de Buy Wenniger, and Peyman
Passban. 2018. Investigating backtranslation in neu-
ral machine translation. arXiv, 1804.06189.
Matt Post. 2018. A call for clarity in reporting bleu
scores. arXiv, 1804.08771.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016a. Improving neural machine translation mod-
els with monolingual data. Conference of the Asso-
ciation for Computational Linguistics (ACL).
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016b. Neural machine translation of rare words
with subword units. In Conference of the Associa-
tion for Computational Linguistics (ACL).
Iulian Serban, Alessandro Sordoni, Yoshua Bengio,
Aaron C. Courville, and Joelle Pineau. 2016. Build-
ing end-to-end dialogue systems using generative hi-
erarchical neural network models. In Conference of
the Association for the Advancement of Artificial In-
telligence (AAAI).

Understanding Back-Translation at Scale

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Understanding Back-Translation at Scale

Uploaded by

Copyright:

Available Formats

Understanding Back-Translation at Scale

Sergey Edunov△ Myle Ott△ Michael Auli△ David Grangier ▽∗

Abstract We focus on back-translation (BT) which

bilingual and monolingual data in the target lan-

top10 sampling sampling 500.17

tions as similar as possible. We assume that when translation model,

share the same target side. It also allows us to

estimate the value of BT data for domain adap-

• newstest2012, i.e. pure newswire data.

5.5 Upsampling the bitext

to synthetic data observed during training. In par-

6 Submission to WMT’18 described in §4.

You might also like