Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Beam Search Strategies for Neural Machine Translation

Markus Freitag and Yaser Al-Onaizan


IBM T.J. Watson Research Center
1101 Kitchawan Rd, Yorktown Heights, NY 10598
{freitagm,onaizan}@us.ibm.com

Abstract 2015; Luong et al., 2015), it has become very pop-


ular in the recent years (Kalchbrenner and Blun-
The basic concept in Neural Machine som, 2013; Sutskever et al., 2014; Bahdanau et al.,
Translation (NMT) is to train a large Neu- 2014). With the recent success of NMT, attention
ral Network that maximizes the transla- has shifted towards making it more practical. One
tion performance on a given parallel cor- of the challenges is the search strategy for extract-
pus. NMT is then using a simple left-to- ing the best translation for a given source sentence.
right beam-search decoder to generate new In NMT, new sentences are translated by a simple
translations that approximately maximize beam search decoder that finds a translation that
the trained conditional probability. The approximately maximizes the conditional proba-
current beam search strategy generates the bility of a trained NMT model. The beam search
target sentence word by word from left-to- strategy generates the translation word by word
right while keeping a fixed amount of ac- from left-to-right while keeping a fixed number
tive candidates at each time step. First, this (beam) of active candidates at each time step. By
simple search is less adaptive as it also ex- increasing the beam size, the translation perfor-
pands candidates whose scores are much mance can increase at the expense of significantly
worse than the current best. Secondly, it reducing the decoder speed. Typically, there is
does not expand hypotheses if they are not a saturation point at which the translation qual-
within the best scoring candidates, even ity does not improve any more by further increas-
if their scores are close to the best one. ing the beam. The motivation of this work is two
The latter one can be avoided by increas- folded. First, we prune the search graph, thus,
ing the beam size until no performance im- speed up the decoding process without losing any
provement can be observed. While you translation quality. Secondly, we observed that the
can reach better performance, this has the best scoring candidates often share the same his-
drawback of a slower decoding speed. In tory and often come from the same partial hypoth-
this paper, we concentrate on speeding esis. We limit the amount of candidates coming
up the decoder by applying a more flexi- from the same partial hypothesis to introduce more
ble beam search strategy whose candidate diversity without reducing the decoding speed by
size may vary at each time step depend- just using a higher beam.
ing on the candidate scores. We speed
up the original decoder by up to 43% for
2 Related Work
the two language pairs German→English
and Chinese→English without losing any The original beam search for sequence to se-
translation quality. quence models has been introduced and described
by (Graves, 2012; Boulanger-Lewandowski et al.,
1 Introduction
2013) and by (Sutskever et al., 2014) for neural
Due to the fact that Neural Machine Translation machine translation. (Hu et al., 2015; Mi et al.,
(NMT) is reaching comparable or even better per- 2016) improved the beam search with a constraint
formance compared to the traditional statistical softmax function which only considered a lim-
machine translation (SMT) models (Jean et al., ited word set of translation candidates to reduce

56
Proceedings of the First Workshop on Neural Machine Translation, pages 56–60,
c
Vancouver, Canada, August 4, 2017. 2017 Association for Computational Linguistics
the computation complexity. This has the advan- cand ∈ C is discarded if:
tage that they normalize only a small set of candi-
score(cand) ≤ max{score(c)} − ap (2)
dates and thus improve the decoding speed. (Wu c∈C
et al., 2016) only consider tokens that have local Relative Local Threshold Pruning. In this prun-
scores that are not more than beamsize below the ing approach, we only consider the score
best token during their search. Further, the au- scorew of the last generated word and not
thors prune all partial hypotheses whose score are the total score which also include the scores
beamsize lower than the best final hypothesis (if of the previously generated words. Given a
one has already been generated). In this work, we pruning threshold rpl and an active candidate
investigate different absolute and relative pruning list C, a candidate cand ∈ C is discarded if:
schemes which have successfully been applied in
statistical machine translation for e.g. phrase table scorew (cand) ≤ rpl ∗ max{scorew (c)}
c∈C
pruning (Zens et al., 2012). (3)
Maximum Candidates per Node We observed
3 Original Beam Search that at each time step during the decoding
process, most of the partial hypotheses share
The original beam-search strategy finds a transla-
the same predecessor words. To introduce
tion that approximately maximizes the conditional
more diversity, we allow only a fixed number
probability given by a specific model. It builds
of candidates with the same history at each
the translation from left-to-right and keeps a fixed
time step. Given a maximum candidate
number (beam) of translation candidates with the
threshold mc and an active candidate list C,
highest log-probability at each time step. For each
a candidate cand ∈ C is discarded if already
end-of-sequence symbol that is selected among
mc better scoring partial hyps with the same
the highest scoring candidates the beam is reduced
history are in the candidate list.
by one and the translation is stored into a final can-
didate list. When the beam is zero, it stops the 5 Experiments
search and picks the translation with the highest For the German→English translation task, we
log-probability (normalized by the number of tar- train an NMT system based on the WMT 2016
get words) out of the final candidate list. training data (Bojar et al., 2016) (3.9M paral-
lel sentences). For the Chinese→English experi-
4 Search Strategies
ments, we use an NMT system trained on 11 mil-
In this section, we describe the different strategies lion sentences from the BOLT project.
we experimented with. In all our extensions, we In all our experiments, we use our in-house
first reduce the candidate list to the current beam attention-based NMT implementation which is
size and apply on top of this one or several of the similar to (Bahdanau et al., 2014). For
following pruning schemes. German→English, we use sub-word units ex-
tracted by byte pair encoding (Sennrich et al.,
Relative Threshold Pruning. The relative 2015) instead of words which shrinks the vocabu-
threshold pruning method discards those lary to 40k sub-word symbols for both source and
candidates that are far worse than the best target. For Chinese→English, we limit our vocab-
active candidate. Given a pruning threshold ularies to be the top 300K most frequent words
rp and an active candidate list C, a candidate for both source and target language. Words not in
cand ∈ C is discarded if: these vocabularies are converted into an unknown
token. During translation, we use the alignments
score(cand) ≤ rp ∗ max{score(c)} (1) (from the attention mechanism) to replace the un-
c∈C
known tokens either with potential targets (ob-
Absolute Threshold Pruning. Instead of taking tained from an IBM Model-1 trained on the paral-
the relative difference of the scores into ac- lel data) or with the source word itself (if no target
count, we just discard those candidates that was found) (Mi et al., 2016). We use an embed-
are worse by a specific threshold than the best ding dimension of 620 and fix the RNN GRU lay-
active candidate. Given a pruning threshold ers to be of 1000 cells each. For the training proce-
ap and an active candidate list C, a candidate dure, we use SGD (Bishop, 1995) to update model

57
28 25 German→English results can be found in Table 1.
By using the combination of all pruning tech-
27.5 20
niques, we can speed up the decoding process by

average fan out per sentence


13% for beam size 5 and by 43% for beam size
27 15
14 without any drop in performance. The rela-
B LEU

26.5 10
tive pruning technique is the best working one for
beam size 5 whereas the absolute pruning tech-
26 5 nique works best for a beam size 14. In Figure 2
B LEU the decoding speed with different relative prun-
average fan out
25.5 0 ing threshold for beam size 5 are illustrated. Set-
5 10 15 20 25
beam size
ting the threshold higher than 0.6 hurts the trans-
lation performance. A nice side effect is that it has
Figure 1: German→English: Original beam-
become possible to decode without any fix beam
search strategy with different beam sizes on new-
size when we apply pruning. Nevertheless, the de-
stest2014.
coding speed drops while the translation perfor-
27.4 5
mance did not change. Further, we looked at the
27.2
number of search errors introduced by our prun-
4.5
27 ing schemes (number of times we prune the best
average fan out per sentence

4
26.8 scoring hypothesis). 5% of the sentences change
26.6 3.5 due to search errors for beam size 5 and 9% of the
B LEU

26.4 3 sentences change for beam size 14 when using all


26.2 2.5 four pruning techniques together.
26
B LEU
25.8 average fan out
2 The Chinese→English translation results can be
25.6 1.5 found in Table 2. We can speed up the decoding
25.4 1 process by 10% for beam size 5 and by 24% for
0 0.2 0.4 0.6 0.8 1
relative pruning, beam size = 5
beam size 14 without loss in translation quality. In
addition, we measured the number of search errors
Figure 2: German→English: Different values of introduced by pruning the search. Only 4% of the
relative pruning measured on newstest2014. sentences change for beam size 5, whereas 22% of
the sentences change for beam size 14.
parameters with a mini-batch size of 64. The train-
ing data is shuffled after each epoch.
We measure the decoding speed by two num- 6 Conclusion
bers. First, we compare the actual speed relative
to the same setup without any pruning. Secondly, The original beam search decoder used in Neu-
we measure the average fan out per time step. For ral Machine Translation is very simple. It gen-
each time step, the fan out is defined as the num- erated translations from left-to-right while look-
ber of candidates we expand. Fan out has an up- ing at a fix number (beam) of candidates from the
per bound of the size of the beam, but can be de- last time step only. By setting the beam size large
creased either due to early stopping (we reduce enough, we ensure that the best translation per-
the beam every time we predict a end-of-sentence formance can be reached with the drawback that
symbol) or by the proposed pruning schemes. For many candidates whose scores are far away from
each pruning technique, we run the experiments the best are also explored. In this paper, we in-
with different pruning thresholds and chose the troduced several pruning techniques which prune
largest threshold that did not degrade the transla- candidates whose scores are far away from the best
tion performance based on a selection set. one. By applying a combination of absolute and
In Figure 1, you can see the German→English relative pruning schemes, we speed up the decoder
translation performance and the average fan out by up to 43% without losing any translation qual-
per sentence for different beam sizes. Based ity. Putting more diversity into the decoder did not
on this experiment, we decided to run our prun- improve the translation quality.
ing experiments for beam size 5 and 14. The

58
pruning beam speed avg fan out tot fan out newstest2014 newstest2015
size up per sent per sent B LEU T ER B LEU T ER
no pruning 1 - 1.00 25 25.5 56.8 26.1 55.4
no pruning 5 - 4.54 122 27.3 54.6 27.4 53.7
rp=0.6 5 6% 3.71 109 27.3 54.7 27.3 53.8
ap=2.5 5 5% 4.11 116 27.3 54.6 27.4 53.7
rpl=0.02 5 5% 4.25 118 27.3 54.7 27.4 53.8
mc=3 5 0% 4.54 126 27.4 54.6 27.5 53.8
rp=0.6,ap=2.5,rpl=0.02,mc=3 5 13% 3.64 101 27.3 54.6 27.3 53.8
no pruning 14 - 12.19 363 27.6 54.3 27.6 53.5
rp=0.3 14 10% 10.38 315 27.6 54.3 27.6 53.4
ap=2.5 14 29% 9.49 279 27.6 54.3 27.6 53.5
rpl=0.3 14 24% 10.27 306 27.6 54.4 27.7 53.4
mc=3 14 1% 12.21 347 27.6 54.4 27.7 53.4
rp=0.3,ap=2.5,rpl=0.3,mc=3 14 43% 8.44 260 27.6 54.5 27.6 53.4
rp=0.3,ap=2.5,rpl=0.3,mc=3 - - 28.46 979 27.6 54.4 27.6 53.3

Table 1: Results German→English: relative pruning(rp), absolute pruning(ap), relative local pruning(rpl)
and maximum candidates per node(mc). Average fan out is the average number of candidates we keep at
each time step during decoding.

pruning beam speed avg fan out tot fan out MT08 nw MT08 wb
size up per sent per sent B LEU T ER B LEU T ER
no pruning 1 - 1.00 29 27.3 61.7 26.0 60.3
no pruning 5 - 4.36 137 34.4 57.3 30.6 58.2
rp=0.2 5 1% 4.32 134 34.4 57.3 30.6 58.2
ap=5 5 4% 4.26 132 34.3 57.3 30.6 58.2
rpl=0.01 5 1% 4.35 135 34.4 57.5 30.6 58.3
mc=3 5 0% 4.37 139 34.4 57.4 30.7 58.2
rp=0.2,ap=5,rpl=0.01,mc=3 5 10% 3.92 121 34.3 57.3 30.6 58.2
no pruning 14 - 11.96 376 35.3 57.1 31.2 57.8
rp=0.2 14 3% 11.62 362 35.2 57.2 31.2 57.8
ap=2.5 14 14% 10.15 321 35.2 56.9 31.1 57.9
rpl=0.3 14 10% 10.93 334 35.3 57.2 31.1 57.9
mc=3 14 0% 11.98 378 35.3 56.9 31.1 57.8
rp=0.2,ap=2.5,rpl=0.3,mc=3 14 24% 8.62 306 35.3 56.9 31.1 57.8
rp=0.2,ap=2.5,rpl=0.3,mc=3 - - 38.76 1411 35.2 57.3 31.1 57.9

Table 2: Results Chinese→English: relative pruning(rp), absolute pruning(ap), relative local pruning(rpl)
and maximum candidates per node(mc).

References 2016 conference on machine translation (wmt16).


Proceedings of WMT .
D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural
machine translation by jointly learning to align and Nicolas Boulanger-Lewandowski, Yoshua Bengio, and
translate. ArXiv e-prints . Pascal Vincent. 2013. Audio chord recognition with
recurrent neural networks. In ISMIR. Citeseer, pages
Christopher M Bishop. 1995. Neural networks for pat- 335–340.
tern recognition. Oxford university press.
Alex Graves. 2012. Sequence transduction with
Ondrej Bojar, Rajen Chatterjee, Christian Federmann, recurrent neural networks. arXiv preprint
Yvette Graham, Barry Haddow, Matthias Huck, An- arXiv:1211.3711 .
tonio Jimeno Yepes, Philipp Koehn, Varvara Lo-
gacheva, Christof Monz, et al. 2016. Findings of the Xiaoguang Hu, Wei Li, Xiang Lan, Hua Wu, and

59
Haifeng Wang. 2015. Improved beam search with
constrained softmax for nmt. Proceedings of MT
Summit XV page 297.
Sébastien Jean, Kyunghyun Cho, Roland Memisevic,
and Yoshua Bengio. 2015. On using very large tar-
get vocabulary for neural machine translation. In
Proceedings of ACL. Beijing, China, pages 1–10.
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent
continuous translation models. In Proceedings of
the 2013 Conference on Empirical Methods in Nat-
ural Language Processing. Association for Compu-
tational Linguistics, Seattle.
Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals,
and Wojciech Zaremba. 2015. Addressing the rare
word problem in neural machine translation. In Pro-
ceedings of ACL. Beijing, China, pages 11–19.
Haitao Mi, Zhiguo Wang, and Abe Ittycheriah. 2016.
Vocabulary manipulation for neural machine trans-
lation. arXiv preprint arXiv:1605.03209 .
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2015. Neural machine translation of rare words with
subword units. arXiv preprint arXiv:1508.07909 .
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural net-
works. In Advances in Neural Information Process-
ing Systems 27: Annual Conference on Neural In-
formation Processing Systems 2014, December 8-
13 2014, Montreal, Quebec, Canada. pages 3104–
3112. http://papers.nips.cc/paper/5346-sequence-
to-sequence-learning-with-neural-networks.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, et al. 2016. Google’s neural ma-
chine translation system: Bridging the gap between
human and machine translation. arXiv preprint
arXiv:1609.08144 .

Richard Zens, Daisy Stanton, and Peng Xu. 2012. A


systematic comparison of phrase table pruning tech-
niques. In Proceedings of the 2012 Joint Confer-
ence on Empirical Methods in Natural Language
Processing and Computational Natural Language
Learning. Association for Computational Linguis-
tics, pages 972–983.

60

You might also like