Professional Documents
Culture Documents
Beam Search Strategies For Neural Machine Translation
Beam Search Strategies For Neural Machine Translation
56
Proceedings of the First Workshop on Neural Machine Translation, pages 56–60,
c
Vancouver, Canada, August 4, 2017.
2017 Association for Computational Linguistics
the computation complexity. This has the advan- cand ∈ C is discarded if:
tage that they normalize only a small set of candi-
score(cand) ≤ max{score(c)} − ap (2)
dates and thus improve the decoding speed. (Wu c∈C
et al., 2016) only consider tokens that have local Relative Local Threshold Pruning. In this prun-
scores that are not more than beamsize below the ing approach, we only consider the score
best token during their search. Further, the au- scorew of the last generated word and not
thors prune all partial hypotheses whose score are the total score which also include the scores
beamsize lower than the best final hypothesis (if of the previously generated words. Given a
one has already been generated). In this work, we pruning threshold rpl and an active candidate
investigate different absolute and relative pruning list C, a candidate cand ∈ C is discarded if:
schemes which have successfully been applied in
statistical machine translation for e.g. phrase table scorew (cand) ≤ rpl ∗ max{scorew (c)}
c∈C
pruning (Zens et al., 2012). (3)
Maximum Candidates per Node We observed
3 Original Beam Search that at each time step during the decoding
process, most of the partial hypotheses share
The original beam-search strategy finds a transla-
the same predecessor words. To introduce
tion that approximately maximizes the conditional
more diversity, we allow only a fixed number
probability given by a specific model. It builds
of candidates with the same history at each
the translation from left-to-right and keeps a fixed
time step. Given a maximum candidate
number (beam) of translation candidates with the
threshold mc and an active candidate list C,
highest log-probability at each time step. For each
a candidate cand ∈ C is discarded if already
end-of-sequence symbol that is selected among
mc better scoring partial hyps with the same
the highest scoring candidates the beam is reduced
history are in the candidate list.
by one and the translation is stored into a final can-
didate list. When the beam is zero, it stops the 5 Experiments
search and picks the translation with the highest For the German→English translation task, we
log-probability (normalized by the number of tar- train an NMT system based on the WMT 2016
get words) out of the final candidate list. training data (Bojar et al., 2016) (3.9M paral-
lel sentences). For the Chinese→English experi-
4 Search Strategies
ments, we use an NMT system trained on 11 mil-
In this section, we describe the different strategies lion sentences from the BOLT project.
we experimented with. In all our extensions, we In all our experiments, we use our in-house
first reduce the candidate list to the current beam attention-based NMT implementation which is
size and apply on top of this one or several of the similar to (Bahdanau et al., 2014). For
following pruning schemes. German→English, we use sub-word units ex-
tracted by byte pair encoding (Sennrich et al.,
Relative Threshold Pruning. The relative 2015) instead of words which shrinks the vocabu-
threshold pruning method discards those lary to 40k sub-word symbols for both source and
candidates that are far worse than the best target. For Chinese→English, we limit our vocab-
active candidate. Given a pruning threshold ularies to be the top 300K most frequent words
rp and an active candidate list C, a candidate for both source and target language. Words not in
cand ∈ C is discarded if: these vocabularies are converted into an unknown
token. During translation, we use the alignments
score(cand) ≤ rp ∗ max{score(c)} (1) (from the attention mechanism) to replace the un-
c∈C
known tokens either with potential targets (ob-
Absolute Threshold Pruning. Instead of taking tained from an IBM Model-1 trained on the paral-
the relative difference of the scores into ac- lel data) or with the source word itself (if no target
count, we just discard those candidates that was found) (Mi et al., 2016). We use an embed-
are worse by a specific threshold than the best ding dimension of 620 and fix the RNN GRU lay-
active candidate. Given a pruning threshold ers to be of 1000 cells each. For the training proce-
ap and an active candidate list C, a candidate dure, we use SGD (Bishop, 1995) to update model
57
28 25 German→English results can be found in Table 1.
By using the combination of all pruning tech-
27.5 20
niques, we can speed up the decoding process by
26.5 10
tive pruning technique is the best working one for
beam size 5 whereas the absolute pruning tech-
26 5 nique works best for a beam size 14. In Figure 2
B LEU the decoding speed with different relative prun-
average fan out
25.5 0 ing threshold for beam size 5 are illustrated. Set-
5 10 15 20 25
beam size
ting the threshold higher than 0.6 hurts the trans-
lation performance. A nice side effect is that it has
Figure 1: German→English: Original beam-
become possible to decode without any fix beam
search strategy with different beam sizes on new-
size when we apply pruning. Nevertheless, the de-
stest2014.
coding speed drops while the translation perfor-
27.4 5
mance did not change. Further, we looked at the
27.2
number of search errors introduced by our prun-
4.5
27 ing schemes (number of times we prune the best
average fan out per sentence
4
26.8 scoring hypothesis). 5% of the sentences change
26.6 3.5 due to search errors for beam size 5 and 9% of the
B LEU
58
pruning beam speed avg fan out tot fan out newstest2014 newstest2015
size up per sent per sent B LEU T ER B LEU T ER
no pruning 1 - 1.00 25 25.5 56.8 26.1 55.4
no pruning 5 - 4.54 122 27.3 54.6 27.4 53.7
rp=0.6 5 6% 3.71 109 27.3 54.7 27.3 53.8
ap=2.5 5 5% 4.11 116 27.3 54.6 27.4 53.7
rpl=0.02 5 5% 4.25 118 27.3 54.7 27.4 53.8
mc=3 5 0% 4.54 126 27.4 54.6 27.5 53.8
rp=0.6,ap=2.5,rpl=0.02,mc=3 5 13% 3.64 101 27.3 54.6 27.3 53.8
no pruning 14 - 12.19 363 27.6 54.3 27.6 53.5
rp=0.3 14 10% 10.38 315 27.6 54.3 27.6 53.4
ap=2.5 14 29% 9.49 279 27.6 54.3 27.6 53.5
rpl=0.3 14 24% 10.27 306 27.6 54.4 27.7 53.4
mc=3 14 1% 12.21 347 27.6 54.4 27.7 53.4
rp=0.3,ap=2.5,rpl=0.3,mc=3 14 43% 8.44 260 27.6 54.5 27.6 53.4
rp=0.3,ap=2.5,rpl=0.3,mc=3 - - 28.46 979 27.6 54.4 27.6 53.3
Table 1: Results German→English: relative pruning(rp), absolute pruning(ap), relative local pruning(rpl)
and maximum candidates per node(mc). Average fan out is the average number of candidates we keep at
each time step during decoding.
pruning beam speed avg fan out tot fan out MT08 nw MT08 wb
size up per sent per sent B LEU T ER B LEU T ER
no pruning 1 - 1.00 29 27.3 61.7 26.0 60.3
no pruning 5 - 4.36 137 34.4 57.3 30.6 58.2
rp=0.2 5 1% 4.32 134 34.4 57.3 30.6 58.2
ap=5 5 4% 4.26 132 34.3 57.3 30.6 58.2
rpl=0.01 5 1% 4.35 135 34.4 57.5 30.6 58.3
mc=3 5 0% 4.37 139 34.4 57.4 30.7 58.2
rp=0.2,ap=5,rpl=0.01,mc=3 5 10% 3.92 121 34.3 57.3 30.6 58.2
no pruning 14 - 11.96 376 35.3 57.1 31.2 57.8
rp=0.2 14 3% 11.62 362 35.2 57.2 31.2 57.8
ap=2.5 14 14% 10.15 321 35.2 56.9 31.1 57.9
rpl=0.3 14 10% 10.93 334 35.3 57.2 31.1 57.9
mc=3 14 0% 11.98 378 35.3 56.9 31.1 57.8
rp=0.2,ap=2.5,rpl=0.3,mc=3 14 24% 8.62 306 35.3 56.9 31.1 57.8
rp=0.2,ap=2.5,rpl=0.3,mc=3 - - 38.76 1411 35.2 57.3 31.1 57.9
Table 2: Results Chinese→English: relative pruning(rp), absolute pruning(ap), relative local pruning(rpl)
and maximum candidates per node(mc).
59
Haifeng Wang. 2015. Improved beam search with
constrained softmax for nmt. Proceedings of MT
Summit XV page 297.
Sébastien Jean, Kyunghyun Cho, Roland Memisevic,
and Yoshua Bengio. 2015. On using very large tar-
get vocabulary for neural machine translation. In
Proceedings of ACL. Beijing, China, pages 1–10.
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent
continuous translation models. In Proceedings of
the 2013 Conference on Empirical Methods in Nat-
ural Language Processing. Association for Compu-
tational Linguistics, Seattle.
Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals,
and Wojciech Zaremba. 2015. Addressing the rare
word problem in neural machine translation. In Pro-
ceedings of ACL. Beijing, China, pages 11–19.
Haitao Mi, Zhiguo Wang, and Abe Ittycheriah. 2016.
Vocabulary manipulation for neural machine trans-
lation. arXiv preprint arXiv:1605.03209 .
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2015. Neural machine translation of rare words with
subword units. arXiv preprint arXiv:1508.07909 .
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural net-
works. In Advances in Neural Information Process-
ing Systems 27: Annual Conference on Neural In-
formation Processing Systems 2014, December 8-
13 2014, Montreal, Quebec, Canada. pages 3104–
3112. http://papers.nips.cc/paper/5346-sequence-
to-sequence-learning-with-neural-networks.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, et al. 2016. Google’s neural ma-
chine translation system: Bridging the gap between
human and machine translation. arXiv preprint
arXiv:1609.08144 .
60