Effective Approaches To Attention-Based Neural Machine Translation
Effective Approaches To Attention-Based Neural Machine Translation
Abstract X Y Z <eos>
p(yt |y<t , x) = softmax(Ws h̃t ) (6) Given the alignment vector as weights, the context
vector ct is computed as the weighted average over
We now detail how each model type computes
all the source hidden states.6
the source-side context vector ct .
Comparison to (Bahdanau et al., 2015) – While
3.1 Global Attention our global attention approach is similar in spirit
to the model proposed by Bahdanau et al. (2015),
The idea of a global attentional model is to con-
there are several key differences which reflect how
sider all the hidden states of the encoder when de-
we have both simplified and generalized from
riving the context vector ct . In this model type,
the original model. First, we simply use hid-
a variable-length alignment vector at , whose size
den states at the top LSTM layers in both the
equals the number of time steps on the source side,
encoder and decoder as illustrated in Figure 2.
is derived by comparing the current target hidden
Bahdanau et al. (2015), on the other hand, use
state ht with each source hidden state h̄s :
the concatenation of the forward and backward
at (s) = align(ht , h̄s ) (7) source hidden states in the bi-directional encoder
exp score(ht , h̄s ) 6
Eq. (8) implies that all alignment vectors at are of the
=P same length. For short sentences, we only use the top part of
s′ exp score(ht , h̄s′ ) at and for long sentences, we ignore words near the end.
yt
h̃t refers to the global attention approach in which
weights are placed “softly” over all patches in the
source image. The hard attention, on the other
Attention Layer hand, selects one patch of the image to attend to at
Context vector
a time. While less expensive at inference time, the
ct hard attention model is non-differentiable and re-
Aligned position
pt quires more complicated techniques such as vari-
at Local weights ance reduction or reinforcement learning to train.
h̄s Our local attention mechanism selectively fo-
ht cuses on a small window of context and is differ-
entiable. This approach has an advantage of avoid-
ing the expensive computation incurred in the soft
attention and at the same time, is easier to train
Figure 3: Local attention model – the model first than the hard attention approach. In concrete de-
predicts a single aligned position pt for the current tails, the model first generates an aligned position
target word. A window centered around the source pt for each target word at time t. The context vec-
position pt is then used to compute a context vec- tor ct is then derived as a weighted average over
tor ct , a weighted average of the source hidden the set of source hidden states within the window
states in the window. The weights at are inferred [pt −D, pt +D]; D is empirically selected.8 Unlike
from the current target state ht and those source the global approach, the local alignment vector at
states h̄s in the window. is now fixed-dimensional, i.e., ∈ R2D+1 . We con-
sider two variants of the model as below.
Monotonic alignment (local-m) – we simply set
and target hidden states in their non-stacking uni- pt = t assuming that source and target sequences
directional decoder. Second, our computation path are roughly monotonically aligned. The alignment
is simpler; we go from ht → at → ct → h̃t vector at is defined according to Eq. (7).9
then make a prediction as detailed in Eq. (5), Predictive alignment (local-p) – instead of as-
Eq. (6), and Figure 2. On the other hand, at suming monotonic alignments, our model predicts
any time t, Bahdanau et al. (2015) build from the an aligned position as follows:
previous hidden state ht−1 → at → ct →
pt = S · sigmoid(v ⊤
p tanh(Wp ht )), (9)
ht , which, in turn, goes through a deep-output
and a maxout layer before making predictions.7 Wp and v p are the model parameters which will
Lastly, Bahdanau et al. (2015) only experimented be learned to predict positions. S is the source sen-
with one alignment function, the concat product; tence length. As a result of sigmoid, pt ∈ [0, S].
whereas we show later that the other alternatives To favor alignment points near pt , we place a
are better. Gaussian distribution centered around pt . Specif-
ically, our alignment weights are now defined as:
3.2 Local Attention
(s − pt )2
The global attention has a drawback that it has to at (s) = align(ht , h̄s ) exp − (10)
2σ 2
attend to all words on the source side for each tar-
We use the same align function as in Eq. (7) and
get word, which is expensive and can potentially
the standard deviation is empirically set as σ = D
2.
render it impractical to translate longer sequences,
Note that pt is a real nummber; whereas s is an
e.g., paragraphs or documents. To address this
integer within the window centered at pt .10
deficiency, we propose a local attentional mech-
8
anism that chooses to focus only on a small subset If the window crosses the sentence boundaries, we sim-
ply ignore the outside part and consider words in the window.
of the source positions per target word. 9
local-m is the same as the global model except that the
This model takes inspiration from the tradeoff vector at is fixed-length and shorter.
10
between the soft and hard attentional models pro- local-p is similar to the local-m model except that we dy-
namically compute pt and use a truncated Gaussian distribu-
posed by Xu et al. (2015) to tackle the image cap- tion to modify the original alignment weights align(ht , h̄s )
tion generation task. In their work, soft attention as shown in Eq. (10). By utilizing pt to derive at , we can
compute backprop gradients for Wp and v p . This model is
7
We will refer to this difference again in Section 3.3. differentiable almost everywhere.
X Y Z <eos>
in this work. Also, our approach is more general;
h̃t
as illustrated in Figure 4, it can be applied to
general stacking recurrent architectures, including
non-attentional models.
Attention Layer
Xu et al. (2015) propose a doubly attentional
approach with an additional constraint added to
the training objective to make sure the model pays
equal attention to all parts of the image during the
caption generation process. Such a constraint can
also be useful to capture the coverage set effect
in NMT that we mentioned earlier. However, we
chose to use the input-feeding approach since it
A B C D <eos> X Y Z
provides flexibility for the model to decide on any
Figure 4: Input-feeding approach – Attentional attentional constraints it deems suitable.
vectors h̃t are fed as inputs to the next time steps to
inform the model about past alignment decisions. 4 Experiments
We evaluate the effectiveness of our models
Comparison to (Gregor et al., 2015) – have pro- on the WMT translation tasks between En-
posed a selective attention mechanism, very simi- glish and German in both directions. new-
lar to our local attention, for the image generation stest2013 (3000 sentences) is used as a develop-
task. Their approach allows the model to select an ment set to select our hyperparameters. Transla-
image patch of varying location and zoom. We, tion performances are reported in case-sensitive
instead, use the same “zoom” for all target posi- BLEU (Papineni et al., 2002) on newstest2014
tions, which greatly simplifies the formulation and (2737 sentences) and newstest2015 (2169 sen-
still achieves good performance. tences). Following (Luong et al., 2015), we report
translation quality using two types of BLEU: (a)
3.3 Input-feeding Approach tokenized12 BLEU to be comparable with existing
In our proposed global and local approaches, NMT work and (b) NIST13 BLEU to be compara-
the attentional decisions are made independently, ble with WMT results.
which is suboptimal. Whereas, in standard MT,
4.1 Training Details
a coverage set is often maintained during the
translation process to keep track of which source All our models are trained on the WMT’14 train-
words have been translated. Likewise, in atten- ing data consisting of 4.5M sentences pairs (116M
tional NMTs, alignment decisions should be made English words, 110M German words). Similar
jointly taking into account past alignment infor- to (Jean et al., 2015), we limit our vocabularies to
mation. To address that, we propose an input- be the top 50K most frequent words for both lan-
feeding approach in which attentional vectors h̃t guages. Words not in these shortlisted vocabular-
are concatenated with inputs at the next time steps ies are converted into a universal token <unk>.
as illustrated in Figure 4.11 The effects of hav- When training our NMT systems, following
ing such connections are two-fold: (a) we hope (Bahdanau et al., 2015; Jean et al., 2015), we fil-
to make the model fully aware of previous align- ter out sentence pairs whose lengths exceed
ment choices and (b) we create a very deep net- 50 words and shuffle mini-batches as we pro-
work spanning both horizontally and vertically. ceed. Our stacking LSTM models have 4 lay-
Comparison to other work – ers, each with 1000 cells, and 1000-dimensional
Bahdanau et al. (2015) use context vectors, embeddings. We follow (Sutskever et al., 2014;
similar to our ct , in building subsequent hidden Luong et al., 2015) in training NMT with similar
states, which can also achieve the “coverage” settings: (a) our parameters are uniformly initial-
effect. However, there has not been any analysis ized in [−0.1, 0.1], (b) we train for 10 epochs us-
of whether such connections are useful as done 12
All texts are tokenized with tokenizer.perl and
11 BLEU scores are computed with multi-bleu.perl.
If n is the number of LSTM cells, the input size of the
13
first LSTM layer is 2n; those of subsequent layers are n. With the mteval-v13a script as per WMT guideline.
System Ppl BLEU
Winning WMT’14 system – phrase-based + large LM (Buck et al., 2014) 20.7
Existing NMT systems
RNNsearch (Jean et al., 2015) 16.5
RNNsearch + unk replace (Jean et al., 2015) 19.0
RNNsearch + unk replace + large vocab + ensemble 8 models (Jean et al., 2015) 21.6
Our NMT systems
Base 10.6 11.3
Base + reverse 9.9 12.6 (+1.3)
Base + reverse + dropout 8.1 14.0 (+1.4)
Base + reverse + dropout + global attention (location) 7.3 16.8 (+2.8)
Base + reverse + dropout + global attention (location) + feed input 6.4 18.1 (+1.3)
Base + reverse + dropout + local-p attention (general) + feed input 19.0 (+0.9)
5.9
Base + reverse + dropout + local-p attention (general) + feed input + unk replace 20.9 (+1.9)
Ensemble 8 models + unk replace 23.0 (+2.1)
Table 1: WMT’14 English-German results – shown are the perplexities (ppl) and the tokenized BLEU
scores of various systems on newstest2014. We highlight the best system in bold and give progressive
improvements in italic between consecutive systems. local-p referes to the local attention with predictive
alignments. We indicate for each attention model the alignment score function used in pararentheses.
ing plain SGD, (c) a simple learning rate sched- gressive improvements when (a) reversing the
ule is employed – we start with a learning rate of source sentence, +1.3 BLEU, as proposed in
1; after 5 epochs, we begin to halve the learning (Sutskever et al., 2014) and (b) using dropout,
rate every epoch, (d) our mini-batch size is 128, +1.4 BLEU. On top of that, (c) the global atten-
and (e) the normalized gradient is rescaled when- tion approach gives a significant boost of +2.8
ever its norm exceeds 5. Additionally, we also BLEU, making our model slightly better than the
use dropout with probability 0.2 for our LSTMs as base attentional system of Bahdanau et al. (2015)
suggested by (Zaremba et al., 2015). For dropout (row RNNSearch). When (d) using the input-
models, we train for 12 epochs and start halving feeding approach, we seize another notable gain
the learning rate after 8 epochs. For local atten- of +1.3 BLEU and outperform their system. The
tion models, we empirically set the window size local attention model with predictive alignments
D = 10. (row local-p) proves to be even better, giving
Our code is implemented in MATLAB. When us a further improvement of +0.9 BLEU on top
running on a single GPU device Tesla K40, we of the global attention model. It is interest-
achieve a speed of 1K target words per second. ing to observe the trend previously reported in
It takes 7–10 days to completely train a model. (Luong et al., 2015) that perplexity strongly corre-
lates with translation quality. In total, we achieve
4.2 English-German Results a significant gain of 5.0 BLEU points over the
We compare our NMT systems in the English- non-attentional baseline, which already includes
German task with various other systems. These known techniques such as source reversing and
include the winning system in WMT’14 dropout.
(Buck et al., 2014), a phrase-based system
whose language models were trained on a huge The unknown replacement technique proposed
monolingual text, the Common Crawl corpus. in (Luong et al., 2015; Jean et al., 2015) yields an-
For end-to-end NMT systems, to the best of other nice gain of +1.9 BLEU, demonstrating that
our knowledge, (Jean et al., 2015) is the only our attentional models do learn useful alignments
work experimenting with this language pair and for unknown works. Finally, by ensembling 8
currently the SOTA system. We only present different models of various settings, e.g., using
results for some of our attention models and will different attention approaches, with and without
later analyze the rest in Section 5. dropout etc., we were able to achieve a new SOTA
As shown in Table 1, we achieve pro- result of 23.0 BLEU, outperforming the existing
best system (Jean et al., 2015) by +1.4 BLEU. System Ppl. BLEU
WMT’15 systems
System BLEU SOTA – phrase-based (Edinburgh) 29.2
Top – NMT + 5-gram rerank (Montreal) 24.9 NMT + 5-gram rerank (MILA) 27.6
Our NMT systems
Our ensemble 8 models + unk replace 25.9
Base (reverse) 14.3 16.9
+ global (location) 12.7 19.1 (+2.2)
Table 2: WMT’15 English-German results –
+ global (location) + feed 10.9 20.1 (+1.0)
NIST BLEU scores of the winning entry in + global (dot) + drop + feed 22.8 (+2.7)
WMT’15 and our best one on newstest2015. 9.7
+ global (dot) + drop + feed + unk 24.9 (+2.1)
Latest results in WMT’15 – despite the fact that Table 3: WMT’15 German-English results –
our models were trained on WMT’14 with slightly performances of various systems (similar to Ta-
less data, we test them on newstest2015 to demon- ble 1). The base system already includes source
strate that they can generalize well to different test reversing on which we add global attention,
sets. As shown in Table 2, our best system es- dropout, input feeding, and unk replacement.
tablishes a new SOTA performance of 25.9 BLEU,
outperforming the existing best system backed by
6
NMT and a 5-gram LM reranker by +1.0 BLEU. basic
basic+reverse
5 basic+reverse+dropout
4.3 German-English Results Test cost
basic+reverse+dropout+globalAttn
basic+reverse+dropout+globalAttn+feedInput
4
We carry out a similar set of experiments for the basic+reverse+dropout+pLocalAttn+feedInput
with large and progressive gains in terms of BLEU Figure 5: Learning curves – test cost (ln perplex-
as illustrated in Table 3. The attentional mech- ity) on newstest2014 for English-German NMTs
anism gives us +2.2 BLEU gain and on top of as training progresses.
that, we obtain another boost of up to +1.0 BLEU
from the input-feeding approach. Using a better
alignment function, the content-based dot product + curve) learns slower than other non-dropout
one, together with dropout yields another gain of models, but as time goes by, it becomes more ro-
+2.7 BLEU. Lastly, when applying the unknown bust in terms of minimizing test errors.
word replacement technique, we seize an addi-
tional +2.1 BLEU, demonstrating the usefulness 5.2 Effects of Translating Long Sentences
of attention in aligning rare words. We follow (Bahdanau et al., 2015) to group sen-
tences of similar lengths together and compute
5 Analysis a BLEU score per group. Figure 6 shows that
We conduct extensive analysis to better understand our attentional models are more effective than the
our models in terms of learning, the ability to han- non-attentional one in handling long sentences:
dle long sentences, choices of attentional architec- the quality does not degrade as sentences become
tures, and alignment quality. All results reported longer. Our best model (the blue + curve) outper-
here are on English-German newstest2014. forms all other systems in all length buckets.
10
Table 6: AER scores – results of various models
10 20 30 40 50 60 70
Sent Lengths on the RWTH English-German alignment data.
Figure 6: Length Analysis – translation qualities
of different systems as sentences become longer. alignments for some sample sentences and ob-
served gains in translation quality as an indica-
BLEU tion of a working attention model, no work has as-
System Ppl
Before After unk sessed the alignments learned as a whole. In con-
global (location) 6.4 18.1 19.3 (+1.2) trast, we set out to evaluate the alignment quality
global (dot) 6.1 18.6 20.5 (+1.9) using the alignment error rate (AER) metric.
global (general) 6.1 17.3 19.1 (+1.8) Given the gold alignment data provided by
local-m (dot) >7.0 x x RWTH for 508 English-German Europarl sen-
local-m (general) 6.2 18.6 20.4 (+1.8) tences, we “force” decode our attentional models
local-p (dot) 6.6 18.0 19.6 (+1.9) to produce translations that match the references.
local-p (general) 5.9 19 20.9 (+1.9) We extract only one-to-one alignments by select-
ing the source word with the highest alignment
Table 4: Attentional Architectures – perfor-
weight per target word. Nevertheless, as shown in
mances of different attentional models. We trained
Table 6, we were able to achieve AER scores com-
two local-m (dot) models; both have ppl > 7.0.
parable to the one-to-many alignments obtained
by the Berkeley aligner (Liang et al., 2006).16
not learn good alignments: the global (location) We also found that the alignments produced by
model can only obtain a small gain when per- local attention models achieve lower AERs than
forming unknown word replacement compared to those of the global one. The AER obtained by the
using other alignment functions.14 For content- ensemble, while good, is not better than the local-
based functions, our implementation concat does m AER, suggesting the well-known observation
not yield good performances and more analysis that AER and translation scores are not well cor-
should be done to understand the reason.15 It is related (Fraser and Marcu, 2007). We show some
interesting to observe that dot works well for the alignment visualizations in Appendix A.
global attention and general is better for the local
attention. Among the different models, the local 5.5 Sample Translations
attention model with predictive alignments (local- We show in Table 5 sample translations in both
p) is best, both in terms of perplexities and BLEU. directions. It it appealing to observe the ef-
fect of attentional models in correctly translating
5.4 Alignment Quality names such as “Miranda Kerr” and “Roger Dow”.
A by-product of attentional models are word align- Non-attentional models, while producing sensi-
ments. While (Bahdanau et al., 2015) visualized ble names from a language model perspective,
14
lack the direct connections from the source side
There is a subtle difference in how we retrieve align-
ments for the different alignment functions. At time step t in
to make correct translations. We also observed
which we receive yt−1 as input and then compute ht , at , ct , an interesting case in the second example, which
and h̃t before predicting yt , the alignment vector at is used requires translating the doubly-negated phrase,
as alignment weights for (a) the predicted word yt in the “not incompatible”. The attentional model cor-
location-based alignment functions and (b) the input word
yt−1 in the content-based functions. rectly produces “nicht . . . unvereinbar”; whereas
15
With concat, the perplexities achieved by different mod- the non-attentional model generates “nicht verein-
els are 6.7 (global), 7.1 (local-m), and 7.1 (local-p). Such
16
high perplexities could be due to the fact that we simplify the We concatenate the 508 sentence pairs with 1M sentence
matrix Wa to set the part that corresponds to h̄s to identity. pairs from WMT and run the Berkeley aligner.
English-German translations
src Orlando Bloom and Miranda Kerr still love each other
ref Orlando Bloom und Miranda Kerr lieben sich noch immer
best Orlando Bloom und Miranda Kerr lieben einander noch immer .
base Orlando Bloom und Lucas Miranda lieben einander noch immer .
src ′′ We ′ re pleased the FAA recognizes that an enjoyable passenger experience is not incompatible
with safety and security , ′′ said Roger Dow , CEO of the U.S. Travel Association .
ref “ Wir freuen uns , dass die FAA erkennt , dass ein angenehmes Passagiererlebnis nicht im Wider-
spruch zur Sicherheit steht ” , sagte Roger Dow , CEO der U.S. Travel Association .
best ′′ Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit Sicherheit und
Sicherheit unvereinbar ist ′′ , sagte Roger Dow , CEO der US - die .
base ′′ Wir freuen uns über die <unk> , dass ein <unk> <unk> mit Sicherheit nicht vereinbar ist mit
Sicherheit und Sicherheit ′′ , sagte Roger Cameron , CEO der US - <unk> .
German-English translations
src In einem Interview sagte Bloom jedoch , dass er und Kerr sich noch immer lieben .
ref However , in an interview , Bloom has said that he and Kerr still love each other .
best In an interview , however , Bloom said that he and Kerr still love .
base However , in an interview , Bloom said that he and Tina were still <unk> .
src Wegen der von Berlin und der Europäischen Zentralbank verhängten strengen Sparpolitik in
Verbindung mit der Zwangsjacke , in die die jeweilige nationale Wirtschaft durch das Festhal-
ten an der gemeinsamen Währung genötigt wird , sind viele Menschen der Ansicht , das Projekt
Europa sei zu weit gegangen
ref The austerity imposed by Berlin and the European Central Bank , coupled with the straitjacket
imposed on national economies through adherence to the common currency , has led many people
to think Project Europe has gone too far .
best Because of the strict austerity measures imposed by Berlin and the European Central Bank in
connection with the straitjacket in which the respective national economy is forced to adhere to
the common currency , many people believe that the European project has gone too far .
base Because of the pressure imposed by the European Central Bank and the Federal Central Bank
with the strict austerity imposed on the national economy in the face of the single currency ,
many people believe that the European project has gone too far .
Table 5: Sample translations – for each example, we show the source (src), the human translation (ref),
the translation from our best model (best), and the translation of a non-attentional model (base). We
italicize some correct translation segments and highlight a few wrong ones in bold.
bar”, meaning “not compatible”.17 The attentional models which already incorporate known tech-
model also demonstrates its superiority in translat- niques such as dropout. For the English to Ger-
ing long sentences as in the last example. man translation direction, our ensemble model has
established new state-of-the-art results for both
6 Conclusion WMT’14 and WMT’15, outperforming existing
best systems, backed by NMT models and n-gram
In this paper, we propose two simple and effective LM rerankers, by more than 1.0 BLEU.
attentional mechanisms for neural machine trans- We have compared various alignment functions
lation: the global approach which always looks and shed light on which functions are best for
at all source positions and the local one that only which attentional models. Our analysis shows that
attends to a subset of source positions at a time. attention-based NMT models are superior to non-
We test the effectiveness of our models in the attentional ones in many cases, for example in
WMT translation tasks between English and Ger- translating names and handling long sentences.
man in both directions. Our local attention yields
large gains of up to 5.0 BLEU over non-attentional
Acknowledgment
17
The reference uses a more fancy translation of “incom-
patible”, which is “im Widerspruch zu etwas stehen”. Both We gratefully acknowledge support from a gift
models, however, failed to translate “passenger experience”. from Bloomberg L.P. and the support of NVIDIA
Corporation with the donation of Tesla K40 GPUs. [Papineni et al.2002] Kishore Papineni, Salim Roukos,
We thank Andrew Ng and his group as well as Todd Ward, and Wei jing Zhu. 2002. Bleu: a
method for automatic evaluation of machine trans-
the Stanford Research Computing for letting us
lation. In ACL.
use their computing resources. We thank Rus-
sell Stewart for helpful discussions on the models. [Sutskever et al.2014] I. Sutskever, O. Vinyals, and
Lastly, we thank Quoc Le, Ilya Sutskever, Oriol Q. V. Le. 2014. Sequence to sequence learning with
neural networks. In NIPS.
Vinyals, Richard Socher, Michael Kayser, Jiwei
Li, Panupong Pasupat, Kelvin Guu, members of [Xu et al.2015] Kelvin Xu, Jimmy Ba, Ryan Kiros,
the Stanford NLP Group and the annonymous re- Kyunghyun Cho, Aaron C. Courville, Ruslan
Salakhutdinov, Richard S. Zemel, and Yoshua Ben-
viewers for their valuable comments and feedback.
gio. 2015. Show, attend and tell: Neural image cap-
tion generation with visual attention. In ICML.
Eu and
hy t
hy t
ex ope
ex ope
w ers
w ers
bu ory
bu ory
. lity
. lity
in ts
in ts
do y
do y
is
is
e
e
d
d
r
r
a
a
t
e
t
t
e
t
t
Th
Th
no
un
no
no
un
no
re
re
th
th
in
in
Sie Sie
verstehen verstehen
nicht nicht
, ,
warum warum
Europa Europa
theoretisch theoretisch
zwar zwar
existiert existiert
, ,
aber aber
nicht nicht
in in
Wirklichkeit Wirklichkeit
. .
Eu and
Eu and
hy t
hy t
ex ope
ex ope
w ers
w ers
bu ory
bu ory
. lity
. lity
in ts
in ts
do y
do y
is
is
e
e
d
d
r
r
a
a
t
e
t
t
e
t
t
Th
Th
no
un
no
no
un
no
re
re
th
th
in
in
Sie Sie
verstehen verstehen
nicht nicht
, ,
warum warum
Europa Europa
theoretisch theoretisch
zwar zwar
existiert existiert
, ,
aber aber
nicht nicht
in in
Wirklichkeit Wirklichkeit
. .
Figure 7: Alignment visualizations – shown are images of the attention weights learned by various
models: (top left) global, (top right) local-m, and (bottom left) local-p. The gold alignments are displayed
at the bottom right corner.