Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Effective Approaches to Attention-based Neural Machine Translation

Minh-Thang Luong Hieu Pham Christopher D. Manning


Computer Science Department, Stanford University, Stanford, CA 94305
{lmthang,hyhieu,manning}@stanford.edu

Abstract X Y Z <eos>

An attentional mechanism has lately been


arXiv:1508.04025v5 [cs.CL] 20 Sep 2015

used to improve neural machine transla-


tion (NMT) by selectively focusing on
parts of the source sentence during trans-
lation. However, there has been little
work exploring useful architectures for
attention-based NMT. This paper exam- A B C D <eos> X Y Z
ines two simple and effective classes of at-
tentional mechanism: a global approach Figure 1: Neural machine translation – a stack-
which always attends to all source words ing recurrent architecture for translating a source
and a local one that only looks at a subset sequence A B C D into a target sequence X Y
of source words at a time. We demonstrate Z. Here, <eos> marks the end of a sentence.
the effectiveness of both approaches on the
WMT translation tasks between English
emitting one target word at a time, as illustrated in
and German in both directions. With local
Figure 1. NMT is often a large neural network that
attention, we achieve a significant gain of
is trained in an end-to-end fashion and has the abil-
5.0 BLEU points over non-attentional sys-
ity to generalize well to very long word sequences.
tems that already incorporate known tech-
This means the model does not have to explicitly
niques such as dropout. Our ensemble
store gigantic phrase tables and language models
model using different attention architec-
as in the case of standard MT; hence, NMT has
tures yields a new state-of-the-art result in
a small memory footprint. Lastly, implementing
the WMT’15 English to German transla-
NMT decoders is easy unlike the highly intricate
tion task with 25.9 BLEU points, an im-
decoders in standard MT (Koehn et al., 2003).
provement of 1.0 BLEU points over the
In parallel, the concept of “attention” has
existing best system backed by NMT and
gained popularity recently in training neural net-
an n-gram reranker.1
works, allowing models to learn alignments be-
1 Introduction tween different modalities, e.g., between image
objects and agent actions in the dynamic con-
Neural Machine Translation (NMT) achieved trol problem (Mnih et al., 2014), between speech
state-of-the-art performances in large-scale trans- frames and text in the speech recognition task
lation tasks such as from English to French (?), or between visual features of a picture and
(Luong et al., 2015) and English to German its text description in the image caption gener-
(Jean et al., 2015). NMT is appealing since it re- ation task (Xu et al., 2015). In the context of
quires minimal domain knowledge and is concep- NMT, Bahdanau et al. (2015) has successfully ap-
tually simple. The model by Luong et al. (2015) plied such attentional mechanism to jointly trans-
reads through all the source words until the end-of- late and align words. To the best of our knowl-
sentence symbol <eos> is reached. It then starts edge, there has not been any other work exploring
1
All our code and models are publicly available at
the use of attention-based architectures for NMT.
http://nlp.stanford.edu/projects/nmt. In this work, we design, with simplicity and ef-
fectiveness in mind, two novel types of attention- recurrent neural network (RNN) architec-
based models: a global approach in which all ture, which most of the recent NMT work
source words are attended and a local one whereby such as (Kalchbrenner and Blunsom, 2013;
only a subset of source words are considered at a Sutskever et al., 2014; Cho et al., 2014;
time. The former approach resembles the model Bahdanau et al., 2015; Luong et al., 2015;
of (Bahdanau et al., 2015) but is simpler architec- Jean et al., 2015) have in common. They, how-
turally. The latter can be viewed as an interesting ever, differ in terms of which RNN architectures
blend between the hard and soft attention models are used for the decoder and how the encoder
proposed in (Xu et al., 2015): it is computation- computes the source sentence representation s.
ally less expensive than the global model or the Kalchbrenner and Blunsom (2013) used an
soft attention; at the same time, unlike the hard at- RNN with the standard hidden unit for the
tention, the local attention is differentiable almost decoder and a convolutional neural network for
everywhere, making it easier to implement and encoding the source sentence representation. On
train.2 Besides, we also examine various align- the other hand, both Sutskever et al. (2014) and
ment functions for our attention-based models. Luong et al. (2015) stacked multiple layers of an
Experimentally, we demonstrate that both of RNN with a Long Short-Term Memory (LSTM)
our approaches are effective in the WMT trans- hidden unit for both the encoder and the decoder.
lation tasks between English and German in both Cho et al. (2014), Bahdanau et al. (2015), and
directions. Our attentional models yield a boost Jean et al. (2015) all adopted a different version of
of up to 5.0 BLEU over non-attentional systems the RNN with an LSTM-inspired hidden unit, the
which already incorporate known techniques such gated recurrent unit (GRU), for both components.4
as dropout. For English to German translation, In more detail, one can parameterize the proba-
we achieve new state-of-the-art (SOTA) results bility of decoding each word yj as:
for both WMT’14 and WMT’15, outperforming
previous SOTA systems, backed by NMT mod- p (yj |y<j , s) = softmax (g (hj )) (2)
els and n-gram LM rerankers, by more than 1.0
with g being the transformation function that out-
BLEU. We conduct extensive analysis to evaluate
puts a vocabulary-sized vector.5 Here, hj is the
our models in terms of learning, the ability to han-
RNN hidden unit, abstractly computed as:
dle long sentences, choices of attentional architec-
tures, alignment quality, and translation outputs. hj = f (hj−1 , s), (3)
2 Neural Machine Translation where f computes the current hidden state
A neural machine translation system is a neural given the previous hidden state and can be
network that directly models the conditional prob- either a vanilla RNN unit, a GRU, or an LSTM
ability p(y|x) of translating a source sentence, unit. In (Kalchbrenner and Blunsom, 2013;
x1 , . . . , xn , to a target sentence, y1 , . . . , ym .3 A Sutskever et al., 2014; Cho et al., 2014;
basic form of NMT consists of two components: Luong et al., 2015), the source representa-
(a) an encoder which computes a representation s tion s is only used once to initialize the
for each source sentence and (b) a decoder which decoder hidden state. On the other hand, in
generates one target word at a time and hence de- (Bahdanau et al., 2015; Jean et al., 2015) and
composes the conditional probability as: this work, s, in fact, implies a set of source
hidden states which are consulted throughout the
Xm
log p(y|x) = log p (yj |y<j , s) (1) entire course of the translation process. Such an
j=1 approach is referred to as an attention mechanism,
which we will discuss next.
A natural choice to model such a de-
In this work, following (Sutskever et al., 2014;
composition in the decoder is to use a
Luong et al., 2015), we use the stacking LSTM
2
There is a recent work by Gregor et al. (2015), which is architecture for our NMT systems, as illustrated
very similar to our local attention and applied to the image
4
generation task. However, as we detail later, our model is They all used a single RNN layer except for the latter two
much simpler and can achieve good performance for NMT. works which utilized a bidirectional RNN for the encoder.
3 5
All sentences are assumed to terminate with a special One can provide g with other inputs such as the currently
“end-of-sentence” token <eos>. predicted word yj as in (Bahdanau et al., 2015).
yt
in Figure 1. We use the LSTM unit defined in h̃t
(Zaremba et al., 2015). Our training objective is
formulated as follows:
X Attention Layer
Jt = − log p(y|x) (4)
(x,y)∈D Context vector
ct
with D being our parallel training corpus. Global align weights
at
3 Attention-based Models
h̄s
Our various attention-based models are classifed ht
into two broad categories, global and local. These
classes differ in terms of whether the “attention”
is placed on all source positions or on only a few
source positions. We illustrate these two model
Figure 2: Global attentional model – at each time
types in Figure 2 and 3 respectively.
step t, the model infers a variable-length align-
Common to these two types of models is the fact
ment weight vector at based on the current target
that at each time step t in the decoding phase, both
state ht and all source states h̄s . A global context
approaches first take as input the hidden state ht
vector ct is then computed as the weighted aver-
at the top layer of a stacking LSTM. The goal is
age, according to at , over all the source states.
then to derive a context vector ct that captures rel-
evant source-side information to help predict the
current target word yt . While these models differ Here, score is referred as a content-based function
in how the context vector ct is derived, they share for which we consider three different alternatives:
the same subsequent steps.  ⊤
Specifically, given the target hidden state ht and ht h̄s
 dot
the source-side context vector ct , we employ a ⊤
score(ht , h̄s ) = ht Wa h̄s general
simple concatenation layer to combine the infor- 
 ⊤ 
v a tanh Wa [ht ; h̄s ] concat
mation from both vectors to produce an attentional
hidden state as follows: Besides, in our early attempts to build attention-
h̃t = tanh(Wc [ct ; ht ]) (5) based models, we use a location-based function
in which the alignment scores are computed from
The attentional vector h̃t is then fed through the solely the target hidden state ht as follows:
softmax layer to produce the predictive distribu-
tion formulated as: at = softmax(Wa ht ) location (8)

p(yt |y<t , x) = softmax(Ws h̃t ) (6) Given the alignment vector as weights, the context
vector ct is computed as the weighted average over
We now detail how each model type computes
all the source hidden states.6
the source-side context vector ct .
Comparison to (Bahdanau et al., 2015) – While
3.1 Global Attention our global attention approach is similar in spirit
to the model proposed by Bahdanau et al. (2015),
The idea of a global attentional model is to con-
there are several key differences which reflect how
sider all the hidden states of the encoder when de-
we have both simplified and generalized from
riving the context vector ct . In this model type,
the original model. First, we simply use hid-
a variable-length alignment vector at , whose size
den states at the top LSTM layers in both the
equals the number of time steps on the source side,
encoder and decoder as illustrated in Figure 2.
is derived by comparing the current target hidden
Bahdanau et al. (2015), on the other hand, use
state ht with each source hidden state h̄s :
the concatenation of the forward and backward
at (s) = align(ht , h̄s ) (7) source hidden states in the bi-directional encoder

exp score(ht , h̄s ) 6
Eq. (8) implies that all alignment vectors at are of the
=P  same length. For short sentences, we only use the top part of
s′ exp score(ht , h̄s′ ) at and for long sentences, we ignore words near the end.
yt
h̃t refers to the global attention approach in which
weights are placed “softly” over all patches in the
source image. The hard attention, on the other
Attention Layer hand, selects one patch of the image to attend to at
Context vector
a time. While less expensive at inference time, the
ct hard attention model is non-differentiable and re-
Aligned position
pt quires more complicated techniques such as vari-
at Local weights ance reduction or reinforcement learning to train.
h̄s Our local attention mechanism selectively fo-
ht cuses on a small window of context and is differ-
entiable. This approach has an advantage of avoid-
ing the expensive computation incurred in the soft
attention and at the same time, is easier to train
Figure 3: Local attention model – the model first than the hard attention approach. In concrete de-
predicts a single aligned position pt for the current tails, the model first generates an aligned position
target word. A window centered around the source pt for each target word at time t. The context vec-
position pt is then used to compute a context vec- tor ct is then derived as a weighted average over
tor ct , a weighted average of the source hidden the set of source hidden states within the window
states in the window. The weights at are inferred [pt −D, pt +D]; D is empirically selected.8 Unlike
from the current target state ht and those source the global approach, the local alignment vector at
states h̄s in the window. is now fixed-dimensional, i.e., ∈ R2D+1 . We con-
sider two variants of the model as below.
Monotonic alignment (local-m) – we simply set
and target hidden states in their non-stacking uni- pt = t assuming that source and target sequences
directional decoder. Second, our computation path are roughly monotonically aligned. The alignment
is simpler; we go from ht → at → ct → h̃t vector at is defined according to Eq. (7).9
then make a prediction as detailed in Eq. (5), Predictive alignment (local-p) – instead of as-
Eq. (6), and Figure 2. On the other hand, at suming monotonic alignments, our model predicts
any time t, Bahdanau et al. (2015) build from the an aligned position as follows:
previous hidden state ht−1 → at → ct →
pt = S · sigmoid(v ⊤
p tanh(Wp ht )), (9)
ht , which, in turn, goes through a deep-output
and a maxout layer before making predictions.7 Wp and v p are the model parameters which will
Lastly, Bahdanau et al. (2015) only experimented be learned to predict positions. S is the source sen-
with one alignment function, the concat product; tence length. As a result of sigmoid, pt ∈ [0, S].
whereas we show later that the other alternatives To favor alignment points near pt , we place a
are better. Gaussian distribution centered around pt . Specif-
ically, our alignment weights are now defined as:
3.2 Local Attention
(s − pt )2
 
The global attention has a drawback that it has to at (s) = align(ht , h̄s ) exp − (10)
2σ 2
attend to all words on the source side for each tar-
We use the same align function as in Eq. (7) and
get word, which is expensive and can potentially
the standard deviation is empirically set as σ = D
2.
render it impractical to translate longer sequences,
Note that pt is a real nummber; whereas s is an
e.g., paragraphs or documents. To address this
integer within the window centered at pt .10
deficiency, we propose a local attentional mech-
8
anism that chooses to focus only on a small subset If the window crosses the sentence boundaries, we sim-
ply ignore the outside part and consider words in the window.
of the source positions per target word. 9
local-m is the same as the global model except that the
This model takes inspiration from the tradeoff vector at is fixed-length and shorter.
10
between the soft and hard attentional models pro- local-p is similar to the local-m model except that we dy-
namically compute pt and use a truncated Gaussian distribu-
posed by Xu et al. (2015) to tackle the image cap- tion to modify the original alignment weights align(ht , h̄s )
tion generation task. In their work, soft attention as shown in Eq. (10). By utilizing pt to derive at , we can
compute backprop gradients for Wp and v p . This model is
7
We will refer to this difference again in Section 3.3. differentiable almost everywhere.
X Y Z <eos>
in this work. Also, our approach is more general;
h̃t
as illustrated in Figure 4, it can be applied to
general stacking recurrent architectures, including
non-attentional models.
Attention Layer
Xu et al. (2015) propose a doubly attentional
approach with an additional constraint added to
the training objective to make sure the model pays
equal attention to all parts of the image during the
caption generation process. Such a constraint can
also be useful to capture the coverage set effect
in NMT that we mentioned earlier. However, we
chose to use the input-feeding approach since it
A B C D <eos> X Y Z
provides flexibility for the model to decide on any
Figure 4: Input-feeding approach – Attentional attentional constraints it deems suitable.
vectors h̃t are fed as inputs to the next time steps to
inform the model about past alignment decisions. 4 Experiments
We evaluate the effectiveness of our models
Comparison to (Gregor et al., 2015) – have pro- on the WMT translation tasks between En-
posed a selective attention mechanism, very simi- glish and German in both directions. new-
lar to our local attention, for the image generation stest2013 (3000 sentences) is used as a develop-
task. Their approach allows the model to select an ment set to select our hyperparameters. Transla-
image patch of varying location and zoom. We, tion performances are reported in case-sensitive
instead, use the same “zoom” for all target posi- BLEU (Papineni et al., 2002) on newstest2014
tions, which greatly simplifies the formulation and (2737 sentences) and newstest2015 (2169 sen-
still achieves good performance. tences). Following (Luong et al., 2015), we report
translation quality using two types of BLEU: (a)
3.3 Input-feeding Approach tokenized12 BLEU to be comparable with existing
In our proposed global and local approaches, NMT work and (b) NIST13 BLEU to be compara-
the attentional decisions are made independently, ble with WMT results.
which is suboptimal. Whereas, in standard MT,
4.1 Training Details
a coverage set is often maintained during the
translation process to keep track of which source All our models are trained on the WMT’14 train-
words have been translated. Likewise, in atten- ing data consisting of 4.5M sentences pairs (116M
tional NMTs, alignment decisions should be made English words, 110M German words). Similar
jointly taking into account past alignment infor- to (Jean et al., 2015), we limit our vocabularies to
mation. To address that, we propose an input- be the top 50K most frequent words for both lan-
feeding approach in which attentional vectors h̃t guages. Words not in these shortlisted vocabular-
are concatenated with inputs at the next time steps ies are converted into a universal token <unk>.
as illustrated in Figure 4.11 The effects of hav- When training our NMT systems, following
ing such connections are two-fold: (a) we hope (Bahdanau et al., 2015; Jean et al., 2015), we fil-
to make the model fully aware of previous align- ter out sentence pairs whose lengths exceed
ment choices and (b) we create a very deep net- 50 words and shuffle mini-batches as we pro-
work spanning both horizontally and vertically. ceed. Our stacking LSTM models have 4 lay-
Comparison to other work – ers, each with 1000 cells, and 1000-dimensional
Bahdanau et al. (2015) use context vectors, embeddings. We follow (Sutskever et al., 2014;
similar to our ct , in building subsequent hidden Luong et al., 2015) in training NMT with similar
states, which can also achieve the “coverage” settings: (a) our parameters are uniformly initial-
effect. However, there has not been any analysis ized in [−0.1, 0.1], (b) we train for 10 epochs us-
of whether such connections are useful as done 12
All texts are tokenized with tokenizer.perl and
11 BLEU scores are computed with multi-bleu.perl.
If n is the number of LSTM cells, the input size of the
13
first LSTM layer is 2n; those of subsequent layers are n. With the mteval-v13a script as per WMT guideline.
System Ppl BLEU
Winning WMT’14 system – phrase-based + large LM (Buck et al., 2014) 20.7
Existing NMT systems
RNNsearch (Jean et al., 2015) 16.5
RNNsearch + unk replace (Jean et al., 2015) 19.0
RNNsearch + unk replace + large vocab + ensemble 8 models (Jean et al., 2015) 21.6
Our NMT systems
Base 10.6 11.3
Base + reverse 9.9 12.6 (+1.3)
Base + reverse + dropout 8.1 14.0 (+1.4)
Base + reverse + dropout + global attention (location) 7.3 16.8 (+2.8)
Base + reverse + dropout + global attention (location) + feed input 6.4 18.1 (+1.3)
Base + reverse + dropout + local-p attention (general) + feed input 19.0 (+0.9)
5.9
Base + reverse + dropout + local-p attention (general) + feed input + unk replace 20.9 (+1.9)
Ensemble 8 models + unk replace 23.0 (+2.1)

Table 1: WMT’14 English-German results – shown are the perplexities (ppl) and the tokenized BLEU
scores of various systems on newstest2014. We highlight the best system in bold and give progressive
improvements in italic between consecutive systems. local-p referes to the local attention with predictive
alignments. We indicate for each attention model the alignment score function used in pararentheses.

ing plain SGD, (c) a simple learning rate sched- gressive improvements when (a) reversing the
ule is employed – we start with a learning rate of source sentence, +1.3 BLEU, as proposed in
1; after 5 epochs, we begin to halve the learning (Sutskever et al., 2014) and (b) using dropout,
rate every epoch, (d) our mini-batch size is 128, +1.4 BLEU. On top of that, (c) the global atten-
and (e) the normalized gradient is rescaled when- tion approach gives a significant boost of +2.8
ever its norm exceeds 5. Additionally, we also BLEU, making our model slightly better than the
use dropout with probability 0.2 for our LSTMs as base attentional system of Bahdanau et al. (2015)
suggested by (Zaremba et al., 2015). For dropout (row RNNSearch). When (d) using the input-
models, we train for 12 epochs and start halving feeding approach, we seize another notable gain
the learning rate after 8 epochs. For local atten- of +1.3 BLEU and outperform their system. The
tion models, we empirically set the window size local attention model with predictive alignments
D = 10. (row local-p) proves to be even better, giving
Our code is implemented in MATLAB. When us a further improvement of +0.9 BLEU on top
running on a single GPU device Tesla K40, we of the global attention model. It is interest-
achieve a speed of 1K target words per second. ing to observe the trend previously reported in
It takes 7–10 days to completely train a model. (Luong et al., 2015) that perplexity strongly corre-
lates with translation quality. In total, we achieve
4.2 English-German Results a significant gain of 5.0 BLEU points over the
We compare our NMT systems in the English- non-attentional baseline, which already includes
German task with various other systems. These known techniques such as source reversing and
include the winning system in WMT’14 dropout.
(Buck et al., 2014), a phrase-based system
whose language models were trained on a huge The unknown replacement technique proposed
monolingual text, the Common Crawl corpus. in (Luong et al., 2015; Jean et al., 2015) yields an-
For end-to-end NMT systems, to the best of other nice gain of +1.9 BLEU, demonstrating that
our knowledge, (Jean et al., 2015) is the only our attentional models do learn useful alignments
work experimenting with this language pair and for unknown works. Finally, by ensembling 8
currently the SOTA system. We only present different models of various settings, e.g., using
results for some of our attention models and will different attention approaches, with and without
later analyze the rest in Section 5. dropout etc., we were able to achieve a new SOTA
As shown in Table 1, we achieve pro- result of 23.0 BLEU, outperforming the existing
best system (Jean et al., 2015) by +1.4 BLEU. System Ppl. BLEU
WMT’15 systems
System BLEU SOTA – phrase-based (Edinburgh) 29.2
Top – NMT + 5-gram rerank (Montreal) 24.9 NMT + 5-gram rerank (MILA) 27.6
Our NMT systems
Our ensemble 8 models + unk replace 25.9
Base (reverse) 14.3 16.9
+ global (location) 12.7 19.1 (+2.2)
Table 2: WMT’15 English-German results –
+ global (location) + feed 10.9 20.1 (+1.0)
NIST BLEU scores of the winning entry in + global (dot) + drop + feed 22.8 (+2.7)
WMT’15 and our best one on newstest2015. 9.7
+ global (dot) + drop + feed + unk 24.9 (+2.1)

Latest results in WMT’15 – despite the fact that Table 3: WMT’15 German-English results –
our models were trained on WMT’14 with slightly performances of various systems (similar to Ta-
less data, we test them on newstest2015 to demon- ble 1). The base system already includes source
strate that they can generalize well to different test reversing on which we add global attention,
sets. As shown in Table 2, our best system es- dropout, input feeding, and unk replacement.
tablishes a new SOTA performance of 25.9 BLEU,
outperforming the existing best system backed by
6
NMT and a 5-gram LM reranker by +1.0 BLEU. basic
basic+reverse
5 basic+reverse+dropout
4.3 German-English Results Test cost
basic+reverse+dropout+globalAttn
basic+reverse+dropout+globalAttn+feedInput
4
We carry out a similar set of experiments for the basic+reverse+dropout+pLocalAttn+feedInput

WMT’15 translation task from German to En- 3


glish. While our systems have not yet matched
the performance of the SOTA system, we never- 2
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Mini−batches 5
theless show the effectiveness of our approaches x 10

with large and progressive gains in terms of BLEU Figure 5: Learning curves – test cost (ln perplex-
as illustrated in Table 3. The attentional mech- ity) on newstest2014 for English-German NMTs
anism gives us +2.2 BLEU gain and on top of as training progresses.
that, we obtain another boost of up to +1.0 BLEU
from the input-feeding approach. Using a better
alignment function, the content-based dot product + curve) learns slower than other non-dropout
one, together with dropout yields another gain of models, but as time goes by, it becomes more ro-
+2.7 BLEU. Lastly, when applying the unknown bust in terms of minimizing test errors.
word replacement technique, we seize an addi-
tional +2.1 BLEU, demonstrating the usefulness 5.2 Effects of Translating Long Sentences
of attention in aligning rare words. We follow (Bahdanau et al., 2015) to group sen-
tences of similar lengths together and compute
5 Analysis a BLEU score per group. Figure 6 shows that
We conduct extensive analysis to better understand our attentional models are more effective than the
our models in terms of learning, the ability to han- non-attentional one in handling long sentences:
dle long sentences, choices of attentional architec- the quality does not degrade as sentences become
tures, and alignment quality. All results reported longer. Our best model (the blue + curve) outper-
here are on English-German newstest2014. forms all other systems in all length buckets.

5.1 Learning curves 5.3 Choices of Attentional Architectures


We compare models built on top of one another as We examine different attention models (global,
listed in Table 1. It is pleasant to observe in Fig- local-m, local-p) and different alignment func-
ure 5 a clear separation between non-attentional tions (location, dot, general, concat) as described
and attentional models. The input-feeding ap- in Section 3. Due to limited resources, we can-
proach and the local attention model also demon- not run all the possible combinations. However,
strate their abilities in driving the test costs lower. results in Table 4 do give us some idea about dif-
The non-attentional model with dropout (the blue ferent choices. The location-based function does
25 Method AER
global (location) 0.39
20
local-m (general) 0.34
ours, no attn (BLEU 13.9) local-p (general) 0.36
ours, local−p attn (BLEU 20.9)
ours, best system (BLEU 23.0) ensemble 0.34
BLEU

15 WMT’14 best (BLEU 20.7)


Jeans et al., 2015 (BLEU 21.6)
Berkeley Aligner 0.32

10
Table 6: AER scores – results of various models
10 20 30 40 50 60 70
Sent Lengths on the RWTH English-German alignment data.
Figure 6: Length Analysis – translation qualities
of different systems as sentences become longer. alignments for some sample sentences and ob-
served gains in translation quality as an indica-
BLEU tion of a working attention model, no work has as-
System Ppl
Before After unk sessed the alignments learned as a whole. In con-
global (location) 6.4 18.1 19.3 (+1.2) trast, we set out to evaluate the alignment quality
global (dot) 6.1 18.6 20.5 (+1.9) using the alignment error rate (AER) metric.
global (general) 6.1 17.3 19.1 (+1.8) Given the gold alignment data provided by
local-m (dot) >7.0 x x RWTH for 508 English-German Europarl sen-
local-m (general) 6.2 18.6 20.4 (+1.8) tences, we “force” decode our attentional models
local-p (dot) 6.6 18.0 19.6 (+1.9) to produce translations that match the references.
local-p (general) 5.9 19 20.9 (+1.9) We extract only one-to-one alignments by select-
ing the source word with the highest alignment
Table 4: Attentional Architectures – perfor-
weight per target word. Nevertheless, as shown in
mances of different attentional models. We trained
Table 6, we were able to achieve AER scores com-
two local-m (dot) models; both have ppl > 7.0.
parable to the one-to-many alignments obtained
by the Berkeley aligner (Liang et al., 2006).16
not learn good alignments: the global (location) We also found that the alignments produced by
model can only obtain a small gain when per- local attention models achieve lower AERs than
forming unknown word replacement compared to those of the global one. The AER obtained by the
using other alignment functions.14 For content- ensemble, while good, is not better than the local-
based functions, our implementation concat does m AER, suggesting the well-known observation
not yield good performances and more analysis that AER and translation scores are not well cor-
should be done to understand the reason.15 It is related (Fraser and Marcu, 2007). We show some
interesting to observe that dot works well for the alignment visualizations in Appendix A.
global attention and general is better for the local
attention. Among the different models, the local 5.5 Sample Translations
attention model with predictive alignments (local- We show in Table 5 sample translations in both
p) is best, both in terms of perplexities and BLEU. directions. It it appealing to observe the ef-
fect of attentional models in correctly translating
5.4 Alignment Quality names such as “Miranda Kerr” and “Roger Dow”.
A by-product of attentional models are word align- Non-attentional models, while producing sensi-
ments. While (Bahdanau et al., 2015) visualized ble names from a language model perspective,
14
lack the direct connections from the source side
There is a subtle difference in how we retrieve align-
ments for the different alignment functions. At time step t in
to make correct translations. We also observed
which we receive yt−1 as input and then compute ht , at , ct , an interesting case in the second example, which
and h̃t before predicting yt , the alignment vector at is used requires translating the doubly-negated phrase,
as alignment weights for (a) the predicted word yt in the “not incompatible”. The attentional model cor-
location-based alignment functions and (b) the input word
yt−1 in the content-based functions. rectly produces “nicht . . . unvereinbar”; whereas
15
With concat, the perplexities achieved by different mod- the non-attentional model generates “nicht verein-
els are 6.7 (global), 7.1 (local-m), and 7.1 (local-p). Such
16
high perplexities could be due to the fact that we simplify the We concatenate the 508 sentence pairs with 1M sentence
matrix Wa to set the part that corresponds to h̄s to identity. pairs from WMT and run the Berkeley aligner.
English-German translations
src Orlando Bloom and Miranda Kerr still love each other
ref Orlando Bloom und Miranda Kerr lieben sich noch immer
best Orlando Bloom und Miranda Kerr lieben einander noch immer .
base Orlando Bloom und Lucas Miranda lieben einander noch immer .
src ′′ We ′ re pleased the FAA recognizes that an enjoyable passenger experience is not incompatible
with safety and security , ′′ said Roger Dow , CEO of the U.S. Travel Association .
ref “ Wir freuen uns , dass die FAA erkennt , dass ein angenehmes Passagiererlebnis nicht im Wider-
spruch zur Sicherheit steht ” , sagte Roger Dow , CEO der U.S. Travel Association .
best ′′ Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit Sicherheit und
Sicherheit unvereinbar ist ′′ , sagte Roger Dow , CEO der US - die .
base ′′ Wir freuen uns über die <unk> , dass ein <unk> <unk> mit Sicherheit nicht vereinbar ist mit
Sicherheit und Sicherheit ′′ , sagte Roger Cameron , CEO der US - <unk> .
German-English translations
src In einem Interview sagte Bloom jedoch , dass er und Kerr sich noch immer lieben .
ref However , in an interview , Bloom has said that he and Kerr still love each other .
best In an interview , however , Bloom said that he and Kerr still love .
base However , in an interview , Bloom said that he and Tina were still <unk> .
src Wegen der von Berlin und der Europäischen Zentralbank verhängten strengen Sparpolitik in
Verbindung mit der Zwangsjacke , in die die jeweilige nationale Wirtschaft durch das Festhal-
ten an der gemeinsamen Währung genötigt wird , sind viele Menschen der Ansicht , das Projekt
Europa sei zu weit gegangen
ref The austerity imposed by Berlin and the European Central Bank , coupled with the straitjacket
imposed on national economies through adherence to the common currency , has led many people
to think Project Europe has gone too far .
best Because of the strict austerity measures imposed by Berlin and the European Central Bank in
connection with the straitjacket in which the respective national economy is forced to adhere to
the common currency , many people believe that the European project has gone too far .
base Because of the pressure imposed by the European Central Bank and the Federal Central Bank
with the strict austerity imposed on the national economy in the face of the single currency ,
many people believe that the European project has gone too far .

Table 5: Sample translations – for each example, we show the source (src), the human translation (ref),
the translation from our best model (best), and the translation of a non-attentional model (base). We
italicize some correct translation segments and highlight a few wrong ones in bold.

bar”, meaning “not compatible”.17 The attentional models which already incorporate known tech-
model also demonstrates its superiority in translat- niques such as dropout. For the English to Ger-
ing long sentences as in the last example. man translation direction, our ensemble model has
established new state-of-the-art results for both
6 Conclusion WMT’14 and WMT’15, outperforming existing
best systems, backed by NMT models and n-gram
In this paper, we propose two simple and effective LM rerankers, by more than 1.0 BLEU.
attentional mechanisms for neural machine trans- We have compared various alignment functions
lation: the global approach which always looks and shed light on which functions are best for
at all source positions and the local one that only which attentional models. Our analysis shows that
attends to a subset of source positions at a time. attention-based NMT models are superior to non-
We test the effectiveness of our models in the attentional ones in many cases, for example in
WMT translation tasks between English and Ger- translating names and handling long sentences.
man in both directions. Our local attention yields
large gains of up to 5.0 BLEU over non-attentional
Acknowledgment
17
The reference uses a more fancy translation of “incom-
patible”, which is “im Widerspruch zu etwas stehen”. Both We gratefully acknowledge support from a gift
models, however, failed to translate “passenger experience”. from Bloomberg L.P. and the support of NVIDIA
Corporation with the donation of Tesla K40 GPUs. [Papineni et al.2002] Kishore Papineni, Salim Roukos,
We thank Andrew Ng and his group as well as Todd Ward, and Wei jing Zhu. 2002. Bleu: a
method for automatic evaluation of machine trans-
the Stanford Research Computing for letting us
lation. In ACL.
use their computing resources. We thank Rus-
sell Stewart for helpful discussions on the models. [Sutskever et al.2014] I. Sutskever, O. Vinyals, and
Lastly, we thank Quoc Le, Ilya Sutskever, Oriol Q. V. Le. 2014. Sequence to sequence learning with
neural networks. In NIPS.
Vinyals, Richard Socher, Michael Kayser, Jiwei
Li, Panupong Pasupat, Kelvin Guu, members of [Xu et al.2015] Kelvin Xu, Jimmy Ba, Ryan Kiros,
the Stanford NLP Group and the annonymous re- Kyunghyun Cho, Aaron C. Courville, Ruslan
Salakhutdinov, Richard S. Zemel, and Yoshua Ben-
viewers for their valuable comments and feedback.
gio. 2015. Show, attend and tell: Neural image cap-
tion generation with visual attention. In ICML.

References [Zaremba et al.2015] Wojciech Zaremba, Ilya


Sutskever, and Oriol Vinyals. 2015. Recurrent
[Bahdanau et al.2015] D. Bahdanau, K. Cho, and neural network regularization. In ICLR.
Y. Bengio. 2015. Neural machine translation by
jointly learning to align and translate. In ICLR. A Alignment Visualization
[Buck et al.2014] Christian Buck, Kenneth Heafield, We visualize the alignment weights produced by
and Bas van Ooyen. 2014. N-gram counts and lan-
guage models from the common crawl. In LREC. our different attention models in Figure 7. The vi-
sualization of the local attention model is much
[Cho et al.2014] Kyunghyun Cho, Bart van Merrien- sharper than that of the global one. This contrast
boer, Caglar Gulcehre, Fethi Bougares, Holger matches our expectation that local attention is de-
Schwenk, and Yoshua Bengio. 2014. Learning
phrase representations using RNN encoder-decoder
signed to only focus on a subset of words each
for statistical machine translation. In EMNLP. time. Also, since we translate from English to Ger-
man and reverse the source English sentence, the
[Fraser and Marcu2007] Alexander Fraser and Daniel white strides at the words “reality” and “.” in the
Marcu. 2007. Measuring word alignment quality
for statistical machine translation. Computational
global attention model reveals an interesting ac-
Linguistics, 33(3):293–303. cess pattern: it tends to refer back to the beginning
of the source sequence.
[Gregor et al.2015] Karol Gregor, Ivo Danihelka, Alex Compared to the alignment visualizations in
Graves, Danilo Jimenez Rezende, and Daan Wier-
stra. 2015. DRAW: A recurrent neural network for
(Bahdanau et al., 2015), our alignment patterns
image generation. In ICML. are not as sharp as theirs. Such difference could
possibly be due to the fact that translating from
[Jean et al.2015] Sébastien Jean, Kyunghyun Cho, English to German is harder than translating into
Roland Memisevic, and Yoshua Bengio. 2015. On
French as done in (Bahdanau et al., 2015), which
using very large target vocabulary for neural ma-
chine translation. In ACL. is an interesting point to examine in future work.

[Kalchbrenner and Blunsom2013] N. Kalchbrenner and


P. Blunsom. 2013. Recurrent continuous translation
models. In EMNLP.

[Koehn et al.2003] Philipp Koehn, Franz Josef Och,


and Daniel Marcu. 2003. Statistical phrase-based
translation. In NAACL.

[Liang et al.2006] P. Liang, B. Taskar, and D. Klein.


2006. Alignment by agreement. In NAACL.

[Luong et al.2015] M.-T. Luong, I. Sutskever, Q. V. Le,


O. Vinyals, and W. Zaremba. 2015. Addressing the
rare word problem in neural machine translation. In
ACL.

[Mnih et al.2014] Volodymyr Mnih, Nicolas Heess,


Alex Graves, and Koray Kavukcuoglu. 2014. Re-
current models of visual attention. In NIPS.
Eu and

Eu and
hy t

hy t
ex ope

ex ope
w ers

w ers
bu ory

bu ory
. lity

. lity
in ts

in ts
do y

do y
is

is
e

e
d

d
r

r
a

a
t

e
t
t

e
t
t
Th

Th
no
un

no

no
un

no
re

re
th

th
in

in
Sie Sie
verstehen verstehen
nicht nicht
, ,
warum warum
Europa Europa
theoretisch theoretisch
zwar zwar
existiert existiert
, ,
aber aber
nicht nicht
in in
Wirklichkeit Wirklichkeit
. .
Eu and

Eu and
hy t

hy t
ex ope

ex ope
w ers

w ers
bu ory

bu ory
. lity

. lity
in ts

in ts
do y

do y
is

is
e

e
d

d
r

r
a

a
t

e
t
t

e
t
t
Th

Th
no
un

no

no
un

no
re

re
th

th
in

in
Sie Sie
verstehen verstehen
nicht nicht
, ,
warum warum
Europa Europa
theoretisch theoretisch
zwar zwar
existiert existiert
, ,
aber aber
nicht nicht
in in
Wirklichkeit Wirklichkeit
. .

Figure 7: Alignment visualizations – shown are images of the attention weights learned by various
models: (top left) global, (top right) local-m, and (bottom left) local-p. The gold alignments are displayed
at the bottom right corner.

You might also like