Neural Machine Translation With Deep Attention.
Neural Machine Translation With Deep Attention.
1, JANUARY 2020
Abstract—Deepening neural models has been proven very successful in improving the model’s capacity when solving complex
learning tasks, such as the machine translation task. Previous efforts on deep neural machine translation mainly focus on the encoder
and the decoder, while little on the attention mechanism. However, the attention mechanism is of vital importance to induce the
translation correspondence between different languages where shallow neural networks are relatively insufficient, especially when the
encoder and decoder are deep. In this paper, we propose a deep attention model (DeepAtt). Based on the low-level attention
information, DeepAtt is capable of automatically determining what should be passed or suppressed from the corresponding encoder
layer so as to make the distributed representation appropriate for high-level attention and translation. We conduct experiments on NIST
Chinese-English, WMT English-German, and WMT English-French translation tasks, where, with five attention layers, DeepAtt yields
very competitive performance against the state-of-the-art results. We empirically find that with an adequate increase of attention layers,
DeepAtt tends to produce more accurate attention weights. An in-depth analysis on the translation of important context words further
reveals that DeepAtt significantly improves the faithfulness of system translations.
Index Terms—Deep attention network, neural machine translation (NMT), attention-based sequence-to-sequence learning,
natural language processing
1 INTRODUCTION
Fig. 1. Illustration of the proposed DeepAtt. We use blue and red color to indicate the source and target side, respectively. The yellow and gray color
denotes the information flow for target word prediction and attention respectively. Notice that we draw the encoder on the right and the decoder on
the left for clarity.
results show that the translation quality of important context Observing that the use of a fixed-length vector is insuffi-
words (e.g., noun, verb, adjective) is indeed improved. On cient in summarizing source-side semantics, Bahdanau
the WMT14 English-German and English-French (using 12M et al. [2] propose the attention mechanism. Luong et al. [7]
corpus only) task, our single model, with 5 attention layers, explore several efficient architectures for this mechanism,
achieves a BLEU score of 24.73 and 38.56 respectively, both introducing the global and local attention models. Zhang
comparable to the state-of-the-art. et al. [8] propose that the recurrent neural network can be
Our main contributions are summarized as follows: used as an alternative to the attention network. Recently,
Zhang et al. [9] introduce a GRU gate to the attention model so
We propose a novel deep attention mechanism
as to improve the discriminative ability of the learned atten-
which operates in a hierarchical manner and allows
tion vectors. Vaswani et al. [10] propose a multi-head attention
the encoder to interact with the decoder layer
network with scaled-dot operation, expecting each head to
by layer. The hierarchy architecture ensures that
capture a particular aspect of the source-target interaction.
source-side semantics can be fully utilized to gener-
Zhang et al. [11] develop an average attention model which
ate the next target word. The layer-wise interaction,
greatly simplifies the decoder-part self-attention mechanism
on the other hand, enables the decoder to selectively
using solely cumulative average operation. Rather than devel-
pick essential untranslated source words for the pre-
oping more flexible attention models, we treat these exiting
diction of the next target word.
models as our basic unit. Although we employ the model of
We develop a novel deep encoding schema which
Bahdanau et al. [2] in our experiments, our DeepAtt can be
alternates forward and backward RNNs with skip
easily adapted to other attention models.
connections to the source input at each layer. The alter-
Based on the success achieved in computer vision [12],
nation helps capture more accurate source semantics
[13], deep neural networks have become a pretty hot spot in
via integrating both history and future source-side
the NMT community, such as [3], [4], [5]. These studies dif-
information. The skip connection, on the other hand,
fer significantly from ours in the following two aspects.
makes the gradient propagation more fluent so as to
First, their major focus is to enable flexible optimization
enable feasible model optimization.
since training a deep neural network is very difficult. To
We conducted a series of experiments on NIST Chi-
this end, Wu et al. [3] leverage the residual connection;
nese-English, WMT14 English-German and WMT14
Zhou et al. [4] propose a fast-forward connection, while
English-French translation tasks. The proposed
Wang et al. [5] introduce a linear associative unit. Second,
model yields consistent and significant impro-
their deep architecture lies in the encoder and the decoder,
vements over several strong baselines, and achieves
ignoring the single-layered attention network which is still
results comparable to various state-of-the-art NMT
shallow. In contrast, our model is also deep in the attention
systems.
network, making the deep encoder and deep decoder cou-
ple more tightly and the training more flexible.
2 RELATED WORK Our work brings these two lines of research together. In
Our work is closely related with two lines of research: the this respect, Yang et al. [14] propose stacked attention net-
attention mechanism and the deep NMT. works to learn to answer natural language questions from
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.
156 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 1, JANUARY 2020
TABLE 1 TABLE 2
Case-Insensitive BLEU Scores on the AER Scores of Word Alignments
Chinese-English Translation Task
System Layer-1 Layer-2 Layer-3 Layer-4 Layer-5 Layer-6 ALL
System MT05 MT02 MT03 MT04 MT06 MT08 ALL
RNNSearch 50.83 - - - - - 50.83
Moses 31.70 33.61 32.63 34.36 31.00 23.96 31.03 DeepAtt-1 45.25 - - - - - 45.25
RNNSearch [2] 34.72 37.95 35.23 37.32 33.56 26.12 34.06 DeepAtt-2 54.43 52.53 - - - - 48.09
DeepAtt-3 67.27 43.99 96.28 - - - 47.23
DeepAtt-1 36.44 40.12 37.63 39.83 35.44 27.34 36.12 *þþ
DeepAtt-4 77.64 52.16 76.86 53.30 - - 45.38
DeepAtt-2 36.90 39.71 37.79 39.93 35.95 27.87 36.34 *þþ
DeepAtt-5 60.71 49.22 64.39 78.86 96.44 - 44.69
DeepAtt-3 36.75 40.53 38.12 40.14 36.14 28.12 36.65 *þþ
DeepAtt-6 60.42 54.45 53.91 73.18 97.50 95.65 46.01
DeepAtt-4 37.87 40.99 39.10 40.77 37.14 28.44 37.34 *þþ
DeepAtt-5 38.82 41.00 39.07 41.09 37.37 28.52 37.50 *þþ
The lower the score, the better the alignment quality. ALL = overall AER
DeepAtt-6 38.29 41.40 39.23 40.66 37.20 28.99 37.51*þþ scores on all attention layers.
DeepAtt-k: the proposed model using “k” attention layers, “k” encoder layers, type “wbe-msd-bidirectional-fe-allff”. All other parameters
and “k” decoder layers (i.e., K ¼“k”). RNNSearch: a vanilla NMT system
using 1-layer encoder and 1-layer decoder with 1-layer attention. ALL = total were kept as the default settings.
BLEU score on all test sets. We highlight the best results in bold for each test For English-German task, we applied the byte pair
set. All neural models were trained with Adadelta optimizer. “"/*”: signifi- encoding compression algorithm [20] to reduce the vocabu-
cantly better than Moses (p < 0.05/p < 0.01); “þ/þþ”: significantly better
than RNNSearch (p < 0.05/p < 0.01).
lary size as well as to deal with rich morphology. For both
languages, we preserved 16K sub-words as the vocabulary.
We also tested a big setting with 30K sub-words extracted
For English-German task, we used the same subset of the as the vocabulary. Similarly, for English-French task, we
WMT 2014 training corpus as in [3], [4], [5], [7]. This training preserved 40K sub-words in the source and target vocabu-
data consists of 4.5M sentence pairs with 116M English lary, respectively.
words and 110M German words respectively.1 We used the We used dw ¼ 620 dimensional word embeddings and
news-test 2013 as the development set, and the news-test dh ¼ 1000 dimensional hidden states for both the source
2014 as the test set. and target languages. All non-recurrent parts were ran-
For English-French task, we also used the WMT 2014 domly initialized with zero mean and standard deviation of
training data. The whole training corpus consists of around 0.01, except the recurrent parameters which were initialized
36M sentence pairs, from which we selected 12M sentence with random orthogonal matrices. During decoding, we
pairs for training so as to meet our computational capabil- used the beam-search algorithm, and set the beam size to 10.
ity. The selection algorithm strictly follows the previous The model is trained through standard SGD algorithm
work2. Finally, our training data contain 304M English with a mini-batch size of 80 sentences. We updated the learn-
words and 348M French words. We used the combination ing rate using the Adadelta algorithm [21] ( ¼ 106 and
of news-test 2012 and news-test 2013 as the development r ¼ 0:95). We clipped the norm of model gradient to make it
set, and the news-test 2014 as the test set. no more than 5.0 so as to avoid gradient explosion issue.
Evaluation. We used the case-insensitive and case-sensitive Dropout was also applied on the output layer to avoid over-
BLEU-4 metric [18] to evaluate translation quality of Chinese- fitting. We set the dropout rate to be 0 for Chinese-English
English and English-German, English-French task respec- task, and 0.2 for English-German and English-French task. In
tively. For all tasks, we tokenized the reference and evaluated addition, following recent advances in deep learning commu-
the performance using multi-bleu.perl.3 We performed paired nity [3], [10], we employed Adam algorithm [22] ( ¼ 108
bootstrap sampling [19] for significance test. and b1 ¼ 0:9; b2 ¼ 0:999) in some cases. Without clear decla-
ration, we used the Adadelta for experiment.
4.2 Model Settings We implemented all our models based on the open-
We adopted similar settings as Bahdanau et al. [2]. For sourced dl4mt system.6 All NMT systems were trained on a
Chinese-English task, we extracted the most frequent 30K GeForce GTX 1080 based on the computational framwork
words from two corpora as the source and target vocabu- Theano where up to 6 attention layers were tested due to
lary, covering approximately 97.7 and 99.3 percent of each the physical memory limit of our GPU. The training of our
corpora respectively. With respect to Moses, we used all the DeepAtt costs around 5 days on the Chinese-English task,
1.25M sentence pairs (without length limitation). We trained around 3 weeks on the English-German task and around 6
a 4-gram language model on the target portion of training weeks on the English-French task when K is set to be 5.
data using the SRILM4 toolkit with modified Kneser-Ney
smoothing. We word-aligned the training corpus using 4.3 Results on Chinese-English Translation
GIZA++5 toolkit with the option “grow-diag-final-and”. We Table 1 shows the translation results of different systems. No
employed the default lexical reordering model with the matter how many attention layers are used, DeepAtt always
significantly outperforms both Moses and RNNSearch, with
1. The preprocessed data can be found and downloaded from
gains of up to 6.48 and 3.45 BLEU points respectively.
http://nlp.stanford.edu/projects/nmt/. Besides, as the number of attention layers increases, the over-
̃
2. http://www-lium.univ-lemans.fr/schwenk/cslm_joint_paper/ all translation performance is improved. Specifically, when
3. https://github.com/moses-smt/mosesdecoder/tree/master/
scripts/generic/multi-bleu.perl
4. http://www.speech.sri.com/projects/srilm/download.html 6. https://github.com/nyu-dl/dl4mt-tutorial/blob/master/ses-
5. http://www.fjoch.com/GIZA++.html sion3/nmt.py.
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.
158 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 1, JANUARY 2020
TABLE 3
Case-Insensitive BLEU Scores on Specific Context Words
RNNSearch DeepAtt
Metric
NN VB NN&VB NN&VB&JJ NN VB NN&VB NN&VB&JJ
BLEU-1 61.97 55.75 59.18 60.66 65.38 58.87 62.52 64.07
BLEU-2 44.33 28.62 29.80 33.15 45.84 33.24 33.09 36.45
BLEU-3 35.66 15.01 15.01 17.92 36.98 21.21 17.41 20.38
BLEU-4 11.40 - 7.71 9.68 17.07 - 9.27 11.50
NN = noun, VB = verb, JJ = adjective, and NN&VB = noun and verb. “” indicates the BLEU score is zero.
there are 6 attention layers, DeepAtt achieves 37.51 BLEU improvements over RNNSearch on all context words. Specif-
score on all test sets. This suggests that the deep attention ically, on “NN”, DeepAtt outperforms RNNSearch by 5.67
architecture benefits the NMT system. BLEU-4 points, while on “VB”, DeepAtt achieves a gain of
We also observed that compared to DeepAtt-6, DeepAtt- 6.2 BLEU-3 points. These significant improvements strongly
5 requires less training time with no significant performance indicate that DeepAtt connects the encoder and decoder
degradation. Thus, we conducted the following experi- more tightly so as to enable the translations more faithful.
ments using 5 attention layers by default, unless mentioned A common challenge for NMT system is the translation
otherwise.7 of long source sentences. The above analysis reveals that
DeepAtt generates more faithful translation. We further ver-
4.4 Analysis on Chinese-English Translation ified this point on long sentences. Following Bahdanau
The major difference between DeepAtt and other deep NMTs et al. [2], we grouped sentences of similar lengths together
lies in the multiple attention layers. Therefore, we first quanti- and computed BLEU score and average length of transla-
tatively evaluated the quality of the learned attention weights tions in each group.9 Fig. 2 shows the overall results. We
at different layers. To this end, we employed the alignment observe that the performance of RNNSearch drops sharply
error rate (AER) metric [23] and used the evaluation dataset when the length of source sentence exceeds 50. Compared
from Liu and Sun [24], which contains 900 manually aligned with RNNSearch, DeepAtt yields consistent and significant
Chinese-English sentence pairs [6]. Table 2 summarizes the improvements on all groups. Specifically, DeepAtt obtains a
results. With respect to the overall AER score, we observed gain of up to 4 BLEU points on the longest group. Surpris-
that there are no consistent improvements as the attention ingly, the translation length of DeepAtt is almost the same
layers deepen. However, all DeepAtt models achieve lower as that of RNNSearch. This suggests that DeepAtt achieves
(better) AER scores than the RNNSearch, especially the Deep- much better translation performance without changing the
Att-5 which yields the lowest 44.69 AER score. This indicates length of translation, demonstrating the ability of DeepAtt
that deepening the attention layers can help improve the in dealing with long-range dependencies as well as generat-
alignment quality, which typically contributes significantly to ing faithful translations.
the translation performance.
With respect to the AER score across different attention 4.5 Translation Analysis
layers (take DeepAtt-5 as example), we find that the score Following the above analysis, we further provide some
decreases at first, then raises sharply (60:71 ! 49:22 ! translation examples to verify whether our model indeed
96:77). This suggests that DeepAtt first seeks the translation- generates more fluent and faithful translation. We show the
relevant source words, and then pay more attention to the instances in Table 4.
other words. We argue that this phenomenon, to some extent, As a traditional statistical system that relies heavily on
is consistent with human’s procedure of translation. That is, a large-scale phrase pairs, the Moses succeeds in generating
human needs to determine which source word to translate at faithful translations, which, however, tend to lack of
first, then checks broader context to confirm its meaning, and fluency. For example, the sentence “zimbabwean president
finally finds out adequate target translations. mugabe 9 to 11 march in the presidential election again elected”
High-quality word alignments play an important role in suffers from serious disorder problem as well as missing-
the translation of significant context words (e.g., noun, verb, predicate problem. In contrast, the translations of all NMT
adjective). As DeepAtt produces better attention weights, systems exhibit incredible fluency. Nevertheless, different
we dug into the translations and investigated whether the NMT systems are faced with different challenges.
translation quality of context words can be improved. We On one hand, these models sometimes prefer to avoid
assigned parts of speech to each word in the references and translating some important source clauses, which is a
translations using the Stanford POS Tagger,8 and evaluated well-known under-translation problem [6]. For example,
the translation quality of noun (NN), verb (VB) and adjective RNNSearch fails to translate the source clause “但(but) 西方
(JJ) alone. We report BLEU from 1 to 4-gram, and show the (western) 国家(countries) 指责(alleging) 选举(the election) 存在
results in Table 3. Obviously, DeepAtt leads to remarkable (in) 严重(serious) 作弊(cheating) 行为(behavior), 缺乏(lack
7. We also examined the effect of beam size with a range from 10 to 9. We divide our test sets into six disjoint groups according to the
50. Unfortunately, however, we do not observe significant change of length of source sentences ((0, 10), [10, 20), [20, 30), [30, 40), [40, 50), [50,
BLEU scores as the beam varies. -)), each of which has 680, 1923, 1839, 1189, 597 and 378 sentences
8. https://nlp.stanford.edu/software/tagger.shtml respectively.
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: NEURAL MACHINE TRANSLATION WITH DEEP ATTENTION 159
Fig. 2. BLEU score and translation length on different length groups of source sentences.
TABLE 4
Examples Generated by Different Systems
The translation of DeepAtt is more accurate in expressing the meanings of source sentences. Important phrases are highlighted in red color.
of) 公正性(fairness) 和(and) 自由性(freedom), 因此(thus) 拒绝 sub-translation “9-11 march” appears several times in all
(refused to) 承认(recognize) 选举(the election) 结果(results) , 并 the NMT systems except ours. Additionally, both
(and) 扬言(threatened) 将(will) 对(against) 津巴布韦(zim- DeepNMT and WideNMT mistakenly produce “zimbabwe
babwe) 进一步(further) 实施(carry out) 制裁(sanctions)。”.10 It ’s president mugabe” rather than “western countries” as the
seems that the shallow model has difficulties in extracting subject that “refused to acknowledge the election results and
and transforming the source semantics, which can also be threatened to further impose sanctions against zimbabwe.” All
reflected on its poor alignment quality. Deepening the these strongly demonstrate that deepening the model
model is a promising way to alleviate this problem, as alone is not sufficient enough to correctly convey the
we observe that all deep models can recover more source meaning of the source sentences.
meaning into the translations. However, except our model, Our DeepAtt, although its generated translations are not
other deep models still neglect several source clauses perfect either, handles these problems much better. We con-
during transformation. tribute this to the proposed attention architecture that is
On the other hand, some common source clauses can more capable of dealing with the underlying semantics of
be translated repeatedly, which is a well-known over- source sentences.
translation problem [6]. This is because if the model fails
to capture the source semantics, it may try to translate
the recognized part over and over. As an example, the 4.6 More Comparisons on Chinese-English
Translation
Except for the Moses and RNNSearch, we provide the fol-
10. Words in bracket are word-by-word English translations. lowing existing systems:
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.
160 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 1, JANUARY 2020
TABLE 5
Case-Insensitive BLEU Scores of Advanced Systems on the Chinese-English Translation Task
System #Enc #Dec #Att MT05 MT02 MT03 MT04 MT06 MT08 ALL
Existing End-to-End NMT Systems
Coverage [5] 1 1 1 34.91 - 34.49 38.34 34.25 - -
MemDec [5] 1 1 1 35.91 - 36.16 39.81 35.98 - -
DeepLAU [5] 4 4 1 38.07 - 39.35 41.15 37.29 - -
VRNMT [25] 1 1 1 36.82 - 38.08 41.07 36.72 - -
ABDNMT [26] 1 1 1 38.84 - 40.02 42.32 38.38 - -
Our Implementation of Comparable NMT Systems
RNNSearch 1 1 1 34.72 37.95 35.23 37.32 33.56 26.12 34.06
DeepNMT 5 5 5 36.44 39.29 37.89 39.65 35.37 27.63 36.02 þþ
VDeepAtt 5 5 5 37.15 39.71 38.36 40.48 36.29 28.00 36.69 þþ
WideAtt 5 1 1 35.66 38.60 37.01 38.49 34.66 26.06 35.00 þþ
MHeadNMT 1 1 1 36.23 39.61 35.89 39.45 35.70 27.80 36.09 þþ
UDeepAtt 5 5 5 36.71 39.93 37.78 39.38 36.03 28.25 36.30 þþ
Our End-to-End NMT System
DeepAtt 5 5 5 38.82 41.00 39.07 41.09 37.37 28.52 37.50 þþ
DeepAtt + LN 5 5 5 40.19 42.20 40.24 42.13 38.59 30.15 38.78 þþ
DeepAtt + LN + Adam 5 5 5 44.16 45.70 44.17 46.82 43.12 34.16 43.08 þþ
DeepAtt + LN + Adam (4 model ensemble) 5 5 5 46.17 47.61 47.30 49.14 45.94 36.64 45.58þþ
“#Enc” = number of encoder layers, “#Dec” = number of decoder layers, and “#Att” = number of attention layers. “” indicates no result is provided in [5].
“Adam” = model is optimized with Adam optimizer, if specified. “LN” = length normalization during decoding.
Fig. 3. BLEU score of different systems on all test sets under different numbers of layers.
proposed attention architecture, DeepAtt obtains another 4.7 Results on English-German Translation
gain of 0.81 BLEU points over VDeepAtt, which suggests Table 6 shows the results on English-German translation.
both the effectiveness and efficiency of our DeepAtt archi- We also show existing systems comparable to ours includ-
tecture, considering that DeepAtt has more compact struc- ing the winning system in WMT14 [27], a phrase-based sys-
ture than VDeepAtt, enabling much efficient gradient tem whose language models were trained on a huge
propagation inside the decoder. monolingual text, the Common Crawl corpus. Obviously,
Compared with UDeepAtt, DeepAtt achieves a clear current WMT14 performance is led by deep NMT systems.
improvement of 1.2 BLEU points. The only difference For example, Wu et al. [3] reported 24.61 BLEU score with 8
between these two models is that we alternate the encoding LSTM layers, and Wang et al. [5] generated 23.80 BLEU
direction between consecutive encoder layers. A benefit score with 4 GRU+LAU layers. Very recently, the state-of-
from this alternation is that future information can be fully the-art is refreshed by Gehring et al. [15] using 15 CNN
mixed with history information, thus enabling DeepAtt to layers and becomes 25.12, which is further broken through
produce more accurate source representations. We also by Vaswani et al. [10] and reaches 27.30.
compared our model with the multi-head attention net- Our model achieves 24.73 BLEU score, a very competitive
work [10]. Results show that MHeadNMT yields a gain of result against the RNN-based and CNN-based systems above.
2.03 BLEU points over the RNNSearch, indicating that cap- Under similar model settings, the GNMT [3] yields 24.36
turing different aspects of the source-target interaction is BLEU score (0.37 BLEU points lower than our model) with
beneficial for translation. Nevertheless, deepening the atten- various non-trivial tricks such as coverage penalty, specific
length normalization, fine-tunning and the RL-refined model.
tion network with compact structures as in our DeepAtt can
Although Gehring et al. [15] achieved 25.16, they used 40K
reach better performance, achieving a gain of 1.41 BLEU
sub-words and 15 layers, several times larger than those of
points over MHeadNMT.
our model. We also performed model ensemble to enhance
In order to have a fair comparison with the existing
the translation performance. By initializing with different ran-
systems, we apply the length normalization during
dom seeds, we trained 4 different models whose ensemble
translation.11 To the best of our knowledge, DeepLAU [5]
pushed the BLEU score to 26.45, making our model outper-
reported the best BLEU scores using the 1.25M training
form both GNMT [3], LAU-NMT [5] and CNN-NMT [15].
data. However, our DeepAtt outperforms all these systems
significantly. Enhanced with the Adam optimizer, our model 4.8 Results on English-French Translation
reaches an overall BLEU score of 43.08, a strong improve- Table 7 summarizes the translation performance of different
ment over the one trained with Adadelta by a great margin NMT systems. Unlike the above translation tasks, this task
of 4.3 BLEU points. We further performed model ensemble. provides a training corpus of 12M sentence pairs, around
Using 4 well-trained model under different random seeds, three times and ten times larger than that of English-
our model resets the state-of-the-art results on this task, German and Chinese-English translation task respectively.
where the overall BLEU score increases to 45.58. Besides, Overall, our single model achieves a BLEU score of 38.56,
except for NMT with 5 layers, we also compared different and its ensemble using 4 well-trained models improves the
models under other numbers of layers, which is shown in score to 39.88. Both results are competitive against both
Fig. 3. We observe that with the increase of layers, NMT RNN-based and CNN-based systems.
models produce better results, and no matter how many Among systems trained with 12M sentence pairs, our
layers are used, DeepAtt always outperforms other related model is the best, outperforming the previous best model, i.e.,
models and achieves the best result. All these demonstrate Wang et al. [5] (35.10), by a great margin of 3.46 BLEU points.
the modeling power of our deep attention architecture. When using the full 36M sentence pairs, GNMT [3] yeilds a
BLEU score of 38.95, Transformer [10] achieves 38.10, and
CNN-NMT [15] reaches 40.15. By contrast, our model, using
11. Even with length normalization, the comparison is not
completely fair. Although all systems use the same training data, the
only 12M training data, is able to generate translations with a
existing systems are tuned on NIST 02, while ours is tuned on NIST 05. BLEU score of 38.56, demonstrating our model’s excellent
However, we believe this is not the key. capability in translation modeling.
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.
162 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 1, JANUARY 2020
TABLE 6
Case-Sensitive BLEU Scores on WMT14 English-German Translation Task
“unk replace” and “PosUnk” denotes the approach of handling rare words in Jean et al. [28] and Luong et al. [7], respectively. “RL” and “WPM” represent the
reinforcement learning optimization and wordpiece model used in Wu et al. [3], respectively. “LAU” and “MRT” denote the linear associative unit and the mini-
mum risk training proposed by Wang et al. [5] and Shen et al. [29], respectively. “BPE” denotes the byte pair encoding algorithm in Sennrich et al. [20]. “SR”
indicates the weakly-recurrent model proposed by Gangi et al. [16]
TABLE 7
Case-sensitive BLEU Scores on WMT14 English-French Translation Task
this work DeepAtt with 5 layers + BPE + Adam 12M 40K 38.56
DeepAtt with 5 layers + BPE + Adam (4 model ensemble) 12M 40K 39.88
“unk replace” and “PosUnk” denotes the approach of handling rare words in Jean et al. [28] and Luong et al. [7], respectively. “RL” and “WPM” represent the
reinforcement learning optimization and wordpiece model used in Wu et al. [3], respectively. “LAU” and “MRT” denote the linear associative unit and the mini-
mum risk training proposed by Wang et al. [5] and Shen et al. [29], respectively. “BPE” denotes the byte pair encoding algorithm in Sennrich et al. [20].
5 CONCLUSION AND FUTURE WORK the effectiveness of our model in improving both the transla-
tion and alignment quality.
In this article, we have presented a deep attention model
In the future, we want to testify DeepAtt on other tasks,
(DeepAtt) for NMT systems. Through multiple stacked atten-
e.g., summarization. Additionally, our model is not limited
tion layers with each layer paying attention to a corresp-
to the attention unit that we have used in this article. As
onding encoder layer, DeepAtt enables low-level attention
mentioned in Section 2, we are also interested in adapting
information to guide what should be passed or suppressed
DeepAtt to other more complex attention models.
from the encoder layer so as to make the learned distributed
representations appropriate for high-level translation tasks.
Our model is simple to implement and flexible to train. ACKNOWLEDGMENTS
Experiments on both NIST Chinese-English, WMT14 English- The authors were supported by the National Natural Sci-
German and English-French translation tasks demonstrated ence Foundation of China (Nos. 61672440 and 61622209),
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: NEURAL MACHINE TRANSLATION WITH DEEP ATTENTION 163
the Fundamental Research Funds for the Central Universi- [21] M. D. Zeiler, “ADADELTA: An adaptive learning rate method,”
arXiv preprint arXiv:1212.5701, 2012.
ties (Grant No. ZK1024), and Scientific Research Project of [22] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
National Language Committee of China (Grant No. YB135- mization,” in Proc. 3rd Int. Conf. Learn. Representations, San Diego,
49). Biao Zhang greatly acknowledges the support of the 2015.
Baidu Scholarship. The authors also thank the reviewers for [23] F. J. Och and H. Ney, “A systematic comparison of various statisti-
cal alignment models,” Comput. Linguist., vol. 29, no. 1, pp. 19–51,
their insightful comments. Mar. 2003.
[24] Y. Liu and M. Sun, “Contrastive unsupervised word alignment
REFERENCES with non-local features,” in Proc. 29th AAAI Conf. Artificial Intell.,
Austin, Texas, 2015, pp. 2295–2301.
[1] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence [25] J. Su, S. Wu, D. Xiong, Y. Lu, X. Han, and B. Zhang, “Variational
learning with neural networks,” in Proc. 27th Int. Conf. Neural Inf. recurrent neural machine translation,” in Proc. 32nd AAAI Conf.
Proc. Syst.—Vol. 2, Montreal, Canada, 2014, pp. 3104–3112. Artif. Intell., 2018, pp. 5488-–5495.
[2] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation [26] X. Zhang, J. Su, Y. Qin, Y. Liu, R. Ji, and H. Wang, “Asynchronous
by jointly learning to align and translate,” ICLR, 2015, https:// bidirectional decoding for neural machine translation,” CoRR,
arxiv.org/abs/1409.0473 vol. abs/1801.05122, 2018. [Online]. Available: http://arxiv.org/
[3] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, abs/1801.05122
M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, [27] C. Buck, K. Heafield, and B. van Ooyen, “N-gram counts and lan-
M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, guage models from the common crawl,” in Proc. Language Resour-
H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, ces Eval. Conf., May 2014, pp. 3579–3584.
J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, [28] S. Jean, K. Cho, R. Memisevic, and Y. Bengio, “On using very large
and J. Dean, “Google’s neural machine translation system: Bridg- target vocabulary for neural machine translation,” in Proc. 53rd
ing the gap between human and machine translation,” CoRR, Annu. Meeting Assoc. Comput. Linguistics 7th Int. Joint Conf. Natural
vol. abs/1609.08144, 2016. Language Process., Jul. 2015, pp. 1–10.
[4] J. Zhou, Y. Cao, X. Wang, P. Li, and W. Xu, “Deep recurrent mod- [29] S. Shen, Y. Cheng, Z. He, W. He, H. Wu, M. Sun, and Y. Liu,
els with fast-forward connections for neural machine translation,” “Minimum risk training for neural machine translation,” in Proc.
Trans. Assoc. Comput. Linguistics, vol. 4, pp. 371–383, 2016. 54th Ann. Meeting Assoc. Comput. Linguistics (Volume 1: Long Papers),
[5] M. Wang, Z. Lu, J. Zhou, and Q. Liu, “Deep Neural Machine Berlin, Germany, 2016, pp. 1683–1692, doi: 10.18653/v1/P16-1159.
Translation with Linear Associative Unit,” in Proc. 55th Ann. [30] Y. Cheng, Z. Tu, F. Meng, J. Zhai, and Y. Liu, “Towards Robust
Meeting Assoc. Comput. Linguistics, Vancouver, Canada, 2017, Neural Machine Translation,” in Proc. 56th Ann. Meeting Assoc.
pp. 136–145, doi: 10.18653/v1/P17-1013. Comput. Linguistics (Volume 1: Long Papers), Melbourne, Australia,
[6] Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li, “Modeling coverage for 2018, pp. 1756–1766.
neural machine translation,” in Proc. 54th Ann. Meeting Ass. Com- [31] M. Wang, Z. Lu, H. Li, and Q. Liu, “Memory-enhanced decoder
put. Linguistics (Vol. 1: Long Papers), Berlin, Germany, pp. 76–85, for neural machine translation,” in Proc. Conf. Empirical Methods
2016, doi: 10.18653/v1/P16-1008. Natural Language Process., Nov. 2016, pp. 278–286.
[7] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to [32] T. Luong, I. Sutskever, Q. Le, O. Vinyals, and W. Zaremba,
attention-based neural machine translation,” in Proc. Conf. Empiri- “Addressing the rare word problem in neural machine trans-
cal Methods Natural Language Process., Sep. 2015, pp. 1412–1421. lation,” in Proc. 53rd Annu. Meeting Assoc. Comput. Linguistics 7th
[8] B. Zhang, D. Xiong, and J. Su, “Recurrent neural machine trans- Int. Joint Conf. Natural Language Process., Jul. 2015, pp. 11–19.
lation,” arXiv preprint arXiv:1607.08725, 2016.
[9] B. Zhang, D. Xiong, and J. Su, “A GRU-gated attention model for Biao Zhang received the master’s degree in
neural machine translation,” arXiv preprint arXiv:1704.08430, 2017. computer science from the School of Software,
[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, Xiamen University, China, in 2018. He is a first-
A. N. Gomez, L. U. Kaiser, and I. Polosukhin, “Attention is all you year research PhD student at the Institute for
need,” in Proc. Conf. Neural Inf. Process. Syst., 2017, pp. 5998–6008. Language, Cognition and Computation, Univer-
[11] B. Zhang, D. Xiong, and J. Su, “Accelerating neural transformer sity of Edinburgh, under Lecturer Dr. Rico Senn-
via an average attention network,” in Proc. 56th Annu. Meeting rich. His research interests are natural language
Assoc. Comput. Linguistics, Jul. 2018, pp. 1789–1798. processing and deep learning, particularly the
[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for neural machine translation.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
nit., Jun. 2016, pp. 777–778, doi: 10.1109/CVPR.2016.90.
[13] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very Deyi Xiong received the PhD degree from the
deep networks,” in Proc. 28th Int. Conf. Neural Inf. Process. Syst., Institute of Computing Technology, Beijing,
2015, pp. 2377–2385. China, in 2007. He is a professor at Tianjin Uni-
[14] Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola, “Stacked atten- versity. Previously, he was a professor at Soo-
tion networks for image question answering,” in Proc. IEEE Conf. chow University from 2013-2018 and a research
Comput. Vis. Pattern Recognit., Las Vegas, NV, USA, Jun. 27–30, scientist at the Institute for Infocomm Research
2016, pp. 21–29, doi: 10.1109/CVPR.2016.10. of Singapore from 2007-2013. His primary
[15] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, research interests are in the area of natural lan-
“Convolutional sequence to sequence learning,” in Proc. 34th Int. guage processing especially in machine transla-
Conf. Mach. Learn., Sydney, NSW, Australia, 6–11 Aug. 2017, tion, dialogue, and natural language generation.
pp. 1243–1252.
[16] M. Antonino Di Gangi and M. Federico, “Deep Neural Machine
Translation with Weakly-Recurrent Units,” in Proc. 21st Ann. Jinsong Su received the PhD degree from
Conf. Eur. Assoc. Mach. Transl., Alicante, Spain, 2018, pp. 119–128. the Institute of Computing Technology, Chinese
[17] J. Chung, Ç. G€ ulçehre, K. Cho, and Y. Bengio, “Empirical evalua- Academy of Sciences, Beijing, China, in 2011.
tion of gated recurrent neural networks on sequence modeling,” He is now an associate professor in the Software
arXiv e-prints, vol. abs/1412.3555, 2014. School of Xiamen University. His research inter-
[18] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A method ests include natural language processing and
for automatic evaluation of machine translation,” in Proc. 40th deep learning.
Annu. Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318.
[19] P. Koehn, “Statistical significance tests for machine translation
evaluation,” in Proc. Conf. Empirical Methods Natural Language Pro-
cess., Barcelona, Spain, 25-26 Jul. 2004, pp. 388–395.
[20] R. Sennrich, B. Haddow, and A. Birch, “Neural machine transla- " For more information on this or any other computing topic,
tion of rare words with subword units,” in Proc. 54th Annu. Meet- please visit our Digital Library at www.computer.org/csdl.
ing Assoc. Comput. Linguistics, Aug. 2016, pp. 1715–1725.
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.