Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

154 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO.

1, JANUARY 2020

Neural Machine Translation with Deep Attention


Biao Zhang , Deyi Xiong, and Jinsong Su

Abstract—Deepening neural models has been proven very successful in improving the model’s capacity when solving complex
learning tasks, such as the machine translation task. Previous efforts on deep neural machine translation mainly focus on the encoder
and the decoder, while little on the attention mechanism. However, the attention mechanism is of vital importance to induce the
translation correspondence between different languages where shallow neural networks are relatively insufficient, especially when the
encoder and decoder are deep. In this paper, we propose a deep attention model (DeepAtt). Based on the low-level attention
information, DeepAtt is capable of automatically determining what should be passed or suppressed from the corresponding encoder
layer so as to make the distributed representation appropriate for high-level attention and translation. We conduct experiments on NIST
Chinese-English, WMT English-German, and WMT English-French translation tasks, where, with five attention layers, DeepAtt yields
very competitive performance against the state-of-the-art results. We empirically find that with an adequate increase of attention layers,
DeepAtt tends to produce more accurate attention weights. An in-depth analysis on the translation of important context words further
reveals that DeepAtt significantly improves the faithfulness of system translations.

Index Terms—Deep attention network, neural machine translation (NMT), attention-based sequence-to-sequence learning,
natural language processing

1 INTRODUCTION

R ECENT advances in deep learning enable the end-to-end


neural machine translation (NMT) system to achieve
very promising results on various language pairs [1], [2],
The attention mechanism [2] aims to dynamically
detect translation-relevant source words for predicting
next target word according to the partial translation. It
[3], [4], [5]. Unlike conventional statistical machine transla- acts as the translation model in SMT, bridging the gap
tion (SMT), NMT is a single, large neural network which between encoder and decoder, which requires complex
heavily relies on an encoder-decoder framework, where the reasoning ability and is very crucial to the faithfulness of
encoder transforms source sentence into distributed vectors translation. To improve its capacity, we propose a deep
from which the decoder generates the corresponding target attention model (DeepAtt). Fig. 1 shows the overall archi-
sentence word by word [2]. However, to achieve state-of- tecture. With one more encoding layer, DeepAtt stacks
the-art performance, researchers often resort to deep NMT one more attention layer. In this way, the low-level atten-
so as to enhance its capacity in capturing source and target tion layer is able to provide translation-aware information
semantics [3], [4], [5]. to the high-level attention layer. This enables the higher
Previous efforts on deep NMT mainly focus on the layer to automatically determine what should be passed
encoder and the decoder. For example, Wu et al. [3] use resid- or suppressed from the corresponding encoder layer
ual connections to train 8 encoder and 8 decoder layers; Zhou which in turn, makes the learned distributed representa-
et al. [4] propose fast-forward connections and train an NMT tion more appropriate for the next target word prediction.
system with a depth of 16 layers; Wang et al. [5] propose Besides, DeepAtt retains the hierarchy of the encoder, and
linear associative units and apply them on both the 4-layer sets up the layer-wise interaction between the encoder
encoder and decoder. Intuitively, the deep encoder is able to and the decoder. This can help the decoder to capture
summarize and represent source semantics more accurately, more accurate source semantics since only one attention
and the deep decoder can memorize much longer translation layer often induces inadequate attentions [6].
history and dependency. Although these deep models bene- DeepAtt is deep on not only the encoder and the decoder,
fit NMT significantly, they all build only upon a single- but also the attention mechanism. This deep attention
layered attention network, which might be insufficient in architecture significantly improves the connection strength
modeling translation correspondence between different lan- between the encoder and decoder, enabling more complex
guages thus hindering the performance of NMT systems. reasoning operations during translation. To testify its effec-
tiveness, we conduct a series of experiments on machine
translation tasks. On NIST Chinese-English task, our model
 B. Zhang and J. Su are with the Software School, Xiamen University, Xiamen,
Fujian 361005, China. E-mail: [email protected], [email protected].
achieves the best performance in terms of BLEU score com-
 D. Xiong is with the College of Intelligence and Computing, Tianjin Uni- pared with existing work using the same training data. We
versity, Tianjin 300350, China. E-mail: [email protected]. quantitatively analyze the attention weights of each attention
Manuscript received 19 Aug. 2017; revised 1 July 2018; accepted 1 Oct. 2018. layer in terms of alignment error rate, and find that with an
Date of publication 16 Oct. 2018; date of current version 3 Dec. 2019. adequate increase of attention layers, DeepAtt produces
(Corresponding author: Jinsong Su.)
more accurate attention weights. We further check whether
Recommended for acceptance by V. Sindhwani.
Digital Object Identifier no. 10.1109/TPAMI.2018.2876404 our model yields more adequate translation. Experiment
0162-8828 ß 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: NEURAL MACHINE TRANSLATION WITH DEEP ATTENTION 155

Fig. 1. Illustration of the proposed DeepAtt. We use blue and red color to indicate the source and target side, respectively. The yellow and gray color
denotes the information flow for target word prediction and attention respectively. Notice that we draw the encoder on the right and the decoder on
the left for clarity.

results show that the translation quality of important context Observing that the use of a fixed-length vector is insuffi-
words (e.g., noun, verb, adjective) is indeed improved. On cient in summarizing source-side semantics, Bahdanau
the WMT14 English-German and English-French (using 12M et al. [2] propose the attention mechanism. Luong et al. [7]
corpus only) task, our single model, with 5 attention layers, explore several efficient architectures for this mechanism,
achieves a BLEU score of 24.73 and 38.56 respectively, both introducing the global and local attention models. Zhang
comparable to the state-of-the-art. et al. [8] propose that the recurrent neural network can be
Our main contributions are summarized as follows: used as an alternative to the attention network. Recently,
Zhang et al. [9] introduce a GRU gate to the attention model so
 We propose a novel deep attention mechanism
as to improve the discriminative ability of the learned atten-
which operates in a hierarchical manner and allows
tion vectors. Vaswani et al. [10] propose a multi-head attention
the encoder to interact with the decoder layer
network with scaled-dot operation, expecting each head to
by layer. The hierarchy architecture ensures that
capture a particular aspect of the source-target interaction.
source-side semantics can be fully utilized to gener-
Zhang et al. [11] develop an average attention model which
ate the next target word. The layer-wise interaction,
greatly simplifies the decoder-part self-attention mechanism
on the other hand, enables the decoder to selectively
using solely cumulative average operation. Rather than devel-
pick essential untranslated source words for the pre-
oping more flexible attention models, we treat these exiting
diction of the next target word.
models as our basic unit. Although we employ the model of
 We develop a novel deep encoding schema which
Bahdanau et al. [2] in our experiments, our DeepAtt can be
alternates forward and backward RNNs with skip
easily adapted to other attention models.
connections to the source input at each layer. The alter-
Based on the success achieved in computer vision [12],
nation helps capture more accurate source semantics
[13], deep neural networks have become a pretty hot spot in
via integrating both history and future source-side
the NMT community, such as [3], [4], [5]. These studies dif-
information. The skip connection, on the other hand,
fer significantly from ours in the following two aspects.
makes the gradient propagation more fluent so as to
First, their major focus is to enable flexible optimization
enable feasible model optimization.
since training a deep neural network is very difficult. To
 We conducted a series of experiments on NIST Chi-
this end, Wu et al. [3] leverage the residual connection;
nese-English, WMT14 English-German and WMT14
Zhou et al. [4] propose a fast-forward connection, while
English-French translation tasks. The proposed
Wang et al. [5] introduce a linear associative unit. Second,
model yields consistent and significant impro-
their deep architecture lies in the encoder and the decoder,
vements over several strong baselines, and achieves
ignoring the single-layered attention network which is still
results comparable to various state-of-the-art NMT
shallow. In contrast, our model is also deep in the attention
systems.
network, making the deep encoder and deep decoder cou-
ple more tightly and the training more flexible.
2 RELATED WORK Our work brings these two lines of research together. In
Our work is closely related with two lines of research: the this respect, Yang et al. [14] propose stacked attention net-
attention mechanism and the deep NMT. works to learn to answer natural language questions from
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.
156 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 1, JANUARY 2020

images. The difference is that they apply the attention


Y
m
only on a single encoding layer and compose the attended pðyjxÞ ¼ pðyj jx; y < j Þ
vectors using the adding operation. Although their model j¼1
works fine on CNN-based image question answering task,   (4)
Ym
this simple architecture and operation is relatively insufficient ¼ softmax gðEyj1 ; sj ; fakj gK
k¼1 Þ ;
for machine translation. Very recently, Gehring et al. [15] and j¼1
Gangi et al. [16], independent of our work, propose a multi-
where y < j ¼ fy1 ; . . . ; yj1 g denotes a partial translation.
step attention. In comparison, our DeepAtt has more compact
network structure, and is more feasible for optimization. Eyj1 2 Rdw is the embedding of previously generated target
Through experiments, we observe that our model is superior word yj1 , akj 2 Rdh is the translation-sensitive attention vec-
to their stacked multi-layered attention-based decoder. tor produced by the kth attention layer and gðÞ is a highly
non-linear function. sj 2 Rdh is the jth target-side hidden
3 THE MODEL state, which is usually calculated in a recurrent manner:

Unlike SMT, NMT models the translation procedure by


sj ¼ fdec ðsj1 ; Eyj1 ; fHk gK
k¼1 Þ: (5)
directly mapping the source sentence x ¼ fx1 ; . . . ; xn g to its
target translation y ¼ fy1 ; . . . ; ym g. As shown in Fig. 1, this is As shown in Fig. 1, DeepAtt is highly coupled with this
achieved via three components: encoder, decoder and attention decoding function. Formally, we decompose fdec as follows:
mechanism. We describe these components in succession.
sj ¼ esK
j (6)
The encoder aims at summarizing and representing
source semantics such that the decoder can recover them e
skj ¼ GRUðe
sk1
j ; aj Þ
k
(7)
using the target language. Given a source sentence x, we
design our encoder as follows (see the blue color in Fig. 1): s0j ¼ GRUðsj1 ; Eyj1 Þ; akj ¼ Attðesjk1 ; Hk Þ:
where e (8)
8 k
<! !k
h i ¼ fenc ð h i1 ; cki ; Exi Þ if k is even Notice that ak relies on esk1 , while esk relies on both esk1 and
hi ¼
k
(1) ak . In this way, the low-level attention information esk1 can
: k k
h i ¼ fenc ð h iþ1 ; cki ; Exi Þ otherwise; help to determine what should be extracted from the corre-
sponding encoding layer Hk , and the extracted information
where hki 2 Rdh is the hidden representation of source word
ak , in turn, amends the expressibility of esk in translation
xi in the kth encoder layer, and Exi 2 Rdw is the source word
embedding. cki 2 Rdh denotes the context representation in correspondence between source and target sentences for
source position i which tells the encoder the unobserved predicting the next target word.
information in the future. Formally, We use AttðÞ to denote the attention mechanism, which
extracts a fixed-length vector ak from varied-length source
representations Hk . Currently, there are several alterna-
cki ¼ hk1
i
(2) tives [2], [7], [8], [9], [11], among which we employed the most
!0 !0
where h0i ¼ h i ¼ GRUð h i1 ; Exi Þ: widely-used one [2]. This can be summarized as follows:
X
With such a serpentine manner, our encoder is aware of akj ¼ akji hki
!k k i
what has been read so far h i1 = h iþ1 , what will be read    (9)
next cki , and what is the current input word Exi so that the where akji ¼ softmax exp vTa tanhðWaesk1 j þ U a hk
i Þ :
future and history information can be fully incorporated
into the learned source word representations. To enable this Our model is deep not only on the encoder and decoder, but
full integration, we design the encoding function fenc using also on the attention mechanism. To optimize our model,
two-level hierarchy (take the first case in Eq. (1) as example): we used the most popular maximum likelihood objective
via stochastic gradient descent algorithm.
! e i ; ci Þ
h i ¼ GRUhigher ðh 4 EXPERIMENTS
! (3)
ei ¼ GRUlower ð h i1 ; Ex Þ:
h We evaluated DeepAtt mainly on the NIST Chinese-English
i
task. Besides, we also provided results on the WMT14
Intuitively, the low-level GRU model [17] provides a special English-German and English-French task.
short-cut connection to the encoding function such that our
deep encoder can be feasibly optimized. After encoding, the 4.1 Setup
source sentence is converted into real-valued hidden repre- Datasets. The training data for NIST Chinese-English task con-
sentations: Hk ¼ fhk1 ; . . . ; hkn g (where 1  k  K, K denotes sists of 1.25M sentence pairs, with 27.9M Chinese words and
the number of encoder layers). The higher the encoder layer 34.5M English words respectively. This data is a combination
is, the more abstract meanings Hk represents. of LDC2002E18, LDC2003E07, LDC2003E14, Hansards por-
The decoder aims at leveraging these encoded source tion of LDC2004T07, LDC2004T08 and LDC2005T06. We
semantics fHk gK k¼1 to generate not only faithful but also flu- chose the NIST 2005 dataset as the development set, and the
ent translation. Generally, it is a conditional recurrent neural NIST 2002, 2003, 2004, 2006 and 2008 datasets as our test sets.
language model which models the translation probability There are 878, 919, 1788, 1082 and 1664 sentences in NIST
based on the following chain rule: 2002, 2003, 2004, 2005, 2006, 2008 dataset respectively.
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: NEURAL MACHINE TRANSLATION WITH DEEP ATTENTION 157

TABLE 1 TABLE 2
Case-Insensitive BLEU Scores on the AER Scores of Word Alignments
Chinese-English Translation Task
System Layer-1 Layer-2 Layer-3 Layer-4 Layer-5 Layer-6 ALL
System MT05 MT02 MT03 MT04 MT06 MT08 ALL
RNNSearch 50.83 - - - - - 50.83
Moses 31.70 33.61 32.63 34.36 31.00 23.96 31.03 DeepAtt-1 45.25 - - - - - 45.25
RNNSearch [2] 34.72 37.95 35.23 37.32 33.56 26.12 34.06 DeepAtt-2 54.43 52.53 - - - - 48.09
DeepAtt-3 67.27 43.99 96.28 - - - 47.23
DeepAtt-1 36.44 40.12 37.63 39.83 35.44 27.34 36.12 *þþ
DeepAtt-4 77.64 52.16 76.86 53.30 - - 45.38
DeepAtt-2 36.90 39.71 37.79 39.93 35.95 27.87 36.34 *þþ
DeepAtt-5 60.71 49.22 64.39 78.86 96.44 - 44.69
DeepAtt-3 36.75 40.53 38.12 40.14 36.14 28.12 36.65 *þþ
DeepAtt-6 60.42 54.45 53.91 73.18 97.50 95.65 46.01
DeepAtt-4 37.87 40.99 39.10 40.77 37.14 28.44 37.34 *þþ
DeepAtt-5 38.82 41.00 39.07 41.09 37.37 28.52 37.50 *þþ
The lower the score, the better the alignment quality. ALL = overall AER
DeepAtt-6 38.29 41.40 39.23 40.66 37.20 28.99 37.51*þþ scores on all attention layers.
DeepAtt-k: the proposed model using “k” attention layers, “k” encoder layers, type “wbe-msd-bidirectional-fe-allff”. All other parameters
and “k” decoder layers (i.e., K ¼“k”). RNNSearch: a vanilla NMT system
using 1-layer encoder and 1-layer decoder with 1-layer attention. ALL = total were kept as the default settings.
BLEU score on all test sets. We highlight the best results in bold for each test For English-German task, we applied the byte pair
set. All neural models were trained with Adadelta optimizer. “"/*”: signifi- encoding compression algorithm [20] to reduce the vocabu-
cantly better than Moses (p < 0.05/p < 0.01); “þ/þþ”: significantly better
than RNNSearch (p < 0.05/p < 0.01).
lary size as well as to deal with rich morphology. For both
languages, we preserved 16K sub-words as the vocabulary.
We also tested a big setting with 30K sub-words extracted
For English-German task, we used the same subset of the as the vocabulary. Similarly, for English-French task, we
WMT 2014 training corpus as in [3], [4], [5], [7]. This training preserved 40K sub-words in the source and target vocabu-
data consists of 4.5M sentence pairs with 116M English lary, respectively.
words and 110M German words respectively.1 We used the We used dw ¼ 620 dimensional word embeddings and
news-test 2013 as the development set, and the news-test dh ¼ 1000 dimensional hidden states for both the source
2014 as the test set. and target languages. All non-recurrent parts were ran-
For English-French task, we also used the WMT 2014 domly initialized with zero mean and standard deviation of
training data. The whole training corpus consists of around 0.01, except the recurrent parameters which were initialized
36M sentence pairs, from which we selected 12M sentence with random orthogonal matrices. During decoding, we
pairs for training so as to meet our computational capabil- used the beam-search algorithm, and set the beam size to 10.
ity. The selection algorithm strictly follows the previous The model is trained through standard SGD algorithm
work2. Finally, our training data contain 304M English with a mini-batch size of 80 sentences. We updated the learn-
words and 348M French words. We used the combination ing rate using the Adadelta algorithm [21] ( ¼ 106 and
of news-test 2012 and news-test 2013 as the development r ¼ 0:95). We clipped the norm of model gradient to make it
set, and the news-test 2014 as the test set. no more than 5.0 so as to avoid gradient explosion issue.
Evaluation. We used the case-insensitive and case-sensitive Dropout was also applied on the output layer to avoid over-
BLEU-4 metric [18] to evaluate translation quality of Chinese- fitting. We set the dropout rate to be 0 for Chinese-English
English and English-German, English-French task respec- task, and 0.2 for English-German and English-French task. In
tively. For all tasks, we tokenized the reference and evaluated addition, following recent advances in deep learning commu-
the performance using multi-bleu.perl.3 We performed paired nity [3], [10], we employed Adam algorithm [22] ( ¼ 108
bootstrap sampling [19] for significance test. and b1 ¼ 0:9; b2 ¼ 0:999) in some cases. Without clear decla-
ration, we used the Adadelta for experiment.
4.2 Model Settings We implemented all our models based on the open-
We adopted similar settings as Bahdanau et al. [2]. For sourced dl4mt system.6 All NMT systems were trained on a
Chinese-English task, we extracted the most frequent 30K GeForce GTX 1080 based on the computational framwork
words from two corpora as the source and target vocabu- Theano where up to 6 attention layers were tested due to
lary, covering approximately 97.7 and 99.3 percent of each the physical memory limit of our GPU. The training of our
corpora respectively. With respect to Moses, we used all the DeepAtt costs around 5 days on the Chinese-English task,
1.25M sentence pairs (without length limitation). We trained around 3 weeks on the English-German task and around 6
a 4-gram language model on the target portion of training weeks on the English-French task when K is set to be 5.
data using the SRILM4 toolkit with modified Kneser-Ney
smoothing. We word-aligned the training corpus using 4.3 Results on Chinese-English Translation
GIZA++5 toolkit with the option “grow-diag-final-and”. We Table 1 shows the translation results of different systems. No
employed the default lexical reordering model with the matter how many attention layers are used, DeepAtt always
significantly outperforms both Moses and RNNSearch, with
1. The preprocessed data can be found and downloaded from
gains of up to 6.48 and 3.45 BLEU points respectively.
http://nlp.stanford.edu/projects/nmt/. Besides, as the number of attention layers increases, the over-
̃
2. http://www-lium.univ-lemans.fr/schwenk/cslm_joint_paper/ all translation performance is improved. Specifically, when
3. https://github.com/moses-smt/mosesdecoder/tree/master/
scripts/generic/multi-bleu.perl
4. http://www.speech.sri.com/projects/srilm/download.html 6. https://github.com/nyu-dl/dl4mt-tutorial/blob/master/ses-
5. http://www.fjoch.com/GIZA++.html sion3/nmt.py.
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.
158 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 1, JANUARY 2020

TABLE 3
Case-Insensitive BLEU Scores on Specific Context Words

RNNSearch DeepAtt
Metric
NN VB NN&VB NN&VB&JJ NN VB NN&VB NN&VB&JJ
BLEU-1 61.97 55.75 59.18 60.66 65.38 58.87 62.52 64.07
BLEU-2 44.33 28.62 29.80 33.15 45.84 33.24 33.09 36.45
BLEU-3 35.66 15.01 15.01 17.92 36.98 21.21 17.41 20.38
BLEU-4 11.40 - 7.71 9.68 17.07 - 9.27 11.50

NN = noun, VB = verb, JJ = adjective, and NN&VB = noun and verb. “” indicates the BLEU score is zero.

there are 6 attention layers, DeepAtt achieves 37.51 BLEU improvements over RNNSearch on all context words. Specif-
score on all test sets. This suggests that the deep attention ically, on “NN”, DeepAtt outperforms RNNSearch by 5.67
architecture benefits the NMT system. BLEU-4 points, while on “VB”, DeepAtt achieves a gain of
We also observed that compared to DeepAtt-6, DeepAtt- 6.2 BLEU-3 points. These significant improvements strongly
5 requires less training time with no significant performance indicate that DeepAtt connects the encoder and decoder
degradation. Thus, we conducted the following experi- more tightly so as to enable the translations more faithful.
ments using 5 attention layers by default, unless mentioned A common challenge for NMT system is the translation
otherwise.7 of long source sentences. The above analysis reveals that
DeepAtt generates more faithful translation. We further ver-
4.4 Analysis on Chinese-English Translation ified this point on long sentences. Following Bahdanau
The major difference between DeepAtt and other deep NMTs et al. [2], we grouped sentences of similar lengths together
lies in the multiple attention layers. Therefore, we first quanti- and computed BLEU score and average length of transla-
tatively evaluated the quality of the learned attention weights tions in each group.9 Fig. 2 shows the overall results. We
at different layers. To this end, we employed the alignment observe that the performance of RNNSearch drops sharply
error rate (AER) metric [23] and used the evaluation dataset when the length of source sentence exceeds 50. Compared
from Liu and Sun [24], which contains 900 manually aligned with RNNSearch, DeepAtt yields consistent and significant
Chinese-English sentence pairs [6]. Table 2 summarizes the improvements on all groups. Specifically, DeepAtt obtains a
results. With respect to the overall AER score, we observed gain of up to 4 BLEU points on the longest group. Surpris-
that there are no consistent improvements as the attention ingly, the translation length of DeepAtt is almost the same
layers deepen. However, all DeepAtt models achieve lower as that of RNNSearch. This suggests that DeepAtt achieves
(better) AER scores than the RNNSearch, especially the Deep- much better translation performance without changing the
Att-5 which yields the lowest 44.69 AER score. This indicates length of translation, demonstrating the ability of DeepAtt
that deepening the attention layers can help improve the in dealing with long-range dependencies as well as generat-
alignment quality, which typically contributes significantly to ing faithful translations.
the translation performance.
With respect to the AER score across different attention 4.5 Translation Analysis
layers (take DeepAtt-5 as example), we find that the score Following the above analysis, we further provide some
decreases at first, then raises sharply (60:71 ! 49:22 ! translation examples to verify whether our model indeed
96:77). This suggests that DeepAtt first seeks the translation- generates more fluent and faithful translation. We show the
relevant source words, and then pay more attention to the instances in Table 4.
other words. We argue that this phenomenon, to some extent, As a traditional statistical system that relies heavily on
is consistent with human’s procedure of translation. That is, a large-scale phrase pairs, the Moses succeeds in generating
human needs to determine which source word to translate at faithful translations, which, however, tend to lack of
first, then checks broader context to confirm its meaning, and fluency. For example, the sentence “zimbabwean president
finally finds out adequate target translations. mugabe 9 to 11 march in the presidential election again elected”
High-quality word alignments play an important role in suffers from serious disorder problem as well as missing-
the translation of significant context words (e.g., noun, verb, predicate problem. In contrast, the translations of all NMT
adjective). As DeepAtt produces better attention weights, systems exhibit incredible fluency. Nevertheless, different
we dug into the translations and investigated whether the NMT systems are faced with different challenges.
translation quality of context words can be improved. We On one hand, these models sometimes prefer to avoid
assigned parts of speech to each word in the references and translating some important source clauses, which is a
translations using the Stanford POS Tagger,8 and evaluated well-known under-translation problem [6]. For example,
the translation quality of noun (NN), verb (VB) and adjective RNNSearch fails to translate the source clause “但(but) 西方
(JJ) alone. We report BLEU from 1 to 4-gram, and show the (western) 国家(countries) 指责(alleging) 选举(the election) 存在
results in Table 3. Obviously, DeepAtt leads to remarkable (in) 严重(serious) 作弊(cheating) 行为(behavior), 缺乏(lack

7. We also examined the effect of beam size with a range from 10 to 9. We divide our test sets into six disjoint groups according to the
50. Unfortunately, however, we do not observe significant change of length of source sentences ((0, 10), [10, 20), [20, 30), [30, 40), [40, 50), [50,
BLEU scores as the beam varies. -)), each of which has 680, 1923, 1839, 1189, 597 and 378 sentences
8. https://nlp.stanford.edu/software/tagger.shtml respectively.
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: NEURAL MACHINE TRANSLATION WITH DEEP ATTENTION 159

Fig. 2. BLEU score and translation length on different length groups of source sentences.

TABLE 4
Examples Generated by Different Systems

The translation of DeepAtt is more accurate in expressing the meanings of source sentences. Important phrases are highlighted in red color.

of) 公正性(fairness) 和(and) 自由性(freedom), 因此(thus) 拒绝 sub-translation “9-11 march” appears several times in all
(refused to) 承认(recognize) 选举(the election) 结果(results) , 并 the NMT systems except ours. Additionally, both
(and) 扬言(threatened) 将(will) 对(against) 津巴布韦(zim- DeepNMT and WideNMT mistakenly produce “zimbabwe
babwe) 进一步(further) 实施(carry out) 制裁(sanctions)。”.10 It ’s president mugabe” rather than “western countries” as the
seems that the shallow model has difficulties in extracting subject that “refused to acknowledge the election results and
and transforming the source semantics, which can also be threatened to further impose sanctions against zimbabwe.” All
reflected on its poor alignment quality. Deepening the these strongly demonstrate that deepening the model
model is a promising way to alleviate this problem, as alone is not sufficient enough to correctly convey the
we observe that all deep models can recover more source meaning of the source sentences.
meaning into the translations. However, except our model, Our DeepAtt, although its generated translations are not
other deep models still neglect several source clauses perfect either, handles these problems much better. We con-
during transformation. tribute this to the proposed attention architecture that is
On the other hand, some common source clauses can more capable of dealing with the underlying semantics of
be translated repeatedly, which is a well-known over- source sentences.
translation problem [6]. This is because if the model fails
to capture the source semantics, it may try to translate
the recognized part over and over. As an example, the 4.6 More Comparisons on Chinese-English
Translation
Except for the Moses and RNNSearch, we provide the fol-
10. Words in bracket are word-by-word English translations. lowing existing systems:
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.
160 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 1, JANUARY 2020

TABLE 5
Case-Insensitive BLEU Scores of Advanced Systems on the Chinese-English Translation Task

System #Enc #Dec #Att MT05 MT02 MT03 MT04 MT06 MT08 ALL
Existing End-to-End NMT Systems
Coverage [5] 1 1 1 34.91 - 34.49 38.34 34.25 - -
MemDec [5] 1 1 1 35.91 - 36.16 39.81 35.98 - -
DeepLAU [5] 4 4 1 38.07 - 39.35 41.15 37.29 - -
VRNMT [25] 1 1 1 36.82 - 38.08 41.07 36.72 - -
ABDNMT [26] 1 1 1 38.84 - 40.02 42.32 38.38 - -
Our Implementation of Comparable NMT Systems
RNNSearch 1 1 1 34.72 37.95 35.23 37.32 33.56 26.12 34.06
DeepNMT 5 5 5 36.44 39.29 37.89 39.65 35.37 27.63 36.02 þþ
VDeepAtt 5 5 5 37.15 39.71 38.36 40.48 36.29 28.00 36.69 þþ
WideAtt 5 1 1 35.66 38.60 37.01 38.49 34.66 26.06 35.00 þþ
MHeadNMT 1 1 1 36.23 39.61 35.89 39.45 35.70 27.80 36.09 þþ
UDeepAtt 5 5 5 36.71 39.93 37.78 39.38 36.03 28.25 36.30 þþ
Our End-to-End NMT System
DeepAtt 5 5 5 38.82 41.00 39.07 41.09 37.37 28.52 37.50 þþ
DeepAtt + LN 5 5 5 40.19 42.20 40.24 42.13 38.59 30.15 38.78 þþ
DeepAtt + LN + Adam 5 5 5 44.16 45.70 44.17 46.82 43.12 34.16 43.08 þþ
DeepAtt + LN + Adam (4 model ensemble) 5 5 5 46.17 47.61 47.30 49.14 45.94 36.64 45.58þþ

“#Enc” = number of encoder layers, “#Dec” = number of decoder layers, and “#Att” = number of attention layers. “” indicates no result is provided in [5].
“Adam” = model is optimized with Adam optimizer, if specified. “LN” = length normalization during decoding.

 Coverage [6]: A RNNSearch with a coverage vector to sk1


j and the source representation Hk . The start
keep track of the translated and un-translated source point for the hidden representation s0j ¼ Eyj1 .
words.  WideAtt: Rather than stacking multiple attention
 MemDec [31]: A RNNSearch whose decoder is layers, WideAtt concatenates the multiple encoded
enhanced with an external memory. source representations and attends to it using only
 DeepLAU [5]: A deep RNNSearch with linear associa- one decoder layer. In summary, WideAtt uses 5
tive units to reduce the gradient propagation length encoder layers and 1 decoder layer with 1 attention
inside the recurrent unit. layer. The source representation applied for decod-
 VRNMT [25]: A RNNSearch equipped with recur- ing is calculated as follows:
rent latent variable to capture semantic variance dur-  
ing decoding. H ¼ concatð H1 ; . . . ; HK Þ; (13)
 ABDNMT [26]: A RNNSearch enhanced with a bidi-
where Hk is defined as in Eq. (1).
rectional decoding procedure. This model uses a
 MHeadNMT: A vanilla RNNSearch system, which
two-stage translation.
utilizes a multi-head attention network described
Besides, we also implement several closely-related models:
in [10] rather than the vanilla attention mecha-
 DeepNMT: A vanilla deep NMT model with 5 nism [2]. We used 8 heads for experiment.
encoder and 5 decoder layers, but only one attention  UDeepAtt: The same model as DeepAtt-5, except that
layer. In practice, we used the same encoder as our each encoder layer follows the same direction, rather
DeepAtt. than the alternative forward and backward architec-
 VDeepAtt: A vanilla design of DeepAtt-5. The differ- ture. Formally, the encoder of UDeepAtt operates as
ence lies in the decoder, where VDeepAtt simply follows:
stacks multiple attention-based decoder layers [15] !k !k
rather than coupling these attention layers into one hki ¼ h i ¼ fenc ð h i1 ; cki ; Exi Þ: (14)
recurrent unit as in DeepAtt. Formally, VDeepAtt Table 5 shows the results. For our NMT systems, all deep
employs 5 stacked conditional recurrent decoders to models outperform RNNSearch significantly, demonstrat-
predict the next target word: ing the modeling capacity of deep neural networks as well
e
skj ¼ GRUðskj1 ; sk1 as the solidness of this line of research. Among these sys-
j Þ (10)
tems, WideAtt yields the worst performance. This indicates
that concatenating the multi-layered source representations
skj ¼ GRUðeskj ; akj Þ (11) makes the NMT shallow, and finally results in the loss of
valuable capacity in modeling translation. Compared with
akj ¼ Attðe
skj ; Hk Þ; (12) DeepNMT, VDeepAtt achieves better performance with a
gain of 0.67 BLEU points. Since the main difference between
where the jth target hidden state in the kth layer skj DeepNMT and VDeepAtt lies in that VDeepAtt applies
depends on the previous hidden state in the same multiple attention layers, we believe that deep attention is a
layer skj1 , the current hidden state in the last layer feasible and effective direction. Enhanced with our
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: NEURAL MACHINE TRANSLATION WITH DEEP ATTENTION 161

Fig. 3. BLEU score of different systems on all test sets under different numbers of layers.

proposed attention architecture, DeepAtt obtains another 4.7 Results on English-German Translation
gain of 0.81 BLEU points over VDeepAtt, which suggests Table 6 shows the results on English-German translation.
both the effectiveness and efficiency of our DeepAtt archi- We also show existing systems comparable to ours includ-
tecture, considering that DeepAtt has more compact struc- ing the winning system in WMT14 [27], a phrase-based sys-
ture than VDeepAtt, enabling much efficient gradient tem whose language models were trained on a huge
propagation inside the decoder. monolingual text, the Common Crawl corpus. Obviously,
Compared with UDeepAtt, DeepAtt achieves a clear current WMT14 performance is led by deep NMT systems.
improvement of 1.2 BLEU points. The only difference For example, Wu et al. [3] reported 24.61 BLEU score with 8
between these two models is that we alternate the encoding LSTM layers, and Wang et al. [5] generated 23.80 BLEU
direction between consecutive encoder layers. A benefit score with 4 GRU+LAU layers. Very recently, the state-of-
from this alternation is that future information can be fully the-art is refreshed by Gehring et al. [15] using 15 CNN
mixed with history information, thus enabling DeepAtt to layers and becomes 25.12, which is further broken through
produce more accurate source representations. We also by Vaswani et al. [10] and reaches 27.30.
compared our model with the multi-head attention net- Our model achieves 24.73 BLEU score, a very competitive
work [10]. Results show that MHeadNMT yields a gain of result against the RNN-based and CNN-based systems above.
2.03 BLEU points over the RNNSearch, indicating that cap- Under similar model settings, the GNMT [3] yields 24.36
turing different aspects of the source-target interaction is BLEU score (0.37 BLEU points lower than our model) with
beneficial for translation. Nevertheless, deepening the atten- various non-trivial tricks such as coverage penalty, specific
length normalization, fine-tunning and the RL-refined model.
tion network with compact structures as in our DeepAtt can
Although Gehring et al. [15] achieved 25.16, they used 40K
reach better performance, achieving a gain of 1.41 BLEU
sub-words and 15 layers, several times larger than those of
points over MHeadNMT.
our model. We also performed model ensemble to enhance
In order to have a fair comparison with the existing
the translation performance. By initializing with different ran-
systems, we apply the length normalization during
dom seeds, we trained 4 different models whose ensemble
translation.11 To the best of our knowledge, DeepLAU [5]
pushed the BLEU score to 26.45, making our model outper-
reported the best BLEU scores using the 1.25M training
form both GNMT [3], LAU-NMT [5] and CNN-NMT [15].
data. However, our DeepAtt outperforms all these systems
significantly. Enhanced with the Adam optimizer, our model 4.8 Results on English-French Translation
reaches an overall BLEU score of 43.08, a strong improve- Table 7 summarizes the translation performance of different
ment over the one trained with Adadelta by a great margin NMT systems. Unlike the above translation tasks, this task
of 4.3 BLEU points. We further performed model ensemble. provides a training corpus of 12M sentence pairs, around
Using 4 well-trained model under different random seeds, three times and ten times larger than that of English-
our model resets the state-of-the-art results on this task, German and Chinese-English translation task respectively.
where the overall BLEU score increases to 45.58. Besides, Overall, our single model achieves a BLEU score of 38.56,
except for NMT with 5 layers, we also compared different and its ensemble using 4 well-trained models improves the
models under other numbers of layers, which is shown in score to 39.88. Both results are competitive against both
Fig. 3. We observe that with the increase of layers, NMT RNN-based and CNN-based systems.
models produce better results, and no matter how many Among systems trained with 12M sentence pairs, our
layers are used, DeepAtt always outperforms other related model is the best, outperforming the previous best model, i.e.,
models and achieves the best result. All these demonstrate Wang et al. [5] (35.10), by a great margin of 3.46 BLEU points.
the modeling power of our deep attention architecture. When using the full 36M sentence pairs, GNMT [3] yeilds a
BLEU score of 38.95, Transformer [10] achieves 38.10, and
CNN-NMT [15] reaches 40.15. By contrast, our model, using
11. Even with length normalization, the comparison is not
completely fair. Although all systems use the same training data, the
only 12M training data, is able to generate translations with a
existing systems are tuned on NIST 02, while ours is tuned on NIST 05. BLEU score of 38.56, demonstrating our model’s excellent
However, we believe this is not the key. capability in translation modeling.
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.
162 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 1, JANUARY 2020

TABLE 6
Case-Sensitive BLEU Scores on WMT14 English-German Translation Task

System Architecture Vocab BLEU


Buck et al. [27] Winning WMT14 system phrase-based + large LM - 20.7
Existing end-to-end NMT systems
Jean et al. [28] RNNSearch + unk replace + large vocab 500K 19.40
Luong et al. [7] LSTM with 4 layers + dropout + local att. + unk replace 50K 20.90
Shen et al. [29] RNNSearch (GroundHog) + MRT + PosUnk 50K 20.45
Zhou et al. [4] LSTM with 16 layers + Fast-Forward connections 80K 20.60
Wu et al. [3] LSTM with 8 layers + Word 80K 23.10
Wu et al. [3] LSTM with 8 layers + RL-refined WPM 32K 24.60
Wang et al. [5] RNNSearch with 4 layers + LAU 80K 22.10
Wang et al. [5] RNNSearch with 4 layers + LAU + PosUnk 80K 23.80
Gehring et al. [15] CNN with 15 layers + Multi-step Attention + BPE 40K 25.16
Cheng et al. [30] RNN with 2 layers + adversarial stability training + BPE 30K 25.26
Gangi et al. [16] RNN with 10 layers + SR + BPE 32K 24.98
Vaswani et al. [10] Attention with 6 layers + WPM + base 32K 27.30
Wang et al. [5] RNNSearch with 4 layers + LAU + PosUnk (8 model ensemble) 80K 26.10
Wu et al. [3] LSTM with 8 layers + RL-refined WPM (8 model ensemble) 32K 26.20
Gehring et al. [15] CNN with 15 layers + Multi-step Attention + BPE (8 model ensemble) 40K 26.43
Our end-to-end NMT systems

this work DeepAtt with 5 layers + BPE 16K 24.22


DeepAtt with 5 layers + BPE + Adam 30K 24.73
DeepAtt with 5 layers + BPE + Adam (4 model ensemble) 30K 26.45

“unk replace” and “PosUnk” denotes the approach of handling rare words in Jean et al. [28] and Luong et al. [7], respectively. “RL” and “WPM” represent the
reinforcement learning optimization and wordpiece model used in Wu et al. [3], respectively. “LAU” and “MRT” denote the linear associative unit and the mini-
mum risk training proposed by Wang et al. [5] and Shen et al. [29], respectively. “BPE” denotes the byte pair encoding algorithm in Sennrich et al. [20]. “SR”
indicates the weakly-recurrent model proposed by Gangi et al. [16]

TABLE 7
Case-sensitive BLEU Scores on WMT14 English-French Translation Task

System Architecture Data Vocab BLEU


Existing end-to-end NMT systems
Jean et al. [28] RNNSearch + unk replace + large vocab 12M 500K 34.11
Luong et al. [32] LSTM with 6 layers + PosUnk 12M 40K 32.70
Shen et al. [29] RNNSearch + MRT + PosUnk 12M 30K 34.23
Zhou et al. [4] LSTM with 16 layers + Fast-Forward connections 36M 80K 37.70
Wu et al. [3] LSTM with 8 layers + WPM 36M 32K 38.95
Wang et al. [5] RNNSearch with 4 layers + LAU + PosUnk 12M 30K 35.10
Gehring et al. [15] CNN with 15 layers + Multi-step Attention + BPE 36M 40K 40.51
Vaswani et al. [10] Attention with 6 layers + WPM + base 36M 32K 38.10
Wu et al. [3] LSTM with 8 layers + WPM (8 model ensemble) 36M 32K 40.35
Gehring et al. [15] CNN with 15 layers + Multi-step Attention + BPE (8 model ensemble) 36M 40K 41.44
Our end-to-end NMT systems

this work DeepAtt with 5 layers + BPE + Adam 12M 40K 38.56
DeepAtt with 5 layers + BPE + Adam (4 model ensemble) 12M 40K 39.88

“unk replace” and “PosUnk” denotes the approach of handling rare words in Jean et al. [28] and Luong et al. [7], respectively. “RL” and “WPM” represent the
reinforcement learning optimization and wordpiece model used in Wu et al. [3], respectively. “LAU” and “MRT” denote the linear associative unit and the mini-
mum risk training proposed by Wang et al. [5] and Shen et al. [29], respectively. “BPE” denotes the byte pair encoding algorithm in Sennrich et al. [20].

5 CONCLUSION AND FUTURE WORK the effectiveness of our model in improving both the transla-
tion and alignment quality.
In this article, we have presented a deep attention model
In the future, we want to testify DeepAtt on other tasks,
(DeepAtt) for NMT systems. Through multiple stacked atten-
e.g., summarization. Additionally, our model is not limited
tion layers with each layer paying attention to a corresp-
to the attention unit that we have used in this article. As
onding encoder layer, DeepAtt enables low-level attention
mentioned in Section 2, we are also interested in adapting
information to guide what should be passed or suppressed
DeepAtt to other more complex attention models.
from the encoder layer so as to make the learned distributed
representations appropriate for high-level translation tasks.
Our model is simple to implement and flexible to train. ACKNOWLEDGMENTS
Experiments on both NIST Chinese-English, WMT14 English- The authors were supported by the National Natural Sci-
German and English-French translation tasks demonstrated ence Foundation of China (Nos. 61672440 and 61622209),
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: NEURAL MACHINE TRANSLATION WITH DEEP ATTENTION 163

the Fundamental Research Funds for the Central Universi- [21] M. D. Zeiler, “ADADELTA: An adaptive learning rate method,”
arXiv preprint arXiv:1212.5701, 2012.
ties (Grant No. ZK1024), and Scientific Research Project of [22] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
National Language Committee of China (Grant No. YB135- mization,” in Proc. 3rd Int. Conf. Learn. Representations, San Diego,
49). Biao Zhang greatly acknowledges the support of the 2015.
Baidu Scholarship. The authors also thank the reviewers for [23] F. J. Och and H. Ney, “A systematic comparison of various statisti-
cal alignment models,” Comput. Linguist., vol. 29, no. 1, pp. 19–51,
their insightful comments. Mar. 2003.
[24] Y. Liu and M. Sun, “Contrastive unsupervised word alignment
REFERENCES with non-local features,” in Proc. 29th AAAI Conf. Artificial Intell.,
Austin, Texas, 2015, pp. 2295–2301.
[1] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence [25] J. Su, S. Wu, D. Xiong, Y. Lu, X. Han, and B. Zhang, “Variational
learning with neural networks,” in Proc. 27th Int. Conf. Neural Inf. recurrent neural machine translation,” in Proc. 32nd AAAI Conf.
Proc. Syst.—Vol. 2, Montreal, Canada, 2014, pp. 3104–3112. Artif. Intell., 2018, pp. 5488-–5495.
[2] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation [26] X. Zhang, J. Su, Y. Qin, Y. Liu, R. Ji, and H. Wang, “Asynchronous
by jointly learning to align and translate,” ICLR, 2015, https:// bidirectional decoding for neural machine translation,” CoRR,
arxiv.org/abs/1409.0473 vol. abs/1801.05122, 2018. [Online]. Available: http://arxiv.org/
[3] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, abs/1801.05122
M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, [27] C. Buck, K. Heafield, and B. van Ooyen, “N-gram counts and lan-
M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, guage models from the common crawl,” in Proc. Language Resour-
H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, ces Eval. Conf., May 2014, pp. 3579–3584.
J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, [28] S. Jean, K. Cho, R. Memisevic, and Y. Bengio, “On using very large
and J. Dean, “Google’s neural machine translation system: Bridg- target vocabulary for neural machine translation,” in Proc. 53rd
ing the gap between human and machine translation,” CoRR, Annu. Meeting Assoc. Comput. Linguistics 7th Int. Joint Conf. Natural
vol. abs/1609.08144, 2016. Language Process., Jul. 2015, pp. 1–10.
[4] J. Zhou, Y. Cao, X. Wang, P. Li, and W. Xu, “Deep recurrent mod- [29] S. Shen, Y. Cheng, Z. He, W. He, H. Wu, M. Sun, and Y. Liu,
els with fast-forward connections for neural machine translation,” “Minimum risk training for neural machine translation,” in Proc.
Trans. Assoc. Comput. Linguistics, vol. 4, pp. 371–383, 2016. 54th Ann. Meeting Assoc. Comput. Linguistics (Volume 1: Long Papers),
[5] M. Wang, Z. Lu, J. Zhou, and Q. Liu, “Deep Neural Machine Berlin, Germany, 2016, pp. 1683–1692, doi: 10.18653/v1/P16-1159.
Translation with Linear Associative Unit,” in Proc. 55th Ann. [30] Y. Cheng, Z. Tu, F. Meng, J. Zhai, and Y. Liu, “Towards Robust
Meeting Assoc. Comput. Linguistics, Vancouver, Canada, 2017, Neural Machine Translation,” in Proc. 56th Ann. Meeting Assoc.
pp. 136–145, doi: 10.18653/v1/P17-1013. Comput. Linguistics (Volume 1: Long Papers), Melbourne, Australia,
[6] Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li, “Modeling coverage for 2018, pp. 1756–1766.
neural machine translation,” in Proc. 54th Ann. Meeting Ass. Com- [31] M. Wang, Z. Lu, H. Li, and Q. Liu, “Memory-enhanced decoder
put. Linguistics (Vol. 1: Long Papers), Berlin, Germany, pp. 76–85, for neural machine translation,” in Proc. Conf. Empirical Methods
2016, doi: 10.18653/v1/P16-1008. Natural Language Process., Nov. 2016, pp. 278–286.
[7] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to [32] T. Luong, I. Sutskever, Q. Le, O. Vinyals, and W. Zaremba,
attention-based neural machine translation,” in Proc. Conf. Empiri- “Addressing the rare word problem in neural machine trans-
cal Methods Natural Language Process., Sep. 2015, pp. 1412–1421. lation,” in Proc. 53rd Annu. Meeting Assoc. Comput. Linguistics 7th
[8] B. Zhang, D. Xiong, and J. Su, “Recurrent neural machine trans- Int. Joint Conf. Natural Language Process., Jul. 2015, pp. 11–19.
lation,” arXiv preprint arXiv:1607.08725, 2016.
[9] B. Zhang, D. Xiong, and J. Su, “A GRU-gated attention model for Biao Zhang received the master’s degree in
neural machine translation,” arXiv preprint arXiv:1704.08430, 2017. computer science from the School of Software,
[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, Xiamen University, China, in 2018. He is a first-
A. N. Gomez, L. U. Kaiser, and I. Polosukhin, “Attention is all you year research PhD student at the Institute for
need,” in Proc. Conf. Neural Inf. Process. Syst., 2017, pp. 5998–6008. Language, Cognition and Computation, Univer-
[11] B. Zhang, D. Xiong, and J. Su, “Accelerating neural transformer sity of Edinburgh, under Lecturer Dr. Rico Senn-
via an average attention network,” in Proc. 56th Annu. Meeting rich. His research interests are natural language
Assoc. Comput. Linguistics, Jul. 2018, pp. 1789–1798. processing and deep learning, particularly the
[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for neural machine translation.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
nit., Jun. 2016, pp. 777–778, doi: 10.1109/CVPR.2016.90.
[13] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very Deyi Xiong received the PhD degree from the
deep networks,” in Proc. 28th Int. Conf. Neural Inf. Process. Syst., Institute of Computing Technology, Beijing,
2015, pp. 2377–2385. China, in 2007. He is a professor at Tianjin Uni-
[14] Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola, “Stacked atten- versity. Previously, he was a professor at Soo-
tion networks for image question answering,” in Proc. IEEE Conf. chow University from 2013-2018 and a research
Comput. Vis. Pattern Recognit., Las Vegas, NV, USA, Jun. 27–30, scientist at the Institute for Infocomm Research
2016, pp. 21–29, doi: 10.1109/CVPR.2016.10. of Singapore from 2007-2013. His primary
[15] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, research interests are in the area of natural lan-
“Convolutional sequence to sequence learning,” in Proc. 34th Int. guage processing especially in machine transla-
Conf. Mach. Learn., Sydney, NSW, Australia, 6–11 Aug. 2017, tion, dialogue, and natural language generation.
pp. 1243–1252.
[16] M. Antonino Di Gangi and M. Federico, “Deep Neural Machine
Translation with Weakly-Recurrent Units,” in Proc. 21st Ann. Jinsong Su received the PhD degree from
Conf. Eur. Assoc. Mach. Transl., Alicante, Spain, 2018, pp. 119–128. the Institute of Computing Technology, Chinese
[17] J. Chung, Ç. G€ ulçehre, K. Cho, and Y. Bengio, “Empirical evalua- Academy of Sciences, Beijing, China, in 2011.
tion of gated recurrent neural networks on sequence modeling,” He is now an associate professor in the Software
arXiv e-prints, vol. abs/1412.3555, 2014. School of Xiamen University. His research inter-
[18] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A method ests include natural language processing and
for automatic evaluation of machine translation,” in Proc. 40th deep learning.
Annu. Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318.
[19] P. Koehn, “Statistical significance tests for machine translation
evaluation,” in Proc. Conf. Empirical Methods Natural Language Pro-
cess., Barcelona, Spain, 25-26 Jul. 2004, pp. 388–395.
[20] R. Sennrich, B. Haddow, and A. Birch, “Neural machine transla- " For more information on this or any other computing topic,
tion of rare words with subword units,” in Proc. 54th Annu. Meet- please visit our Digital Library at www.computer.org/csdl.
ing Assoc. Comput. Linguistics, Aug. 2016, pp. 1715–1725.
Authorized licensed use limited to: University of Pennsylvania. Downloaded on April 28,2024 at 15:08:10 UTC from IEEE Xplore. Restrictions apply.

You might also like