Neural Machine Translation A Review of Methods Resources and - 2020 - AI Ope

AI Open 1 (2020) 5–21
Contents lists available at ScienceDirect
AI Open
journal homepage: www.keaipublishing.com/en/journals/ai-open
Neural machine translation: A review of methods, resources, and tools

Zhixing Tan a, c, d, Shuo Wang a, c, d, Zonghan Yang a, c, d, Gang Chen a, c, d, Xuancheng Huang a, c, d,
Maosong Sun a, c, d, e, Yang Liu a, b, c, d, e, *
a
Department of Computer Science and Technology, Tsinghua University, China
b
Institute for AI Industry Research, Tsinghua University, China
c
Institute for Artificial Intelligence, China
d
Beijing National Research Center for Information Science and Technology, China
e
Beijing Academy of Artificial Intelligence, China
A R T I C L E I N F O A B S T R A C T
Keywords: Machine translation (MT) is an important sub-field of natural language processing that aims to translate natural
neural Machine translation languages using computers. In recent years, end-to-end neural machine translation (NMT) has achieved great
Attention mechanism success and has become the new mainstream method in practical MT systems. In this article, we first provide a
Deep learning
broad review of the methods for NMT and focus on methods relating to architectures, decoding, and data
Natural language processing
augmentation. Then we summarize the resources and tools that are useful for researchers. Finally, we conclude
with a discussion of possible future research directions.
1. Introduction tuned components in SMT. Besides its simplicity, NMT has achieved
state-of-the-art performance on various language pairs (Junczys-Dow-
Machine Translation (MT) is an important task that aims to translate munt et al., 2016). In practice, NMT also becomes the key technology
natural language sentences using computers. The early approach to ma- behind many commercial MT systems (Wu et al., 2016; Hassan et al.,
chine translation relies heavily on hand-crafted translation rules and 2018).
linguistic knowledge. As natural languages are inherently complex, it is As neural machine translation attracts much research interest and
difficult to cover all language irregularities with manual translation rules. grows into an area with many research directions, we believe it is
With the availability of large-scale parallel corpora, data-driven ap- necessary to conduct a comprehensive review of NMT. In this work, we
proaches that learn linguistic information from data have gained will give an overview of the key ideas and innovations behind NMT. We
increasing attention. Unlike rule-based machine translation, Statistical also summarize the resources and tools that are useful and easily acces-
Machine Translation (SMT) (Brown et al., 1990; Koehn et al., 2003) sible. We hope that by tracing the origins and evolution of NMT, we can
learns latent structures such as word alignments or phrases directly from stand on the shoulder of past studies, and gain insights into the future of
parallel corpora. Incapable of modeling long-distance dependencies be- NMT.
tween words, the translation quality of SMT is far from satisfactory. With The remainder of this article is organized as follows: Section 2 will
the breakthrough of deep learning, Neural Machine Translation (NMT) review the methods of NMT. We first introduce the basics of NMT, and
(Kalchbrenner and Blunsom, 2013; Cho et al., 2014a; Sutskever et al., then we selectively describe the recent progress of NMT. We focus on
2014; Bahdanau et al., 2015) has emerged as a new paradigm and quickly methods related to architectures, decoding, and data augmentation.
replaced SMT as the mainstream approach to MT. Section 3 will summarize the resources such as parallel or monolingual
Neural machine translation is a radical departure from previous ma- corpora that are publicly available to researchers. Section 4 will describe
chine translation approaches. On the one hand, NMT employs continuous tools that are useful for training and evaluating NMT models. Finally, we
representations instead of discrete symbolic representations in SMT. On conclude and discuss future directions in Section 5.
the other hand, NMT uses a single large neural network to model the
entire translation process, freeing the need for excessive feature engi-
neering. The training of NMT is end-to-end as opposed to separately
* Corresponding author. Department of Computer Science and Technology, Tsinghua University, China
E-mail address: [email protected] (Y. Liu).
https://doi.org/10.1016/j.aiopen.2020.11.001
Received 10 October 2020; Accepted 20 November 2020
Available online 4 March 2021
2666-6510/© 2020 The Author(s). Published by Elsevier B.V. on behalf of KeAi Communications Co., Ltd. This is an open access article under the CC BY-NC-ND
license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Z. Tan et al. AI Open 1 (2020) 5–21
2. Methods consists of four basic components: the embedding layers, the encoder and
decoder networks, and the classification layer. Fig. 1 shows a typical
As a data-driven approach to machine translation, NMT also embraces autoregressive NMT model using the encoder-decoder framework, which
the probabilistic framework. Mathematically speaking, the goal of NMT we shall use as an example. ‘‘<bos>’’ and ‘‘<eos>’’ are special symbols
is to estimate an unknown conditional distribution PðyjxÞ given the that mark the beginning and ending of a sentence, respectively.
dataset D , where x and y are random variables representing source input The embedding layer embodies the concept of continuous representa-
and target output, respectively. We strive to answer the three basic tion. It maps a discrete symbols xt into a continuous vector xt 2 Rd , where
questions of NMT: d denotes the dimension of the vector. The embeddings are then fed into
later layers for more finer-grained feature extraction.
Modeling. How to design neural networks to model the conditional The encoder network maps the source embeddings into hidden
distribution? continuous representations. To learn expressive representations, the
Inference. Given a source input, how to generate a translation sentence encoder must be able to model the ordering and complex dependencies
from the NMT model? that existed in the source language. Recurrent neural networks (RNN) are
Learning. How to effectively learn the parameters of NMT from data? suitable choice for modeling variable-length sequences. With RNNs, the
computation involves in encoder can be described as:
In 2.1, we first describe the basic methods of NMT for addressing the
above three questions. We then dive into the details of NMT architectures ht ¼ RNNENC ðxt ; ht1 Þ: (2)
in 2.2. We introduce non-autoregressive NMTs and bidirectional infer- By iteratively applying the state transition function RNNENC over
ence in 2.3, and discuss alternative training objectives and using mono- input sequence, we can use the final state hS as the representation for the
lingual data in 2.4 and 2.5. entire source sentence, and then feed it to the decoder.
Despite the great success, NMT is far from perfect. There are several The decoder can be viewed as a language model conditioned on hS .
theoretical and practical challenges faced by NMT. We survey the The decoder network extracts necessary information from the encoder
research progress of some important directions. We describe methods for output, and also models the long-distance dependencies between target
open vocabulary in 2.6, prior knowledge integration in 2.7, and inter- words. Given the start symbol y0 ¼< bos > and the initial state s0 ¼ hS ,
pretability and robustness in 2.8. the RNN decoder compresses the decoding history fy0 ; …; yt1 g into a
state vector st 2 Rd :
2.1. Overview of NMT
st ¼ RNNDEC ðyt1 ; st1 Þ: (3)
2.1.1. Modeling The classification layer predicts the distribution of target tokens. The
Translation can be modeled at different levels, such as document-, classification layer is typically a linear layer with softmax activation
paragraph-, or sentence-level. In this article, we focus on sentence-level function. Assuming the vocabulary of target language is V , and jVj is the
translation. Besides, we also assume the input and output sentences are
size of the vocabulary. Given an decoder output st 2 Rd , the classi-
sequences. Thus the NMT model can be viewed as a sequence-to-sequence
ficaition layer first maps h to a vector z in the vocabulary space RjVj with
model. Assuming we are given a source sentence x ¼ fx1 ; …; xS g and a
the linear map. Then the softmax function is used to ensure the output
target sentence y ¼ fy1 ; …; yT g. By using the chain rule, the conditional
vector is a valid probability:
distribution can factorize from left-to-right (L2R) as:
expðzÞ
Y
T
softmaxðzÞ ¼ PjVj ; (4)
P yjx ¼ Pðyt jy0 ; …; yt1 ; x : (1) i¼1 expðz½i Þ
t¼1
where we use z½i to denote the i-th component in z.

NMT models which conform the Eq. (1) is referred as L2R autore-
gressive NMT (Kalchbrenner and Blunsom, 2013; Cho et al., 2014a;
2.1.2. Inference
Sutskever et al., 2014; Bahdanau et al., 2015), for the prediction at
Given an NMT model and a source sentence x, how to generate a
time-step t is taken as a input at time-step t þ 1.
translation from the model is an important problem. Ideally, we would
Almost all neural machine translation models employ the encoder-
like to find the target sentence y which maximizes the model prediction
decoder framework (Cho et al., 2014a). The encoder-decoder framework
Fig. 1. An overview of the NMT architecture, which consists of embedding layers, a classification layer, an encoder network, and a decoder network. We use different
colors to distinguish different languages.
6
Pðyjx ¼ x; θÞ as the translation. However, due to the intractably large

search space, it is impractical to find the translation with the highest The pseudo-codes of the beam search algorithm are given in 1. We
probability. Therefore, NMT typically uses local search algorithms such also give a running example of the algorithm in Fig. 2.
as greedy search or beam search to find a local best translation.
Beam search is a classic local search algorithm which have been 2.1.3. Training of NMT models
widely used in NMT. Previously, beam search have been successfully NMT typically uses maximum log-likelihood (MLE) as the training
applied in SMT. The beam search algorithm keeps track of k states during objective function, which is a commonly used method of estimating the
the inference stage. Each state is a tuple hy0 …yt ; vi, where y0 …yt is a parameters of a probability distribution. Formally, given the training set
candidate translation, and v is the log-probability of the candidate. At S
D ¼ fhxðsÞ ; yðsÞ igs¼1 , the goal of training is to find a set of model pa-
each step, all the successors of all k states are generated, but only the top- rameters that maximize the log-likelihood on the training set:
k successors selected. The algorithm usually terminates when the step
exceed a pre-defined value or k full translation are found. It should be b
θ MLE ¼ argmaxfL ðθÞg; (5)
noted that the beam search will degrade into the greedy search if k ¼ 1. θ
Algorithm 1. The beam search algorithm where the log-likelihood is defined as

X
L ðθÞ ¼ log PðyðsÞ jxðsÞ ; θÞ: (6)
s¼1
By the virtue of back-propagation algorithm, we can efficiently

compute the gradient of L with respect to θ. The training of NMT models
usually adopts stochastic gradient search (SGD) algorithm. Instead of
computing gradients on the full training set, SGD computes the loss
function and gradients on a minibatch of the training set. The plain SGD
optimzier updates the parameters of an NMT model with the following
rule:
θ ← θ αrL ðθÞ; (7)
where α is the learning rate. With well-chosen learning rate, the param-
eters of NMT are guaranteed to converge into a local optima. In practice,
instead of plain SGD optimizer, adaptive learning rate optimizers such as
Adam (Kingma and BaAdam, 2014) are found to greatly reduce the
training time.
2.2. Architectures
2.2.1. Evolution of NMT architectures

Since 2013, there are attempts to build a pure neural MT. Early NMT
architectures such as RCTM (Kalchbrenner and Blunsom, 2013),
RNNEncdec (Cho et al., 2014a), and Seq2Seq (Sutskever et al., 2014)
adopt a fixed-length approach, where the size of source representation is
fixed regardless the length of source sentences. These works typically use
Fig. 2. A running example of the beam-search algorithm.
7
recurrent neural networks (RNN) as the decoder network for generating

variable-length translation. However, it is found that the performance of
this approach degrades as the length of the input sentence increases (Cho
et al., 2014b). Two explanations can account for this phenomenon:
1. The fixed-length representations have become the bottleneck during

the encoding process for long sentences (Cho et al., 2014a). As the
encoder is forced to compress the entire source sentence into a set of
fixed-length vectors, some important information may be lost in this
process.
2. The longest path between the source words and target words is OðS þ
TÞ, and it is challenging for neural networks to learn long-term de-
pendencies (Bengio et al., 1994). Sutskever et al. (2014) found that
reverse the source sentence can significantly improve the perfor-
mance of the fixed-length approach. By reversing the source sentence,
the paths between the beginning words of source and target sentences
are reduced, thus the optimization problem becomes easier.
Due to these limitations, later NMT architectures switch to variable-

length source representations, where the length of source representations Fig. 3. At each decoding step, the attention mechanism dynamically generates a
depends on the length of the source sentence. The RNNsearch architec- context vector based on the most relevant source representations for predicting
ture (Bahdanau et al., 2015) introduces attention mechanism, which is an the next target word.
important approach to implementing variable-length representations.
Fig. 3 shows the comparison between fixed-length and variable-length
R½i;j ¼ vT tanh Ws Q½i þ Us K½j ; (10)
approaches. By using the attention mechanism, the paths between any
source and target words are within a constant length. As a result, the where Ws 2 Rdd ; Us 2 Rdd , and v 2 Rd1 are learnable parameters. On
attention mechanism has eased optimization difficulty. the other hand, the dot-product attention uses dot production to compute
With the breakthrough of deep learning, NMT with deep neural the matching score:
networks have attracted much research interest. Seq2Seq (Sutskever
et al., 2014) is the first architecture demonstrate the potential of deep R½i;j ¼ QT½i K½j : (11)
NMT. Later architectures such as GNMT (Wu et al., 2016), ByteNet
(Kalchbrenner et al., 2016), ConvSeq2Seq (Gehring et al., 2017), and In practice, the dot-product attention is much faster than the additive
Transformer (Vaswani et al., 2017) all use multi-layered neural networks. attention. However, the dot-product attention is found to be less stable
ByteNet and ConvSeq2Seq have replaced RNNs with convolutional than the additive attention when d is large (Vaswani et al., 2017). Vas-
neural networks (CNN) in their architectures while Transformer relies wani et al. (2017) suspect that the dot-products grow large in magnitude
entirely on self-attention networks (SAN). Both CNNs and SANs can for large values of d, which may resulting extremely small gradients
reduce the sequential operations involved in RNNs, and benefit from the caused by the softmax function. To remedy this issue, they propose to
parallel computation provided by modern devices such as GPU or TPU. scale the dot-products by p1ffiffid.
Importantly, SAN can further reduce the longest path between two target The attention mechanism is usually used as a part of the decoder
tokens. network. Another type of attention network called self-attention
network, is widely used in both the encoder and decoder of NMT. We
2.2.2. Attention mechanism shall describe self-attention and other variants of attention network later.
The introduction of attention mechanism (Bahdanau et al., 2015) is a
milestone in NMT architecture research. The attention network computes 2.2.3. RNNs, CNNs, and SANs
the relevance of each value vector based on queries and keys. This can The encoder and decoder are key components of NMT architectures.
also be interpreted as a content-based addressing scheme (Graves et al., There are many methods to build powerful encoders and decoders, which
2014). Formally, given a set of m query vectors Q 2 Rmd , a set of n key can roughly divide into three categories: recurrent neural network (RNN)
vectors K 2 Rnd and associated value vectors V 2 Rnd , the computation based methods, convolution neural network (CNN) based methods and
of attention network involves two steps. The first step is to compute the self-attention network (SAN) based methods. There are several aspects
relevance between keys and values, which is formally described as: we need to take into considerations for building an encoder and decoder:
R ¼ scoreðQ; KÞ; (8) 1. Receptive field. We hope each output produced by the encoder and
decoder can potentially encode arbitrary information in the input
where score is a scoring function which have several alternatives. R 2 sequence.
Rmn is a matrix storing the relevance score between each keys and 2. Computational complexity. It is desirable to a use network with lower
values. The next step is compute the output vectors. For each query computational complexity.
vector, the corresponding output vector is expressed as a weighted sum of 3. Sequential operations. Too many sequential operations preclude the
value vectors: parallel computation within the sequence.
4. Position awareness. The network should distinguish the ordering pre-
AttentionðQ; K; VÞ ¼ softmaxðRÞ V: (9)
sents in the sequence.
Fig. 4 depicts the two steps involved in the computation of attention
mechanism. Table 1summarizes the computation as well as the above-mentioned
Considering on the scoring function, the attention networks can be aspects of typical RNN, CNN, and SAN (see Table 2).
roughly classified into two categories: additive attention (Bahdanau Fig. 5 gives an overview of the ways of RNN, CNN, and SAN to encode
et al., 2015) and dot-product attention (Luong et al., 2015). The additive sequences, respectively. As we can see in Fig. 5(a), RNNs are a family of
attention models score through a feed-forward neural network: sequential models that repeatedly apply the same state transition
8
Table 1
Comparisons between different neural network layers. We use R.F. to denote the
receptive field, S.O. to denote the number of sequential operations, and P.A. to
denote the position awareness of the layer. t is the position in the sequence, l is
the layer number. For CNN, k is the filter width and WðiÞ is the weight of the i-th
filter.
Layer Computation R.F. Complexity S.O. P.A.
RNN hl;t ¼ Whl1;t þ Uhl;t1 ∞ Oðn d2 Þ OðnÞ Yes

P
CNN hl;t ¼ ki¼1 WðiÞ h k Oðk n d2 Þ Oð1Þ Yes
kþ1
l1;tþi
Pn 2
SAN hl;t ¼ i¼1 αl;i hl1;i ∞ Oðn2 dÞ Oð1Þ No
Table 2
Comparison of fundamental architectures. V.R. denotes whether the architecture
employs variable representation. PathE denotes the longest path between the
source and target tokens. PathD denotes the longest path between two target
tokens.
Model Encoder Decoder Complexity V.R. PathE PathD
RCTM 1 CNN RNN OðS2 þ TÞ No S T

(Kalchbrenner
and Blunsom,
2013)
RCTM 2 CNN RNN OðS2 þ TÞ Yes S T
(Kalchbrenner
and Blunsom,
2013)
RNNENCDEC/ RNN RNN OðS þ TÞ No Sþ T T
SEQ2SEQ (Cho
et al., 2014a;
Sutskever et al.,
2014)
RNNSEARCH RNN RNN OðSTÞ Yes 1 T
(Bahdanau
et al., 2015)
Fig. 4. Detailed computations involved in the attention mechanism.
BYTENET CNN CNN OðS þ TÞ Yes c c
(Kalchbrenner
function to sequences. In theory, RNNs are among the most powerful et al., 2016)
CONVSEQ2SEQ CNN CNN OðSTÞ Yes 1 c
family of neural networks (Siegelmann and Sontag, 1995). However, it
(Gehring et al.,
suffers from severe vanishing and exploding gradient problem (Bengio 2017)
et al., 1994) in practice. RNNs with gates, such as long short-term TRANSFORMER SAN SAN OðS2 þ Yes 1 1
memory (LSTM) (Hochreiter and Schmidhuber, 1997) and gated recur- (Vaswani et al., ST þ T 2 Þ
rent unit (GRU) (Cho et al., 2014a) have been proposed to alleviate this 2017)
problem. Another way to stabilize the training is to incorporate

normalization layers, such as layer normalization (Ba et al., 2016).
In order to keep the auto-regressive property of NMT decoder during Table 3
training, CNN and SAN further needs additional padding and masking to Domain and language pairs provided by WMT20, IWSLT20, WAT20.
prevent the network from seeing future words. Fig. 6 shows padding and Workshop Domain Language Pair
masking used in CNN and SAN.
WMT20 News zh-en, cz-en, fr-de, de-en, iu-en, km-en, ja-en, ps-
Fig. 7 shows three extensions to RNNs that are widely used in NMT en, pl-en, ru-en, ta-en
literature. Deep RNNs is one important way to increase the expressive Biomedical en-eu, en-zh, en-fr, en-de, en-it, en-pt, en-ru, en-
power of RNNs. However, training deep neural networks is challenging es
because it also faces the vanishing and exploding gradient problem. Chat en-de
IWSLT20 TED Talks en-de
There are many ways to construct deep RNNs, and the most popular one e-Commerce zh-en, en-ru
is by stacking multiple RNNs with residual connections (He et al., 2016a). Open Domain zh-ja
The residual connection is an important method to construct deep neural WAT20 Scientific Paper en-ja, zh-ja
networks. Residual connections use the identity mappings as the skip Business Scene en-ja
Dialogue
connections, which is formally described as:
Patent zh-ja, ko-ja, en-ja
News ja-en, ja-ru
y ¼ x þ f ðxÞ; (12) IT and Wikinews hi-en, th-en, ms-en, id-en
where x, and y are input and output, respectively. f is the neural network.
By using identity mappings, the gradient signal can directly propagate 2015), which consists of RNNs in opposite directions in adjacent layers.
into lower layers. Bidirectional RNNs (Bahdanau et al., 2015) use two Besides the difficulty training of RNNs, another major drawback of
RNNs to process the same sequence in opposite directions, and concat- RNNs is that RNNs are sequential models in nature, which cannot benefit
enating the results of both RNNs to be the final output. In this way, each from the parallel computations provided by modern GPUs. CNNs and
output of bidirectional RNNs encodes all the tokens in the sequence. An SANs, however, which fully exploit the parallel computation within se-
alternative to bidirectional RNNs is alternating RNNs (Zhou and Xu, quences, are widely used in newer NMT architectures.
9
Table 4
Number of sentences that available at OPUS for major languages to English.
Source Fr-En Es-En De-En Pt-En Ru-En Ar-En Zh-En Ja-En Hi-En
OPUS (Tiedemann, 2016) 200.6M 172.0M 93.3M 77.7M 75.5M 69.2M 31.2M 6.2M 1.7M
Table 5
Popular Open-source NMT toolkits on GitHub, the ordering is determined by the
number of stars as the date of December 2020.
Name Language Framework Status
TENSOR2TENSOR Python TensorFlow Deprecated

FAIRSEQ Python PyTorch Active
NMT Python TensorFlow Deprecated
OPENNMT Python/Cþþ PyTorch/TensorFlow Active
SOCKEYE Python MXNet Active
NEMATUS Python Tensorflow Active
MARIAN Cþþ – Active
THUMT Python PyTorch/TensorFlow Active
NMT-KERAS Python Keras Active
NEURAL MONKEY Python TensorFlow Active
Fig. 7. Three extensions to RNNs.
width k can increase the receptive field from k to L ðk 1Þ þ 1. The

network needs to go deeper with large L and adopt large kernel size k to
model long sentences. However, learning deep CNNs is challenging, and
using large kernel size k may significantly increase the complexity and
parameters involved in CNNs.
Fig. 5. Overview of the computation diagram of RNN, CNN, and SAN. To be One solution to increase the receptive field without using a large k is
clarity, we use a node to denote the input or output vector of a specific layer. through dilation (Kalchbrenner et al., 2016). Fig. 8 shows the comparison
between plain CNN and dilated CNN. Plain CNN can be viewed as a
Convolution neural network (CNN) was first introduced into NMT in special case of dilated CNN with a dilation rate r ¼ 1. The computation of
2013 (Kalchbrenner and Blunsom, 2013). However, it was not as suc- dilated CNN is mathematically formulated as:
cessful as RNNs until 2017 (Gehring et al., 2017). The main obstacle for
applying CNNs is its limited receptive field. Stacking L CNNs with kernel X
k
hl;t ¼ WðiÞ h : (13)
i¼1 l1;tþ i kþ1
2 r
Stacking L dilated CNNs whereby the dilation rates are doubled every
layer, the receptive field increases to ð2L 1Þ ðk 1Þ þ 1. As a result,
the receptive field grows exponentially with L, as opposed to linearly
with L in plain CNN.
Another solution is to reduce the computations involved in CNN.
Depthwise convolution (Kaiser et al., 2017) reduces the complexity from
Oðkd2 Þ to OðkdÞ by performing convolution independently over channels.
Fig. 9 depicts the comparison between CNN and depthwise CNN. The
output of the depthwise convolution layer is defined as:
X
k
hl;t ¼ wðiÞ h ; (14)
i¼1 kþ1
l1;tþi 2
Fig. 6. The computation of CNN and SAN during decoding.
10
where wðiÞ is the i-th column of weight matrix W 2 R. Lightweight

convolution (Wu et al., 2019a) further reduces the number of parameters
of depthwise convolution through weight sharing.
Self-attention network (SAN) (Vaswani et al., 2017) is a special case
of attention network where the queries, keys, and values come from the
same sequence. Similar to CNN, SAN is trivial to parallelize. Furthermore,
Each output in SAN also has infinite receptive fields, which is the same
with RNN. In SAN, the queries, keys, and values are typically obtained
through a linear map of the input representations. The scaled dot-product
self-attention mechanism can be formally described as:

QKT
AttentionðQ; K; VÞ ¼ softmax pffiffiffi V: (15)
d
Multi-head attention (Vaswani et al., 2017) is an extended attention
network with multiple parallel heads. Each head attends information
from different subspace across value vectors. As a result, multi-head
attention can perform more flexible transformations than the
single-head attention. We give an illustration of multi-head attention in
Fig. 10.
The major disadvantage of SAN network is that it ignores the ordering
of words in the sequence. To remedy this, SAN needs additional position
encoding to differentiate orders. Vaswani et al. (2017) proposed a sinu- Fig. 9. Comparison between CNN and depthwise CNN. Each node in the graph
soid style position encoding, which is formulated as: represents a neuron instead of a vector. (a) Plain CNN. We highlight the
computation of the first neuron in the output vector. (b) Depthwise CNN. Note

timingðt; 2iÞ ¼ sin t 100002i=d ; (16) that the connections are significantly reduced compared with plain CNN.

timingðt; 2i þ 1Þ ¼ cos t 100002i=d ; (17) normalization in each sub-layer.
We summarize the comparison of fundamental NMT architectures in
where t is the position and i is the dimension index. Another popular way Table 2. We highlight several important aspects of these fundamental
of position encoding is to learn an additional position embedding. architectures.
Finally, the position encoding is added to each word representation, so
the same words with different positions can have different
representations. 2.3. Bidirectional inference and non-autoregressive NMT
2.2.4. Comparison of fundamental architectures The dominate approach to NMT factorizes the conditional probability
We take the state-of-the-art Transformer architecture (Vaswani et al., PðyjxÞ from left to right (L2R) auto-repressively. However, the factor-
2017) as an example to put all things together. Fig. 11 shows the archi- ization of the distribution is not unique. Researchers (Liu et al., 2016;
tecture of Transformer. The Transformer model relies solely on attention Hoang et al., 2017; Zhang et al., 2018; Zhou et al., 2019a) have found
networks, with additional sinusoid-style position encoding added to that models with right-to-left (R2L) factorization are complementary to
input embedding. The transformer network consists of a stack of 6 L2R models. The bidirectional inference is an approach to simultaneously
encoder layers and 6 decoder layers. Each encoder layer contains two generating translation with both L2R and R2L decoders. In addition to
sub-layers whereas each decoder layer contains three sub-layers. To auto-regressive approaches where each output word on previously
stabilize optimization, Transformer uses residual connection and layer generated outputs, non-autoregressive NMTs (Gu et al., 2018) avoids this
auto-regressive property and produces outputs in parallel, allowing much
lower latency during inference.
2.3.1. Bidirectional inference

Ignoring the future context is another obvious weakness of AR
decoding. Thus, a natural idea is that the quality of translation will be
improved if autoregressive models can ‘‘know’’ the future information.
From this perspective, many approaches have been proposed to improve
translation performance by exploring the future context. Some re-
searchers proposed to model both past and future context (Zheng et al.,
2018, 2019; Zhang et al., 2019a) and some others also found that L2R
and R2L autoregressive models can generate complementary translations
Fig. 8. Comparison between CNN and dilated CNN. (a) Two layers CNN with
filter width k ¼ 3 for each layer. (b) Dilated CNN with filter width k ¼ 3 for all
layers, dilation rate r ¼ 1 in layer 1 and r ¼ 2 in layer 2. Fig. 10. An illustration of multi-head attention.
11
an n-best list of translations, and then an R2L model rescores each

translation in the n-best list (Liu et al., 2016; Sennrich et al., 2016a,
2017). As the scores from L2R and R2L directions are based on com-
plementary models, the quality of translation can be improved by
rescoring. Recently, Zhang et al. (2018) introduced a new strategy to
exploit both L2R and R2L models. They named this method asynchronous
bidirectional decoding (ASBD), which first produces outputs (hidden
states) by an R2L model and then uses these outputs to optimize the L2R
model. ASBD can be done in three steps: The first step is to train a R2L
model with bilingual corpora. The second step is to obtain outputs for
each given source sentence using the trained R2L model. Finally, the
output of R2L model is used as the additional context with the training
data to train the L2R model. Thanks to incorporating the future infor-
mation from the R2L model, the performance of L2R model can be sub-
stantially improved.
Although ASBD improves the quality of translation, it also incurs
other problems. The L2R and R2L models are trained separately so that
they have no chance to interact with each other. Besides, the L2R model
translates source sentences based on the outputs of an R2L model, this
degrades the efficiency of inference. To address these problems, Zhou
et al. (2019a) further proposed a synchronous bidirectional decoding
(SBD) method which generates translations using both L2R and R2L
inference synchronously and interactively. Specifically, SBD uses a new
synchronous attention model to allow both L2R and R2L models
‘‘communicating’’ with each other. As shown in Fig. 12(a), the dotted
arrows illustrate interactions between L2R and R2L decoding. Zhou et al.
(2019a) also designed a variant of the standard beam search algorithm to
hold L2R and R2L decoding concurrently. The idea behind this algorithm
is to maintain that each half beam contains L2R and R2L predictions,
respectively. Empirical results show that SBD can significantly improve
performance with a slight cost to decoding speed.
Mehri and Sigal (2018) proposed a novel middle-out decoder archi-
tecture that begins from an initial middle-word and simultaneously ex-
pands the sequence in both L2R and R2L directions. Zhou et al. (2019b)
also proposed a similar method that allows L2R and R2L inferences to
Fig. 11. The Transformer architecture.
start concurrently from the left and right sides, respectively. Both L2R
and R2L inferences terminate at the middle position. Extensive experi-
(Liu et al., 2016; Hoang et al., 2017; Zhang et al., 2018; Zhou et al.,
ments demonstrate that this method can improve not only the accuracy of
2019a). For instance, Zhou et al. (2019a) analyzed the translation ac-
translation but also decoding efficiency.
curacy of the first and last 4 tokens for L2R and R2L models, respectively.
The statistical results show that, in Chinese-English translation, L2R
2.3.2. Non-autoregressive NMTs
performs better in the first 4 tokens while R2L translates better in the last
To reduce the latency during inference, Gu et al. (2018) first proposed
4 tokens.
the non-autoregressive NMT (NAT) to generate the target words in par-
Based on the findings mentioned above, a number of methods have
allel. Formally, given the source sentence x, the probability of the target
been proposed to combine the advantages of L2R and R2L decoding.
sentence y is modeled as follows:
These approaches are collectively referred to as bidirectional decoding.
Bidirectional decoding based methods can be mainly fall into four cate- YT
PN A ðyjx; θÞ ¼ PL ðTjx; θÞ Pðyt jx; θÞ; (19)
gories (Zhang and Zong, 2020): (1) agreement between L2R and R2L (Liu t¼1
et al., 2016; Yang et al., 2018a; Zhang et al., 2019b), (2) rescore with
where PN A ðyjx; θÞ is the NAT model, PL ðTjx; θÞ is a length sub-model to
bidirectional decoding (Liu et al., 2016; Sennrich et al., 2016a), (3)
determine the length of target sentence, and θ denotes the set of model
asynchronous bidirectional decoding (Zhang et al., 2018; Su et al., 2019),
parameters.
and (4) synchronous bidirectional decoding (Zhou et al., 2019a, 2019b;
How to predict the length of target sentence (i.e., PL ðTjx; θÞ in Eq.
Zhang et al., 2020a).
(19)) is critical for NAT. Gu et al. (2018) proposed a fertility predictor to
Mathematically, the L2R translation order is rather arbitrary, and
predict the length of translation. The fertility of a word in the source side
other arrangements such as R2L factorization are equally correct:
determines how many target words it is aligned to. The fertility predictor
YT YT can be denoted as
PðyjxÞ ¼ Pðyt jy<t ; xÞ ¼ t¼1 Pðyt jy>t ; xÞ : (18)
|fflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflffl}
t¼1
YS
L2R model R2L model PF fjx; θÞ ¼ s¼1
Pðfs jx; θ ; (20)
Based on this theoretical assumption, [Liu et al. (2016), Yang et al.
(2018a), and Zhang et al. (2019b) proposed joint training schemes in where f ¼ ff1 ; ⋯; fS g is the fertility of the source sentence that consists of
which each direction is used as a regularizer for the other direction. S words, and θ is the set of parameters. At the training phase, the gold
Empirical results show that these methods can lead to significant im- fertility of each sentence pair in the training data can be obtained by a
provements compared with standard L2R and R2L models. word alignment system. At the inference phase, the length of the target
Another common scheme to combine L2R and R2L translations is sentence can be determined by the fertility predictor:
rescoring (also known as reranking). A strong L2R model firstly produces
12
x, and b θ is the set of learned parameters.

Different from autoregressive NMT models that take the previous
words (ie., y<t ) as the input to predict the next target word yt , NAT lacks
such history information. Gu et al. (2018) also noticed that missing the
input of the decoder can greatly impair translation quality. Thus, the
authors proposed to copy each source token to the decoder, and the times
each input token to be copied is its ‘‘feritility’’. Gu et al. (2018) also used
knowledge distillation (Kim and Rush, 2016), which employs strong
autoregressive models as the ‘‘teachers’’ to improve the performance.
Knowledge distillation has proven necessary for non-autoregressive
translation (Zhou et al., 2019c; Gu et al., 2018; Lee et al., 2018; Libov-
ickỳ and Helcl, 2018; Ghazvininejad et al., 2019).
Despite the promising success of NAT, which can boost the decoding
efficiency by about 15 times speedup compared with vanilla Trans-
former, NAT suffers from considerable quality degradation. Recently,
many methods have been proposed to narrow the performance gap be-
tween non-autoregressive NMT and autoregressive NMT (Lee et al.,
Fig. 12. Comparisons of different decoding strategies. (a) Bidirectional decod-
2018; Wang et al., 2018a, 2019a; Guo et al., 2019; Shao et al., 2019;
ing: generates a sentence in both left-to-right (L2R) and right-to-left (R2L) di-
Stern et al., 2019; Wei et al., 2019; Akoury et al., 2019; Ghazvininejad
rections; (b) Non-autoregressive (NAR) decoding: generates a sentence at
et al., 2019; Gu et al., 2019).
one time.
To take advantage of both autoregressive NMT and non-
X autoregressive NMT, Wang et al. (2018a) designed a
b¼
T fbs ; (21) semi-autoregressive Transformer (SAT) model. SAT keeps the autore-
s¼1
gressive property in global but performs non-autoregressive translation
in local. Specifically, SAT produces K sequential words per time-step
fbs ¼ argmaxPðfs jx; b
θÞ; (22) independently to others. Consequently, SAT can balance autoregressive
fs
NMT (K ¼ 1) and non-autoregressive NMT (K ¼ T) by adjusting the
b is the number of words in the translation of the source sentence
where T value of K. Akoury et al. (2019) moved a further step to propose a syn-
tactically supervised Transformer (SynST), which first autoregressively
Fig. 13. Three commonly used ways for pre-training.
13
predicts a chunked parse tree and then generates all words in one shot sentences, forcing the NMT model to assign a higher probability to a
conditioned on the predicted parse. ground-truth translation and a lower probability to an erroneous trans-
A critical issue of NAT is that NAT copies the source words as the lation. Wieting et al. (2019) aimed at improving the semantic similarity
input of the decoder while ignores the difference between the source and between ground-truth references and translation outputs from NMT
target semantics. To address this problem, Guo et al. (2019) proposed to systems. They proposed to use a margin-based loss as an alternative
use a phrase table to covert source words to target words. They adopt a reward function, encouraging NMT models to output semantically cor-
maximum match algorithm to greedily segment the source sentence into rect hypotheses even if they mismatch with the reference in the lexicon.
several phrases and then map these source phrases into target phrases by Chen et al. (2019) aimed at improving model capability of capturing
retrieving a pre-defined phrase table. Thanks to the enhanced decoder long-range semantic structure. They proposed to explicitly model the
input, translation quality is significantly improved. source-target alignment with optimal transport (OT), and couple the OT
Inspired by the mask-predict task proposed by Devlin et al. (2019), loss with the MLE loss function. Kumar and Tsvetkov (2019) aimed at
Ghazvininejad et al. (2019) introduced a conditioned masked language improving model efficiency and reducing the memory footprint of NMT
model (CMLM) to generate translation by iterative refinement. CMLM models. Observing that the softmax layer usually takes considerable
trains the conditioned language model using a mask-predict manner and memory usage and the longest computation time, they proposed to
produces target sentences by iterative decoding during inference. Spe- replace the softmax layer with a continuous embedding layer, using Von
cifically, in the training phase, CMLM first randomly masks the words in Mises-Fisher distribution to implement soft ranking as softmax layer
the target sentence and then predicts these masked words. In the infer- functions. As a result, the novel probabilistic loss enables NMT models to
ence, CMLM generates the entire target sentence in a preset number of train much faster and handle very large vocabularies.
decoding iteration N. At iteration n 2 ½1; N, the decoder input is the
entire target sentence with T TðNtþ1Þ
N words masked. The decoding 2.5. Using monolingual data and unsupervised NMT
process starts with a fully-masked target sentence and the words with the
lowest prediction probabilities will be masked. With a proper number of The amount of parallel data significantly affects the training of pa-
decoding iteration, CMLM can effectively close the gap with fully rameters as NMT is found to be data-hungry (Zoph et al., 2016). Unfor-
autoregressive models and maintain the decoding efficiency. tunately, large-scale parallel corpora are not available for the vast
majority of language pairs. In contrast, monolingual corpora are abun-
2.4. Alternative training objectives dant and much easy to obtain. As a result, it is important to augment the
training set with monolingual data.
NMT trained with maximum likelihood estimation or MLE have
achieved state-of-the-art results on various language pairs (Junczys-- 2.5.1. Using monolingual data
Dowmunt et al., 2016). Despite the remarkable success, Ranzato et al. As NMT is trained in an end-to-end way, it raises the difficulties in
(2015) indicate two drawbacks of MLE for NMT. First, NMT models are taking advantage of monolingual data. In the past few years researchers
not exposed to their errors during training, which is referred to as the have proposed various methods to make use of the source- and target-
exposure bias problem. Second, MLE is defined at word-level rather than side monolingual data in neural machine translation.
sentence-level. Due to these limitations, researchers have investigated For target-side monolingual data, early attempts try to incorporate a
several alternative objectives. language model trained on large-scale monolingual data into NMT.
Shen et al. (2016) proposed minimum risk training (MRT) to alleviate Gulcehre et al. (2017) proposed two ways to integrate a language model.
the problem. In MRT, the risk is defined as the expected loss with respect One way is called shallow fusion, which uses a language model during
to the posterior distribution: decoding to rescore the n-base list. Another way is called deep fusion,
which combines the decoder and language model with a controller
X
L ðθÞ ¼ EyjxðsÞ ;θ ½Δðy; yðsÞ Þ (23) mechanism. However, the improvements of these approaches are limited.
s¼1 Another way to use target-side monolingual data is called Back-
translation (BT) (Sennrich et al., 2016b). BT can make use of target-side
X X
¼ PðyjxðsÞ ; θÞΔðy; yðsÞ Þ; (24) monolingual data without changing the architecture of NMT. In Senn-
s¼1 y2Y ðxðsÞ Þ rich et al. (2016b), they first trained a target-to-source translation model
using the parallel corpus. Then, the target-side monolingual data are used
where Y ðxðsÞ Þ is a set of all possible candidate translations for xðsÞ ; to build a synthetic parallel corpus, whose source sides are generated by
Δðy; yðsÞ Þ measures the difference between model prediction and gold- the target-to-source translation model. Finally, the concatenation of
standard. Shen et al. (2016) indicate three advantages for MRT over parallel corpus and synthetic parallel corpus is used to learn a
MLE. Firstly, MRT direct optimize NMT with respect to evaluation met- source-to-target translation model. Although the architecture and
rics. Secondly, MRT can incorporate with arbitrary loss functions. Finally, decoding algorithm is kept unchanged, the monolingual data is fully
MRT is transparent to architectures and can be applied to any end-to-end utilized to improve the translation quality. The authors attributed the
NMT systems. MRT achieves significant performance improvements than effectiveness of using monolingual data to domain adaptation effects,
MLE training for RNNSearch. Reinforcement learning adopts a similar reductions of overfitting, and improved fluency. BT has shown to be the
way as MRT does, which is comprehensively studied in Ranzato et al. most simple and effective method to leverage target-side monolingual
(2016), Wu et al. (2018)]. However, recent literature (Choshen et al., data (Sennrich et al., 2016b; Poncelas et al., 2018). It is especially useful
2020) has also pointed out the weakness of reinforcement learning for when only a small number of parallel data is available (Karakanta et al.,
NMT, including discussion about authentic optimization goals and dif- 2018). Imamura et al. (2018) found that the diversities of source sen-
ficulty in convergence. tences affect the performance of BT. In the meantime, Edunov et al.
Efforts on improving training objectives reveal the art of translating (2018b) analyzed BT extensively and showed that noised-BT, which
motivation into functions and rewrite the conventional loss function with builds a synthetic corpus by sampled source sentences or noised output of
them or integrating them into it as regularizers. A collection of classical beam-search, leads to higher accuracy. Caswell et al. (2019) investigated
structured prediction losses are reviewed and compared in Edunov et al. the role of noise in noised-BT. They revealed that the noises work in a
(2018a), including MLE, sequence-level MLE, MRT, and max-margin way of making the model be able to distinguish the synthetic data and
learning. Yang et al. (2019a) leveraged the idea of max-margin genuine data. The model can further take advantages of helpful signal
learning in reducing word omission errors in NMT. They artificially and ignore harmful signal. As a result, they proposed a simple method
constructed negative examples by omitting words in target reference called tagged-BT, which appends a preceding tag (e.g., <BT>) to every
14
synthetic source sentence. Wang et al. (2019b) proposed to consider First, by the virtue of recent advances on unsupervised cross-lingual
uncertainty-based confidence to help NMT models distinguish synthetic embeddings (Zhang et al., 2017a; Artetxe et al., 2017a) and
data from authentic data. word-by-word translation systems (Conneau et al., 2017), the unsuper-
Besides target-side monolingual data, source-side monolingual data vised translation models can be initialized by weak translation models
are also important resources to improve the translation quality of semi- with fundamental cross-lingual information. Second, denoising autoen-
supervised neural machine translation. Zhang and Zong (2016) coders (Vincent et al., 2008) are used to embed the sentences into dense
explored two ways to leverage source-side monolingual data. The former latent representations. The sentences of different languages are assumed
one is knowledge distillation (also called self-training), which utilizes the to be embedded into the same latent space so that the latent represen-
source-to-target translation model to build a synthetic parallel corpus. tations of source sentences can be decoded into the target language.
The latter is multi-task learning that simultaneously learns translation Third, iterative back-translation is used to strengthen the source-to-target
and source sentence reordering tasks. and target-to-source translation models. Lample et al. (2017) and Artetxe
There are many works to make use of both source- and target-side et al. (2017b) first successfully built an unsupervised NMT system as
monolingual data. Hoang et al. (2018) found that the translation qual- described above. Specifically, Lample et al. (2017) utilizes a discrimi-
ity of the target-to-source model in BT matters and then proposed iter- nator to force the encoder to embed sentences of each language to the
ative back-translation, making the source-to-target and target-to-source same latent space.
to enhance each other iteratively. Cheng et al. (2016) presented an While Lample et al. (2017) used a shared encoder and a shared
approach to train a bidirectional neural machine translation model, decoder, Artetxe et al. (2017b) adopt a shared encoder but two separate
which introduced autoencoders on the monolingual corpora with decoder approach. Yang et al. (2018b) conjectured that sharing of the
source-to-target and target-to-source translation models as encoders and encoder and decoder between two languages may lose their language
decoders by appending a reconstruction term to the training objective. characteristics. Therefore they proposed leveraging two separate en-
He et al. (2016b) proposed a dual-learning mechanism, which utilized coders with some shared layers and using two different GANs to restrict
reinforcement learning to make the source-to-target and target-to-source the latent representations. Artetxe et al. (2018) and Lample et al. (2018)
model to teach each other with the help of source- and target-side lan- found that an unsupervised statistical machine translation system with
guage models. Zheng et al. (2020) proposed a mirror-generative NMT iterative back-translation can easily outperform the unsupervised NMT
model to integrate source-to-target and target-to-source NMT models and counterpart. Lample et al. (2018) summarized that initialization, lan-
both-side language models, which can learn from monolingual data guage modeling, and iterative back-translation are three principles in
naturally. fully unsupervised MT and they further found that combining unsuper-
Pre-training is an alternative way to utilize monolingual data for vised SMT and unsupervised NMT can reach better performances.
NMT, which is shown to be beneficial by further combining with back- Ren et al. (2019) suggested that the noises and errors existed in
translation in the supervised and unsupervised NMT scenario (Song pseudo-data can be accumulated and hinder the improvements during
et al., 2019; Liu et al., 2020). Recently, pre-training has attracted iterative back-translations. Therefore, they proposed to use SMT which is
tremendous attention because of its effectiveness on low-resource lan- less sensitive to noises as posterior regularizations to unsupervised NMT.
guage understanding and language generation tasks (Peters et al., 2018; As the unsupervised NMT is usually initialized by unsupervised bilingual
Radford et al., 2019; Devlin et al., 2019). Researchers found that models word embeddings (UBWE), Sun et al. (2019) proposed to utilize UBWE
trained on large-scale monolingual data can learn linguistics knowledge agreement to enhance unsupervised NMT. Wu et al. (2019b) considered
(Clark et al., 2019). These knowledge can be transferred into downstream that pseudo sentences predicted by weak unsupervised MT systems are
tasks by initializing the task-oriented models with the pre-trained usually of low quality. To alleviate this issue, they proposed an
weights. Language modeling is a commonly used pre-training method. extract-edit approach, which is an alternative to back-translation. First,
The drawback of standard language modeling is that it is unidirectional, they extracted the most relevant target sentences from target mono-
which may be sub-optimal as a pre-training technique. Devlin et al. lingual data given the source sentence. Then, extracted target sentences
(2019) proposed a masked pre-training language model (MLM) objec- were edited to be aligned with the source sentences. This method makes
tive, which allows the model to make full use of context at the price of it possible to use real sentence pairs to train the unsupervised NMT
losing the ability to generate sequences. Combining language modeling system. Ren et al. (2020) also proposed a similar retrieve-and-rewrite
and masking with sequence-to-sequence models, however, do not suffer method to initialize an unsupervised SMT system. Artetxe et al. (2019)
from these limitations (Song et al., 2019; Lewis et al., 2019; Liu et al., improved unsupervised SMT by exploiting subword information, devel-
2020). Fig. 13 summarizes the three commonly used ways for oping a theoretically well-founded unsupervised tuning method, and
pre-training. incorporating a joint refinement procedure. Finally, they utilized the
Edunov et al. (2019) fed the output representations of ELMO (Peters improved unsupervised SMT to initialize NMT model and get
et al., 2018) to the encoder of NMT. Zhu et al. (2020) proposed to fuse state-of-the-art results. As a unique method to utilize monolingual data,
extracted representations into each layer of encoder and decoder through cross-lingual pre-trained models are used by Lample and Conneau (2019)
attention mechanism. Song et al. (2019) proposed to pre-train a to initialize unsupervised MT systems.
sequence-to-sequence model first, and then finetune the pre-trained
model on translation task directly. BART (Lewis et al., 2019) took 2.6. Open vocabulary
various noising method to pre-train a denoising sequence-to-sequence
model and then finetune the model with an additional encoder that re- NMT typically operates with a fixed vocabulary. Due to practical
places the word embeddings of the pre-trained encoder. Liu et al. (2019a) reasons such as computational concerns and memory constraints, the
proposed mBART which is trained by applying BART to large-scale vocabulary size of NMT models often ranges from 30k to 50k. For word-
monolingual data across many languages. level NMT, the limited size of vocabulary results in a large number of
unknown words. Therefore, word-level NMT is unable to translate these
2.5.2. Unsupervised NMT words and performs poorly in open-vocabulary settings (Sutskever et al.,
Due to insufficient parallel corpus, it is not feasible to use supervised 2014; Bahdanau et al., 2015).
methods to train an NMT model on many language pairs. Unsupervised Although word-level NMT is unable to translate out-of-vocabulary
neural machine translation aims to obtain a translation model without words, character-level NMT do not have this problem. By splitting
using parallel data. Apparently, unsupervised machine translation is words into characters, the vocabulary size is much smaller and every rare
much more difficult than the supervised and semi-supervised settings. word can be represented. Chung et al. (2016) found that the NMT model
Unsupervised neural machine translation is composed of three parts. with subword-level encoder and character-level decoder can also work
15
well. Lee et al. (2017) introduced a fully character-level NMT with structures (Gu et al., 2018; Wang et al., 2018b; Wu et al., 2017; Aharoni
convolutional network and found that character-to-character NMT is and Goldberg, 2017; Bastings et al., 2017; Li et al., 2018; Yang et al.,
suitable in many-to-one multilingual setting. Luong and Manning (2016) 2019b, 2020). Aharoni and Goldberg (2017) trained a end-to-end model
built hybrid systems that translate mostly at the word level and consult to directly translate source sentences into constituency trees. Similar
the character components for rare words. Passban et al. (2018) proposed approaches are proposed to use two neural models to generate the target
an extension to the model of Chung et al. (2016), which works at the sentence and its corresponding tree structure (Wang et al., 2018b; Wu
character level and boosts the decoder with target-side morphological et al., 2017). G
u et al. (2018) proposed to use a single model to perform
information. Chen et al. (2018) proposed an NMT model at different translation and parsing at the same time. Yang et al. (2019b) introduced a
levels of granularity with a multi-level attention. Gao et al. (2020) found latent variable model to capture the co-dependence between syntax and
that self-attention performs very well on character-level translation. semantics. Yang et al. (2020) trained a neural model to predict the soft
Character-level NMT also has its imperfection, splitting words into template of the target sentence conditioning only on the source sentence
characters results in longer sequences in which each symbol contains less and then incorporated the predicted template into the NMT model via a
information, creating both modeling and computational challenges separate template encoder.
(Cherry et al., 2018). Other than word-level and character-level methods,
subword-level method is another choice to model input and output 2.8. Interpretability and robustness
sentences. Sennrich et al. (2016c) first adapted byte-pair-encoding (BPE)
to word segmentation task, which is a simple but effective method. BPE Despite the remarkable progress, it is hard to interpret the internal
making the NMT model capable of open-vocabulary translation by workings of NMT models. All internal information in NMT is represented
encoding rare and unknown words as sequences of subword units. This as high-dimensional real-valued vectors or matrices. Therefore, it is
method reaches a compromise between vocabulary size and sequence challenging to associate these hidden states with language structures.
length with stabilized better performance over word- and character-level The lack of interpretability has made it very difficult for researchers to
methods. Moreover, it is an unsupervised method with few understand the translation process of NMT models.
hyper-parameters, making it the most commonly used method for word In addition to interpretability, the lack of robustness is a severe
segmentation for neural machine translation and text generation. Kudo challenge for NMT systems as well. With small perturbations in source
(2018) presented a simple regularization method, namely subword reg- inputs (also referred to as adversarial examples), the translations of NMT
ularization, to improve the robustness of subword-level NMT. Provilkov models may lead to significant erroneous changes (Belinkov and Bisk,
et al. (2019) introduced BPE-dropout to regularize the subword seg- 2018; Cheng et al., 2019). The lack of robustness of NMT limits its
mentation algorithm BPE, which is more compatible with conventional application on tasks that require robust performance on noisy inputs.
BPE than the method proposed by Kudo (2018). Wang et al. (2020) Therefore, improving the robustness of NMT has gained increasing
investigated byte-level subwords, specifically byte-level BPE (BBPE), attention in the NMT community.
which is more efficient than using pure bytes only.
2.8.1. Interpretability
2.7. Prior knowledge integration Efforts have been devoted to improving the interpretability of NMT
systems in recent works. Ding et al. (2017) proposed to visualize the
Prior knowledge such as parse tree has been shown to be effective to internal workings of the RNNSearch (Bahdanau et al., 2015) architecture.
improve SMT (Liu et al., 2006). Several works find that integrating prior With layer-wise relevance propagation (Bach et al., 2015), they
knowledge can also improve the translation performance of NMT models. computed and visualized the contribution of each contextual word to
One line of studies focus on inducing lexical knowledge into NMT arbitrary hidden states in RNNSearch. Bau et al. (2019) share similar
models. Zhang et al. (2017b) proposed a general framework that can motivations with Ding et al. (2017). Their basic assumption is that the
integrate prior knowledge into NMT models through posterior regulari- same neuron in different NMT models captures similar syntactic and
zation and found that bilingual dictionary is useful to improve NMT semantic information. They proposed to use several types of correlation
models. Morishita et al. (2018) found that feeding hierarchical subword coefficients to measure the importance of each neuron. As a result, by
units to different modules of NMT models can also improve the trans- identifying important neurons and controlling their activation, the
lation quality. Liu et al. (2019b) proposed a novel shared-private word translation process of NMT systems can be controlled. Strobelt et al.
embedding to capture the relationship of different words for NMT (2019) also put effort into visualizing the working process of RNNSearch.
models. Chen et al. (2020) distinguished content words and functional The highlights of their work lie in the utilization of training data. When
words depending on the term frequency inverse document frequency an NMT system decodes some words, their visualization system provides
(i.e., TF-IDF) and then added an additional encoder and an additional loss the most relevant training corpora by using the nearest neighbor search.
for content words. Weller-Di Marco and Fraser (Weller-Di Marco and In case of translation errors, the system can locate the erroneous outputs
Fraser, 2020) studied strategies to model word formation in NMT to directly in the training set by showing its origin cause. As a result, this
explicitly model fusional morphology. function provides better assistants and makes it easy for developers to
Modeling the source-side syntactic structure has also drawn a lot of adjust the model and the data.
attention. Eriguchi et al. (2016) extended NMT models to an end-to-end With the tremendous success of the Transformer architecture (Vas-
syntactic model, where the decoder is softly aligned with phrases at the wani et al., 2017), the NMT community have shown increasing interest in
source side when generating a target word. Sennrich and Haddow (2016) understanding and interpreting Transformer. He et al. (2019) general-
explored external linguistic information such as lemmas, morphological ized the idea of layer-wise relevance to word importance by attributing
features, POS tags and dependency labels to improve translation quality. the NMT output to every input word through a gradient-based method.
Hao et al. (2019) presented a multi-granularity self-attention mechanism The calculated word importance illustrates the influence of each source
to model phrases which are extracted by syntactic trees. Bugliarello and words, which also serves as an implication of under-translation errors.
Okazaki (2020) proposed the Parent-Scaled Self-Attention to incorporate Raganato and Tiedemann (2018) analyzed the internal representations of
dependency tree to capture the syntactic knowledge of the source sen- Transformer encoder. Utilizing the attention weights in each layer, they
tence. There are also some works that use multi-task training to learn extract relation among each word in the source sentence. They designed
source-side syntactic knowledge, in which the encoder of a NMT model is four types of probing tasks to analyze the syntactic and semantic infor-
trained to perform POS tagging or syntactic parsing (Eriguchi et al., 2017; mation encoded by each layer representation and test their trans-
Baniata et al., 2018). ferability. Voita et al. (2019) also proposed to analyze the bottom-up
Another line of studies directly model the target-side syntactic evolution of representations in Transformer with canonical correlation
16
analysis (CCA). By estimating mutual information, they studied how in- There are several publicly available corpora, such as the datasets pro-
formation flows in Transformer. Stahlberg et al. (2018) proposed an vided by WMT,1 IWSLT,2 and WAT .3 Table 3 lists the available domains
operation sequence model to interpret NMT. Based on the translation and language pairs in these workshops.
outputted by the Transformer system, they proposed explicit modeling of Besides the aforementioned machine translation workshops, we also
the word reordering process and provided explicit word alignment be- recommend OPUS 4 to search resources for training NMT models, which
tween the reordered target-side sentence and the source sentence. As a gathers parallel data for a large number of language pairs. We list the
result, one can track the reordering process of each word's information as number of sentence pairs that are available for major languages to
they are explicitly aligned with the source side. Recent work (Yun et al., Enligsh in Table 4. OPUS also provides the OPUS-100 corpus for multi-
2020) also provided a theoretical understanding of Transformer by lingual machine translation research (Zhang et al., 2020b), which is an
proving that Transformer networks are universal approximators of English-centric multilingual corpus covering over 100 languages.
sequence-to-sequence functions.
3.2. Monolingual data
2.8.2. Robustness
Belinkov and Bisk (2018) first investigated the robustness of NMT. Monolingual data are also valuable resources for NMT. The Common
They pointed out that both synthetic and natural noise can severely harm Crawl Foundation 5 provides open access to high quality crawled data for
the performance of NMT models. They experimented with four types of over 40 languages. The CCNET toolkit 6 (Wenzek et al., 2020) can be used
synthetic noise and leveraged structure-invariant representation and to download and clean Common Crawl texts. Wikipedia provides data-
adversarial training to improve the robustness of NMT. Similarly, Zhao base dump 7 that can be used to extract monolingual data, which can be
et al. (2018) proposed to map the input sentence to a latent space with download using WIKIEXTRACTOR .8 WMT 2020 also provides several
generative adversarial networks (GAN) and search for adversarial ex- monolingual training data, which consists of data collected from News-
amples in that space. Their approach can produce semantically and Crawl, NewsDicussions, Europarl, NewsCommentary, CommonCrawl,
syntactically coherent sentences that have negative impacts on the per- and WikiDumps.
formance of NMT models.
Ribeiro et al. (2018) proposed semantic-preserving adversarial rules 4. Tools
to explicitly induce adversarial examples. This approach provides a better
guarantee for the adversarial examples to satisfy semantically equiva- With the rapid advances of deep learning, many open-source deep
lence property. Cheng et al. (2018) proposed two types of approaches to learning frameworks have emerged, with TensorFlow (Abadi et al., 2016)
generating adversarial examples by perturbing the source sentence or the
and PyTorch (Paszke et al., 2019) as representative examples. At the
internal representation of the encoder. By integrating the effect of same time, we have also witnessed the rapid development of open-source
adversarial examples into the loss function, the robustness of neural
NMT toolkits, which significantly boosted the research progress of NMT.
machine translation is improved by adversarial training. In this section, we will give a summarization of popular open-source NMT
Ebrahimi et al. (2018) proposed a character-level white-box attack for
toolkits. Besides, we also introduce tools that are useful for evaluation,
character-level NMT. They proposed to model the operations of character analysis, and data pre-processing.
insertion, deletion, and swapping with vector computations so that the
generation of adversarial examples can be formulated with differentiable
string-edit operations. Liu et al. (2019a) proposed to jointly utilize tex- 4.1. Open-source NMT toolkits
tual and phonetic embedding in NMT to improve robustness. They found
that to train a more robust model, more weights should be put on the We summarize some popular open-source NMT toolkits on GitHub in
phonetic rather than textual information. Cheng et al. (2019) proposed Table 5. The users can get the source codes of these toolkits directly from
doubly adversarial inputs to improve the robustness of NMT. Concretely, GitHub. We shall give a brief description of these projects.
they proposed to both attack the translation model with adversarial Tensor2Tensor. TENSOR2TENSOR (Vaswani et al., 2018) is a library of
source examples and defend the translation model with adversarial target deep learning models and datasets based on TensorFlow (Abadi et al.,
inputs for model robustness. Zou et al. (2020) utilized reinforcement 2016). The library was mainly developed by the Google Brain team.
learning to generate adversarial examples, producing stable attacks with TENSOR2TENSOR provides implementation of several NMT architectures
semantic-preserving adversarial examples. Cheng et al. (2020) proposed (e.g., Transformer) for the translation task. The users can run TENSOR2-
a novel adversarial augmentation method that minimizes the vicinal risk TENSOR easily on CPU, GPU, and TPU, either locally or on Cloud.
over virtual sentences sampled from a smoothly interpolated embedding FairSeq. FAIRSEQ (Ott et al., 2019) is a sequence modeling toolkit
space around the observed training sentence pairs. The adversarial data developed by Facebook AI Research. The toolkit is based on Pytorch
augmentation method substantially outperforms other data augmenta- (Paszke et al., 2019) and allows the users to train custom models for the
tion methods and achieves significant improvements in translation translation task. FAIRSEQ implements traditional RNN-based models and
quality and robustness. For the better exploration of robust NMT, Michel Transformer models. Besides, it also includes CNN-based translation
and Neubig (2018) proposed an MTNT dataset, source sentences of which models (e.g., LightConv and DynamicConv).
are collected from Reddit discussion, and contain several types of noise. Nmt. NMT (Luong et al., 2017) is a toolkit developed by Google
Target referenced translations for each source sentence, in contrast, are Research. The toolkit implements the GNMT architecture (Wu et al.,
clear from noise. Experiments showed that current NMT models perform 2016). Besides, the NMT project also provides a nice tutorial for building
badly on the MTNT dataset. As a result, this dataset can serve as a testbed a competitive NMT model from scratch. The codebase of NMT is
for NMT robustness analysis. high-quality and lightweight, which is friendly for users to add custom-
ized models.
3. Resources
2
http://iwslt.org/doku.php.
3.1. Parallel data 3
http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT2020/index.html.
4
http://opus.nlpl.eu.
Bilingual parallel corpora are the most important resources for NMT. 5
https://commoncrawl.org/.
6
https://github.com/facebookresearch/cc_net.
7
https://dumps.wikimedia.org.
1 8
http://www.statmt.org/wmt20/index.html. https://github.com/attardi/wikiextractor.
17
OpenNMT. OPENNMT is an open-source NMT toolkit developed by accuracy of generation of particular types of words, bucketed histograms
the collaboration of Harvard University and SYSTRAN. The toolkit of sentence accuracies or counts based on salient characteristics, and so
currently maintains two implementations: OPENNMT-PY and OPENNMT-TF. on.
OPENNMT is proven to be research-friendly and production-ready. The MT-COMPAREVAL. MT-COMPAREVAL11 is also a tool for comparison and
OpenNMT project also provides CTRANSLATE2 as a fast inference engine evaluation of machine translations. It allows users to compare trans-
that supports both CPU and GPU. lations according to automatic metrics or quality comparison from the
Sockeye. SOCKEYE (Hieber et al., 2017) is a versatile aspects of n-grams.
sequence-to-sequence toolkit that is based on MXNet (Chen et al., 2016).
SOCKEYE is maintained by Amazon and powers machine translation ser- 4.3. Other tools
vices such as Amazon Translate. The toolkit features state-of-the-art
machine translation models and fast CPU inference, which is useful for Asides from the above mentioned tools, we found the following
both research and production. toolkits are very useful for NMT research and deployment.
Nematus. NEMATUS is an NMT toolkit developed by the NLP Group at MOSES. MOSES12 (Koehn et al., 2007) is a self-contained statistical
the University of Edinburgh. The toolkit is based on TensorFlow and machine translation toolkit. Besides SMT-related components, MOSES
supports RNN-based NMT architectures as well as the TRANSFORMER ar- provides a large number of tools to clean and pre-process texts, which are
chitecture. In addition to the toolkits, NEMATUS also released high- also useful for training NMT models. MOSES also contains several
performing NMT models covering 13 translation directions. easy-to-use scripts to analyze and evaluate MT outputs.
Marian. MARIAN (Junczys-Dowmunt et al., 2018) is an efficient and SUBWORD-NMT. SUBWORD-NMT13 is an open-source toolkit for unsu-
self-contained NMT framework currently being developed by the pervised word segmentation for neural machine translation and text
Microsoft Translator team. The framework is written entirely in Cþþ generation. It adopts the Byte-Pair Encoding (BPE) algorithm proposed
with minimal dependencies. Marian is widely deployed by many com- by (Sennrich et al., 2016c) and BPE dropout proposed by (Provilkov
panies and organizations. For example, Microsoft Translator currently et al., 2019). It is the most commonly used toolkit to alleviate the
adopts Marian as its neural machine translation engine. out-of-vocabulary problem in NMT.
THUMT. THUMT (Zhang et al., 2017c) is an open-source toolkit for SENTENCEPIECE. SENTENCEPIECE14 is a powerful unsupervised text seg-
neural machine translation developed by the NLP Group at Tsinghua mentation toolkit. SENTENCEPIECE is written in Cþþ and provides APIs for
University. The toolkit includes Theano (Team et al., 2016), TensorFlow, other languages such as Python. SENTENCEPIECE implements the BPE al-
and Pytorch implementations. It supports vanilla RNN-based and gorithm (Sennrich et al., 2016c) and unigram language model (Kudo,
Transformer models and is easy for users to build new models. Further- 2018). Unlike SUBWORD-NMT, SENTENCEPIECE can learn to segment raw texts
more, THUMT provides visualization analysis using layer-wise relevance without additional pre-processing. As a result, SENTENCEPIECE is a suitable
propagation (Ding et al., 2017). choice to segment multilingual texts.
NMT-Keras. NMT-KERAS (Peris and Casacuberta, 2018) is a flexible
toolkit for neural machine translation developed by the Pattern Recog-
5. Conclusion
nition and Human Language Technology Research Center at Polytechnic
University of Valencia. The toolkit is based on Keras which uses Theano
Neural machine translation has become the dominant approach to
or TensorFlow as the backend. NMT-KERAS emphasizes the development
machine translation in both research and practice. This article reviewed
of advanced applications for NMT systems, such as interactive NMT and
the widely used methods in NMT, including modeling, decoding, data
online learning. It also has been extended to other tasks including image
augmentation, interpretation, as well as evaluation. We then summarize
and video captioning, sentence classification, and visual question
the resources and tools that are useful for NMT research.
answering.
Despite the great success achieved by NMT, there are still many
Neural Monkey NEURAL MONKEY is an open-source neural machine
problems to be explored. We list some important and challenging prob-
translation and general sequence-to-sequence learning system. The
lems for NMT as follows:
toolkit is built on the TensorFlow library and provides a high-level API
tailored for fast prototyping of complex architectures.
Understanding NMT. Although there are many attempts to analyze
and interpret NMT, our understandings about NMT are still limited.
4.2. Tools for evaluation and analysis Understanding how and why NMT produces its translation result is
important to figure out the bottleneck and weakness of NMT models.
Manual evaluation of MT outputs is not only expensive but also Designing better architectures. Designing a new architecture that
impractical to scaling for large language pairs. On the contrary, auto- better than Transformer is beneficial for both NMT research and
matic MT evaluation is inexpensive and language-independent, with production. Furthermore, designing a new architecture that balances
BLEU (Papineni et al., 2002) as the representative automatic evaluation translation performance and computational complexity is also
metric. Besides evaluation, there is also a need for analyzing MT outputs. important.
We recommend the following tools for evaluating and analyzing MT Making full use of monolingual data. Monolingual data are valu-
output. able resources. Despite the remarkable progress, we believe that there
SACREBLEU. SACREBLEU 9 (Post, 2018) is a toolkit to compute share- is still much room for NMT to make use of abundant monolingual
able, comparable, and reproducible BLEU scores. SACREBLEU computes data.
BLEU scores on detokenized outputs, using WMT standard tokenization. Prior knowledge integration. Incorporating human knowledge into
As a result, the scores are not affected by different processing tools. Be- NMT is also an important problem. Although there is some progress,
sides, it can produce a short version string that facilitates cross-paper the results are far from satisfactory. How to convert discrete and
comparisons. continuous representations into each other is a problem of NMT that
COMPARE-MT. COMPARE-MT10 (Neubig et al., 2019) is a program to needs further exploration.
compare the outputs of multiple systems for language generation. In
order to provide high-level analysis of outputs, it enables analysis of
11
https://github.com/ondrejklejch/MT-ComparEval.
12
https://github.com/moses-smt/mosesdecoder.
9 13
https://github.com/mjpost/sacrebleu. https://github.com/rsennrich/subword-nmt.
10 14
https://github.com/neulab/compare-mt. https://github.com/google/sentencepiece.
18
Declaration of competing interest Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.,
Bengio, Y., 2014a. Learning phrase representations using rnn encoder–decoder for
statistical machine translation. In: Proceedings of EMNLP, pp. 1724–1734.
The authors declare that they have no known competing financial Cho, K., Van Merri€enboer, B., Bahdanau, D., Bengio, Y., 2014b. On the Properties of
interests or personal relationships that could have appeared to influence Neural Machine Translation: Encoder-Decoder Approaches arXiv preprint arXiv:
the work reported in this paper. 1409.1259.
Choshen, L., Fox, L., Aizenbud, Z., Abend, O., 2020. On the weaknesses of reinforcement
learning for neural machine translation. In: Proceedings of ICLR.
Acknowledgements Chung, J., Cho, K., Bengio, Y., 2016. A Character-Level Decoder without Explicit
Segmentation for Neural Machine Translation arXiv preprint arXiv:1603.06147.
Clark, K., Khandelwal, U., Levy, O., Manning, C.D., 2019. What does bert look at? an
This work was supported by the National Key R&D Program of China analysis of bert's attention. In: Proceedings of the 2019 ACL Workshop BlackboxNLP:
(No. 2017YFB0 202204), National Natural Science Foundation of China Analyzing and Interpreting Neural Networks for NLP, pp. 276–286.
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jegou, H., 2017. Word Translation
(No. 61925601, No. 61761166 008, No. 61772302), Beijing Academy of
without Parallel Data arXiv preprint arXiv:1710.04087.
Artificial Intelligence, Huawei Noah's Ark Lab, and the NExTþþ project Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. Bert: pre-training of deep
supported by the National Research Foundation, Prime Ministers Office, bidirectional transformers for language understanding. In: Proceedings of NAACL-
Singapore under its IRC@Singapore Funding Initiative. HLT, pp. 4171–4186.
Ding, Y., Liu, Y., Luan, H., Sun, M., 2017. Visualizing and understanding neural machine
translation. In: Proceedings of ACL, pp. 1150–1159.
References Ebrahimi, J., Lowd, D., Dou, D., 2018. On adversarial examples for character-level neural
machine translation. In: Proceedings of COLING, pp. 653–663.
Edunov, S., Ott, M., Auli, M., Grangier, D., Ranzato, M., 2018. Classical structured
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S.,
prediction losses for sequence to sequence learning. In: Proceedings of NAACL-HLT,
Irving, G., Isard, M., et al., 2016. Tensorflow: a system for large-scale machine
pp. 355–364.
learning. In: 12th {USENIX} Symposium on Operating Systems Design and
Edunov, S., Ott, M., Auli, M., Grangier, D., 2018. Understanding Back-Translation at Scale
Implementation ({OSDI} 16), pp. 265–283.
arXiv preprint arXiv:1808.09381.
Aharoni, R., Goldberg, Y., 2017. Towards string-to-tree neural machine translation. In:
Edunov, S., Baevski, A., Auli, M., 2019. Pre-trained Language Model Representations for
Proceedings of ACL, pp. 132–140.
Language Generation arXiv preprint arXiv:1903.09722.
Akoury, N., Krishna, K., Iyyer, M., 2019. Syntactically supervised transformers for faster
Eriguchi, A., Hashimoto, K., Tsuruoka, Y., 2016. Tree-to-sequence attentional neural
neural machine translation. In: Proceedings of ACL, pp. 1269–1281.
machine translation. In: Proceedings of ACL, pp. 823–833.
Artetxe, M., Labaka, G., Agirre, E., 2017a. Learning bilingual word embeddings with
Eriguchi, A., Tsuruoka, Y., Cho, K., 2017. Learning to parse and translate improves neural
(almost) no bilingual data. In: Proceedings of ACL, pp. 451–462.
machine translation. In: Proceedings of ACL, pp. 72–78.
Artetxe, M., Labaka, G., Agirre, E., Cho, K., 2017b. Unsupervised Neural Machine
Gao, Y., Nikolov, N.I., Hu, Y., Hahnloser, R.H., 2020. Character-level Translation with
Translation arXiv preprint arXiv:1710.11041.
Self-Attention arXiv preprint arXiv:2004.14788.
Artetxe, M., Labaka, G., Agirre, E., 2018. Unsupervised Statistical Machine Translation
Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N., 2017. Convolutional
arXiv preprint arXiv:1809.01272.
Sequence to Sequence Learning arXiv preprint arXiv:1705.03122.
Artetxe, M., Labaka, G., Agirre, E., 2019. An Effective Approach to Unsupervised Machine
Ghazvininejad, M., Levy, O., Liu, Y., Zettlemoyer, L., 2019. Mask-predict: parallel
Translation arXiv preprint arXiv:1902.01313.
decoding of conditional masked language models. In: Proceedings of EMNLP-IJCNLP,
Ba, J.L., Kiros, J.R., Hinton, G.E., 2016. Layer Normalization arXiv preprint arXiv:
pp. 6114–6123.
1607.06450.
Graves, A., Wayne, G., Danihelka, I., 2014. Neural Turing Machines arXiv preprint arXiv:
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., Samek, W., 2015. On
1410.5401.
pixel-wise explanations for non-linear classifier decisions by layer-wise relevance
Gu, J., Bradbury, J., Xiong, C., Li, V.O., Socher, R., 2018. Non-autoregressive neural
propagation. PloS One 10, e0130140.
machine translation. In: Proceedings of ICLR.
Bahdanau, D., Cho, K., Bengio, Y., 2015. Neural machine translation by jointly learning to
Gu, J., Liu, Q., Cho, K., 2019. Insertion-based decoding with automatically inferred
align and translate. In: Proceedings of ICLR.
generation order. TACL 7, 661–676.
Baniata, L.H., Park, S., Park, S.-B., 2018. A multitask-based neural machine translation
Gulcehre, C., Firat, O., Xu, K., Cho, K., Bengio, Y., 2017. On integrating a language model
model with part-of-speech tags integration for Arabic dialects. Appl. Sci. 8, 2502.
into neural machine translation. Comput. Speech Lang 45, 137–148.
Bastings, J., Titov, I., Aziz, W., Marcheggiani, D., Sima’an, K., 2017. Graph convolutional
Guo, J., Tan, X., He, D., Qin, T., Xu, L., Liu, T.-Y., 2019. Non-autoregressive neural
encoders for syntax-aware neural machine translation. In: Proceedings of EMNLP,
machine translation with enhanced decoder input. In: Proceedings of AAAI, vol. 33,
pp. 1957–1967.
pp. 3723–3730.
Bau, A., Belinkov, Y., Sajjad, H., Durrani, N., Dalvi, F., Glass, J., 2019. Identifying and
G
u, J., Shavarani, H.S., Sarkar, A., 2018. Top-down tree structured decoding with
controlling important neurons in neural machine translation. Proc. ICLR.
syntactic connections for neural machine translation and parsing. In: Proceedings of
Belinkov, Y., Bisk, Y., 2018. Synthetic and natural noise both break neural machine
EMNLP, pp. 401–413.
translation. In: Proceedings of ICLR.
Hao, J., Wang, X., Shi, S., Zhang, J., Tu, Z., 2019. Multi-granularity self-attention for
Bengio, Y., Simard, P., Frasconi, P., 1994. Learning long-term dependencies with gradient
neural machine translation. In: Proceedings of EMNLP-IJCNLP, pp. 886–896.
descent is difficult. IEEE Trans. Neural Network. 5, 157–166.
Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Federmann, C., Huang, X.,
Brown, P.F., Cocke, J., Della Pietra, S.A., Della Pietra, V.J., Jelinek, F., Lafferty, J.,
Junczys-Dowmunt, M., Lewis, W., Li, M., et al., 2018. Achieving Human Parity on
Mercer, R.L., Roossin, P.S., 1990. A statistical approach to machine translation.
Automatic Chinese to English News Translation arXiv preprint arXiv:1803.05567.
Comput. Ling. 16, 79–85.
He, K., Zhang, X., Ren, S., Sun, J., 2016a. Deep residual learning for image recognition.
Bugliarello, E., Okazaki, N., 2020. Enhancing machine translation with dependency-
Proc. CVPR 770–778.
aware self-attention. In: Proceedings of ACL, pp. 1618–1627.
He, D., Xia, Y., Qin, T., Wang, L., Yu, N., Liu, T.-Y., Ma, W.-Y., 2016b. Dual learning for
Caswell, I., Chelba, C., Grangier, D., 2019. Tagged back-translation. WMT 2019, 53.
machine translation. In: Advances in NeurIPS, pp. 820–828.
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.,
He, S., Tu, Z., Wang, X., Wang, L., Lyu, M., Shi, S., 2019. Towards understanding neural
2016. Mxnet: a flexible and efficient machine learning library for heterogeneous
machine translation with word importance. In: Proceedings of EMNLP-IJCNLP,
distributed systems. In: Proceedings of NeurIPS, Workshop.
pp. 953–962.
Chen, H., Huang, S., Chiang, D., Dai, X., Chen, J., 2018. Combining character and word
Hieber, F., Domhan, T., Denkowski, M., Vilar, D., Sokolov, A., Clifton, A., Post, M., 2017.
information in neural machine translation using a multi-level attention. In:
Sockeye: A Toolkit for Neural Machine Translation arXiv preprint arXiv:1712.05690.
Proceedings of the 2018 Conference of the North American Chapter of the
Hoang, C.D.V., Haffari, G., Cohn, T., 2017. Towards decoding as continuous optimisation
Association for Computational Linguistics: Human Language Technologies, vol. 1,
in neural machine translation. In: Proceedings of EMNLP, pp. 146–156.
pp. 1284–1293 (Long Papers).
Hoang, V.C.D., Koehn, P., Haffari, G., Cohn, T., 2018. Iterative back-translation for neural
Chen, L., Zhang, Y., Zhang, R., Tao, C., Gan, Z., Zhang, H., Li, B., Shen, D., Chen, C.,
machine translation. In: Proceedings of the 2nd Workshop on Neural Machine
Carin, L., 2019. Improving sequence-to-sequence learning via optimal transport. In:
Translation and Generation, pp. 18–24.
Proceedings of ICLR.
Hochreiter, S., Schmidhuber, J., 1997. Long Short-Term Memory, Neural Computation.
Chen, K., Wang, R., Utiyama, M., Sumita, E., 2020. Content word aware neural machine
Imamura, K., Fujita, A., Sumita, E., 2018. Enhancement of encoder and attention using
target monolingual corpora in neural machine translation. In: Proceedings of the 2nd
Cheng, Y., Xu, W., He, Z., He, W., Wu, H., Sun, M., Liu, Y., 2016. Semi-supervised learning
Workshop on Neural Machine Translation and Generation, pp. 55–63.
for neural machine translation. In: Proceedings of ACL, pp. 1965–1974.
Junczys-Dowmunt, M., Dwojak, T., Hoang, H., 2016. Is Neural Machine Translation
Cheng, Y., Tu, Z., Meng, F., Zhai, J., Liu, Y., 2018. Towards robust neural machine
Ready for Deployment? a Case Study on 30 Translation Directions arXiv preprint
arXiv:1610.01108.
Cheng, Y., Jiang, L., Macherey, W., 2019. Robust neural machine translation with doubly
Junczys-Dowmunt, M., Grundkiewicz, R., Dwojak, T., Hoang, H., Heafield, K.,
adversarial inputs. In: Proceedings of ACL, pp. 4324–4333.
Neckermann, T., Seide, F., Germann, U., Aji, A.F., Bogoychev, N., Martins, A.F.T.,
Cheng, Y., Jiang, L., Macherey, W., Eisenstein, J., AdvAug, 2020. Robust adversarial
Birch, A., 2018. Marian: fast neural machine translation in Cþþ. In: Proceedings of
augmentation for neural machine translation. In: Proceedings of ACL, pp. 5961–5970.
ACL. System Demonstrations, pp. 116–121.
Cherry, C., Foster, G., Bapna, A., Firat, O., Macherey, W., 2018. Revisiting Character-
Kaiser, L., Gomez, A.N., Chollet, F., 2017. Depthwise Separable Convolutions for Neural
Based Neural Machine Translation with Capacity and Compression arXiv preprint
Machine Translation arXiv preprint arXiv:1706.03059.
arXiv:1808.09943.
19
Kalchbrenner, N., Blunsom, P., 2013. Recurrent continuous translation models. In: Poncelas, A., Shterionov, D., Way, A., de Buy Wenniger, G., Passban, P., 2018.
Proceedings of EMNLP, pp. 1700–1709. Investigating Backtranslation in Neural Machine Translation arXiv preprint arXiv:
Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A.v. d., Graves, A., Kavukcuoglu, K., 1804.06189.
2016. Neural Machine Translation in Linear Time arXiv preprint arXiv:1610.10099. Post, M., 2018. A Call for Clarity in Reporting Bleu Scores arXiv preprint arXiv:
Karakanta, A., Dehdari, J., van Genabith, J., 2018. Neural machine translation for low- 1804.08771.
resource languages without parallel corpora. Mach. Translat. 32, 167–189. Provilkov, I., Emelianenko, D., Voita, E., 2019. Bpe-dropout: Simple and Effective
Kim, Y., Rush, A.M., 2016. Sequence-level knowledge distillation. In: Proceedings of Subword Regularization arXiv preprint arXiv:1910.13267.
EMNLP, pp. 1317–1327. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., 2019. Language models
Kingma, D., Ba, J., Adam, 2014. A Method for Stochastic Optimization arXiv preprint are unsupervised multitask learners. OpenAI Blog 1, 9.
arXiv:1412.6980. Raganato, A., Tiedemann, J., 2018. An analysis of encoder representations in transformer-
Koehn, P., Och, F.J., Marcu, D., 2003. Statistical Phrase-Based Translation, Technical based machine translation. In: Proceedings of EMNLP Workshop, pp. 287–297.
Report. UNIVERSITY OF SOUTHERN CALIFORNIA MARINA DEL REY Ranzato, M., Chopra, S., Auli, M., Zaremba, W., 2015. Sequence Level Training with
INFORMATION SCIENCES INST. Recurrent Neural Networks arXiv preprint arXiv:1511.06732.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Ranzato, M., Chopra, S., Auli, M., Zaremba, W., 2016. Sequence level training with
Shen, W., Moran, C., Zens, R., et al., 2007. Moses: open source toolkit for statistical recurrent neural networks. In: Proceedings of ICLR.
machine translation. In: Proceedings of ACL on Interactive Poster and Demonstration Ren, S., Zhang, Z., Liu, S., Zhou, M., Ma, S., 2019. Unsupervised neural machine
Sessions, pp. 177–180. translation with smt as posterior regularization. In: Proceedings of the AAAI, vol. 33,
Kudo, T., 2018. Subword Regularization: Improving Neural Network Translation Models pp. 241–248.
with Multiple Subword Candidates arXiv preprint arXiv:1804.10959. Ren, S., Wu, Y., Liu, S., Zhou, M., Ma, S., 2020. A retrieve-and-rewrite initialization
Kumar, S., Tsvetkov, Y., 2019. Von mises-Fisher loss for training sequence to sequence method for unsupervised machine translation. In: Proceedings of ACL,
models with continuous outputs. In: Proceedings of ICLR. pp. 3498–3504.
Lample, G., Conneau, A., 2019. Cross-lingual Language Model Pretraining arXiv preprint Ribeiro, M.T., Singh, S., Guestrin, C., 2018. Semantically equivalent adversarial rules for
arXiv:1901.07291. debugging nlp models. In: Proceedings of ACL, pp. 856–865.
Lample, G., Conneau, A., Denoyer, L., Ranzato, M., 2017. Unsupervised Machine Sennrich, R., Haddow, B., 2016. Linguistic input features improve neural machine
Translation Using Monolingual Corpora Only arXiv preprint arXiv:1711.00043. translation. In: Proceedings of WMT, pp. 83–91.
Lample, G., Ott, M., Conneau, A., Denoyer, L., Ranzato, M., 2018. Phrase-based & Neural Sennrich, R., Haddow, B., Birch, A., 2016a. Edinburgh neural machine translation systems
Unsupervised Machine Translation arXiv preprint arXiv:1804.07755. for wmt 16. In: Proceedings of WMT, pp. 371–376.
Lee, J., Cho, K., Hofmann, T., 2017. Fully character-level neural machine translation Sennrich, R., Haddow, B., Birch, A., 2016b. Improving neural machine translation models
without explicit segmentation. Trans. Assoc. Comput. Ling. 5, 365–378. with monolingual data. In: Proceedings of ACL, pp. 86–96.
Lee, J., Mansimov, E., Cho, K., 2018. Deterministic non-autoregressive neural sequence Sennrich, R., Haddow, B., Birch, A., 2016c. Neural machine translation of rare words with
modeling by iterative refinement. In: Proceedings of EMNLP, pp. 1173–1182. subword units. In: Proceedings of ACL.
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., Sennrich, R., Birch, A., Currey, A., Germann, U., Haddow, B., Heafield, K., Barone, A.V.M.,
Zettlemoyer, L., 2019. Bart: Denoising Sequence-To-Sequence Pre-training for Williams, P., 2017. The university of edinburgh's neural mt systems for wmt17. In:
Natural Language Generation, Translation, and Comprehension arXiv preprint arXiv: Proceedings of WMT, pp. 389–399.
1910.13461. Shao, C., Feng, Y., Zhang, J., Meng, F., Chen, X., Zhou, J., 2019. Retrieving sequential
Li, X., Liu, L., Tu, Z., Shi, S., Meng, M., 2018. Target foresight based attention for neural information for non-autoregressive neural machine translation. In: Proceedings of
machine translation. In: Proceedings of NAACL-HLT, pp. 1380–1390. ACL, pp. 3013–3024.
Libovickỳ, J., Helcl, J., 2018. End-to-end non-autoregressive neural machine translation Shen, S., Cheng, Y., He, Z., He, W., Wu, H., Sun, M., Liu, Y., 2016. Minimum risk training
with connectionist temporal classification. In: Proceedings of EMNLP, for neural machine translation. In: Proceedings of ACL, pp. 1683–1692.
pp. 3016–3021. Siegelmann, H.T., Sontag, E.D., 1995. On the computational power of neural nets.
Liu, Y., Liu, Q., Lin, S., 2006. Tree-to-string alignment template for statistical machine J. Comput. Syst. Sci. 50, 132–150.
translation. In: Proceedings of the 21st International Conference on Computational Song, K., Tan, X., Qin, T., Lu, J., Liu, T.-Y., 2019. Mass: Masked Sequence to Sequence Pre-
Linguistics and 44th Annual Meeting of the Association for Computational training for Language Generation arXiv preprint arXiv:1905.02450.
Linguistics. Association for Computational Linguistics, Sydney, Australia, Stahlberg, F., Saunders, D., Byrne, B., 2018. An operation sequence model for explainable
pp. 609–616. https://doi.org/10.3115/1220175.1220252. URL: https://www. neural machine translation. In: Proceedings of EMNLP Workshop, pp. 175–186.
aclweb.org/anthology/P06-1077. Stern, M., Chan, W., Kiros, J., Uszkoreit, J., 2019. Insertion transformer: flexible sequence
Liu, L., Utiyama, M., Finch, A., Sumita, E., 2016. Agreement on target-bidirectional neural generation via insertion operations. In: Proceedings of ICML, pp. 5976–5985.
machine translation. In: Proceedings of NAACL-HLT, pp. 411–416. Strobelt, H., Gehrmann, S., Behrisch, M., Perer, A., Pfister, H., Rush, A.M., 2019. Seq2seq-
Liu, H., Ma, M., Huang, L., Xiong, H., He, Z., 2019a. Robust neural machine translation vis: a visual debugging tool for sequence-to-sequence models. IEEE Trans. Visual.
with joint textual and phonetic embedding. In: Proceedings of ACL, pp. 3044–3049. Comput. Graph. 25, 353–363.
Liu, X., Wong, D.F., Liu, Y., Chao, L.S., Xiao, T., Zhu, J., 2019b. Shared-private bilingual Su, J., Zhang, X., Lin, Q., Qin, Y., Yao, J., Liu, Y., 2019. Exploiting reverse target-side
word embeddings for neural machine translation. In: Proceedings of ACL, contexts for neural machine translation via asynchronous bidirectional decoding.
pp. 3613–3622. Artif. Intell. 277, 103168.
Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., Zettlemoyer, L., Sun, H., Wang, R., Chen, K., Utiyama, M., Sumita, E., Zhao, T., 2019. Unsupervised
2020. Multilingual Denoising Pre-training for Neural Machine Translation arXiv bilingual word embedding agreement for unsupervised neural machine translation.
preprint arXiv:2001.08210. In: Proceedings of ACL, pp. 1235–1245.
Luong, M.-T., Manning, C.D., 2016. Achieving Open Vocabulary Neural Machine Sutskever, I., Vinyals, O., Le, Q.V., 2014. Sequence to sequence learning with neural
Translation with Hybrid Word-Character Models arXiv preprint arXiv:1604.00788. networks. In: Proceedings of NeurIPS, pp. 3104–3112.
Luong, M.-T., Pham, H., Manning, C.D., 2015. Effective Approaches to Attention-Based Team, T.T.D., Al-Rfou, R., Alain, G., Almahairi, A., Angermueller, C., Bahdanau, D.,
Neural Machine Translation arXiv preprint arXiv:1508.04025. Ballas, N., Bastien, F., Bayer, J., Belikov, A., et al., 2016. Theano: A python
Luong, M., Brevdo, E., Zhao, R., 2017. Neural machine translation (seq2seq) tutorial. http Framework for Fast Computation of Mathematical Expressions arXiv preprint arXiv:
s://github.com/tensorflow/nmt. 1605.02688.
Mehri, S., Sigal, L., 2018. Middle-out decoding. In: Advances in NeurIPS, pp. 5518–5529. Tiedemann, J., 2016. Opus–parallel corpora for everyone. Baltic J. Mod. Comput. 384.
Michel, P., Neubig, G., 2018. Mtnt: a testbed for machine translation of noisy text. In: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
Proceedings of EMNLP, pp. 543–553. Polosukhin, I., 2017. Attention is all you need. In: Proceedings of NeurIPS,
Morishita, M., Suzuki, J., Nagata, M., 2018. Improving neural machine translation by pp. 5998–6008.
incorporating hierarchical subword features. In: Proceedings of COLING, Vaswani, A., Bengio, S., Brevdo, E., Chollet, F., Gomez, A., Gouws, S., Jones, L., Kaiser, Ł.,
pp. 618–629. Kalchbrenner, N., Parmar, N., Sepassi, R., Shazeer, N., Uszkoreit, J., 2018.
Neubig, G., Dou, Z.-Y., Hu, J., Michel, P., Pruthi, D., Wang, X., 2019. compare-mt: a tool Tensor2Tensor for neural machine translation. In: Proceedings of AMTA,
for holistic comparison of language generation systems. In: Proceedings of NAACL- pp. 193–199.
HLT (Demonstrations), pp. 35–41. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A., 2008. Extracting and composing
Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., Auli, M., 2019. robust features with denoising autoencoders. In: Proceedings of ICML,
fairseq: a fast, extensible toolkit for sequence modeling. In: Proceedings of NAACL- pp. 1096–1103.
HLT (Demonstrations), pp. 48–53. Voita, E., Sennrich, R., Titov, I., 2019. The bottom-up evolution of representations in the
Papineni, K., Roukos, S., Ward, T., Zhu, W., 2002. Bleu: a method for automatic transformer: a study with machine translation and language modeling objectives. In:
evaluation of machine translation. In: Proceedings of ACL. Proceedings of EMNLP-IJCNLP, pp. 4396–4406.
Passban, P., Liu, Q., Way, A., 2018. Improving Character-Based Decoding Using Target- Wang, C., Zhang, J., Chen, H., 2018a. Semi-autoregressive neural machine translation. In:
Side Morphological Information for Neural Machine Translation arXiv preprint arXiv: Proceedings of EMNLP, pp. 479–488.
1804.06506. Wang, X., Pham, H., Yin, P., Neubig, G., 2018b. A tree-based decoder for neural machine
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., translation. In: Proceedings of EMNLP, pp. 4772–4777.
Gimelshein, N., Antiga, L., et al., 2019. Pytorch: an imperative style, high- Wang, Y., Tian, F., He, D., Qin, T., Zhai, C., Liu, T.-Y., 2019a. Non-autoregressive machine
performance deep learning library. In: Advances in NeurIPS, pp. 8026–8037. translation with auxiliary regularization. In: Proceedings of AAAI, vol. 33,
Peris, A., Casacuberta, F., 2018. Nmt-keras: a very flexible toolkit with a focus on pp. 5377–5384.
interactive nmt and online learning. Prague Bull. Math. Linguist. 111, 113–124. Wang, S., Liu, Y., Wang, C., Luan, H., Sun, M., 2019b. Improving Back-Translation with
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L., Uncertainty-Based Confidence Estimation arXiv preprint arXiv:1909.00157.
2018. Deep Contextualized Word Representations arXiv preprint arXiv:1802.05365.
20
Wang, C., Cho, K., Gu, J., 2020. Neural machine translation with byte-level subwords. In: Zhang, J., Zong, C., 2020. Neural Machine Translation: Challenges, Progress and Future
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, arXiv preprint arXiv:2004.05809.
pp. 9154–9160. Zhang, M., Liu, Y., Luan, H., Sun, M., 2017a. Adversarial training for unsupervised
Wei, B., Wang, M., Zhou, H., Lin, J., Sun, X., 2019. Imitation learning for non- bilingual lexicon induction. In: Proceedings of ACL, pp. 1959–1970.
autoregressive neural machine translation. In: Proceedings of ACL, pp. 1304–1312. Zhang, J., Liu, Y., Luan, H., Xu, J., Sun, M., 2017b. Prior knowledge integration for neural
Weller-Di Marco, M., Fraser, A., 2020. Modeling word formation in English–German machine translation using posterior regularization. In: Proceedings of ACL,
neural machine translation. In: Proceedings of ACL, pp. 4227–4232. pp. 1514–1523.
Wenzek, G., Lachaux, M.-A., Conneau, A., Chaudhary, V., Guzman, F., Joulin, A., Zhang, J., Ding, Y., Shen, S., Cheng, Y., Sun, M., Luan, H., Liu, Y., 2017c. Thumt: an Open
Ccnet, 2020. Extracting high quality monolingual datasets from web crawl
Grave, E., Source Toolkit for Neural Machine Translation arXiv preprint arXiv:1706.06415.
data. In: Proceedings of the 12th Language Resources and Evaluation Conference, Zhang, X., Su, J., Qin, Y., Liu, Y., Ji, R., Wang, H., 2018. Asynchronous bidirectional
pp. 4003–4012. decoding for neural machine translation. In: Proceedings of AAAI, pp. 5698–5705.
Wieting, J., Berg-Kirkpatrick, T., Gimpel, K., Neubig, G., 2019. Beyond BLEU:training Zhang, B., Xiong, D., Su, J., Luo, J., 2019a. Future-aware knowledge distillation for neural
neural machine translation with semantic similarity. In: Proceedings of ACL, machine translation. IEEE/ACM Trans. Audio Speech Lang. Proc. 27, 2278–2287.
pp. 4344–4355. Zhang, Z., Wu, S., Liu, S., Li, M., Zhou, M., Xu, T., 2019b. Regularizing neural machine
Wu, Y., Schuster, M., Chen, Z., Le, Q.V., et al., 2016. Google's Neural Machine Translation translation by target-bidirectional agreement. Proc. AAAI 33, 443–450.
System: Bridging the Gap between Human and Machine Translation arXiv preprint Zhang, J., Zhou, L., Zhao, Y., Zong, C., 2020a. Synchronous bidirectional inference for
arXiv:1609.08144. neural sequence generation. Artif. Intell. 281, 103234.
Wu, S., Zhang, D., Yang, N., Li, M., Zhou, M., 2017. Sequence-to-dependency neural Zhang, B., Williams, P., Titov, I., Sennrich, R., 2020b. Improving Massively Multilingual
machine translation. In: Proceedings of ACL, pp. 698–707. Neural Machine Translation and Zero-Shot Translation arXiv preprint arXiv:
Wu, L., Tian, F., Qin, T., Lai, J., Liu, T.-Y., 2018. A study of reinforcement learning for 2004.11867.
neural machine translation. In: Proceedings of EMNLP, pp. 3612–3621. Zhao, Z., Dua, D., Singh, S., 2018. Generating natural adversarial examples. In:
Wu, F., Fan, A., Baevski, A., Dauphin, Y.N., Auli, M., 2019a. Pay Less Attention with Proceedings of ICLR.
Lightweight and Dynamic Convolutions arXiv preprint arXiv:1901.10430. Zheng, Z., Zhou, H., Huang, S., Mou, L., Dai, X., Chen, J., Tu, Z., 2018. Modeling past and
Wu, J., Wang, X., Wang, W.Y., 2019b. Extract and Edit: an Alternative to Back-Translation future for neural machine translation. Trans. Assoc. Comput. Ling. 6, 145–157.
for Unsupervised Neural Machine Translation arXiv preprint arXiv:1904.02331. Zheng, Z., Huang, S., Tu, Z., Dai, X.-Y., Jiajun, C., 2019. Dynamic past and future for
Yang, Z., Chen, L., Le Nguyen, M., 2018a. Regularizing forward and backward decoding neural machine translation. In: Proceedings of EMNLP-IJCNLP, pp. 930–940.
to improve neural machine translation. In: Proceedings of International Conference Zheng, Z., Zhou, H., Huang, S., Li, L., Dai, X.-Y., Chen, J., 2020. Mirror-generative neural
on Knowledge and Systems Engineering. KSE, pp. 73–78. machine translation. In: Proceedings of ICLR.
Yang, Z., Chen, W., Wang, F., Xu, B., 2018b. Unsupervised Neural Machine Translation Zhou, J., Xu, W., 2015. End-to-end learning of semantic role labeling using recurrent
with Weight Sharing arXiv preprint arXiv:1804.09057. neural networks. In: Proceedings of ACL, pp. 1127–1137.
Yang, Z., Cheng, Y., Liu, Y., Sun, M., 2019a. Reducing word omission errors in neural Zhou, L., Zhang, J., Zong, C., 2019a. Synchronous Bidirectional Neural Machine
machine translation: a contrastive learning approach. In: Proceedings of ACL, Translation. TACL.
pp. 6191–6196. Zhou, L., Zhang, J., Zong, C., Yu, H., 2019b. Sequence generation: from both sides to the
Yang, X., Liu, Y., Xie, D., Wang, X., Balasubramanian, N., 2019b. Latent part-of-speech middle. In: Proceedings of IJCAI, pp. 5471–5477.
sequences for neural machine translation. In: Proceedings of EMNLP-IJCNLP, Zhou, C., Gu, J., Neubig, G., 2019c. Understanding knowledge distillation in non-
pp. 780–790. autoregressive machine translation. In: Proceedings of ICLR.
Yang, J., Ma, S., Zhang, D., Li, Z., Zhou, M., 2020. Improving neural machine translation Zhu, J., Xia, Y., Wu, L., He, D., Qin, T., Zhou, W., Li, H., Liu, T.-Y., 2020. Incorporating
with soft template prediction. In: Proceedings of WMT, pp. 5979–5989. Bert into Neural Machine Translation arXiv preprint arXiv:2002.06823.
Yun, C., Bhojanapalli, S., Rawat, A.S., Reddi, S., Kumar, S., 2020. Are transformers Zoph, B., Yuret, D., May, J., Knight, K., 2016. Transfer Learning for Low-Resource Neural
universal approximators of sequence-to-sequence functions?. In: Proceedings of ICLR. Machine Translation arXiv preprint arXiv:1604.02201.
Zhang, J., Zong, C., 2016. Exploiting source-side monolingual data in neural machine Zou, W., Huang, S., Xie, J., Dai, X., Chen, J., 2020. A reinforced generation of adversarial
translation. In: Proceedings of EMNLP, pp. 1535–1545. examples for neural machine translation. In: Proceedings of ACL, pp. 3486–3497.
21

Neural Machine Translation A Review of Methods Resources and - 2020 - AI Ope

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Machine Translation A Review of Methods Resources and - 2020 - AI Ope

Uploaded by

Copyright:

Available Formats

AI Open 1 (2020) 5–21

Contents lists available at ScienceDirect

Neural machine translation: A review of methods, resources, and tools

where we use z½i to denote the i-th component in z.

Pðyjx ¼ x; θÞ as the translation. However, due to the intractably large

Algorithm 1. The beam search algorithm where the log-likelihood is deﬁned as

By the virtue of back-propagation algorithm, we can efﬁciently

θ ← θ αrL ðθÞ; (7)

2.2.1. Evolution of NMT architectures

Fig. 2. A running example of the beam-search algorithm.

recurrent neural networks (RNN) as the decoder network for generating

1. The ﬁxed-length representations have become the bottleneck during

Due to these limitations, later NMT architectures switch to variable-

RNN hl;t ¼ Whl1;t þ Uhl;t1 ∞ Oðn  d2 Þ OðnÞ Yes

RCTM 1 CNN RNN OðS2 þ TÞ No S T

problem. Another way to stabilize the training is to incorporate

TENSOR2TENSOR Python TensorFlow Deprecated

Fig. 7. Three extensions to RNNs.

width k can increase the receptive ﬁeld from k to L  ðk 1Þ þ 1. The

Fig. 6. The computation of CNN and SAN during decoding.

where wðiÞ is the i-th column of weight matrix W 2 R. Lightweight

2.3.1. Bidirectional inference

an n-best list of translations, and then an R2L model rescores each

x, and b θ is the set of learned parameters.

Fig. 13. Three commonly used ways for pre-training.

You might also like

RNN hl;t ¼ Whl1;t þ Uhl;t1 ∞ Oðn d2 Þ OðnÞ Yes

width k can increase the receptive ﬁeld from k to L ðk 1Þ þ 1. The