Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

A Character-Level Decoder without Explicit Segmentation

for Neural Machine Translation

Junyoung Chung Kyunghyun Cho Yoshua Bengio


Université de Montréal New York University Université de Montréal
[email protected] CIFAR Senior Fellow

Abstract
tion, although neural networks do not suffer from
The existing machine translation systems, character-level modelling and rather suffer from
whether phrase-based or neural, have the issues specific to word-level modelling, such
arXiv:1603.06147v4 [cs.CL] 21 Jun 2016

relied almost exclusively on word-level as the increased computational complexity from a


modelling with explicit segmentation. In very large target vocabulary (Jean et al., 2015; Lu-
this paper, we ask a fundamental question: ong et al., 2015b). Therefore, in this paper, we ad-
can neural machine translation generate dress a question of whether neural machine trans-
a character sequence without any explicit lation can be done directly on a sequence of char-
segmentation? To answer this question, acters without any explicit word segmentation.
we evaluate an attention-based encoder– To answer this question, we focus on represent-
decoder with a subword-level encoder and ing the target side as a character sequence. We
a character-level decoder on four language evaluate neural machine translation models with
pairs–En-Cs, En-De, En-Ru and En-Fi– a character-level decoder on four language pairs
using the parallel corpora from WMT’15. from WMT’15 to make our evaluation as convinc-
Our experiments show that the models ing as possible. We represent the source side as
with a character-level decoder outperform a sequence of subwords extracted using byte-pair
the ones with a subword-level decoder on encoding from Sennrich et al. (2015), and vary the
all of the four language pairs. Further- target side to be either a sequence of subwords or
more, the ensembles of neural models with characters. On the target side, we further design a
a character-level decoder outperform the novel recurrent neural network (RNN), called bi-
state-of-the-art non-neural machine trans- scale recurrent network, that better handles multi-
lation systems on En-Cs, En-De and En-Fi ple timescales in a sequence, and test it in addition
and perform comparably on En-Ru. to a naive, stacked recurrent neural network.
On all of the four language pairs–En-Cs, En-De,
1 Introduction En-Ru and En-Fi–, the models with a character-
The existing machine translation systems have re- level decoder outperformed the ones with a
lied almost exclusively on word-level modelling subword-level decoder. We observed a similar
with explicit segmentation. This is mainly due trend with the ensemble of each of these con-
to the issue of data sparsity which becomes much figurations, outperforming both the previous best
more severe, especially for n-grams, when a sen- neural and non-neural translation systems on En-
tence is represented as a sequence of characters Cs, En-De and En-Fi, while achieving a compara-
rather than words, as the length of the sequence ble result on En-Ru. We find these results to be
grows significantly. In addition to data sparsity, a strong evidence that neural machine translation
we often have a priori belief that a word, or its can indeed learn to translate at the character-level
segmented-out lexeme, is a basic unit of meaning, and that in fact, it benefits from doing so.
making it natural to approach translation as map-
ping from a sequence of source-language words to 2 Neural Machine Translation
a sequence of target-language words.
This has continued with the more recently Neural machine translation refers to a recently
proposed paradigm of neural machine transla- proposed approach to machine translation (For-
cada and Ñeco, 1997; Kalchbrenner and Blunsom, We use a feedforward network with a single hid-
2013; Cho et al., 2014; Sutskever et al., 2014). den layer in this 1 Z is a normalization con-
PTpaper.
This approach aims at building an end-to-end neu- stant: Z = k=1 x
efscore (ey (yt0 −1 ),ht0 −1 ,zk ) . This
ral network that takes as input a source sentence procedure can be understood as computing the
X = (x1 , . . . , xTx ) and outputs its translation alignment probability between the t0 -th target
Y = (y1 , . . . , yTy ), where xt and yt0 are respec- symbol and t-th source symbol.
tively source and target symbols. This neural net- The hidden state ht0 , together with the previous
work is constructed as a composite of an encoder target symbol yt0 −1 and the context vector ct0 , is
network and a decoder network. fed into a feedforward neural network to result in
The encoder network encodes the input sen- the conditional distribution:
tence X into its continuous representation. In yt0
p(yt0 | y<t0 , X) ∝ efout (ey (yt0 −1 ),ht0 ,ct0 ) . (4)
this paper, we closely follow the neural transla-
tion model proposed in Bahdanau et al. (2015) The whole model, consisting of the encoder,
and use a bidirectional recurrent neural network, decoder and soft-alignment mechanism, is then
which consists of two recurrent neural networks. tuned end-to-end to minimize the negative log-
The forward network reads the input sentence likelihood using stochastic gradient descent.


in a forward direction: → −z t = φ (ex (xt ), →
−z t−1 ),
where ex (xt ) is a continuous embedding of the 3 Towards Character-Level Translation
t-th input symbol, and φ is a recurrent activa- 3.1 Motivation
tion function. Similarly, the reverse network
Let us revisit how the source and target sen-
reads the sentence in a reverse direction (right

− tences (X and Y ) are represented in neural ma-
to left): ←−
z t = φ (ex (xt ), ← −
z t+1 ). At each loca-
chine translation. For the source side of any given
tion in the input sentence, we concatenate the hid-
training corpus, we scan through the whole cor-
den states from the forward and reverse RNNs
pus to build a vocabulary Vx of unique tokens to
to form
→
− ←−  set C = {z1 , . . . , zTx } , where
a context
which we assign integer indices. A source sen-
zt = z t ; z t .
tence X is then built as a sequence of the indices
Then the decoder computes the conditional dis- of such tokens belonging to the sentence, i.e.,
tribution over all possible translations based on X = (x1 , . . . , xTx ), where xt ∈ {1, 2, . . . , |Vx |}.
this context set. This is done by first rewrit- The target sentence is similarly transformed into a
ing the conditional probability of a translation: target sequence of integer indices.
PT
log p(Y |X) = t0y=1 log p(yt0 |y<t0 , X). For each Each token, or its index, is then transformed
conditional term in the summation, the decoder into a so-called one-hot vector of dimensionality
RNN updates its hidden state by |Vx |. All but one elements of this vector are set to
0. The only element whose index corresponds to
ht0 = φ(ey (yt0 −1 ), ht0 −1 , ct0 ), (1) the token’s index is set to 1. This one-hot vector
is the one which any neural machine translation
where ey is the continuous embedding of a target model sees. The embedding function, ex or ey , is
symbol. ct0 is a context vector computed by a soft- simply the result of applying a linear transforma-
alignment mechanism: tion (the embedding matrix) to this one-hot vector.
The important property of this approach based
ct0 = falign (ey (yt0 −1 ), ht0 −1 , C)). (2) on one-hot vectors is that the neural network is
oblivious to the underlying semantics of the to-
The soft-alignment mechanism falign weights kens. To the neural network, each and every token
each vector in the context set C according to its in the vocabulary is equal distance away from ev-
relevance given what has been translated. The ery other token. The semantics of those tokens are
weight of each vector zt is computed by simply learned (into the embeddings) to maximize
the translation quality, or the log-likelihood of the
1 fscore (ey (yt0 −1 ),ht0 −1 ,zt )
αt,t0 = e , (3) model.
Z
This property allows us great freedom in the
where fscore is a parametric function returning an choice of tokens’ unit. Neural networks have been
1
unnormalized score for zt given ht0 −1 and yt0 −1 . For other possible implementations, see (Luong et al., 2015a).
shown to work well with word tokens (Bengio et suboptimal quality.
al., 2001; Schwenk, 2007; Mikolov et al., 2010) Based on this observation and analysis, in this
but also with finer units, such as subwords (Sen- paper, we ask ourselves and the readers a question
nrich et al., 2015; Botha and Blunsom, 2014; Lu- which should have been asked much earlier: Is it
ong et al., 2013) as well as symbols resulting possible to do character-level translation without
from compression/encoding (Chitnis and DeNero, any explicit segmentation?
2015). Although there have been a number of
previous research reporting the use of neural net- 3.2 Why Word-Level Translation?
works with characters (see, e.g., Mikolov et al. (1) Word as a Basic Unit of Meaning A word
(2012) and Santos and Zadrozny (2014)), the dom- can be understood in two different senses. In the
inant approach has been to preprocess the text into abstract sense, a word is a basic unit of mean-
a sequence of symbols, each associated with a se- ing (lexeme), and in the other sense, can be un-
quence of characters, after which the neural net- derstood as a “concrete word as used in a sen-
work is presented with those symbols rather than tence.” (Booij, 2012). A word in the former sense
with characters. turns into that in the latter sense via a process
More recently in the context of neural machine of morphology, including inflection, compound-
translation, two research groups have proposed to ing and derivation. These three processes do al-
directly use characters. Kim et al. (2015) proposed ter the meaning of the lexeme, but often it stays
to represent each word not as a single integer index close to the original meaning. Because of this
as before, but as a sequence of characters, and use view of words as basic units of meaning (either
a convolutional network followed by a highway in the form of lexemes or derived form) from lin-
network (Srivastava et al., 2015) to extract a con- guistics, much of previous work in natural lan-
tinuous representation of the word. This approach, guage processing has focused on using words as
which effectively replaces the embedding func- basic units of which a sentence is encoded as a
tion ex , was adopted by Costa-Jussà and Fonollosa sequence. Also, the potential difficulty in finding
(2016) for neural machine translation. Similarly, a mapping between a word’s character sequence
Ling et al. (2015b) use a bidirectional recurrent and meaning3 has likely contributed to this trend
neural network to replace the embedding functions toward word-level modelling.
ex and ey to respectively encode a character se-
quence to and from the corresponding continuous (2) Data Sparsity There is a further technical
word representation. A similar, but slightly differ- reason why much of previous research on ma-
ent approach was proposed by Lee et al. (2015), chine translation has considered words as a ba-
where they explicitly mark each character with its sic unit. This is mainly due to the fact that ma-
relative location in a word (e.g., “B”eginning and jor components in the existing translation systems,
“I”ntermediate). such as language models and phrase tables, are a
Despite the fact that these recent approaches count-based estimator of probabilities. In other
work at the level of characters, it is less satisfying words, a probability of a subsequence of sym-
that they all rely on knowing how to segment char- bols, or pairs of symbols, is estimated by count-
acters into words. Although it is generally easy ing the number of its occurrences in a training
for languages like English, this is not always the corpus. This approach severely suffers from the
case. This word segmentation procedure can be issue of data sparsity, which is due to a large
as simple as tokenization followed by some punc- state space which grows exponentially w.r.t. the
tuation normalization, but also can be as compli- length of subsequences while growing only lin-
cated as morpheme segmentation requiring a sep- early w.r.t. the corpus size. This poses a great chal-
arate model to be trained in advance (Creutz and lenge to character-level modelling, as any subse-
Lagus, 2005; Huang and Zhao, 2007). Further- quence will be on average 4–5 times longer when
more, these segmentation2 steps are often tuned characters, instead of words, are used. Indeed,
or designed separately from the ultimate objective Vilar et al. (2007) reported worse performance
of translation quality, potentially contributing to a when the character sequence was directly used by
a phrase-based machine translation system. More
2
From here on, the term segmentation broadly refers to
3
any method that splits a given character sequence into a se- For instance, “quit”, “quite” and “quiet” are one edit-
quence of subword symbols. distance away from each other but have distinct meanings.
recently, Neubig et al. (2013) proposed a method vectors for “s” and“ing”. Each of those variants
to improve character-level translation with phrase- will be then a composite of the lexeme vector
based translation systems, however, with only a (shared across these variants) and morpheme vec-
limited success. tors (shared across words sharing the same suffix,
for example) (Botha and Blunsom, 2014). This
(3) Vanishing Gradient Specifically to neural makes use of distributed representation, which
machine translation, a major reason behind the generally yields better generalization, but seems
wide adoption of word-level modelling is due to to require an optimal segmentation, which is un-
the difficulty in modelling long-term dependen- fortunately almost never available.
cies with recurrent neural networks (Bengio et al., In addition to inefficiency in modelling, there
1994; Hochreiter, 1998). As the lengths of the are two additional negative consequences from us-
sentences on both sides grow when they are repre- ing (unsegmented) words. First, the translation
sented in characters, it is easy to believe that there system cannot generalize well to novel words,
will be more long-term dependencies that must be which are often mapped to a token reserved for
captured by the recurrent neural network for suc- an unknown word. This effectively ignores any
cessful translation. meaning or structure of the word to be incorpo-
3.3 Why Character-Level Translation? rated when translating. Second, even when a lex-
eme is common and frequently observed in the
Why not Word-Level Translation? The most training corpus, its morphological variant may not
pressing issue with word-level processing is that be. This implies that the model sees this specific,
we do not have a perfect word segmentation al- rare morphological variant much less and will not
gorithm for any one language. A perfect segmen- be able to translate it well. However, if this rare
tation algorithm needs to be able to segment any morphological variant shares a large part of its
given sentence into a sequence of lexemes and spelling with other more common words, it is de-
morphemes. This problem is however a difficult sirable for a machine translation system to exploit
problem on its own and often requires decades of those common words when translating those rare
research (see, e.g., Creutz and Lagus (2005) for variants.
Finnish and other morphologically rich languages
and Huang and Zhao (2007) for Chinese). There- Why Character-Level Translation? All of
fore, many opt to using either a rule-based tok- these issues can be addressed to certain extent by
enization approach or a suboptimal, but still avail- directly modelling characters. Although the issue
able, learning based segmentation algorithm. of data sparsity arises in character-level transla-
The outcome of this naive, sub-optimal segmen- tion, it is elegantly addressed by using a paramet-
tation is that the vocabulary is often filled with ric approach based on recurrent neural networks
many similar words that share a lexeme but have instead of a non-parametric count-based approach.
different morphology. For instance, if we apply Furthermore, in recent years, we have learned how
a simple tokenization script to an English corpus, to build and train a recurrent neural network that
“run”, “runs”, “ran” and “running” are all separate can well capture long-term dependencies by using
entries in the vocabulary, while they clearly share more sophisticated activation functions, such as
the same lexeme “run”. This prevents any ma- long short-term memory (LSTM) units (Hochre-
chine translation system, in particular neural ma- iter and Schmidhuber, 1997) and gated recurrent
chine translation, from modelling these morpho- units (Cho et al., 2014).
logical variants efficiently. Kim et al. (2015) and Ling et al. (2015a) re-
More specifically in the case of neural machine cently showed that by having a neural network that
translation, each of these morphological variants– converts a character sequence into a word vector,
“run”, “runs”, “ran” and “running”– will be as- we avoid the issues from having many morpho-
signed a d-dimensional word vector, leading to logical variants appearing as separate entities in
four independent vectors, while it is clear that if a vocabulary. This is made possible by sharing
we can segment those variants into a lexeme and the character-to-word neural network across all the
other morphemes, we can model them more effi- unique tokens. A similar approach was applied to
ciently. For instance, we can have a d-dimensional machine translation by Ling et al. (2015b).
vector for the lexeme “run” and much smaller These recent approaches, however, still rely on
the availability of a good, if not optimal, segmen-
tation algorithm. Ling et al. (2015b) indeed states
that “[m]uch of the prior information regarding
morphology, cognates and rare word translation
among others, should be incorporated”.
It however becomes unnecessary to consider
these prior information, if we use a neural net-
work, be it recurrent, convolution or their combi- (a) Gating units (b) One-step processing
nation, directly on the unsegmented character se-
quence. The possibility of using a sequence of un- Figure 1: Bi-scale recurrent neural network
segmented characters has been studied over many
years in the field of deep learning. For instance, questions; whether the current recurrent neural
Mikolov et al. (2012) and Sutskever et al. (2011) networks, which are already widely used in neu-
trained a recurrent neural network language model ral machine translation, are able to address these
(RNN-LM) on character sequences. The latter challenges as they are. In this paper, we aim at an-
showed that it is possible to generate sensible text swering these questions empirically and focus on
sequences by simply sampling a character at a the challenges on the target side (as the target side
time from this model. More recently, Zhang et shows both of the challenges).
al. (2015) and Xiao and Cho (2016) successfully
applied a convolutional net and a convolutional- 4 Character-Level Translation
recurrent net respectively to character-level docu-
ment classification without any explicit segmenta- In this paper, we try to answer the questions posed
tion. Gillick et al. (2015) further showed that it earlier by testing two different types of recurrent
is possible to train a recurrent neural network on neural networks on the target side (decoder).
unicode bytes, instead of characters or words, to First, we test an existing recurrent neural net-
perform part-of-speech tagging and named entity work with gated recurrent units (GRUs). We call
recognition. this decoder a base decoder.
Second, we build a novel two-layer recurrent
These previous works suggest the possibility of
neural network, inspired by the gated-feedback
applying neural networks for the task of machine
network from Chung et al. (2015), called a bi-
translation, which is often considered a substan-
scale recurrent neural network. We design this
tially more difficult problem compared to docu-
network to facilitate capturing two timescales, mo-
ment classification and language modelling.
tivated by the fact that characters and words may
work at two separate timescales.
3.4 Challenges and Questions
We choose to test these two alternatives for the
There are two overlapping sets of challenges for following purposes. Experiments with the base
the source and target sides. On the source side, it decoder will clearly answer whether the existing
is unclear how to build a neural network that learns neural network is enough to handle character-level
a highly nonlinear mapping from a spelling to the decoding, which has not been properly answered
meaning of a sentence. in the context of machine translation. The alterna-
On the target side, there are two challenges. The tive, the bi-scale decoder, is tested in order to see
first challenge is the same one from the source whether it is possible to design a better decoder, if
side, as the decoder neural network needs to sum- the answer to the first question is positive.
marize what has been translated. In addition to
this, the character-level modelling on the target 4.1 Bi-Scale Recurrent Neural Network
side is more challenging, as the decoder network In this proposed bi-scale recurrent neural network,
must be able to generate a long, coherent sequence there are two sets of hidden units, h1 and h2 . They
of characters. This is a great challenge, as the size contain the same number of units, i.e., dim(h1 ) =
of the state space grows exponentially w.r.t. the dim(h2 ). The first set h1 models a fast-changing
number of symbols, and in the case of characters, timescale (thereby, a faster layer), and h2 a slower
it is often 300-1000 symbols long. timescale (thereby, a slower layer). For each hid-
All these challenges should first be framed as den unit, there is an associated gating unit, to
which we refer by g1 and g2 . For the descrip-
tion below, we use yt0 −1 and ct0 for the previous
target symbol and the context vector (see Eq. (2)),
respectively.
Let us start with the faster layer. The faster layer
outputs two sets of activations, a normal output h1t0
and its gated version ȟ1t0 . The activation of the
faster layer is computed by Figure 2: (left) The BLEU scores on En-Cs
 1
h i w.r.t. the length of source sentences. (right) The
h1t0 = tanh Wh ey (yt0 −1 ); ȟ1t0 −1 ; ĥ2t0 −1 ; ct0 , difference of word negative log-probabilities be-
tween the subword-level decoder and either of the
where ȟ1t0 −1 and ĥ2t0 −1 are the gated activations of character-level base or bi-scale decoder.
the faster and slower layers respectively. These
gated activations are computed by ȟ2t0 −1 indicates the reset activation from the pre-
vious time step, similarly to what happened in the
ȟ1t0 = (1 − gt10 ) h1t0 , ĥ2t0 = gt10 h2t0 . faster layer, and ct0 is the input from the context.
According to gt10 h1t0 in Eq. (5), the faster layer
In other words, the faster layer’s activation is influences the slower layer, only when the faster
based on the adaptive combination of the faster layer has finished processing the current chunk
and slower layers’ activations from the previous and is about to reset itself (gt10 ≈ 1). In other
time step. Whenever the faster layer determines words, the slower layer does not receive any in-
that it needs to reset, i.e., gt10 −1 ≈ 1, the next put from the faster layer, until the faster layer has
activation will be determined based more on the quickly processed the current chunk, thereby run-
slower layer’s activation. ning at a slower rate than the faster layer does.
The faster layer’s gating unit is computed by At each time step, the final output of the pro-
 1
h i posed bi-scale recurrent neural network is the con-
gt10 = σ Wg ey (yt0 −1 ); ȟ1t0 −1 ; ĥ2t0 −1 ; ct0 , catenation of the output vectors of the faster and
slower layers, i.e., h1 ; h2 . This concatenated
 
where σ is a sigmoid function. vector is used to compute the probability distribu-
The slower layer also outputs two sets of acti- tion over all the symbols in the vocabulary, as in
vations, a normal output h2t0 and its gated version Eq. (4). See Fig. 1 for graphical illustration.
ȟ2t0 . These activations are computed as follows:
5 Experiment Settings
h2t0 = (1 − gt10 ) h2t0 −1 + gt10 h̃2t0 ,
For evaluation, we represent a source sentence as
ȟ2t0 = (1 − gt20 ) h2t0 , a sequence of subword symbols extracted by byte-
pair encoding (BPE, Sennrich et al. (2015)) and a
where h̃2t0 is a candidate activation. The slower target sentence either as a sequence of BPE-based
layer’s gating unit gt20 is computed by symbols or as a sequence of characters.
 2  
gt20 =σ Wg (gt10 h1t0 ); ȟ2t0 −1 ; ct0 . Corpora and Preprocessing We use all avail-
able parallel corpora for four language pairs from
This adaptive leaky integration based on the gat- WMT’15: En-Cs, En-De, En-Ru and En-Fi. They
ing unit from the faster layer has a consequence consist of 12.1M, 4.5M, 2.3M and 2M sentence
that the slower layer updates its activation only pairs, respectively. We tokenize each corpus using
when the faster layer resets. This puts a soft con- a tokenization script included in Moses.4 We only
straint that the faster layer runs at a faster rate by use the sentence pairs, when the source side is up
preventing the slower layer from updating while to 50 subword symbols long and the target side is
the faster layer is processing a current chunk. either up to 100 subword symbols or 500 charac-
The candidate activation is then computed by ters. We do not use any monolingual corpus.
4
  Although tokenization is not necessary for character-
2 
h̃2t0 = tanh Wh (gt10 h1t0 ); ȟ2t0 −1 ; ct0 . (5) level modelling, we tokenize the all target side corpora to
make comparison against word-level modelling easier.
Attention Development Test1 Test2

p th

el
Src

od
Trgt h1 h2 Single Ens Single Ens Single Ens

De
D

M
(a) 1 20.78 – 19.98 – 21.72 –
(b)
BPE
2 D D Base
21.2621.45 23.49 20.4720.88 23.10 22.0222.21 24.83
D
20.62 19.30 21.35
(c) 2 21.5721.88 23.14 21.3321.56 23.11 23.4523.91 25.24
D D
20.88 19.82 21.72
Base
En-De

(d) BPE 2 20.31 – 19.70 – 21.30 –


(e) Char 2 D 21.2921.43 23.05 21.2521.47 23.04 23.0623.47 25.44
D D
21.13 20.62 22.85
(f) 2 Bi-S 20.78 – 20.19 – 22.26 –
(g) 2 D 20.08 – 19.39 – 20.94 –
State-of-the-art Non-Neural Approach∗ – 20.60(1) 24.00(2)
(h) BPE 2 D D Base 16.1216.96 19.21 17.1617.68 20.79 14.6315.09 17.61
D
15.96 16.38 14.26
BPE
En-Cs

(i) 2 Base 17.6817.78 19.52 19.2519.55 21.95 16.9817.17 18.92


D
17.39 18.89 16.81
Char
(j) 2 Bi-S 17.6217.93
17.43 19.83 19.2719.53
19.15 22.15 16.8617.10
16.68 18.93
State-of-the-art Non-Neural Approach∗ – 21.00(3) 18.20(4)
(k) BPE 2 D D Base 18.5618.70 21.17 25.3025.40 29.26 19.7220.29 22.96
D Base
18.26 24.95 19.02
BPE
En-Ru

(l) 2 18.5618.87 20.53 26.0026.07 29.37 21.1021.24 23.51


D Bi-S
18.39 25.04 20.14
Char
(m) 2 18.3018.54
17.88 20.53 25.5925.76
24.57 29.26 20.7321.02
19.97 23.75

State-of-the-art Non-Neural Approach – 28.70(5) 24.30(6)
(n) BPE 2 D D Base 9.6110.02 11.92 – – 8.979.17 11.73
D Base
9.24 8.88
BPE
En-Fi

(o) 2 11.1911.55 13.72 – – 10.9311.56 13.48


D Bi-S
11.09 10.11
Char
(p) 2 10.7311.04
10.40 13.39 – – 10.2410.63
9.71 13.32
State-of-the-art Non-Neural Approach∗ – – 12.70(7)

Table 1: BLEU scores of the subword-level, character-level base and character-level bi-scale decoders
for both single models and ensembles. The best scores among the single models per language pair
are bold-faced, and those among the ensembles are underlined. When available, we report the median
value, and the minimum and maximum values as a subscript and a superscript, respectively. (∗) http:
//matrix.statmt.org/ as of 11 March 2016 (constrained only). (1) Freitag et al. (2014). (2, 6) Williams et al. (2015).
(3, 5) Durrani et al. (2014). (4) Haddow et al. (2015). (7) Rubino et al. (2015).

For all the pairs other than En-Fi, we use given a source sentence. The beam widths are
newstest-2013 as a development set, and newstest- 5 and 15 respectively for the subword-level and
2014 (Test1 ) and newstest-2015 (Test2 ) as test sets. character-level decoders. They were chosen based
For En-Fi, we use newsdev-2015 and newstest- on the translation quality on the development set.
2015 as development and test sets, respectively. The translations are evaluated using BLEU.5
Models and Training We test three models set- Multilayer Decoder and Soft-Alignment Mech-
tings: (1) BPE→BPE, (2) BPE→Char (base) and anism When the decoder is a multilayer re-
(3) BPE→Char (bi-scale). The latter two differ by current neural network (including a stacked net-
the type of recurrent neural network we use. We work as well as the proposed bi-scale network),
use GRUs for the encoder in all the settings. We the
 1 decoder outputs multiple hidden vectors–
used GRUs for the decoders in the first two set- h , . . . , hL for L layers, at a time. This allows
tings, (1) and (2), while the proposed bi-scale re- an extra degree of freedom in the soft-alignment
current network was used in the last setting, (3). mechanism (fscore in Eq. (3)). We evaluate using
The encoder has 512 hidden units for each direc- alternatives, including (1) using only hL (slower
tion (forward and reverse), and the decoder has layer) and (2) using all of them (concatenated).
1024 hidden units per layer.
We train each model using stochastic gradient Ensembles We also evaluate an ensemble of
descent with Adam (Kingma and Ba, 2014). Each neural machine translation models and compare
update is computed using a minibatch of 128 sen- its performance against the state-of-the-art phrase-
tence pairs. The norm of the gradient is clipped based translation systems on all four language
with a threshold 1 (Pascanu et al., 2013). pairs. We decode from an ensemble by taking the
average of the output probabilities at each step.
Decoding and Evaluation We use beamsearch
5
to approximately find the most likely translation We used the multi-bleu.perl script from Moses.
Figure 3: Alignment matrix of a test example from En-De using the BPE→Char (bi-scale) model.

6 Quantitative Analysis Ensembles Each ensemble was built using eight


independent models. The first observation we
Slower Layer for Alignment On En-De, we
make is that in all the language pairs, neural ma-
test which layer of the decoder should be used
chine translation performs comparably to, or often
for computing soft-alignments. In the case of
better than, the state-of-the-art non-neural transla-
subword-level decoder, we observed no difference
tion system. Furthermore, the character-level de-
between choosing any of the two layers of the de-
coders outperform the subword-level decoder in
coder against using the concatenation of all the
all the cases.
layers (Table 1 (a–b)) On the other hand, with the
character-level decoder, we noticed an improve-
7 Qualitative Analysis
ment when only the slower layer (h2 ) was used
for the soft-alignment mechanism (Table 1 (c–g)). (1) Can the character-level decoder generate
This suggests that the soft-alignment mechanism a long, coherent sentence? The translation in
benefits by aligning a larger chunk in the target characters is dramatically longer than that in
with a subword unit in the source, and we use only words, likely making it more difficult for a recur-
the slower layer for all the other language pairs. rent neural network to generate a coherent sen-
Single Models In Table 1, we present a com- tence in characters. This belief turned out to be
prehensive report of the translation qualities of false. As shown in Fig. 2 (left), there is no sig-
(1) subword-level decoder, (2) character-level base nificant difference between the subword-level and
decoder and (3) character-level bi-scale decoder, character-level decoders, even though the lengths
for all the language pairs. We see that the both of the generated translations are generally 5–10
types of character-level decoder outperform the times longer in characters.
subword-level decoder for En-Cs and En-Fi quite (2) Does the character-level decoder help with
significantly. On En-De, the character-level base rare words? One advantage of character-level
decoder outperforms both the subword-level de- modelling is that it can model the composition of
coder and the character-level bi-scale decoder, any character sequence, thereby better modelling
validating the effectiveness of the character-level rare morphological variants. We empirically con-
modelling. On En-Ru, among the single mod- firm this by observing the growing gap in the aver-
els, the character-level decoders outperform the age negative log-probability of words between the
subword-level decoder, but in general, we observe subword-level and character-level decoders as the
that all the three alternatives work comparable to frequency of the words decreases. This is shown
each other. in Fig. 2 (right) and explains one potential cause
These results clearly suggest that it is indeed behind the success of character-level decoding in
possible to do character-level translation without our experiments (we define diff(x, y) = x − y).
explicit segmentation. In fact, what we observed is
that character-level translation often surpasses the (3) Can the character-level decoder soft-align
translation quality of word-level translation. Of between a source word and a target charac-
course, we note once again that our experiment is ter? In Fig. 3 (left), we show an example soft-
restricted to using an unsegmented character se- alignment of a source sentence, “Two sets of light
quence at the decoder only, and a further explo- so close to one another”. It is clear that the
ration toward replacing the source sentence with character-level translation model well captured the
an unsegmented character sequence is needed. alignment between the source subwords and tar-
get characters. We observe that the character- Acknowledgments
level decoder correctly aligns to “lights” and “sets
of” when generating a German compound word The authors would like to thank the developers
“Lichtersets” (see Fig. 3 (right) for the zoomed- of Theano (Team et al., 2016). We acknowledge
in version). This type of behaviour happens simi- the support of the following agencies for research
larly between “one another” and “einander”. Of funding and computing support: NSERC, Calcul
course, this does not mean that there exists an Québec, Compute Canada, the Canada Research
alignment between a source word and a target Chairs, CIFAR and Samsung. KC thanks the sup-
character. Rather, this suggests that the internal port by Facebook, Google (Google Faculty Award
state of the character-level decoder, the base or bi- 2016) and NVIDIA (GPU Center of Excellence
scale, well captures the meaningful chunk of char- 2015-2016). JC thanks Orhan Firat for his con-
acters, allowing the model to map it to a larger structive feedbacks.
chunk (subword) in the source.
References
(4) How fast is the decoding speed of the Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
character-level decoder? We evaluate the de- gio. 2015. Neural machine translation by jointly
coding speed of subword-level base, character- learning to align and translate. In Proceedings of
level base and character-level bi-scale decoders on the International Conference on Learning Represen-
tations (ICLR).
newstest-2013 corpus (En-De) with a single Titan
X GPU. The subword-level base decoder gener- Yoshua Bengio, Patrice Simard, and Paolo Frasconi.
ates 31.9 words per second, and the character-level 1994. Learning long-term dependencies with gradi-
base decoder and character-level bi-scale decoder ent descent is difficult. IEEE Transactions on Neu-
ral Networks, 5(2):157–166.
generate 27.5 words per second and 25.6 words
per second, respectively. Note that this is evalu- Yoshua Bengio, Réjean Ducharme, and Pascal Vincent.
ated in an online setting, performing consecutive 2001. A neural probabilistic language model. In Ad-
translation, where only one sentence is translated vances in Neural Information Processing Systems,
pages 932–938.
at a time. Translating in a batch setting could dif-
fer from these results. Geert Booij. 2012. The grammar of words: An intro-
duction to linguistic morphology. Oxford University
Press.
8 Conclusion
Jan A Botha and Phil Blunsom. 2014. Compositional
morphology for word representations and language
In this paper, we addressed a fundamental ques- modelling. In ICML 2014.
tion on whether a recently proposed neural ma-
Rohan Chitnis and John DeNero. 2015. Variable-
chine translation system can directly handle trans- length word encodings for neural translation models.
lation at the level of characters without any word In Proceedings of the 2015 Conference on Empiri-
segmentation. We focused on the target side, in cal Methods in Natural Language Processing, pages
which a decoder was asked to generate one char- 2088–2093.
acter at a time, while soft-aligning between a tar- Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-
get character and a source subword. Our extensive cehre, Fethi Bougares, Holger Schwenk, and Yoshua
experiments, on four language pairs–En-Cs, En- Bengio. 2014. Learning phrase representations
De, En-Ru and En-Fi– strongly suggest that it is using RNN encoder-decoder for statistical machine
translation. In Proceedings of the Empiricial
indeed possible for neural machine translation to Methods in Natural Language Processing (EMNLP
translate at the level of characters, and that it actu- 2014), October.
ally benefits from doing so.
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho,
Our result has one limitation that we used sub- and Yoshua Bengio. 2015. Gated feedback recur-
word symbols in the source side. However, this rent neural networks. In Proceedings of the 32nd
has allowed us a more fine-grained analysis, but in International Conference on Machine Learning.
the future, a setting where the source side is also Marta R Costa-Jussà and José AR Fonollosa. 2016.
represented as a character sequence must be inves- Character-based neural machine translation. arXiv
tigated. preprint arXiv:1603.00810.
Mathias Creutz and Krista Lagus. 2005. Unsupervised Hyoung-Gyu Lee, JaeSong Lee, Jun-Seok Kim, and
morpheme segmentation and morphology induction Chang-Ki Lee. 2015. Naver machine translation
from text corpora using Morfessor 1.0. Helsinki system for wat 2015. In Proceedings of the 2nd
University of Technology. Workshop on Asian Translation (WAT2015), pages
69–73.
Nadir Durrani, Barry Haddow, Philipp Koehn, and
Kenneth Heafield. 2014. Edinburgh’s phrase-based Wang Ling, Tiago Luı́s, Luı́s Marujo, Ramón Fernan-
machine translation systems for wmt-14. In Pro- dez Astudillo, Silvio Amir, Chris Dyer, Alan W
ceedings of the ACL 2014 Ninth Workshop on Sta- Black, and Isabel Trancoso. 2015a. Finding
tistical Machine Translation, Baltimore, MD, USA, function in form: Compositional character models
pages 97–104. for open vocabulary word representation. arXiv
preprint arXiv:1508.02096.
Mikel L Forcada and Ramón P Ñeco. 1997. Recur-
sive hetero-associative memories for translation. In Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W
International Work-Conference on Artificial Neural Black. 2015b. Character-based neural machine
Networks, pages 453–462. Springer. translation. arXiv preprint arXiv:1511.04586.
Markus Freitag, Stephan Peitz, Joern Wuebker, Her-
Thang Luong, Richard Socher, and Christopher D
mann Ney, Matthias Huck, Rico Sennrich, Nadir
Manning. 2013. Better word representations
Durrani, Maria Nadejde, Philip Williams, Philipp
with recursive neural networks for morphology. In
Koehn, et al. 2014. Eu-bridge mt: Combined ma-
CoNLL, pages 104–113.
chine translation.

Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Minh-Thang Luong, Hieu Pham, and Christopher D
Subramanya. 2015. Multilingual language process- Manning. 2015a. Effective approaches to attention-
ing from bytes. arXiv preprint arXiv:1512.00103. based neural machine translation. arXiv preprint
arXiv:1508.04025.
Barry Haddow, Matthias Huck, Alexandra Birch, Niko-
lay Bogoychev, and Philipp Koehn. 2015. The edin- Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol
burgh/jhu phrase-based machine translation systems Vinyals, and Wojciech Zaremba. 2015b. Address-
for wmt 2015. In Proceedings of the Tenth Work- ing the rare word problem in neural machine trans-
shop on Statistical Machine Translation, pages 126– lation. arXiv preprint arXiv:1410.8206.
133.
Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Cernockỳ, and Sanjeev Khudanpur. 2010. Recur-
Long short-term memory. Neural computation, rent neural network based language model. In IN-
9(8):1735–1780. TERSPEECH, volume 2, page 3.

Sepp Hochreiter. 1998. The vanishing gradient Tomas Mikolov, Ilya Sutskever, Anoop Deoras, Hai-
problem during learning recurrent neural nets and Son Le, Stefan Kombrink, and J Cernocky. 2012.
problem solutions. International Journal of Un- Subword language modeling with neural networks.
certainty, Fuzziness and Knowledge-Based Systems, Preprint.
6(02):107–116.
Graham Neubig, Taro Watanabe, Shinsuke Mori, and
Changning Huang and Hai Zhao. 2007. Chinese word Tatsuya Kawahara. 2013. Substring-based machine
segmentation: A decade review. Journal of Chinese translation. Machine translation, 27(2):139–166.
Information Processing, 21(3):8–20.
Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho,
Sébastien Jean, Kyunghyun Cho, Roland Memisevic,
and Yoshua Bengio. 2013. How to construct
and Yoshua Bengio. 2015. On using very large
deep recurrent neural networks. arXiv preprint
target vocabulary for neural machine translation.
arXiv:1312.6026.
In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics: Short
Raphael Rubino, Tommi Pirinen, Miquel Espla-Gomis,
Papers-Volume 2.
N Ljubešic, Sergio Ortiz Rojas, Vassilis Papavassil-
Nal Kalchbrenner and Phil Blunsom. 2013. Recur- iou, Prokopis Prokopidis, and Antonio Toral. 2015.
rent continuous translation models. In EMNLP, vol- Abu-matran at wmt 2015 translation task: Morpho-
ume 3, page 413. logical segmentation and web crawling. In Proceed-
ings of the Tenth Workshop on Statistical Machine
Yoon Kim, Yacine Jernite, David Sontag, and Alexan- Translation, pages 184–191.
der M Rush. 2015. Character-aware neural lan-
guage models. arXiv preprint arXiv:1508.06615. Cicero D Santos and Bianca Zadrozny. 2014. Learning
character-level representations for part-of-speech
Diederik Kingma and Jimmy Ba. 2014. Adam: A tagging. In Proceedings of the 31st International
method for stochastic optimization. arXiv preprint Conference on Machine Learning (ICML-14), pages
arXiv:1412.6980. 1818–1826.
Holger Schwenk. 2007. Continuous space language
models. Computer Speech & Language, 21(3):492–
518.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2015. Neural machine translation of rare words with
subword units. arXiv preprint arXiv:1508.07909.
Rupesh K Srivastava, Klaus Greff, and Jürgen Schmid-
huber. 2015. Training very deep networks. In Ad-
vances in Neural Information Processing Systems,
pages 2368–2376.
Ilya Sutskever, James Martens, and Geoffrey E Hin-
ton. 2011. Generating text with recurrent neural
networks. In Proceedings of the 28th International
Conference on Machine Learning (ICML’11), pages
1017–1024.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural net-
works. In Advances in Neural Information Process-
ing Systems, pages 3104–3112.
The Theano Development Team, Rami Al-Rfou,
Guillaume Alain, Amjad Almahairi, Christof
Angermueller, Dzmitry Bahdanau, Nicolas Ballas,
Frédéric Bastien, Justin Bayer, Anatoly Belikov,
et al. 2016. Theano: A python framework for fast
computation of mathematical expressions. arXiv
preprint arXiv:1605.02688.
David Vilar, Jan-T Peter, and Hermann Ney. 2007.
Can we translate letters? In Proceedings of the
Second Workshop on Statistical Machine Transla-
tion, pages 33–39. Association for Computational
Linguistics.
Philip Williams, Rico Sennrich, Maria Nadejde,
Matthias Huck, and Philipp Koehn. 2015. Edin-
burgh’s syntax-based systems at wmt 2015. In Pro-
ceedings of the Tenth Workshop on Statistical Ma-
chine Translation, pages 199–209.

Yijun Xiao and Kyunghyun Cho. 2016. Efficient


character-level document classification by combin-
ing convolution and recurrent layers. arXiv preprint
arXiv:1602.00367.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.


Character-level convolutional networks for text clas-
sification. In Advances in Neural Information Pro-
cessing Systems, pages 649–657.

You might also like