Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Revisiting Character-Based Neural Machine Translation

with Capacity and Compression

Colin Cherry∗, George Foster∗, Ankur Bapna, Orhan Firat, Wolfgang Macherey
Google AI
colincherry,fosterg,ankurbpn,orhanf,[email protected]

Abstract that should ideally be tuned for each language pair


and corpus, an expensive step that is frequently
Translating characters instead of words or omitted. Even when properly tuned, the repre-
word-fragments has the potential to simplify
sentation of the corpus generated by pipelined
the processing pipeline for neural machine
translation (NMT), and improve results by external processing is likely to be sub-optimal.
eliminating hyper-parameters and manual fea- For instance, it is easy to find examples of word
ture engineering. However, it results in longer fragmentations, such as fling → fl + ing, that
sequences in which each symbol contains less are linguistically implausible. NMT systems are
information, creating both modeling and com- generally robust to such infelicities—and can be
putational challenges. In this paper, we show made more robust through subword regularization
that the modeling problem can be solved by
(Kudo, 2018)—but their effect on performance has
standard sequence-to-sequence architectures
of sufficient depth, and that deep models op-
not been carefully studied. The problem of find-
erating at the character level outperform iden- ing optimal segmentations becomes more complex
tical models operating over word fragments. when an NMT system must handle multiple source
This result implies that alternative architec- and target languages, as in multilingual translation
tures for handling character input are bet- or zero-shot approaches (Johnson et al., 2017).
ter viewed as methods for reducing compu- Translating characters instead of word frag-
tation time than as improved ways of model- ments avoids these problems, and gives the system
ing longer sequences. From this perspective,
access to all available information about source
we evaluate several techniques for character-
level NMT, verify that they do not match the and target sequences. However, it presents sig-
performance of our deep character baseline nificant modeling and computational challenges.
model, and evaluate the performance versus Longer sequences incur linear per-layer cost and
computation time tradeoffs they offer. Within quadratic attention cost, and require information
this framework, we also perform the first eval- to be retained over longer temporal spans. Finer
uation for NMT of conditional computation temporal granularity also creates the potential for
over time, in which the model learns which
attention jitter (Gulcehre et al., 2017). Perhaps
timesteps can be skipped, rather than having
them be dictated by a fixed schedule specified most significantly, since the meaning of a word
before training begins. is not a compositional function of its characters,
the system must learn to memorize many character
1 Introduction sequences, a different task from the (mostly) com-
positional operations it performs at higher levels
Neural Machine Translation (NMT) has largely re- of linguistic abstraction.
placed the complex pipeline of Phrase-Based MT In this paper, we show that a standard LSTM
with a single model that is trained end-to-end. sequence-to-sequence model works very well for
However, NMT systems still typically rely on pre- characters, and given sufficient depth, consistently
and post-processing operations such as tokeniza- outperforms identical models operating over word
tion and word fragmentation through byte-pair en- fragments. This result suggests that a produc-
coding (BPE; Sennrich et al., 2016). Although tive line of research on character-level models is
these are effective, they involve hyperparameters to seek architectures that approximate standard

*Equal contributions sequence-to-sequence models while being compu-

4295
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4295–4305
Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics
tationally cheaper. One approach to this problem of whether their specific encoder produces a de-
is temporal compression: reducing the number of sirable speed-quality tradeoff in the context of a
state vectors required to represent input or output much stronger baseline translation system. We
sequences. We evaluate various approaches for draw inspiration from their pooling solution for re-
performing temporal compression, both accord- ducing sequence length, along with similar ideas
ing to a fixed schedule; and, more ambitiously, from the speech community (Chan et al., 2016),
learning compression decisions with a Hierarchi- when devising fixed-schedule reduction strategies
cal Multiscale architecture (Chung et al., 2017). in Section 3.3.
Following recent work by Lee et al. (2017), we fo- One of our primary contributions is an ex-
cus on compressing the encoder. tensive invesigation of the efficacy of a typical
Our contributions are as follows: LSTM-based NMT system when operating at the
character-level. The vast majority of existing stud-
• The first large-scale empirical investigation ies compare a specialized character-level architec-
of the translation quality of standard LSTM ture to a distinct word-level one. To the best of
sequence-to-sequence architectures operat- our knowledge, only a small number of papers
ing at the character level, demonstrating im- have explored running NMT unmodified on char-
provements in translation quality over word acter sequences; these include: Luong and Man-
fragments, and quantifying the effect of cor- ning (2016) on WMT’15 English-Czech, Wu et al.
pus size and model capacity. (2016) on WMT’14 English-German, and Brad-
• A comparison of techniques to compress bury et al. (2016) on IWSLT German-English. All
character sequences, assessing their ability to report scores that either trail behind or reach par-
trade translation quality for increased speed. ity with word-level models. Only Wu et al. (2016)
• A first attempt to learn how to compress the compare to word fragment models, which they
source sequence during NMT training by us- show to outperform characters by a sizeable mar-
ing the Hierarchical Multiscale LSTM to dy- gin. We revisit the question of character- versus
namically shorten the source sequence as it fragment-level NMT here, and reach quite differ-
passes through the encoder. ent conclusions.

2 Related Work 3 Methods

Early work on modeling characters in NMT fo- 3.1 Baseline Sequence-to-Sequence Model
cused on solving the out-of-vocabulary and soft- We adopt a simplified version of the LSTM archi-
max bottleneck problems associated with word- tecture of Chen et al. (2018) that achieves state-of-
level models (Ling et al., 2015; Costa-jussà and the-art performance on the competitive WMT14
Fonollosa, 2016; Luong and Manning, 2016). English-French and English-German benchmarks.
These took the form of word-boundary-aware hi- This incorporates bidirectional LSTM (BiLSTM)
erarchical models, with word-level models dele- layers in the encoder, concatenating the output
gating to character-level models to generate repre- from forward and backward directions before
sentations in the encoder and words in the decoder. feeding the next layer. Output from the top en-
Our work will not assume fixed word boundaries coder layer is projected down to the decoder di-
are given in advance. mension and used in an additive attention mech-
With the advent of word-fragment approaches, anism computed over the bottom decoder layer.
interest in character-level processing fell off, but The decoder consists of unidirectional layers, all
has recently been reignited with the work of of which use the encoder context vectors com-
Lee et al. (2017). They propose a specialized puted from attention weights over the bottom
character-level encoder, connected to an unmod- layer. For both encoder and decoder we use layer
ified character-level RNN decoder. They address normalization (Ba et al., 2016) and residual con-
the modeling and efficiency challenges of long nections beginning at the third layer. We do not
character sequences using a convolutional layer, apply a non-linearity to LSTM output. We regu-
max-pooling over time, and highway layers. We larize with dropout applied to embeddings and to
agree with their conclusion that character-level the output of each LSTM layer.
translation is effective, but revisit the question In the interests of simplicity and reproducibil-

4296
ity, we depart from Chen et al. (2018) in several As mentioned earlier, Lee et al. (2017) pro-
ways: we do not use multi-headed attention, feed pose a specialized character encoder that com-
encoder context vectors to the softmax, regularize bines convolutional layers to accumulate local
with label smoothing or weight decay, nor apply context, max-pooling layers to reduce sequence
dropout to the attention mechanism. lengths, highway layers to increase network ca-
Our baseline character models and BPE mod- pacity, followed by bidirectional GRU layers to
els both use this architecture, differing only in generate globally aware contextual source repre-
whether the source and target languages are tok- sentations. This strategy is particularly efficient
enized into sequences of characters or BPE word because all reductions happen before the first re-
fragments. We describe BPE briefly below. current layer. We re-implement their approach
faithfully, with the exceptions of using LSTMs in
3.2 Byte-Pair Encoding place of GRUs,1 and modifying the batch sizes to
accomodate our multi-GPU training scheme.
Byte-Pair Encoding (BPE) offers a simple inter-
While pooling based approaches are typically
polation between word- and character-level rep-
employed in association with convolutional lay-
resentations (Sennrich et al., 2016). It creates a
ers, we can also intersperse pooling layers into our
vocabulary of frequent words and word fragments
high capacity baseline encoder. This means that
in an iterative greedy merging process that begins
after each BiLSTM layer, we have the option to in-
with characters and ends when a desired vocab-
clude a fixed-stride pooling layer to compress the
ulary size is reached. The source and target lan-
sequence before it is processed by the next BiL-
guage are typically processed together in order
STM layer. This is similar to the pyramidal LSTM
to exploit lexical similarities. Given a vocabu-
encoders used for neural speech recognition (Chan
lary, BPE re-tokenizes the corpus into word frag-
et al., 2016). This general strategy affords consid-
ments in a greedy left-to-right fashion, selecting
erable flexibility to the network designer, leaving
the longest possible vocabulary match, and back-
the type of pooling (concatenation, max, mean),
ing off to characters when necessary.
and the strides with which to pool as design deci-
Since each BPE token consists of one or
sions that can be tuned to fit the task.
more characters, BPE-tokenized sequences will be
shorter than character sequences. Viewed as a 3.4 Learned Temporal Compression
mechanism to reduce sequence length, BPE differs
It is unsatisfying to compress a sequence on a fixed
from the solutions we will discuss subsequently
schedule; after all, the characters in a sentence do
in that it increases the vocabulary size, delegat-
not each carry an identical amount of information.
ing the task of creating representations for word
The goal of this section is to explore data-driven
fragments to the embedding table. Also, despite
reduction methods that are optimized to the NMT
being data-driven, its segmentation decisions are
system’s objective, and which learn to compress
fixed before NMT training begins.
as a part of training.
Any strategy for performing temporal com-
3.3 Fixed stride Temporal Pooling
pression will necessarily make discrete decisions,
We explore using fixed stride temporal pooling since sentence length is discrete. Examples of
within the encoder to compress the source char- such strategies include sparse attention (Raffel
acter sequence. These solutions are characterized et al., 2017) and discrete auto-encoders (Kaiser
by pooling the contents of two or more contigu- et al., 2018). For our initial exploration, we chose
ous timesteps to create a single vector that sum- the hierarchical multiscale (HM) architecture of
marizes them, and will replace them to shorten Chung et al. (2017), which we briefly describe.
the sequence in the next layer. These approaches
can learn to interpret the raw character sequence in 3.4.1 Hierarchical Multiscale LSTM
service to their translation objective, but any such The HM is a bottom-up temporal subsampling ap-
interepretation must fit into the pooling schedule proach, with each layer selecting the timesteps that
that was specified during network construction. will survive to the layer above. At a given timestep
We evaluate two methods in this family: a re- t and layer `, the network makes a binary decision,
implementation of Lee et al. (2017), and a version 1
Development experiments indicated that using LSTMs
of our baseline with interspersed pooling layers. over GRUs resulted in a slight improvement.

4297
zt` , to determine whether or not it should send its in which upper layers never increase the amount
output up to layer ` + 1. The preactivation for this of computation.
decision, z̃t` , is a function of the current node’s We also found that the flush component of the
inputs from below and from the previous hidden original architecture, which modifies the LSTM
state, similar to an LSTM gate. However, zt` ’s ac- update at (t+1, `) to discard the LSTM’s inter-
tivation is a binary step function in the forward nal cell, provided too much incentive to leave zt`
pass, to enable discrete decisions, and a hard sig- at 0, resulting in degenerate configurations which
moid in the backward pass, to allow gradients to collapsed to having very few tokens in their up-
flow through the decision point.2 The zt` decision per layers. We addressed this by removing the no-
affects both the layer above, and the next timestep tion of a flush from our architecture. The node to
of the current layer: the right (t+1, `) always performs a normal LSTM
update, regardless of zt` . This modification is sim-
• zt` = 1, flow up: the node above (t, `+1) per- ilar to one proposed independently by Kádár et al.
forms a normal LSTM update; the node to (2018), who simplified the flush operation by re-
the right (t+1, `) performs a modified update moving the connection to (t, ` + 1).
called a flush, which ignores the LSTM inter- We found it useful to change the initial value of
nal cell at (t, `), and redirects the incoming the bias term used in the calculation of z̃t` , which
LSTM hidden state from (t, `) to (t, ` + 1). we refer to as the z-bias. Setting z-bias to 1, which
• zt` = 0, flow right: the node above (t, `+1) is the saturation point for the hard sigmoid with
simply copies the cell and hidden state values slope 1, improves training stability by encourag-
from (t−1, `+1); the node to the right (t+1, `) ing the encoder to explore configurations where
performs a normal LSTM update. most timesteps survive through all layers, before
starting to discard them.
Conceptually, when zt` = 0, the node above it be- Even with these modifications, we observed de-
comes a placeholder and is effectively removed generate behavior in some settings. To discour-
from the sequence for that layer. Shorter upper age this, we added a compression loss component
layers save computation and facilitate the left-to- similar to that of Ke et al. (2018) to penalize z
right flow of information for the surviving nodes. activation rates outside a specified range α1 , α2 :
Typically, one uses the top hidden state hL t from Lc = l max(0, Z l − α1 T, α2 T − Z l ), where
P
a stack of L RNNs to provide the representation T is source sequence length and Z l = Tt=1 ztl .
P
for a timestep t. But for the HM, the top layer To incorporate the HM into our NMT encoder,
may be updated much less frequently than the lay- we replace the lowest BiLSTM layer with unidi-
ers below it. To enable tasks that need a distinct rectional HM layers.3 We adapt any remaining
representation for each timestep, such as language BiLSTM layers to copy or update according to the
modeling, the HM employs a gated output module z-values calculated by the top HM layer.
to mix hidden states across layers. This learned
module combines the states h1t , h2t , . . ., hL
t using 4 Experimental Design
scaling and projection operators to produce a sin-
4.1 Corpora
gle output ht .
We adopt the corpora used by Lee et al (2017),
3.4.2 Modifying the HM for NMT with the exception of WMT15 Russian-English.4
We would like sequences to become progressively To measure performance on an “easy” language
shorter as we move upward through the layers. As pair, and to calibrate our results against recent
originally specified, the HM calculates zt` indepen- benchmarks, we also included WMT14 English-
dently for every t and `, including copied nodes, French. Table 1 gives details of the corpora used.
meaning that a “removed” timestep could reappear All corpora are preprocessed using Moses tools.5
in a higher layer when a copied node (t, `) sets 3
The flush operation makes the original HM inherently
zt` = 1. This is easily addressed by locking zt` = 0 left-to-right. Since we have dropped flushes from our current
for copied nodes, creating a hierarchical structure version, it should be straightforward to devise a bidirectional
variant, which we leave to future work.
2 4
This disconnect between forward and backward activa- Due to licence restrictions.
5
tions is known as a straight-through estimator (Bengio et al., Scripts and arguments:
2013). remove-non-printing-char.perl

4298
corpus train dev test Tokenized BLEU SacreBLEU
WMT15 Finnish-En 2.1M 1500 1370 Language BPE Char Delta Char
WMT15 German-En 4.5M 3003 2169 EnFr 38.8 39.2 0.4 38.1
WMT15 Czech-En 14.8M 3003 2056 CsEn 24.8 25.9 1.1 25.6
WMT14 En-French 39.9M 3000 3003 DeEn 29.7 31.6 1.9 31.6
FiEn 17.5 19.3 1.8 19.5
Table 1: Corpora, with linecounts. Test sets are
WMT14-15 newstest. Dev sets are newsdev 2015 (Fi) Table 2: Character versus BPE translation.
and newstest 2013 (De, Fr), and 2014 (Cs).
Comparison Point Ref Ours
Chen et al. (2018) BPE EnFr 41.0
Dev and test corpora are tokenized, but not filtered 38.8
Wu et al. (2016) BPE EnFr 39.0
or cleaned. Our character models use only the Lee et al. (2017) Char CsEn 22.5 25.9
most frequent 496 characters across both source DeEn 25.8 31.6
and target languages; similarly, BPE is run across FiEn 13.1 19.3
both languages, with a vocabulary size of 32k.
Table 3: Comparisons with some recent points in the
4.2 Model sizes, training, and inference literature. Scores are tokenized BLEU.
Except where noted below, we used 6 bidirectional
layers in the encoder, and 8 unidirectional layers in we also report SacreBLEU scores (Post, 2018) for
the decoder. All vector dimensions were 512. key results, using the Moses detokenizer.
Models were trained using sentence-level cross-
entropy loss. Batch sizes are capped at 16,384 to- 5 Results
kens, and each batch is divided among 16 NVIDIA 5.1 Character-level translation
P100s running synchronously.
Parameters were initialized with a uniform We begin with experiments to compare the stan-
(0.04) distribution. We use the Adam optimizer, dard RNN architecture from Section 3.1 at the
with β1 = 0.9, β2 = 0.999, and  = 10−6 (Kingma character and BPE levels, using our full-scale
and Ba, 2014). Gradient norm is clipped to 5.0. model with 6 bidirectional encoder layers and 8
The initial learning rate is 0.0004, and we halve decoder layers. The primary results of our experi-
it whenever dev set perplexity has not decreased ments are presented in Table 2, while Table 3 posi-
for 2k batches, with at least 2k batches between tions the same results with respect to recent points
successive halvings. Training stops when dev set from the literature.
perplexity has not decreased for 8k batches. There are a number of observations we can draw
from this data. First, from the EnFr results in Ta-
Inference uses beam search with 8 hypothe-
ble 3, we are in line with GNMT (Wu et al., 2016),
ses, coverage penalty of 0.2 (Tu et al., 2016), and
and within 2 BLEU points of the RNN and Trans-
length normalization of 0.2 (Wu et al., 2016).
former models investigated by Chen et al. (2018).
4.3 Tuning and Evalution So, while we are not working at the exact state-of-
the-art, we are definitely in a range that should be
When comparing character-level and BPE models, relevant to most practitioners.
we tuned dropout independently for each setting, Also from Table 3, we compare quite favorably
greedily exploring increments of 0.1 in the range with Lee et al. (2017), exceeding their reported
0.1–0.5, and selecting based on dev-set BLEU. scores by 3-6 points, which we attribute to hav-
This expensive strategy is crucial to obtaining ing employed much higher model capacity, as they
valid conclusions, since optimal dropout values use a single bidirectional layer in the encoder and
tend to be lower for character models. a two-layer decoder. We investigate the impact of
Our main evaluation metric is Moses-tokenized model capacity in Section 5.1.1.
case-sensitive BLEU score. We report test-set Finally, Table 2 clearly shows the character-
scores on the checkpoints having highest dev-set level systems outperforming BPE for all language
BLEU. To facilitate comparison with future work pairs. The dominance of character-level methods
tokenize.perl in Table 2 indicates that RNN-based NMT archi-
clean-corpus-n.perl -ratio 9 1 100 tectures are not only capable of translating charac-

4299
ter sequences, but actually benefit from them. This point to be much higher for more morphologically
is in direct contradiction to the few previously re- complex languages. It is also important to re-
ported results on this matter, which can in most call that relatively few language-pairs can assem-
cases be explained by our increased model capac- ble parallel corpora of this size.
ity. The exception is GNMT (Wu et al., 2016),
which had similar depth. In this case, possible 5.1.3 Speed
explanations for the discrepancy include our use The performance advantage of working with char-
of a fully bidirectional encoder, our translating acters comes at a significant computational cost.
into English instead of German, and our model- With our full-sized architecture, character models
specific tuning of dropout. trained roughly 8x more slowly than BPE mod-
els.6 Figure 3 shows that training time grows lin-
5.1.1 Effect of model capacity early with number of layers in the model, and that
Character-level NMT systems have a more diffi- character models have a much higher per-layer
cult sequence-modeling task, as they need to infer cost: roughly 0.38 msec/sentence versus 0.04 for
the meaning of words from their constituent char- BPE. We did not directly measure the difference
acters, where models with larger tokens instead in attention cost, but it cannot be greater than the
delegate this task to the embedding table. There- difference in total cost for the smallest number
fore, we hypothesize that increasing the model’s of layers. Therefore, we can infer from Figure 3
capacity by adding layers will have a greater im- that processing 5 layers in a character model in-
pact on character-level models. Figure 1 tests curs roughly the same time cost as attention. This
this hypothesis by measuring the impact of three is surprising given the quadratic cost of attention,
model sizes on test BLEU score. For each of and indicates that efforts to speed up character
our four language pairs, the word-fragment model models cannot focus exclusively on attention.
starts out ahead, and quickly loses ground as
architecture size increases. For the languages 5.1.4 Qualitative comparison
with greater morphological complexity—German, To make a qualitative comparison between word
Czech and Finnish—the slope of the character fragments (BPE) and characters for NMT, we ex-
model’s curve is notably steeper than that of the amined 100 randomly selected sentence pairs from
BPE system, indicating that these systems could the DeEn test set. One author examined the sen-
benefit from yet more modeling capacity. tences, using a display that showed the source7 and
the reference, along with the output of BPE and
5.1.2 Effect of corpus size character models. Any differences between the
One of the most compelling arguments for work- two outputs were highlighted. They then assigned
ing with characters (and to a lesser extent, word- tags to both system outputs indicating broad er-
fragments) is improved generalization. Through ror categories, such as lexical choice, word order
morphological generalizations, the system can and German compound handling.8 Tags were re-
better handle low-frequency and previously un- stricted to cases where one system made a mistake
seen words. It stands to reason that as the train- that the other did not.
ing corpus increases in size, the importance of Of the 100 sentences, 47 were annotated as be-
these generalization capabilities will decrease. We ing identical or of roughly the same quality. The
test this hypothesis by holding the language pair remaining 53 exhibited a large variety of differ-
constant, and varying the training corpus size by ences. Table 4 summarizes the errors that were
downsampling the full training corpus. We choose most easily characterized. BPE and character sys-
EnFr because it has by far the most available data. 6
Recall that we use batches containing 16,384 tokens—
We compare four sizes: 2M, 4M, 14M and 40M. corresponding to a fixed memory budget—for both character
The results are shown in Figure 2. As expected, and BPE models. Thus character models are slowed not only
the gap between character and word-fragment by having longer sentences, but also by parallelizing across
fewer sentences in each batch.
modeling decreases as corpus size increases. From 7
The annotating author does not speak German.
the slopes of the curves, we can infer that the ad- 8
Our annotator also looked specifically for agreement and
vantage of character-level modeling will disappear negation errors, as studied by Sennrich (2017) for English-to-
German character-level NMT. However, neither system ex-
completely as we reach 60-70M sentence pairs. hibited these error types with sufficient frequency to draw
However, there is reason to expect this break-even meaningful conclusions.

4300
Figure 1: Test BLEU for character and BPE translation as architectures scale from 1 BiLSTM encoder layer and 2
LSTM decoder layers (1×2+2) to our standard 6×2+8. The y-axis spans 6 BLEU points for each language pair.

Error Type BPE Char


Lexical Choice 19 8
Compounds 13 1
Proper Names 2 1
Morphological 2 2
Other lexical 2 4
Dropped Content 7 0
Table 4: Error counts out of 100 randomly sampled ex-
Figure 2: BLEU versus training corpus size in millions amples from the DeEn test set.
of sentence pairs, for the EnFr language-pair.

ing one where the character system is a strict im-


provement, translating Bunsenbrenner into bunsen
burner instead of bullets. The second error follows
another common pattern, where both systems mis-
handle the German compound (Chemiestunden /
chemistry lessons), but the character system fails
in a more useful way.
We also found that both systems occasionally
mistranslate proper names. Both fail by attempt-
Figure 3: Training time per sentence versus total num- ing to translate when they should copy over, but
ber of layers (encoder plus decoder) in the model. the BPE system’s errors are harder to understand
as they involve semantic translation, rendering
Britta Hermann as Sir Leon, and Esme Nussbaum
tems differ most in the number of lexical choice
as smiling walnut.9 The character system’s one
errors, and in the extent to which they drop con-
observed error in this category was phonetic rather
tent. The latter is surprising, and appears to be a
than semantic, rendering Schotten as Scottland.
side-effect of a general tendency of the character
models to be more faithful to the source, verging Interestingly, we also observed several in-
on being overly literal. An example of dropped stances where the model correctly translates the
content is shown in Table 5 (top). German 24-hour clock into the English 12-hour
clock; for example, 19.30 becomes 7:30 p.m..
Regarding lexical choice, the two systems dif-
This deterministic transformation is potentially in
fer not only in the number of errors, but in the
reach for both models, but we observed it only for
nature of those errors. In particular, the BPE
the character system in this sample.
model had more trouble handling German com-
pound nouns. Table 5 (bottom) shows an exam- 9
The BPE segmentations for these names were: _Britt
ple which exhibits two compound errors, includ- a _Herr mann and _Es me _N uss baum

4301
Src Für diejenigen, die in ländlichen und abgelegenen Regionen des Staates lebten, . . .
Ref Those living in regional and remote areas of the state . . .
BPE For those who lived in rural and remote regions, . . .
Char For those who lived in rural and remote regions of the state, . . .
Src Überall im Land, in Tausenden von Chemiestunden, haben Schüler ihre Bunsenbrenner
auf Asbestmatten abgestellt.
Ref Up and down the country, in myriad chemistry lessons, pupils have perched their Bunsen
burners on asbestos mats.
BPE Across the country, thousands of chemists have turned their bullets on asbestos mats.
Char Everywhere in the country, in thousands of chemical hours, students have parked their
bunsen burners on asbestos mats.

Table 5: Examples of BPE and character outputs for two sentences from the DeEn test set, demonstrating dropped
content (top) and errors with German compounds (bottom).

5.2 Compressing the Source Sequence Encoder BPE Size BLEU Comp.
BiLSTM Char 31.6 1.00
At this point we have established that character- BiLSTM 1k 30.5 0.44
level NMT benefits translation quality, but incurs a BiLSTM 2k 30.4 0.35
large computational cost. In this section, we eval- BiLSTM 4k 30.0 0.29
uate the speed-quality tradeoffs of various tech- BiLSTM 8k 29.6 0.25
niques for reducing the number of state vectors re- BiLSTM 16k 30.0 0.22
quired to represent the source sentence. All exper- BiLSTM 32k 29.7 0.20
iments are conducted on our DeEn language pair, Lee et. al. reimpl Char 28.0 0.20
chosen for having a good balance of morphologi- BiLSTM + pooling Char 30.0 0.47
cal complexity and training corpus size. HM, 3-layer Char 31.2 0.77
HM, 2-layer Char 30.9 0.89
5.2.1 Optimizing the BPE vocabulary
Table 6: Compression results on WMT15 DeEn. The
Recall that BPE interpolates between word- and Comp. column shows the ratio of total computations
carried out in the encoder.
character-level processing by tokenizing consecu-
tive characters into word fragments; larger BPE
vocabulary sizes result in larger fragments and 5.2.2 Fixed Stride Compression
shorter sequences. If character-level models out- The goal of these experiments is to determine
perform BPE with a vocabulary size of 32k, then is whether using fixed schedule compression is a
there a smaller BPE vocabulary size that reaps the feasible alternative to BPE. We evaluate our re-
benefits of character-level processing, while still implementation of the pooling model of Lee et al.
substantially reducing the sequence length? (2017) and our pooled BiLSTM encoder, both de-
To answer this question, we test a number of scribed in Section 3.3. For the pooled BiLSTM
BPE vocabularies, as shown in Table 6. For encoder, development experiments led us to intro-
each vocabulary, we measure BLEU and sequence duce two mean-pooling layers, a stride 3 layer af-
compression rate, defined as the average size of ter the second BiLSTM, and a stride 2 layer after
the source sequence in characters divided by its the third. Therefore, the final output of the encoder
size in word fragments (the ratio for the target se- is compressed by a factor of 6.
quence was similar). Unfortunately, even at just 1k The results are also shown in Table 6. Note that
vocabulary items, BPE has already lost a BLEU for the pooled BiLSTM, different encoder layers
point with respect to the character model. When have different lengths: 2 full length layers, fol-
comparing these results to the other methods in lowed by 1 at 13 length and 3 at 16 length. There-
this section, it is important to recall that BPE is fore, we report the average compression across
compressing both the source and target sequence layers here and for the HM in Section 5.2.3.
(by approximately the same amount), doubling its Our implementation of Lee et al. (2017) outper-
effective compression rate. forms the original results by more than 2 BLEU

4302
Model BLEU Comp. tion by roughly 60% relative to the layer below.10
LSTM 28.9 1.00 For full-scale experiments, we stacked 5 BiL-
HM, no-fl 27.3 0.63 STM layers on top of 2 or 3 HM layers, as de-
HM, no-fl, hier 28.5 0.65 scribed in section 3.4.1, using only the top HM
HM, no-fl, hier, zb1, anneal 28.8 0.65 layer (rather than the gated output module) as in-
put to the lowest BiLSTM layer. To stabilize the 3-
Table 7: HM small-scale results on WMT15 DeEn.
The Comp. column is the proportion of layer-wise
HM configuration we used a compression penalty
computation relative to the full LSTM. with a weight of 2, and α1 and α2 of 0.1 and 0.9.
Given the tendency of HM layers to reduce com-
putation by a roughly constant proportion, we ex-
points. We suspect most of these gains result from pect fewer z-gates to be open in the 3-HM con-
better optimization of the model with large batch figuration, but this is achieved at the cost of one
training. However, our attempts to scale this en- extra layer relative to our standard 12-layer en-
coder to larger depths, and therfore to the level of coder. As shown in table 6, the 3-HM configura-
performance exhibited by our other systems, did tion achieves much better compression even when
not result in any significant improvements. This this is accounted for, and also gives slightly better
is possibly due to difficulties with optimizing a performance than 2-HM. In general, HM gating
deeper stack of diverse layers. results in less compression but better performance
Comparing the performance of our Pooled BiL- than the fixed-stride techniques.
STM model against BPE, we notice that for a com- Although these preliminary results are promis-
parable level of compression (BPE size of 1k), ing, it should be emphasized that the speed gains
BPE out-performs the pooled model by around 0.5 they demonstrate are conceptual, and that realizing
BLEU points. At a similar level of performance them in practice comes with significant engineer-
(BPE size of 4k), BPE has significantly shorter se- ing challenges.
quences. Although fixed-stride pooling does not
yet match the performance of BPE, we remain op- 6 Conclusion
timistic about its potential. The appeal of these
We have demonstrated the translation quality
models derives from their simplicity; they are easy
of standard NMT architectures operating at the
to optimize, perform reasonably well, and remove
character-level. Our experiments show the sur-
the complication of BPE preprocessing.
prising result that character NMT can substan-
tially out-perform BPE tokenization for all but the
5.2.3 Hierarchical Multiscale Compression
largest training corpora sizes, and the less surpris-
We experimented with using the Hierarchical Mul- ing result that doing so incurs a large computa-
tiscale (HM; Section 3.4.1) architecture to learn tional cost. To address this cost, we have ex-
compression decisions for the encoder. plored a number of methods for source-sequence
For initial exploration, we used a scaled-down compression, including the first application of the
architecture consisting of 3 unidirectional HM en- Hierarchical Multiscale LSTM to NMT, which
coder layers and 2 LSTM decoder layers, attend- allows us to learn to dynamically compress the
ing over the HM’s gated output module. Com- source sequence.
parisons to an equivalent LSTM are shown in ta- We intend this paper as a call to action.
ble 7. The first two HM lines justify the no-flush Character-level translation is well worth doing, but
and hierarchical modifications described in Sec- we do not yet have the necessary techniques to
tion 3.4.1, yielding incremental gains of 27.3 (the benefit from this quality boost without suffering a
flush variant failed to converge), and 1.2 respec- disproportionate reduction in speed. We hope that
tively. Initializing z-bias to 1 and annealing the these results will spur others to revisit the question
slope of the hard binarizer from 1.0 to 5.0 over 80k of character-level translation as an interesting test-
minibatches gave further small gains, bringing the bed for methods that can learn to process, summa-
HM to parity with the LSTM while saving approx- rize or compress long sequences.
imately 35% of layer-wise computations. Interest- 10
For instance, the 2nd and 3rd layer of the best configu-
ingly, we found that, over a wide range of training ration shown had on average 60% and 36% of z gates open,
conditions, each layer tended to reduce computa- yielding the computation ratio of (1 + 0.6 + 0.36)/3 = 0.65.

4303
References Nan Rosemary Ke, Konrad Zolna, Alessandro Sor-
doni, Zhouhan Lin, Adam Trischler, Yoshua Ben-
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- gio, Joelle Pineau, Laurent Charlin, and Chris
ton. 2016. Layer normalization. arXiv preprint Pal. 2018. Focused hierarchical RNNs for con-
arXiv:1607.06450. ditional sequence processing. arXiv preprint
arXiv:1806.04342.
Yoshua Bengio, Nicholas Léonard, and Aaron
Courville. 2013. Estimating or propagating gradi- Diederik P Kingma and Jimmy Lei Ba. 2014. Adam: A
ents through stochastic neurons for conditional com- method for stochastic optimization. arXiv preprint
putation. arXiv preprint arXiv:1308.3432. arXiv:1412.6980.
James Bradbury, Stephen Merity, Caiming Xiong, and Taku Kudo. 2018. Subword regularization: Improv-
Richard Socher. 2016. Quasi-recurrent neural net- ing neural network translation models with multiple
works. arXiv preprint arXiv:1611.01576. subword candidates. In Proceedings of the 56th An-
nual Meeting of the Association for Computational
William Chan, Navdeep Jaitly, Quoc Le, and Oriol Linguistics (To Appear).
Vinyals. 2016. Listen, attend and spell: A neural
network for large vocabulary conversational speech Ákos Kádár, Marc-Alexandre Côté, Grzegorz
recognition. In Acoustics, Speech and Signal Pro- Chrupała, and Afra Alishahi. 2018. Revisiting
cessing (ICASSP), 2016 IEEE International Confer- the hierarchical multiscale LSTM. In COLING.
ence on, pages 4960–4964. IEEE.
Jason Lee, Kyunghyun Cho, and Thomas Hofmann.
Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin 2017. Fully character-level neural machine trans-
Johnson, Wolfgang Macherey, George Foster, Llion lation without explicit segmentation. Transactions
Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, of the Association for Computational Linguistics,
Yonghui Wu, and Macduff Hughes. 2018. The best 5:365–378.
of both worlds: Combining recent advances in neu-
ral machine translation. In Proceedings of the 56th Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W
Annual Meeting of the Association for Computa- Black. 2015. Character-based neural machine trans-
tional Linguistics (To Appear). lation. arXiv preprint arXiv:1511.04586.
Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Minh-Thang Luong and Christopher D. Manning.
2017. Hierarchical multiscale recurrent neural net- 2016. Achieving open vocabulary neural machine
works. In International Conference on Learning translation with hybrid word-character models. In
Representations, Toulon, France. Proceedings of the 54th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Marta R. Costa-jussà and José A. R. Fonollosa. 2016. Long Papers), pages 1054–1063, Berlin, Germany.
Character-based neural machine translation. In Pro- Association for Computational Linguistics.
ceedings of the 54th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 2: Short Matt Post. 2018. A call for clarity in reporting BLEU
Papers), pages 357–361, Berlin, Germany. Associa- scores. arXiv preprint arXiv:1804.08771v1.
tion for Computational Linguistics.
Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J.
Caglar Gulcehre, Francis Dutil, Adam Trischler, and Weiss, and Douglas Eck. 2017. Online and linear-
Yoshua Bengio. 2017. Plan, attend, generate: time attention by enforcing monotonic alignments.
Character-level neural machine translation with In International Conference on Machine Learning,
planning. In Proceedings of the 2nd Workshop on Sydney, Australia.
Representation Learning for NLP, pages 228–234,
Vancouver, Canada. Association for Computational Rico Sennrich. 2017. How Grammatical is Character-
Linguistics. level Neural Machine Translation? Assessing MT
Quality with Contrastive Translation Pairs. In
Melvin Johnson, Mike Schuster, Quoc Le, Maxim EACL, pages 376–382, Valencia, Spain.
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho-
rat, Fernand a Viégas, Martin Wattenberg, Greg Rico Sennrich, Barry Haddow, and Alexandra Birch.
Corrado, Macduff Hughes, and Jeffrey Dean. 2017. 2016. Neural machine translation of rare words
Google’s multilingual neural machine translation with subword units. In Proceedings of the 54th An-
system: Enabling zero-shot translation. Transac- nual Meeting of the Association for Computational
tions of the Association for Computational Linguis- Linguistics (Volume 1: Long Papers), pages 1715–
tics, 5:339–351. 1725, Berlin, Germany. Association for Computa-
tional Linguistics.
Łukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Par-
mar, Samy Bengio, Jakob Uszkoreit, and Noam Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua
Shazeer. 2018. Fast decoding in sequence mod- Liu, and Hang Li. 2016. Modeling coverage
els using discrete latent variables. arXiv preprint for neural machine translation. arXiv preprint
arXiv:1803.03382. arXiv:1601.04811.

4304
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, Jeff Klingner, Apurva Shah, Melvin
Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan
Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kaza-
waand Keith Stevens, George Kurian, Nishant Patil,
Wei Wang, Cliff Young, Jason Smith, Jason Riesa,
Alex Rudnick, Oriol Vinyals, Greg Corrado, Mac-
duff Hughes, and Jeffrey Dean. 2016. Google’s neu-
ral machine translation system: Bridging the gap
between human and machine translation. arXiv
preprint arXiv:1609.08144.

4305

You might also like