Professional Documents
Culture Documents
A Character-Level Decoder Without Explicit Segmentation For Neural Machine Translation
A Character-Level Decoder Without Explicit Segmentation For Neural Machine Translation
Abstract
tion, although neural networks do not suffer from
The existing machine translation systems, character-level modelling and rather suffer from
whether phrase-based or neural, have the issues specific to word-level modelling, such
arXiv:1603.06147v4 [cs.CL] 21 Jun 2016
p th
el
Src
od
Trgt h1 h2 Single Ens Single Ens Single Ens
De
D
M
(a) 1 20.78 – 19.98 – 21.72 –
(b)
BPE
2 D D Base
21.2621.45 23.49 20.4720.88 23.10 22.0222.21 24.83
D
20.62 19.30 21.35
(c) 2 21.5721.88 23.14 21.3321.56 23.11 23.4523.91 25.24
D D
20.88 19.82 21.72
Base
En-De
Table 1: BLEU scores of the subword-level, character-level base and character-level bi-scale decoders
for both single models and ensembles. The best scores among the single models per language pair
are bold-faced, and those among the ensembles are underlined. When available, we report the median
value, and the minimum and maximum values as a subscript and a superscript, respectively. (∗) http:
//matrix.statmt.org/ as of 11 March 2016 (constrained only). (1) Freitag et al. (2014). (2, 6) Williams et al. (2015).
(3, 5) Durrani et al. (2014). (4) Haddow et al. (2015). (7) Rubino et al. (2015).
For all the pairs other than En-Fi, we use given a source sentence. The beam widths are
newstest-2013 as a development set, and newstest- 5 and 15 respectively for the subword-level and
2014 (Test1 ) and newstest-2015 (Test2 ) as test sets. character-level decoders. They were chosen based
For En-Fi, we use newsdev-2015 and newstest- on the translation quality on the development set.
2015 as development and test sets, respectively. The translations are evaluated using BLEU.5
Models and Training We test three models set- Multilayer Decoder and Soft-Alignment Mech-
tings: (1) BPE→BPE, (2) BPE→Char (base) and anism When the decoder is a multilayer re-
(3) BPE→Char (bi-scale). The latter two differ by current neural network (including a stacked net-
the type of recurrent neural network we use. We work as well as the proposed bi-scale network),
use GRUs for the encoder in all the settings. We the
1 decoder outputs multiple hidden vectors–
used GRUs for the decoders in the first two set- h , . . . , hL for L layers, at a time. This allows
tings, (1) and (2), while the proposed bi-scale re- an extra degree of freedom in the soft-alignment
current network was used in the last setting, (3). mechanism (fscore in Eq. (3)). We evaluate using
The encoder has 512 hidden units for each direc- alternatives, including (1) using only hL (slower
tion (forward and reverse), and the decoder has layer) and (2) using all of them (concatenated).
1024 hidden units per layer.
We train each model using stochastic gradient Ensembles We also evaluate an ensemble of
descent with Adam (Kingma and Ba, 2014). Each neural machine translation models and compare
update is computed using a minibatch of 128 sen- its performance against the state-of-the-art phrase-
tence pairs. The norm of the gradient is clipped based translation systems on all four language
with a threshold 1 (Pascanu et al., 2013). pairs. We decode from an ensemble by taking the
average of the output probabilities at each step.
Decoding and Evaluation We use beamsearch
5
to approximately find the most likely translation We used the multi-bleu.perl script from Moses.
Figure 3: Alignment matrix of a test example from En-De using the BPE→Char (bi-scale) model.
Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Minh-Thang Luong, Hieu Pham, and Christopher D
Subramanya. 2015. Multilingual language process- Manning. 2015a. Effective approaches to attention-
ing from bytes. arXiv preprint arXiv:1512.00103. based neural machine translation. arXiv preprint
arXiv:1508.04025.
Barry Haddow, Matthias Huck, Alexandra Birch, Niko-
lay Bogoychev, and Philipp Koehn. 2015. The edin- Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol
burgh/jhu phrase-based machine translation systems Vinyals, and Wojciech Zaremba. 2015b. Address-
for wmt 2015. In Proceedings of the Tenth Work- ing the rare word problem in neural machine trans-
shop on Statistical Machine Translation, pages 126– lation. arXiv preprint arXiv:1410.8206.
133.
Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Cernockỳ, and Sanjeev Khudanpur. 2010. Recur-
Long short-term memory. Neural computation, rent neural network based language model. In IN-
9(8):1735–1780. TERSPEECH, volume 2, page 3.
Sepp Hochreiter. 1998. The vanishing gradient Tomas Mikolov, Ilya Sutskever, Anoop Deoras, Hai-
problem during learning recurrent neural nets and Son Le, Stefan Kombrink, and J Cernocky. 2012.
problem solutions. International Journal of Un- Subword language modeling with neural networks.
certainty, Fuzziness and Knowledge-Based Systems, Preprint.
6(02):107–116.
Graham Neubig, Taro Watanabe, Shinsuke Mori, and
Changning Huang and Hai Zhao. 2007. Chinese word Tatsuya Kawahara. 2013. Substring-based machine
segmentation: A decade review. Journal of Chinese translation. Machine translation, 27(2):139–166.
Information Processing, 21(3):8–20.
Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho,
Sébastien Jean, Kyunghyun Cho, Roland Memisevic,
and Yoshua Bengio. 2013. How to construct
and Yoshua Bengio. 2015. On using very large
deep recurrent neural networks. arXiv preprint
target vocabulary for neural machine translation.
arXiv:1312.6026.
In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics: Short
Raphael Rubino, Tommi Pirinen, Miquel Espla-Gomis,
Papers-Volume 2.
N Ljubešic, Sergio Ortiz Rojas, Vassilis Papavassil-
Nal Kalchbrenner and Phil Blunsom. 2013. Recur- iou, Prokopis Prokopidis, and Antonio Toral. 2015.
rent continuous translation models. In EMNLP, vol- Abu-matran at wmt 2015 translation task: Morpho-
ume 3, page 413. logical segmentation and web crawling. In Proceed-
ings of the Tenth Workshop on Statistical Machine
Yoon Kim, Yacine Jernite, David Sontag, and Alexan- Translation, pages 184–191.
der M Rush. 2015. Character-aware neural lan-
guage models. arXiv preprint arXiv:1508.06615. Cicero D Santos and Bianca Zadrozny. 2014. Learning
character-level representations for part-of-speech
Diederik Kingma and Jimmy Ba. 2014. Adam: A tagging. In Proceedings of the 31st International
method for stochastic optimization. arXiv preprint Conference on Machine Learning (ICML-14), pages
arXiv:1412.6980. 1818–1826.
Holger Schwenk. 2007. Continuous space language
models. Computer Speech & Language, 21(3):492–
518.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2015. Neural machine translation of rare words with
subword units. arXiv preprint arXiv:1508.07909.
Rupesh K Srivastava, Klaus Greff, and Jürgen Schmid-
huber. 2015. Training very deep networks. In Ad-
vances in Neural Information Processing Systems,
pages 2368–2376.
Ilya Sutskever, James Martens, and Geoffrey E Hin-
ton. 2011. Generating text with recurrent neural
networks. In Proceedings of the 28th International
Conference on Machine Learning (ICML’11), pages
1017–1024.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural net-
works. In Advances in Neural Information Process-
ing Systems, pages 3104–3112.
The Theano Development Team, Rami Al-Rfou,
Guillaume Alain, Amjad Almahairi, Christof
Angermueller, Dzmitry Bahdanau, Nicolas Ballas,
Frédéric Bastien, Justin Bayer, Anatoly Belikov,
et al. 2016. Theano: A python framework for fast
computation of mathematical expressions. arXiv
preprint arXiv:1605.02688.
David Vilar, Jan-T Peter, and Hermann Ney. 2007.
Can we translate letters? In Proceedings of the
Second Workshop on Statistical Machine Transla-
tion, pages 33–39. Association for Computational
Linguistics.
Philip Williams, Rico Sennrich, Maria Nadejde,
Matthias Huck, and Philipp Koehn. 2015. Edin-
burgh’s syntax-based systems at wmt 2015. In Pro-
ceedings of the Tenth Workshop on Statistical Ma-
chine Translation, pages 199–209.