Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

LSTM Neural Reordering Feature for Statistical Machine Translation

Yiming Cui, Shijin Wang and Jianfeng Li


iFLYTEK Research, Beijing, China
{ymcui,sjwang3,jfli3}@iflytek.com
arXiv:1512.00177v3 [cs.CL] 16 Jun 2016

Abstract (2) Lexicalized RM: Conditions reordering proba-


bilities on current phrase pairs. According to the
Artificial neural networks are powerful mod-
orientation determinants, lexicalized reordering
els, which have been widely applied into
many aspects of machine translation, such model can further be classified into word-based
as language modeling and translation mod- RM (Tillman, 2004), phrase-based RM (Koehn
eling. Though notable improvements have et al., 2007), and hierarchical phrase-based RM
been made in these areas, the reordering prob- (Galley and Manning, 2008).
lem still remains a challenge in statistical ma-
chine translations. In this paper, we present Furthermore, some researchers proposed a re-
a novel neural reordering model that directly ordering model that conditions both current and
models word pairs and their alignment. Fur- previous phrase pairs by utilizing recursive auto-
ther by utilizing LSTM recurrent neural net- encoders (Li et al., 2014).
works, much longer context could be learned In this paper, we propose a novel neural reorder-
for reordering prediction. Experimental re- ing feature by including longer context for pre-
sults on NIST OpenMT12 Arabic-English
dicting orientations. We utilize a long short-term
and Chinese-English 1000-best rescoring task
show that our LSTM neural reordering feature memory recurrent neural network (LSTM-RNN)
is robust, and achieves significant improve- (Graves, 1997), and directly models word pairs to
ments over various baseline systems. predict its most probable orientation. Experimen-
tal results on NIST OpenMT12 Arabic-English and
1 Introduction Chinese-English translation show that our neural re-
ordering model achieves significant improvements
In statistical machine translation, the language over various baselines in 1000-best rescoring task.
model, translation model, and reordering model are
the three most important components. Among these 2 Related Work
models, the reordering model plays an important
Recently, various neural network models have been
role in phrase-based machine translation (Koehn et
applied into machine translation.
al., 2004), and it still remains a major challenge in
Feed-forward neural language model was first
current study.
proposed by Bengio et al. (2003), which was a
In recent years, various phrase reordering meth-
breakthrough in language modeling. Mikolov et
ods have been proposed for phrase-based SMT sys-
al. (2011) proposed to use recurrent neural net-
tems, which can be classified into two broad cate-
work in language modeling, which can include
gories:
much longer context history for predicting next
(1) Distance-based RM: Penalize phrase displace- word. Experimental results show that RNN-based
ments with respect to the degree of non- language model significantly outperform standard
monotonicity (Koehn et al., 2004). feed-forward language model.
Devlin et al. (2014) proposed a neural network ai , i.e. previous and current alignment.
joint model (NNJM) by conditioning both source n
Y
and target language context for target word predict- p(o|e, f ) = p(oi |ei , fai , ai (1)
1 , ai )
ing. Though the network architecture is a simple i=1
feed-forward neural network, the results have shown
significant improvements over state-of-the-art base- In Equation 1, the oi represents the set of phrase
lines. orientations. For example, in the most commonly
Sundermeyer et al. (2014) also put forward a used MSD-based orientation type, oi takes three val-
neural translation model, by utilizing LSTM-based ues: M stands for monotone, S for swap, and D for
RNN and bidirectional RNN. By introducing bidi- discontinuous. The definition of MSD-based orien-
rectional RNNs, the target word is conditioned on tation is shown in Equation 2.
not only the history but also future source context, 8
>
< M, ai ai 1 = 1
which forms a full source sentence for predicting tar-
oi = S, ai ai 1 = 1 (2)
get words. >
: D, |a
i ai 1 | =
6 1
Li et al. (2013) proposed to use a recursive auto-
encoder (RAE) to map each phrase pairs into contin- For other orientation types, such as LR and MSLR
uous vectors, and handle reordering problems with a are also widely used, whose definition can be found
classifier. Also, they suggested that by both includ- on Moses official website 1 .
ing current and previous phrase pairs to determine Recent studies on reordering model suggest that
the phrase orientations could achieve further im- by also conditioning previous phrase pairs can im-
provements in reordering accuracy (Li et al., 2014). prove context sensitivity and reduce reordering am-
By far, we have noticed that this is the first time to biguity.
use LSTM-RNN in reordering model. We could in-
clude much longer context information to determine 4 LSTM Neural Reordering Model
phrase orientations using RNN architecture. Fur-
In order to include more context information for de-
thermore, by utilizing the LSTM units, the network
termining reordering, we propose to use a recurrent
is able to capture much longer range dependencies
neural network, which has been shown to perform
than standard RNNs.
considerably better than standard feed-forward ar-
Because we need to record fixed length of history chitectures in sequence prediction (Mikolov et al.,
information in SMT decoding step, we only utilize 2011). However, RNN with conventional back-
our LSTM-RNN reordering model as a feature in propagation training suffers from gradient vanishing
1000-best rescoring step. As word alignments are issues (Bengio et al., 1994) .
known after generating n-best list, it is possible to Later, long short-term memory was proposed for
use LSTM-RNN reordering model to score each hy- solving gradient vanishing problem, and it could
pothesis. catch longer context than standard RNNs with sig-
moid activation functions. In this paper, we adopt
3 Lexicalized Reordering Model LSTM architecture for training neural reordering
In traditional statistical machine translation, lexi- model.
calized reordering models have been widely used 4.1 Training Data Processing
(Koehn et al., 2007). It considers alignments of cur-
rent and previous phrase pairs to determine the ori- For reducing model complexity and easy implemen-
entation. tation, our neural reordering model is purely lexical-
ized and trained on word-level.
Formally, when given source language sentence
We will take LR orientation type for explana-
f = {f1 , ..., fn }, target language sentence e =
tions, while other orientation types (MSD, MSLR)
{e1 , ..., en }, and phrase alignment a = {a1 , ..., an },
can be induced similarly. Given a sentence pair and
the lexicalized reordering model can be illustrated
in Equation 1, which only conditions on ai 1 and 1
http://www.statmt.org/moses/
its alignment information, we can induce the word- Where ei1 = {e1 , ..., ei }, f1ai = {f1 , ..., fai }. In-
based reordering information by following steps. clusion of history word pairs is done with recurrent
Note that, we always evaluate the model in the or- neural network, which is known for its capability of
der of target sentence. learning history information.
The architecture of LSTM-RNN reordering
(1) If current target word is one-to-one alignment, model is depicted in Figure 2, and corresponding
then we can directly induce its orientations, i.e. equations are shown in Equation 4 to 6.
hlef ti or hrighti.
yi = W 1 ⇤ f ai + W 2 ⇤ e i (4)
(2) If current source/target word is one-to-many
alignment, then we judge its orientation by con- zi = LST M (yi , W3 , y1i 1 ) (5)
sidering its first aligned target/source word, and p(oi |ei1 , f1ai , ai = sof tmax(W4 ⇤ zi ) (6)
1 , ai )
the other aligned target/source words are anno-
tated as hf ollowi reordering type, which means The input layer consists both source and target
these word pairs inherent the orientation of pre- language word, which is in one-hot representation.
vious word pair. Then we perform a linear transformation of input
layer to a projection layer, which is also called em-
(3) If current source/target word is not aligned to bedding layer. We adopt extended-LSTM as our hid-
any target/source words, we introduce a hnulli den layer implementation, which consists of three
token in its opposite side, and annotate this word gating units, i.e. input, forget and output gates. We
pair as hf ollowi reordering type. omit rather extensive LSTM equations here, which
can be found in (Graves and Schmidhuber, 2005).
Figure 1 shows an example of data processing. The output layer is composed by orientation types.
For example, in LR condition, the output layer con-
dengdao zhengfu de pizhun
tains two units: hlef ti and hrighti orientation. Fi-
...... ......
nally, we apply softmax function to obtain normal-
(a)
R R L ized probabilities of each orientation.
...... wait for approval of the government ......
+
!(#$ |&'( , *' , , -(.', -( )
dengdao zhengfu de pizhun
Output Layer
...... ......
(b)
R F R L L F
...... wait for approval of the government ...... LSTM Layer

Figure 1: Illustration of data processing. (a) Original reorder-


Projection Layer
ing (alignment inside each phrase is omitted); (b) processed re-
ordering, all alignments are regularized to word level. R-right,
L-left, F-follow. Input Layer
*+, &(

4.2 LSTM Network Architecture Figure 2: Architecture of LSTM neural reordering model.

After processing the training data, we can directly


utilize the word pairs and its orientation to train a 5 Experiments
neural reordering model.
Given a word pair and its orientation, a neural re- 5.1 Setups
ordering model can be illustrated by Equation 3. We mainly tested our approach on Arabic-English
and Chinese-English translation. The training
n
Y corpus contains 7M words for Arabic, and 4M
p(o|e, f ) = p(oi |ei1 , f1ai , ai 1 , ai ) (3)
i=1
words for Chinese, which is selected from NIST
System Dev Test1 Test2 set to 100K and 50K respectively, and all out-of-
MT04-05-06 MT08 MT09 vocabulary words are mapped to a hunki token.
Ar-En
(3795) (1360) (1313)
MT05-08 MT08.prog MT12.rd 5.2 Results on Different Orientation Types
Zh-En
(2439) (1370) (820) At first, we test our neural reordering model (NRM)
Table 1: Statistics of development and test set. The number of on the baseline that contains word-based reordering
segments are indicated in brackets. model with LR orientation. The results are shown in
Table 2 and 3.
As we can see that, among various orienta-
OpenMT12 parallel dataset. We use the SAMA to- tion types (LR, MSD, MSLR), our model could
kenizer2 for Arabic word tokenization, and in-house give consistent improvements over baseline system.
segmenter for Chinese words. The English part of The overall BLEU improvements range from 0.42
parallel data is tokenized and lowercased. All de- to 0.79 for Arabic-English, and 0.31 to 0.72 for
velopment and test sets have 4 references for each Chinese-English systems. All neural results are sig-
segment. The statistics of development and test sets nificantly better than baselines (p < 0.001 level).
are shown in Table 1. In the meantime, we also find that “Left-Right”
The baseline systems are built with the open- based orientation methods, such as LR and MSLR,
source phrase-based SMT toolkit Moses (Koehn et consistently outperform MSD-based orientations.
al., 2007). Word alignment and phrase extrac- The may caused by non-separability problem, which
tion are done by GIZA++ (Och and Ney, 2000) means that MSD-based methods are vulnerable to
with L0-normalization (Vaswani et al., 2012), and the change of context, and weak in resolving re-
grow-diag-final refinement rule (Koehn et al., 2004). ordering ambiguities. Similar conclusion can be
Monolingual part of training data is used to train found in Li et al. (2014) .
a 5-gram language model using SRILM (Stolcke,
2002). Parameter tuning is done by K-best MIRA Ar-En System Dev Test1 Test2
(Cherry and Foster, 2012). For guarantee of re- Baseline 43.87 39.84 42.05
sult stability, we tune every system 5 times inde- +NRM LR 44.43 40.53 42.84
pendently, and take the average BLEU score (Clark +NRM MSD 44.29 40.41 42.62
et al., 2011). The translation quality is evaluated +NRM MSLR 44.52 40.59 42.78
by case-insensitive BLEU-4 metric (Papineni et al., Table 2: LSTM reordering model with different orientation
2002). The statistical significance test is also car- types for Arabic-English system.
ried out with paired bootstrap resampling method
with p < 0.001 intervals (Koehn, 2004). Our mod-
els are evaluated in a 1000-best rescoring step, and Zh-En System Dev Test1 Test2
all features in 1000-best list as well as LSTM-RNN Baseline 27.18 26.17 24.04
reordering feature are retuned via K-best MIRA al- +NRM LR 27.90 26.58 24.70
gorithm. +NRM MSD 27.49 26.51 24.39
For neural network training, we use all parallel +NRM MSLR 27.82 26.78 24.53
text in the baseline training. As a trade-off be- Table 3: LSTM reordering model with different orientation
tween computational cost and performance, the pro- types for Chinese-English system.
jection layer and hidden layer are set to 100, which
is enough for our task (We have not seen signifi- 5.3 Results on Different Reordering Baselines
cant gains when increasing dimensions greater than
We also test our approach on various baselines,
100). We use an initial learning rate of 0.01 with
which either contains word-based, phrase-based, or
standard SGD optimization without momentum. We
hierarchical phrase-based reordering model. We
trained model for a total of 10 epochs with cross-
only show the results of MSLR orientation, which
entropy criterion. Input and output vocabulary are
is relatively superior than others according to the re-
2
https://catalog.ldc.upenn.edu/LDC2010L01 sults in Section 5.2.
Ar-En System Dev Test1 Test2 more, we are also going to integrate our neural re-
Baseline wbe 43.87 39.84 42.05 ordering model into neural machine translation sys-
+NRM MSLR 44.52 40.59 42.78 tems.
Baseline phr 44.11 40.09 42.21
+NRM MSLR 44.52 40.73 42.89 Acknowledgments
Baseline hier 44.30 40.23 42.38 We sincerely thank the anonymous reviewers for
+NRM MSLR 44.61 40.82 42.86 their thoughtful comments on our work.
Zh-En System Dev Test1 Test2
Baseline wbe 27.18 26.17 24.04
+NRM MSLR 27.90 26.58 24.70 References
Baseline phr 27.33 26.05 24.13 [Bengio et al.1994] Y. Bengio, P. Simard, and P. Frasconi.
+NRM MSLR 27.86 26.46 24.73 1994. Learning long-term dependencies with gradient
Baseline hier 27.56 26.29 24.38 descent is difficult. IEEE Transactions on Neural Net-
+NRM MSLR 28.02 26.49 24.67 works, 5(2):157–166.
Table 4: Results on various baselines for Arabic-English and [Bengio et al.2003] Yoshua Bengio, Holger Schwenk,
Jean Sbastien Sencal, Frderic Morin, and Jean Luc
Chinese-English system. “wbe”: word-based; “phr”: phrase-
Gauvain. 2003. A neural probabilistic language
based; “hier”: hierarchical phrase-based reordering model. All model. Journal of Machine Learning Research,
NRM results are significantly better than baselines (p < 0.001 3(6):1137–1155.
level). [Cherry and Foster2012] Colin Cherry and George Foster.
2012. Batch tuning strategies for statistical machine
translation. In Proceedings of the 2012 Conference
In Table 4 and 5, we can see that though we add
of the North American Chapter of the Association for
a strong hierarchical phrase-based reordering model Computational Linguistics: Human Language Tech-
in the baseline, our model can still bring a maximum nologies, pages 427–436, Montréal, Canada, June. As-
gain of 0.59 BLEU score, which suggest that our sociation for Computational Linguistics.
model is applicable and robust in various circum- [Clark et al.2011] Jonathan H. Clark, Chris Dyer, Alon
stances. However, we have noticed that the gains Lavie, and Noah A. Smith. 2011. Better hypothe-
in Arabic-English system is relatively greater than sis testing for statistical machine translation: Control-
that in Chinese-English system. This is probably be- ling for optimizer instability. In Proceedings of the
49th Annual Meeting of the Association for Compu-
cause hierarchical reordering features tend to work
tational Linguistics: Human Language Technologies,
better for Chinese words, and thus our model will pages 176–181, Portland, Oregon, USA, June. Associ-
bring little remedy to its baseline. ation for Computational Linguistics.
[Devlin et al.2014] Jacob Devlin, Rabih Zbib,
6 Conclusions Zhongqiang Huang, Thomas Lamar, Richard
Schwartz, and John Makhoul. 2014. Fast and
We present a novel work that build a reordering
robust neural network joint models for statistical
model using LSTM-RNN, which is much sensitive machine translation. In Proceedings of the 52nd
to the change of context and introduce rich con- Annual Meeting of the Association for Computational
text information for reordering prediction. Further- Linguistics (Volume 1: Long Papers), pages 1370–
more, the proposed model is purely lexicalized and 1380, Baltimore, Maryland, June. Association for
straightforward, which is easy to realize. Experi- Computational Linguistics.
mental results on 1000-best rescoring show that our [Galley and Manning2008] Michel Galley and Christo-
neural reordering feature is robust, and could give pher D. Manning. 2008. A simple and effective hi-
erarchical phrase reordering model. In Proceedings of
consistent improvements over various baseline sys-
the 2008 Conference on Empirical Methods in Natu-
tems. ral Language Processing, pages 848–856, Honolulu,
In future, we are planning to extend our word- Hawaii, October. Association for Computational Lin-
based LSTM reordering model to phrase-based re- guistics.
ordering model, in order to dissolve much more am- [Graves and Schmidhuber2005] A. Graves and J. Schmid-
biguities and improve reordering accuracy. Further- huber. 2005. Framewise phoneme classification with
bidirectional lstm networks. In Proceedings in 2005 Proceedings of 40th Annual Meeting of the Associ-
IEEE International Joint Conference on Neural Net- ation for Computational Linguistics, pages 311–318,
works, pages 2047–2052 vol. 4. Philadelphia, Pennsylvania, USA, July. Association
[Graves1997] Alex Graves. 1997. Long short-term mem- for Computational Linguistics.
ory. Neural Computation, 9(8):1735–1780. [Stolcke2002] Andreas Stolcke. 2002. Srilm — an ex-
[Koehn et al.2004] Philipp Koehn, Franz Josef Och, and tensible language modeling toolkit. In Proceedings of
Daniel Marcu. 2004. Statistical phrase-based transla- the 7th International Conference on Spoken Language
tion. In Conference of the North American Chapter of Processing (ICSLP 2002), pages 901–904.
the Association for Computational Linguistics on Hu- [Sundermeyer et al.2014] Martin Sundermeyer, Tamer
man Language Technology-volume, pages 127–133. Alkhouli, Joern Wuebker, and Hermann Ney. 2014.
[Koehn et al.2007] Philipp Koehn, Hieu Hoang, Alexan- Translation modeling with bidirectional recurrent neu-
dra Birch, Chris Callison-Burch, Marcello Federico, ral networks. In Proceedings of the 2014 Conference
Nicola Bertoldi, Brooke Cowan, Wade Shen, Chris- on Empirical Methods in Natural Language Process-
tine Moran, Richard Zens, Chris Dyer, Ondrej Bo- ing (EMNLP), pages 14–25, Doha, Qatar, October. As-
jar, Alexandra Constantin, and Evan Herbst. 2007. sociation for Computational Linguistics.
Moses: Open source toolkit for statistical machine [Tillman2004] Christoph Tillman. 2004. A unigram ori-
translation. In Proceedings of the 45th Annual Meet- entation model for statistical machine translation. In
ing of the Association for Computational Linguistics Daniel Marcu Susan Dumais and Salim Roukos, edi-
Companion Volume Proceedings of the Demo and tors, HLT-NAACL 2004: Short Papers, pages 101–104,
Poster Sessions, pages 177–180, Prague, Czech Re- Boston, Massachusetts, USA, May 2 - May 7. Associ-
public, June. Association for Computational Linguis- ation for Computational Linguistics.
tics. [Vaswani et al.2012] Ashish Vaswani, Liang Huang, and
[Koehn2004] Philipp Koehn. 2004. Statistical signif- David Chiang. 2012. Smaller alignment models for
icance tests for machine translation evaluation. In better translations: Unsupervised word alignment with
Dekang Lin and Dekai Wu, editors, Proceedings of the l0-norm. In Proceedings of the 50th Annual Meet-
EMNLP 2004, pages 388–395, Barcelona, Spain, July. ing of the Association for Computational Linguistics
Association for Computational Linguistics. (Volume 1: Long Papers), pages 311–319, Jeju Island,
Korea, July. Association for Computational Linguis-
[Li et al.2013] Peng Li, Yang Liu, and Maosong Sun.
tics.
2013. Recursive autoencoders for ITG-based trans-
lation. In Proceedings of the 2013 Conference on
Empirical Methods in Natural Language Processing,
pages 567–577, Seattle, Washington, USA, October.
Association for Computational Linguistics.
[Li et al.2014] Peng Li, Yang Liu, Maosong Sun, Tatsuya
Izuha, and Dakun Zhang. 2014. A neural reordering
model for phrase-based translation. In Proceedings of
COLING 2014, the 25th International Conference on
Computational Linguistics: Technical Papers, pages
1897–1907, Dublin, Ireland, August. Dublin City Uni-
versity and Association for Computational Linguistics.
[Mikolov et al.2011] T. Mikolov, S. Kombrink, L. Bur-
get, and J. H. Cernocky. 2011. Extensions of recur-
rent neural network language model. In IEEE Inter-
national Conference on Acoustics, Speech and Signal
Processing, pages 5528–5531.
[Och and Ney2000] Franz Josef Och and Hermann Ney.
2000. A comparison of alignment models for statis-
tical machine translation. In Proceedings of the 18th
conference on Computational linguistics - Volume 2,
pages 1086–1090.
[Papineni et al.2002] Kishore Papineni, Salim Roukos,
Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method
for automatic evaluation of machine translation. In

You might also like