A Teacher-Student Framework For Zero-Resource Neural Machine Translation
A Teacher-Student Framework For Zero-Resource Neural Machine Translation
X Z Y Z
P (y|z; ✓ z!y )
(a) (b)
Figure 1: (a) The pivot-based approach and (b) the teacher-student approach to zero-resource neural
machine translation. X, Y, and Z denote source, target, and pivot languages, respectively. We use a
dashed line to denote that there is a parallel corpus available for the connected language pair. Solid
lines with arrows represent translation directions. The pivot-based approach leverages a pivot to achieve
indirect source-to-target translation: it first translates x into z, which is then translated into y. Our
training algorithm is based on the translation equivalence assumption: if x is a translation of z, then
P (y|x; θx→y ) should be close to P (y|z; θz→y ). Our approach directly trains the intended source-to-
target model P (y|x; θx→y ) (“student”) on a source-pivot parallel corpus, with the guidance of an existing
pivot-to-target model P (y|z; θ̂z→y ) (“teacher”).
proach is the intractability in calculating the gra- log P (y|x, y<j ; θx→y ). (12)
dients because of the exponential search space of
target sentences. To address this problem, it is pos- Therefore, our goal is to find a set of source-to-
sible to construct a sub-space by either sampling target model parameters that minimizes the train-
(Shen et al., 2016), generating a k-best list (Cheng ing objective:
et al., 2016b) or mode approximation (Kim and n o
Rush, 2016). Then, standard stochastic gradient θ̂x→y = argmin JWORD (θx→y ) . (13)
θx→y
descent algorithms can be used to optimize model
parameters. We use similar approaches as described in Sec-
tion 3.2 for approximating the full search space
3.3 Word-Level Teaching with sentence-level teaching. After obtaining
Instead of minimizing the KL divergence between θ̂x→y , the same decision rule as shown in Equa-
the teacher and student models at the sentence tion (1) can be utilized to find the most probable
level, we further define a training objective at the target sentence ŷ for a source sentence x.
word level based on Assumption 2:
4 Experiments
Table 3: Comparison with previous work on Spanish-French and German-French translation tasks from
the Europarl corpus. English is treated as the pivot language. The likelihood method uses 100K parallel
source-target sentences, which are not available for other methods.
B L E U
1 2 0 1 5
s e n t-g re e d y
9 0 1 0 s e n t-b e a m
w o r d -g re e d y
6 0 5 w o r d -b e a m
w o r d - s a m p lin g
3 0 0
0 3 6 9 1 2 1 5 0 3 6 9 1 2 1 5
4 4
Ite r a tio n s ×1 0 Ite r a tio n s ×1 0
Figure 2: Validation loss and BLEU across iterations of our proposed methods.
Training BLEU
Method
Es→ En En→ Fr Es→ Fr Newstest2012 Newstest2013
Existing zero-resource NMT systems
Cheng et al. (2016a)† pivot 6.78M 9.29M - 24.60 -
Cheng et al. (2016a)† likelihood 6.78M 9.29M 100K 25.78 -
Firat et al. (2016b) one-to-one 34.71M 65.77M - 17.59 17.61
Firat et al. (2016b)† many-to-one 34.71M 65.77M - 21.33 21.19
Our zero-resource NMT system
word-sampling 6.78M 9.29M - 28.06 27.03
Table 5: Comparison with previous work on Spanish-French translation in a zero-resource scenario over
the WMT corpus. The BLEU scores are case sensitive. †: the method depends on two-step decoding.
time complexity grows linearly with the number tend to have lower validation loss compared with
of beams k, the better performance is achieved at sentence-level methods. Generally, models with
the expense of search time. lower validation loss tend to have higher BLEU.
For word-level experiments, we observe that Our results indicate that this is not necessarily the
the word-sampling method performs much bet- case: the sent-beam method converges to +0.31
ter than the other two methods: +1.94 BLEU BLEU points on the validation set with +13 vali-
points on Spanish-French translation and +1.88 dation loss compared with the word-beam method.
BLEU points on German-French translation over Kim and Rush (2016) claim a similar observation
the word-greedy method; +2.65 BLEU points in data distillation for NMT and provide an expla-
on Spanish-French translation and +2.84 BLEU nation that student distributions are more peaked
points on German-French translation over the for sentence-level methods. This is indeed the
word-beam method. Although Table 2 shows that case in our result: on German-French translation
word-level KL divergence approximated by sam- task the argmax for the sent-beam student model
pling is larger than that by greedy or beam, sam- (on average) approximately accounts for 3.49% of
pling approximation introduces more data diver- the total probability mass, while the correspond-
sity for training, which dominates the effect of KL ing number is 1.25% for the word-beam student
divergence difference. model and 2.60% for the teacher model.
We plot validation loss4 and BLEU scores over
4.4 Results on the WMT Corpus
iterations on the German-French translation task
in Figure 2. We observe that word-level models The word-sampling method obtains the best per-
formance in our five proposed approaches ac-
4
Validation loss: the average negative log-likelihood of cording to experiments on the Europarl corpus.
sentence pairs on the validation set. To further verify this approach, we conduct ex-
Os sentáis al volante en la costa oeste , en San Francisco , y vuestra misión es llegar los
source
primeros a Nueva York .
You get in the car on the west coast , in San Francisco , and your task is to be the first one
groundtruth pivot
to reach New York .
Vous vous asseyez derrière le volant sur la côte ouest à San Francisco et votre mission est
target
d' arriver le premier à New York .
You 'll feel at the west coast in San Francisco , and your mission is to get the first to
pivot
New York . [BLEU: 33.93]
pivot
Vous vous sentirez comme chez vous à San Francisco , et votre mission est d' obtenir
target
le premier à New York . [BLEU: 44.52]
You feel at the west coast , in San Francisco , and your mission is to reach the first to New
pivot
York . [BLEU: 47.22]
likelihood
Vous vous sentez à la côte ouest , à San Francisco , et votre mission est d' atteindre
target
le premier à New York . [BLEU: 49.44]
Vous vous sentez au volant sur la côte ouest , à San Francisco et votre mission est d'
word-sampling target
arriver le premier à New York . [BLEU: 78.78]
Table 6: Examples and corresponding sentence BLEU scores of translations using the pivot and likeli-
hood methods in (Cheng et al., 2016a) and the proposed word-sampling method. We observe that our
approach generates better translations than the methods in (Cheng et al., 2016a). We italicize correct
translation segments which are no short than 2-grams.
6 Conclusion Yong Cheng, Yang Liu, Qian Yang, Maosong Sun, and
Wei Xu. 2016a. Neural machine translation with
In this paper, we propose a novel framework to pivot languages. CoRR abs/1611.04928.
train the student model without parallel corpora
Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua
available under the guidance of the pre-trained Wu, Maosong Sun, and Yang Liu. 2016b. Semi-
teacher model on a source-pivot parallel corpus. supervised learning for neural machine translation
We introduce sentence-level and word-level teach- .
Trevor Cohn and Mirella Lapata. 2007. Machine trans- Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol
lation by triangulation: Making effective use of Vinyals, and Wojciech Zaremba. 2015. Addressing
multi-parallel corpora. In ACL. the rare word problem in neural machine translation.
In ACL.
Adrià de Gispert and José B. Mariño. 2006. Catalan-
english statistical machine translation without paral- Hideki Nakayama and Noriki Nishida. 2016. Zero-
lel corpus: bridging through spanish. In Proceed- resource machine translation by multimodal
ings of 5th International Conference on Language encoder-decoder network with multimedia pivot.
Resources and Evaluation (LREC). Citeseer, pages CoRR abs/1611.04503.
65–68.
Franz Josef Och. 2003. Minimum error rate training in
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. statistical machine translation. In ACL.
2016a. Multi-way, multilingual neural machine
translation with a shared attention mechanism. In Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
HLT-NAACL. Jing Zhu. 2002. Bleu: a method for automatic eval-
uation of machine translation. In ACL.
Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan,
Fatos T. Yarman-Vural, and Kyunghyun Cho. 2016b. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,
Zero-resource translation with multi-lingual neural and Wojciech Zaremba. 2015. Sequence level
machine translation. In EMNLP. training with recurrent neural networks. CoRR
abs/1511.06732.
Thanh-Le Ha, Jan Niehues, and Alexander H. Waibel.
2016. Toward multilingual neural machine trans- Rico Sennrich, Barry Haddow, and Alexandra Birch.
lation with universal encoder and decoder. CoRR 2016. Neural machine translation of rare words with
abs/1611.04798. subword units .
Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua
Tie-Yan Liu, and Wei-Ying Ma. 2016. Dual learning Wu, Maosong Sun, and Yang Liu. 2016. Minimum
for machine translation. In NIPS. risk training for neural machine translation .
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
2015. Distilling the knowledge in a neural network. Sequence to sequence learning with neural networks
CoRR abs/1503.02531. .
Sébastien Jean, Kyunghyun Cho, Roland Memisevic, Masao Utiyama and Hitoshi Isahara. 2007. A compari-
and Yoshua Bengio. 2015. On using very large tar- son of pivot methods for phrase-based statistical ma-
get vocabulary for neural machine translation. In chine translation. In HLT-NAACL.
ACL.
Hua Wu and Haifeng Wang. 2007. Pivot language ap-
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim proach for phrase-based statistical machine transla-
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho- tion. Machine Translation 21:165–181.
rat, Fernanda B. Viégas, Martin Wattenberg, Gre-
gory S. Corrado, Macduff Hughes, and Jeffrey Dean. Hua Wu and Haifeng Wang. 2009. Revisiting pivot
2016. Google’s multilingual neural machine trans- language approach for machine translation. In
lation system: Enabling zero-shot translation. CoRR ACL/IJCNLP.
abs/1611.04558.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Le, Mohammad Norouzi, Wolfgang Macherey,
continuous translation models. In EMNLP. Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, Jeff Klingner, Apurva Shah, Melvin
Ahmed El Kholy, Nizar Habash, Gregor Leusch, Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
Evgeny Matusov, and Hassan Sawaf. 2013. Lan- Gouws, Yoshikiyo Kato, Taku Kudo, Hideto
guage independent connectivity strength features for Kazawa, Keith Stevens, George Kurian, Nishant
phrase pivot statistical machine translation. Patil, Wei Wang, Cliff Young, Jason Smith, Jason
Riesa, Alex Rudnick, Oriol Vinyals, Gregory S.
Yoon Kim and Alexander M. Rush. 2016. Sequence- Corrado, Macduff Hughes, and Jeffrey Dean. 2016.
level knowledge distillation. In EMNLP. Google’s neural machine translation system: Bridg-
ing the gap between human and machine translation.
Philipp Koehn. 2005. Europarl: a parallel corpus for CoRR abs/1609.08144.
statistical machine translation.
Samira Tofighi Zahabi, Somayeh Bakhshaei, and
Jinyu Li, Rui Zhao, Jui-Ting Huang, and Yifan Shahram Khadivi. 2013. Using context vectors in
Gong. 2014. Learning small-size dnn with output- improving a machine translation system with bridge
distribution-based criteria. In INTERSPEECH. language. In ACL.
Xiaoning Zhu, Zhongjun He, Hua Wu, Haifeng Wang,
Conghui Zhu, and Tiejun Zhao. 2013. Improving
pivot-based statistical machine translation using ran-
dom walk. In EMNLP.
Barret Zoph, Deniz Yuret, Jonathan May, and Kevin
Knight. 2016. Transfer learning for low-resource
neural machine translation. In EMNLP.