A Teacher-Student Framework For Zero-Resource Neural Machine Translation

A Teacher-Student Framework for
Zero-Resource Neural Machine Translation

Yun Chen† , Yang Liu‡ , Yong Cheng+ , Victor O.K. Li†
†
Department of Electrical and Electronic Engineering, The University of Hong Kong
‡
State Key Laboratory of Intelligent Technology and Systems
Tsinghua National Laboratory for Information Science and Technology
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Jiangsu Collaborative Innovation Center for Language Competence, Jiangsu, China
+
Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
[email protected]; [email protected];
[email protected]; [email protected]
Abstract translation quality than a statistical machine trans-
lation (SMT) system on low-resource languages.
While end-to-end neural machine transla-
arXiv:1705.00753v1 [cs.CL] 2 May 2017
As a result, a number of authors have endeav-

tion (NMT) has made remarkable progress ored to explore methods for translating language
recently, it still suffers from the data pairs without parallel corpora available. These
scarcity problem for low-resource lan- methods can be roughly divided into two broad
guage pairs and domains. In this paper, categories: multilingual and pivot-based. Firat
we propose a method for zero-resource et al. (2016b) present a multi-way, multilin-
NMT by assuming that parallel sentences gual model with shared attention to achieve zero-
have close probabilities of generating a resource translation. They fine-tune the attention
sentence in a third language. Based on part using pseudo bilingual sentences for the zero-
this assumption, our method is able to resource language pair. Another direction is to
train a source-to-target NMT model (“stu- develop a universal NMT model in multilingual
dent”) without parallel corpora available, scenarios (Johnson et al., 2016; Ha et al., 2016).
guided by an existing pivot-to-target NMT They use parallel corpora of multiple languages
model (“teacher”) on a source-pivot par- to train one single model, which is then able to
allel corpus. Experimental results show translate a language pair without parallel corpora
that the proposed method significantly im- available. Although these approaches prove to be
proves over a baseline pivot-based model effective, the combination of multiple languages
by +3.0 BLEU points across various lan- in modeling and training leads to increased com-
guage pairs. plexity compared with standard NMT.
Another direction is to achieve source-to-target
1 Introduction
NMT without parallel data via a pivot, which
Neural machine translation (NMT) (Kalchbren- is either text (Cheng et al., 2016a) or image
ner and Blunsom, 2013; Sutskever et al., 2014; (Nakayama and Nishida, 2016). Cheng et al.
Bahdanau et al., 2015), which directly models (2016a) propose a pivot-based method for zero-
the translation process in an end-to-end way, has resource NMT: it first translates the source lan-
attracted intensive attention from the commu- guage to a pivot language, which is then translated
nity. Although NMT has achieved state-of-the-art to the target language. Nakayama and Nishida
translation performance on resource-rich language (2016) show that using multimedia information as
pairs such as English-French and German-English pivot also benefits zero-resource translation. How-
(Luong et al., 2015; Jean et al., 2015; Wu et al., ever, pivot-based approaches usually need to di-
2016; Johnson et al., 2016), it still suffers from vide the decoding process into two steps, which
the unavailability of large-scale parallel corpora is not only more computationally expensive, but
for translating low-resource languages. Due to the also potentially suffers from the error propagation
large parameter space, neural models usually learn problem (Zhu et al., 2013).
poorly from low-count events, resulting in a poor In this paper, we propose a new method for
choice for low-resource language pairs. Zoph et zero-resource neural machine translation. Our
al. (2016) indicate that NMT obtains much worse method assumes that parallel sentences should
P (y|x; ✓ x!y )
P (z|x; ✓ x!z ) P (y|z; ✓ z!y ) X
X Z Y Z
P (y|z; ✓ z!y )
(a) (b)
Figure 1: (a) The pivot-based approach and (b) the teacher-student approach to zero-resource neural
machine translation. X, Y, and Z denote source, target, and pivot languages, respectively. We use a
dashed line to denote that there is a parallel corpus available for the connected language pair. Solid
lines with arrows represent translation directions. The pivot-based approach leverages a pivot to achieve
indirect source-to-target translation: it first translates x into z, which is then translated into y. Our
training algorithm is based on the translation equivalence assumption: if x is a translation of z, then
P (y|x; θx→y ) should be close to P (y|z; θz→y ). Our approach directly trains the intended source-to-
target model P (y|x; θx→y ) (“student”) on a source-pivot parallel corpus, with the guidance of an existing
pivot-to-target model P (y|z; θ̂z→y ) (“teacher”).
have close probabilities of generating a sentence to denote a source-to-target neural translation

in a third language. To train a source-to-target model, where θx→y is a set of model parame-
NMT model without parallel corpora available ters. Given a source-target parallel corpus Dx,y ,
(“student”), we leverage an existing pivot-to-target which is a set of parallel source-target sentences,
NMT model (“teacher”) to guide the learning the model parameters can be learned by maximiz-
process of the student model on a source-pivot ing the log-likelihood of the parallel corpus:
parallel corpus. Compared with pivot-based ap- ( )
proaches (Cheng et al., 2016a), our method al- θ̂x→y = argmax
X
log P (y|x; θx→y ) .
lows direct parameter estimation of the intended θx→y
hx,yi∈Dx,y
NMT model, without the need to divide decod-
ing into two steps. This strategy not only im- Given learned model parameters θ̂x→y , the de-
proves efficiency but also avoids error propaga- cision rule for finding the translation with the
tion in decoding. Experiments on the Europarl and highest probability for a source sentence x is given
WMT datasets show that our approach achieves by
significant improvements in terms of both trans- ( )
lation quality and decoding efficiency over a base-
line pivot-based approach to zero-resource NMT ŷ = argmax P (y|x; θ̂x→y ) . (1)
y
on Spanish-French and German-French transla-
tion tasks. As a data-driven approach, NMT heavily relies
on the availability of large-scale parallel corpora
2 Background to deliver state-of-the-art translation performance
(Wu et al., 2016; Johnson et al., 2016). Zoph et
Neural machine translation (Sutskever et al., 2014; al. (2016) report that NMT obtains much lower
Bahdanau et al., 2015) advocates the use of neu- BLEU scores than SMT if only small-scale par-
ral networks to model the translation process in allel corpora are available. Therefore, the heavy
an end-to-end manner. As a data-driven approach, dependence on the quantity of training data poses
NMT treats parallel corpora as the major source a severe challenge for NMT to translate zero-
for acquiring translation knowledge. resource language pairs.
Let x be a source-language sentence and y be a Simple and easy-to-implement, pivot-based
target-language sentence. We use P (y|x; θx→y ) methods have been widely used in SMT for
translating zero-resource language pairs (de Gis- 3 Approach
pert and Mariño, 2006; Cohn and Lapata, 2007;
Utiyama and Isahara, 2007; Wu and Wang, 2007; 3.1 Assumptions
Bertoldi et al., 2008; Wu and Wang, 2009; Za- In this work, we propose to directly model the in-
habi et al., 2013; Kholy et al., 2013). As pivot- tended source-to-target neural translation based on
based methods are agnostic to model structures, a teacher-student framework. The basic idea is to
they have been adapted to NMT recently (Cheng use a pre-trained pivot-to-target model (“teacher”)
et al., 2016a; Johnson et al., 2016). to guide the learning process of a source-to-target
Figure 1(a) illustrates the basic idea of pivot- model (“student”) without training data available
based approaches to zero-resource NMT (Cheng on a source-pivot parallel corpus. One advantage
et al., 2016a). Let X, Y, and Z denote source, tar- of our approach is that Equation (1) can be used as
get, and pivot languages. We use dashed lines to the decision rule for decoding, which avoids the
denote language pairs with parallel corpora avail- error propagation problem faced by two-step de-
able and solid lines with arrows to denote transla- coding in pivot-based approaches.
tion directions. As shown in Figure 1(b), we still assume
Intuitively, the source-to-target translation can that a source-pivot parallel corpus Dx,z and
be indirectly modeled by bridging two NMT mod- a pivot-target parallel corpus Dz,y are avail-
els via a pivot: able. Unlike pivot-based approaches, we first
P (y|x; θx→z , θz→y ) use the pivot-target parallel corpus Dz,y to ob-
X tain a teacher model P (y|z; θ̂z→y ), where θ̂z→y
= P (z|x; θx→z )P (y|z; θz→y ). (2) is a set of learned model parameters. Then,
z
the teacher model “teaches” the student model
As shown in Figure 1(a), pivot-based ap- P (y|x; θx→y ) on the source-pivot parallel corpus
proaches assume that the source-pivot parallel cor- Dx,z based on the following assumptions.
pus Dx,z and the pivot-target parallel corpus Dz,y
are available. As it is impractical to enumerate all Assumption 1 If a source sentence x is a transla-
possible pivot sentences, the two NMT models are tion of a pivot sentence z, then the probability of
trained separately in practice: generating a target sentence y from x should be
( ) close to that from its counterpart z.
X
θ̂x→z = argmax log P (z|x; θx→z ) , We can further introduce a word-level assump-
θx→z
hx,zi∈Dx,z tion:
( )
X
θ̂z→y = argmax log P (y|z; θz→y ) . Assumption 2 If a source sentence x is a transla-
θz→y
hz,yi∈Dz,y tion of a pivot sentence z, then the probability of
generating a target word y from x should be close
Due to the exponential search space of pivot
to that from its counterpart z, given the already
sentences, the decoding process of translating an
obtained partial translation y<j .
unseen source sentence x has to be divided into
two steps: The two assumptions are empirically verified in
n o our experiments (see Table 2). In the following
ẑ = argmax P (z|x; θ̂x→z ) , (3) subsections, we will introduce two approaches to
z
n o zero-resource neural machine translation based on
ŷ = argmax P (y|ẑ; θ̂z→y ) . (4) the two assumptions.
y
The above two-step decoding process potentially 3.2 Sentence-Level Teaching

suffers from the error propagation problem (Zhu
et al., 2013): the translation errors made in the Given a source-pivot parallel corpus Dx,z , our
first step (i.e., source-to-pivot translation) will af- training objective based on Assumption 1 is de-
fect the second step (i.e., pivot-to-target transla- fined as follows:
tion).
Therefore, it is necessary to explore methods to JSENT (θx→y )
X
directly model source-to-target translation without = KL P (y|z; θ̂z→y ) P (y|x; θx→y ) , (5)
parallel corpora available. hx,zi∈Dx,z
where the KL divergence sums over all possible Equation (9) suggests that the teacher model
target sentences: P (y|z, y<j ; θ̂z→y ) “teaches” the student model
P (y|x, y<j ; θx→y ) in a word-by-word way. Note
KL P (y|z; θ̂z→y ) P (y|x; θx→y ) that the KL-divergence between the two models is
X P (y|z; θ̂z→y ) defined at the word level:
= P (y|z; θ̂z→y ) log .(6)
P (y|x; θx→y )

y KL P (y|z, y<j ; θ̂z→y ) P (y|x, y<j ; θx→y )
As the teacher model parameters are fixed, the X P (y|z, y<j ; θ̂z→y )
training objective can be equivalently written as = P (y|z, y<j ; θ̂z→y ) log ,
P (y|x, y<j ; θx→y )
y∈Vy
JSENT (θx→y )
X h i where Vy is the target vocabulary. As the param-
=− Ey|z;θ̂z→y log P (y|x; θx→y ) . (7) eters of the teacher model are fixed, the training
hx,zi∈Dx,z
objective can be equivalently written as:
In training, our goal is to find a set of source-to-
target model parameters that minimizes the train- JWORD (θx→y )
X h i
ing objective: =− Ey|z;θ̂z→y S(x, y, z, θ̂z→y , θx→y ) , (11)
hx,zi∈Dx,z
n o
θ̂x→y = argmin JSENT (θx→y ) . (8)
θx→y
where
With learned source-to-target model parameters
θ̂x→y , we use the standard decision rule as shown S(x, y, z, θ̂z→y , θx→y )
in Equation (1) to find the translation ŷ for a |y|
X X
source sentence x. = P (y|z, y<j ; θ̂z→y ) ×
However, a major difficulty faced by our ap- j=1 y∈Vy
proach is the intractability in calculating the gra- log P (y|x, y<j ; θx→y ). (12)
dients because of the exponential search space of
target sentences. To address this problem, it is pos- Therefore, our goal is to find a set of source-to-
sible to construct a sub-space by either sampling target model parameters that minimizes the train-
(Shen et al., 2016), generating a k-best list (Cheng ing objective:
et al., 2016b) or mode approximation (Kim and n o
Rush, 2016). Then, standard stochastic gradient θ̂x→y = argmin JWORD (θx→y ) . (13)
θx→y
descent algorithms can be used to optimize model
parameters. We use similar approaches as described in Sec-
tion 3.2 for approximating the full search space
3.3 Word-Level Teaching with sentence-level teaching. After obtaining
Instead of minimizing the KL divergence between θ̂x→y , the same decision rule as shown in Equa-
the teacher and student models at the sentence tion (1) can be utilized to find the most probable
level, we further define a training objective at the target sentence ŷ for a source sentence x.
word level based on Assumption 2:
4 Experiments
JWORD (θx→y ) 4.1 Setup

We evaluate our approach on the Europarl (Koehn,
X h i
= Ey|z;θ̂z→y J(x, y, z, θ̂z→y , θx→y ) , (9)
2005) and WMT corpora. To compare with pivot-
hx,zi∈Dx,z
based methods, we use the same dataset as (Cheng
where et al., 2016a). All the sentences are tokenized by
J(x, y, z, θ̂z→y , θx→y ) the tokenize.perl script. All the experiments
treat English as the pivot language and French as
|y|
X the target language.
= KL P (y|z, y<j ; θ̂z→y )
For the Europarl corpus, we evaluate our pro-
j=1
posed methods on Spanish-French (Es-Fr) and
P (y|x, y<j ; θx→y ) . (10) German-French (De-Fr) translation tasks in a
Corpus Direction Train Dev. Test iments:
Es→ En 850K 2,000 2,000
Europarl De→ En 840K 2,000 2,000 1. Sentence-Level Teaching: for simplicity, we
En→ Fr 900K 2,000 2,000 use the mode as suggested in (Kim and Rush,
Es→ En 6.78M 3,003 3,003 2016) to approximate the target sentence
WMT space in calculating the expected gradients
En→ Fr 9.29M 3,003 3,003
with respect to the expectation in Equation
(7). We run beam search on the pivot sen-
Table 1: Data statistics. For the Europarl corpus,
tence with the teacher model and choose the
we evaluate our approach on Spanish-French (Es-
highest-scoring target sentence as the mode.
Fr) and German-French (De-Fr) translation tasks.
Beam size with k = 1 (greedy decoding) and
For the WMT corpus, we evaluate our approach on
k = 5 are investigated in our experiments,
the Spanish-French (Es-Fr) translation task. En-
denoted as sent-greedy and sent-beam, re-
glish is used as a pivot language in all experiments.
spectively.3
zero-resource scenario. To avoid the trilingual
2. Word-Level Teaching: we use the same mode
corpus constituted by the source-pivot and pivot-
approximation approach as in sentence-level
target corpora, we split the overlapping pivot sen-
teaching to approximate the expectation in
tences of the original source-pivot and pivot-target
Equation 12, denoted as word-greedy (beam
corpora into two equal parts and merge them sepa-
search with k = 1) and word-beam (beam
rately with the non-overlapping parts for each lan-
search with k = 5), respectively. Besides,
guage pair. The development and test sets are from
Monte Carlo estimation by sampling from the
WMT 2006 shared task.1 The evaluation metric is
teacher model is also investigated since it in-
case-insensitive BLEU (Papineni et al., 2002) as
troduces more diverse data, denoted as word-
calculated by the multi-bleu.perl script. To
sampling.
deal with out-of-vocabulary words, we adopt byte
pair encoding (BPE) (Sennrich et al., 2016) to split 4.2 Assumptions Verification
words into sub-words. The size of sub-words is set
to 30K for each language. To verify the assumptions in Section 3.1,
For the WMT corpus, we evaluate our approach we train a source-to-target translation model
on a Spanish-French (Es-Fr) translation task with P (y|x; θx→y ) and a pivot-to-target translation
a zero-resource setting. We combine the follow- model P (y|z; θz→y ) using the trilingual Europarl
ing corpora to form the Es-En and En-Fr paral- corpus. Then, we measure the sentence-level
lel corpora: Common Crawl, News Commentary, and word-level KL divergence from the source-to-
Europarl v7 and UN. All the sentences are tok- target model P (y|x; θx→y ) at different iterations
enized by the tokenize.perl script. New- to the trained pivot-to-target model P (y|z; θ̂z→y )
stest2011 serves as the development set and New- by caculating JSENT (Equation (5)) and JWORD
stest2012 and Newstest2013 serve as test sets. We 3
We can also adopt sampling and k-best list for approxi-
use case-sensitive BLEU to evaluate translation re- mation. Random sampling brings a large variance (Sutskever
sults. BPE is also used to reduce the vocabulary et al., 2014; Ranzato et al., 2015; He et al., 2016) for
sentence-level teaching. For k-best list, we renormalize the
size. The size of sub-words is set to 43K, 33K, probabilities
43K for Spanish, English and French, respectively.
See Table 1 for detailed statistics for the Europarl P (y|z; θ̂z→y )α
P (y|z; θ̂z→y ) ∼ P ,
and WMT corpora. y∈Yk P (y|z; θ̂z→y )α
We leverage an open-source NMT toolkit dl4mt
implemented by Theano 2 for all the experiments where Yk is the k-best list from beam search of the teacher
model and α is a hyperparameter controling the sharpness
and compare our approach with state-of-the-art of the distribution (Och, 2003). We set k = 5 and α =
multilingual methods (Firat et al., 2016b) and 5×10−3 . The results on test set for Eureparl Corpus are 32.24
pivot-based methods (Cheng et al., 2016a). Two BLEU over Spanish-French translation and 24.91 BLEU over
German-French translation, which are slightly better than the
variations of our framework are used in the exper- sent-beam method. However, considering the traing time and
the memory consumption, we believe mode approximation is
1
http://www.statmt.org/wmt07/shared-task.html already a good way to approximate the target sentence space
2
dl4mt-tutorial: https://github.com/nyu-dl for sentence-level teaching.
Iterations
Approx.
0 2w 4w 6w 8w
greedy 313.0 73.1 61.5 56.8 55.1
JSENT
beam 323.5 73.1 60.7 55.4 54.0
greedy 274.0 51.5 43.1 39.4 38.8
JWORD beam 288.7 52.7 43.3 39.2 38.4
sampling 268.6 53.8 46.6 42.8 42.4
Table 2: Verification of sentence-level and word-level assumptions by evaluating approximated KL di-

vergence from the source-to-target model to the pivot-to-target model over training iterations of the
source-to-target model. The pivot-to-target model is trained and kept fixed.
Method Es→ Fr De→ Fr

pivot 29.79 23.70
hard 29.93 23.88
Cheng et al. (2016a)
soft 30.57 23.79
likelihood 32.59 25.93
sent-beam 31.64 24.39
Ours
word-sampling 33.86 27.03
Table 3: Comparison with previous work on Spanish-French and German-French translation tasks from
the Europarl corpus. English is treated as the pivot language. The likelihood method uses 100K parallel
source-target sentences, which are not available for other methods.
(Equation (9)) on 2,000 parallel source-pivot sen- Es→ Fr De→ Fr

Method
tences from the development set of WMT 2006 dev test dev test
shared task. sent-greedy 31.00 31.05 22.34 21.88
Table 2 shows the results. The source-to-target sent-beam 31.57 31.64 24.95 24.39
model is randomly initialized at iteration 0. We word-greedy 31.37 31.92 24.72 25.15
find that JSENT and JWORD decrease over time, word-beam 30.81 31.21 24.64 24.19
suggesting that the source-to-target and pivot-to- word-sampling 33.65 33.86 26.99 27.03
target models do have small KL divergence at both
sentence and word levels. Table 4: Comparison of our proposed methods
on Spanish-French and German-French transla-
4.3 Results on the Europarl Corpus
tion tasks from the Europarl corpus. English is
Table 3 gives BLEU scores on the Europarl treated as the pivot language.
corpus of our best performing sentence-level
method (sent-beam) and word-level method significant improvements can be explained by
(word-sampling) compared with pivot-based the error propagation problem of pivot-based
methods (Cheng et al., 2016a). We use the same methods, which propagates translation error of
data preprocessing as in (Cheng et al., 2016a). We the source-to-pivot translation process to the
find that both the sent-beam and word-sampling pivot-to-target translation process.
methods outperform the pivot-based approaches Table 4 shows BLEU scores on the Europarl
in a zero-resource scenario across language corpus of our five proposed methods. For
pairs. Our word-sampling method improves over sentence-level approaches, the sent-beam method
the best performing zero-resource pivot-based outperforms the sent-greedy method by +0.59
method (soft) on Spanish-French translation BLEU points over Spanish-French translation and
by +3.29 BLEU points and German-French +2.51 BLEU points over German-French transla-
translation by +3.24 BLEU points. In addition, tion on the test set. The results are in line with our
the word-sampling mothod surprisingly obtains observation in Table 2 that sentence-level KL di-
improvement over the likelihood method, which vergence by beam approximation is smaller than
leverages a source-target parallel corpus. The that by greedy approximation. However, as the
2 1 0 3 0
s e n t-g re e d y
1 8 0 s e n t-b e a m 2 5
w o r d -g re e d y
1 5 0 w o r d -b e a m 2 0
w o r d - s a m p lin g
V a lid L o s s
B L E U
1 2 0 1 5
s e n t-g re e d y
9 0 1 0 s e n t-b e a m
w o r d -g re e d y
6 0 5 w o r d -b e a m
w o r d - s a m p lin g
3 0 0
0 3 6 9 1 2 1 5 0 3 6 9 1 2 1 5
4 4
Ite r a tio n s ×1 0 Ite r a tio n s ×1 0
Figure 2: Validation loss and BLEU across iterations of our proposed methods.
Training BLEU
Method
Es→ En En→ Fr Es→ Fr Newstest2012 Newstest2013
Existing zero-resource NMT systems
Cheng et al. (2016a)† pivot 6.78M 9.29M - 24.60 -
Cheng et al. (2016a)† likelihood 6.78M 9.29M 100K 25.78 -
Firat et al. (2016b) one-to-one 34.71M 65.77M - 17.59 17.61
Firat et al. (2016b)† many-to-one 34.71M 65.77M - 21.33 21.19
Our zero-resource NMT system
word-sampling 6.78M 9.29M - 28.06 27.03
Table 5: Comparison with previous work on Spanish-French translation in a zero-resource scenario over
the WMT corpus. The BLEU scores are case sensitive. †: the method depends on two-step decoding.
time complexity grows linearly with the number tend to have lower validation loss compared with
of beams k, the better performance is achieved at sentence-level methods. Generally, models with
the expense of search time. lower validation loss tend to have higher BLEU.
For word-level experiments, we observe that Our results indicate that this is not necessarily the
the word-sampling method performs much bet- case: the sent-beam method converges to +0.31
ter than the other two methods: +1.94 BLEU BLEU points on the validation set with +13 vali-
points on Spanish-French translation and +1.88 dation loss compared with the word-beam method.
BLEU points on German-French translation over Kim and Rush (2016) claim a similar observation
the word-greedy method; +2.65 BLEU points in data distillation for NMT and provide an expla-
on Spanish-French translation and +2.84 BLEU nation that student distributions are more peaked
points on German-French translation over the for sentence-level methods. This is indeed the
word-beam method. Although Table 2 shows that case in our result: on German-French translation
word-level KL divergence approximated by sam- task the argmax for the sent-beam student model
pling is larger than that by greedy or beam, sam- (on average) approximately accounts for 3.49% of
pling approximation introduces more data diver- the total probability mass, while the correspond-
sity for training, which dominates the effect of KL ing number is 1.25% for the word-beam student
divergence difference. model and 2.60% for the teacher model.
We plot validation loss4 and BLEU scores over
4.4 Results on the WMT Corpus
iterations on the German-French translation task
in Figure 2. We observe that word-level models The word-sampling method obtains the best per-
formance in our five proposed approaches ac-
4
Validation loss: the average negative log-likelihood of cording to experiments on the Europarl corpus.
sentence pairs on the validation set. To further verify this approach, we conduct ex-
Os sentáis al volante en la costa oeste , en San Francisco , y vuestra misión es llegar los
source
primeros a Nueva York .
You get in the car on the west coast , in San Francisco , and your task is to be the first one
groundtruth pivot
to reach New York .
Vous vous asseyez derrière le volant sur la côte ouest à San Francisco et votre mission est
target
d' arriver le premier à New York .
You 'll feel at the west coast in San Francisco , and your mission is to get the first to
pivot
New York . [BLEU: 33.93]
pivot
Vous vous sentirez comme chez vous à San Francisco , et votre mission est d' obtenir
target
le premier à New York . [BLEU: 44.52]
You feel at the west coast , in San Francisco , and your mission is to reach the first to New
pivot
York . [BLEU: 47.22]
likelihood
Vous vous sentez à la côte ouest , à San Francisco , et votre mission est d' atteindre
target
le premier à New York . [BLEU: 49.44]
Vous vous sentez au volant sur la côte ouest , à San Francisco et votre mission est d'
word-sampling target
arriver le premier à New York . [BLEU: 78.78]
Table 6: Examples and corresponding sentence BLEU scores of translations using the pivot and likeli-
hood methods in (Cheng et al., 2016a) and the proposed word-sampling method. We observe that our
approach generates better translations than the methods in (Cheng et al., 2016a). We italicize correct
translation segments which are no short than 2-grams.
periments on the large scale WMT corpus for Corpus

Method BLEU
Spanish-French translation. Table 5 shows the re- De-En De-Fr En-Fr
√
sults of our word-sampling method in compari- MLE × × 19.30
√ √
son with other state-of-the-art baselines. Cheng transfer × 22.39
√ √
et al. (2016a) use the same datasets and the same pivot × 17.32
√ √
preprocessing as ours. Firat et al. (2016b) uti- Ours × 22.95
lize a much larger training set.5 Our method ob-
tains significant improvement over the pivot base- Table 7: Comparison on German-French trans-
line by +3.46 BLEU points on Newstest2012 and lation task from the Europarl corpus with 100K
over many-to-one by +5.84 BLEU points on New- German-English sentences. English is regarded as
stest2013. Note that both methods depend on a the pivot language. Transfer represents the trans-
source-pivot-target decoding path. Table 6 shows fer learning method in (Zoph et al., 2016). 100K
translation examples of the pivot and likelihood parallel German-French sentences are used for the
methods proposed in (Cheng et al., 2016a) and our MLE and transfer methods.
proposed word-sampling method. For the pivot
English-French corpora.
and likelihood methods, the Spainish sentence
segment ’sentáis al volante’ is lost when translated To fulfill this task, we combine our best per-
to English. Therefore, both methods miss this in- forming word-sampling method with the initial-
formation in the translated French sentence. How- ization and parameter freezing strategy proposed
ever, the word-sampling method generates ’volant in (Zoph et al., 2016). The Europarl corpus is used
sur’, which partially translates ’sentáis al volante’, in the experiments. We set the size of German-
resulting in improved translation quality of the English training data to 100K and use the same
target-language sentence. teacher model trained with 900K English-French
sentences.
4.5 Results with Small Source-Pivot Data Table 7 gives the BLEU score of our method on
The word-sampling method can also be applied German-French translation compared with three
to zero-resource NMT with a small source-pivot other methods. Note that our task is much harder
corpus. Specifically, the size of the source-pivot than transfer learning (Zoph et al., 2016) since
corpus is orders of magnitude smaller than that of the latter depends on a parallel German-French
the pivot-target corpus. This setting makes sense corpus. Surprisingly, our method outperforms all
in applications. For example, there are signifi- other methods. We significantly improve the base-
cantly fewer Urdu-English corpora available than line pivot method by +5.63 BLEU points and the
5
Their training set does not include the Common Crawl state-of-the-art transfer learning method by +0.56
corpus. BLEU points.
5 Related Work ing to guide the learning process of the student
model. Experiments on the Europarl and WMT
Training NMT models in a zero-resource scenario
corpora across languages show that our proposed
by leveraging other languages has attracted inten-
word-level sampling method can significantly out-
sive attention in recent years. Firat et al. (2016b)
performs the state-of-the-art pivot-based methods
proposed an approach which delivers the multi-
and multilingual methods in terms of translation
way, multilingual NMT model proposed by (Firat
quality and decoding efficiency.
et al., 2016a) for zero-resource translation. They
We also analyze zero-resource translation with
used the multi-way NMT model trained by other
small source-pivot data, and combine our word-
language pairs to generate a pseudo parallel cor-
level sampling method with initialization and pa-
pus and fine-tuned the attention mechanism of the
rameter freezing suggested by (Zoph et al., 2016).
multi-way NMT model to enable zero-resource
The experiments on the Europarl corpus show that
translation. Several authors proposed a universal
our approach obtains an significant improvement
encoder-decoder network in multilingual scenar-
over the pivot-based baseline.
ios to perform zero-shot learning (Johnson et al.,
In the future, we plan to test our approach on
2016; Ha et al., 2016). This universal model ex-
more diverse language pairs, e.g., zero-resource
tracts translation knowledge from multiple differ-
Uyghur-English translation using Chinese as a
ent languages, making zero-resource translation
pivot. It is also interesting to extend the teacher-
feasible without direct training.
student framework to other cross-lingual NLP ap-
Besides multilingual NMT, another important
plications as our method is transparent to architec-
line of research attempts to bridge source and tar-
tures.
get languages via a pivot language. This idea
is widely used in SMT (de Gispert and Mariño, Acknowledgments
2006; Cohn and Lapata, 2007; Utiyama and Isa-
hara, 2007; Wu and Wang, 2007; Bertoldi et al., This work was done while Yun Chen is visiting
2008; Wu and Wang, 2009; Zahabi et al., 2013; Tsinghua University. This work is partially sup-
Kholy et al., 2013). Cheng et al. (2016a) pro- ported by the National Natural Science Founda-
pose pivot-based NMT by simultaneously improv- tion of China (No.61522204, No. 61331013) and
ing source-to-pivot and pivot-to-target translation the 863 Program (2015AA015407).
quality in order to improve source-to-target trans-
lation quality. Nakayama and Nishida (2016)
achieve zero-resource machine translation by uti- References
lizing image as a pivot and training multimodal en- Jimmy Ba and Rich Caurana. 2014. Do deep nets really
coders to share common semantic representation. need to be deep? In NIPS.
Our work is also related to knowledge distilla-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
tion, which trains a compact model to approximate gio. 2015. Neural machine translation by jointly
the function learned by a larger, more complex learning to align and translate. In Proceedings of
model or an ensemble of models (Bucila et al., ICLR .
2006; Ba and Caurana, 2014; Li et al., 2014; Hin-
Nicola Bertoldi, Madalina Barbaiani, Marcello Fed-
ton et al., 2015). Kim and Rush (2016) first in- erico, and Roldano Cattoni. 2008. Phrase-based sta-
troduce knowledge distillation in neural machine tistical machine translation with pivot languages. In
translation. They suggest to generate a pseudo cor- IWSLT.
pus to train the student network. Compared with
Cristian Bucila, Rich Caruana, and Alexandru
their work, we focus on zero-resource learning in- Niculescu-Mizil. 2006. Model compression. In
stead of model compression. KDD.
6 Conclusion Yong Cheng, Yang Liu, Qian Yang, Maosong Sun, and
Wei Xu. 2016a. Neural machine translation with
In this paper, we propose a novel framework to pivot languages. CoRR abs/1611.04928.
train the student model without parallel corpora
Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua
available under the guidance of the pre-trained Wu, Maosong Sun, and Yang Liu. 2016b. Semi-
teacher model on a source-pivot parallel corpus. supervised learning for neural machine translation
We introduce sentence-level and word-level teach- .
Trevor Cohn and Mirella Lapata. 2007. Machine trans- Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol
lation by triangulation: Making effective use of Vinyals, and Wojciech Zaremba. 2015. Addressing
multi-parallel corpora. In ACL. the rare word problem in neural machine translation.
In ACL.
Adrià de Gispert and José B. Mariño. 2006. Catalan-
english statistical machine translation without paral- Hideki Nakayama and Noriki Nishida. 2016. Zero-
lel corpus: bridging through spanish. In Proceed- resource machine translation by multimodal
ings of 5th International Conference on Language encoder-decoder network with multimedia pivot.
Resources and Evaluation (LREC). Citeseer, pages CoRR abs/1611.04503.
65–68.
Franz Josef Och. 2003. Minimum error rate training in
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. statistical machine translation. In ACL.
2016a. Multi-way, multilingual neural machine
translation with a shared attention mechanism. In Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
HLT-NAACL. Jing Zhu. 2002. Bleu: a method for automatic eval-
uation of machine translation. In ACL.
Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan,
Fatos T. Yarman-Vural, and Kyunghyun Cho. 2016b. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,
Zero-resource translation with multi-lingual neural and Wojciech Zaremba. 2015. Sequence level
machine translation. In EMNLP. training with recurrent neural networks. CoRR
abs/1511.06732.
Thanh-Le Ha, Jan Niehues, and Alexander H. Waibel.
2016. Toward multilingual neural machine trans- Rico Sennrich, Barry Haddow, and Alexandra Birch.
lation with universal encoder and decoder. CoRR 2016. Neural machine translation of rare words with
abs/1611.04798. subword units .
Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua
Tie-Yan Liu, and Wei-Ying Ma. 2016. Dual learning Wu, Maosong Sun, and Yang Liu. 2016. Minimum
for machine translation. In NIPS. risk training for neural machine translation .
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
2015. Distilling the knowledge in a neural network. Sequence to sequence learning with neural networks
CoRR abs/1503.02531. .
Sébastien Jean, Kyunghyun Cho, Roland Memisevic, Masao Utiyama and Hitoshi Isahara. 2007. A compari-
and Yoshua Bengio. 2015. On using very large tar- son of pivot methods for phrase-based statistical ma-
get vocabulary for neural machine translation. In chine translation. In HLT-NAACL.
ACL.
Hua Wu and Haifeng Wang. 2007. Pivot language ap-
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim proach for phrase-based statistical machine transla-
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho- tion. Machine Translation 21:165–181.
rat, Fernanda B. Viégas, Martin Wattenberg, Gre-
gory S. Corrado, Macduff Hughes, and Jeffrey Dean. Hua Wu and Haifeng Wang. 2009. Revisiting pivot
2016. Google’s multilingual neural machine trans- language approach for machine translation. In
lation system: Enabling zero-shot translation. CoRR ACL/IJCNLP.
abs/1611.04558.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Le, Mohammad Norouzi, Wolfgang Macherey,
continuous translation models. In EMNLP. Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, Jeff Klingner, Apurva Shah, Melvin
Ahmed El Kholy, Nizar Habash, Gregor Leusch, Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
Evgeny Matusov, and Hassan Sawaf. 2013. Lan- Gouws, Yoshikiyo Kato, Taku Kudo, Hideto
guage independent connectivity strength features for Kazawa, Keith Stevens, George Kurian, Nishant
phrase pivot statistical machine translation. Patil, Wei Wang, Cliff Young, Jason Smith, Jason
Riesa, Alex Rudnick, Oriol Vinyals, Gregory S.
Yoon Kim and Alexander M. Rush. 2016. Sequence- Corrado, Macduff Hughes, and Jeffrey Dean. 2016.
level knowledge distillation. In EMNLP. Google’s neural machine translation system: Bridg-
ing the gap between human and machine translation.
Philipp Koehn. 2005. Europarl: a parallel corpus for CoRR abs/1609.08144.
statistical machine translation.
Samira Tofighi Zahabi, Somayeh Bakhshaei, and
Jinyu Li, Rui Zhao, Jui-Ting Huang, and Yifan Shahram Khadivi. 2013. Using context vectors in
Gong. 2014. Learning small-size dnn with output- improving a machine translation system with bridge
distribution-based criteria. In INTERSPEECH. language. In ACL.
Xiaoning Zhu, Zhongjun He, Hua Wu, Haifeng Wang,
Conghui Zhu, and Tiejun Zhao. 2013. Improving
pivot-based statistical machine translation using ran-
dom walk. In EMNLP.
Barret Zoph, Deniz Yuret, Jonathan May, and Kevin
Knight. 2016. Transfer learning for low-resource
neural machine translation. In EMNLP.

A Teacher-Student Framework For Zero-Resource Neural Machine Translation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Teacher-Student Framework For Zero-Resource Neural Machine Translation

Uploaded by

Copyright:

Available Formats

A Teacher-Student Framework for

Zero-Resource Neural Machine Translation

As a result, a number of authors have endeav-

have close probabilities of generating a sentence to denote a source-to-target neural translation

The above two-step decoding process potentially 3.2 Sentence-Level Teaching

JWORD (θx→y ) 4.1 Setup

Table 2: Verification of sentence-level and word-level assumptions by evaluating approximated KL di-

Method Es→ Fr De→ Fr

(Equation (9)) on 2,000 parallel source-pivot sen- Es→ Fr De→ Fr

periments on the large scale WMT corpus for Corpus

You might also like