Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Improving Neural Machine Translation with Conditional Sequence

Generative Adversarial Nets

Zhen Yang1,2 , Wei Chen1∗, Feng Wang1,2 , Bo Xu1


1
Institute of Automation, Chinese Academy of Sciences
2
University of Chinese Academy of Sciences
{yangzhen2014, wei.chen.media, feng.wang, xubo}@ia.ac.cn

Abstract language sentence automatically. Recently, with


the rapid development of deep neural networks,
This paper proposes a new route for ap- the neural machine translation (Kalchbrenner and
arXiv:1703.04887v2 [cs.CL] 17 Apr 2017

plying generative adversarial nets (GANs) Blunsom, 2013; Sutskever et al., 2014; Cho et al.,
to NLP tasks (taking the neural ma- 2014; Bahdanau et al., 2014; Ranzato et al., 2015)
chine translation as an instance) and the which leverages a single neural network directly to
widespread perspective that GANs can’t transform the source sentence into the target sen-
work well in the NLP area turns out to tence, has obtained state-of-the-art performance
be unreasonable. In this work, we build a for several language pairs (Wu et al., 2016; John-
conditional sequence generative adversar- son et al., 2016; Bradbury and Socher, 2016). This
ial net which comprises of two adversarial end-to-end NMT typically consists of two sub re-
sub models, a generative model (genera- current neural nets. The encoder network reads
tor) which translates the source sentence and encodes the source sentence into the context
into the target sentence as the traditional vector representation; and the decoder network
neural machine translation (NMT) models generates the target sentence word by word based
do and a discriminative model (discrim- on the context vector. To dynamically generate a
inator) which discriminates the machine- context vector for a target word being generated,
translated target sentence from the human- the attention mechanism is usually deployed. Op-
translated one. From the perspective of timization for this NMT model is to directly max-
Turing test, the proposed model is to imize the likelihood of the training data. Specifi-
generate the translation which is indistin- cally, at each decoding step, the NMT model is op-
guishable from the human-translated sen- timized to maximize the likelihood estimation of
tence. Experiments show that the pro- the ground word (MLE) at the current step. Ran-
posed model achieves significant improve- zato et al. (2015) indicate that the MLE loss func-
ments than the strong baseline model. In tion is only defined at the word level instead of
Chinese-English translation tasks, we ob- the sentence level. Hence the NMT model may
tain up to +2.5 BLEU points improvement. generate the best candidate word for the current
To the best of our knowledge, this is the time step yet a bad component of the whole sen-
first time that the quantitative results about tence in the long run. Shen et al. (2015) give a
the application of GANs in the traditional solution by introducing the minimum risk train-
NLP task are reported. Meanwhile, we ing from the statistical machine translation (SMT).
present detailed strategies for GAN train- They incorporate the sentence-level BLEU (Chen
ing. In addition, we find that the discrimi- and Cherry, 2014) into the loss function. Hence
nator of the proposed model performs well the NMT model is optimized to generate sentences
in data cleaning. with higher BLEU points. Since the BLEU point
is computed as the geometric mean of the modified
1 Introduction
n-gram precisions (Papineni et al., 2002), we con-
Machine translation is one of the traditional clude that almost all of the prior NMT models are
NLP tasks which aims to translate one source- trained to cover more n-grams with the ground tar-
language sentence into the corresponding target- get sentence (MLE can be viewed as training the
NMT to cover more 1-gram with the target sen- reaches a Nash Equilibrium (Zhao et al., 2016).
tence). In summary, we mainly make the following
However, it is widely acknowledged that higher contributions:
n-gram precisions don’t ensure a better sentence
• To the best of our knowledge, we are the
(Callison-Burch and Osborne, 2006; Chatterjee
first to introduce the generative adversarial
et al., 2007). Additionally, the manually defined
training into NMT, which trains the NMT
loss function is unable to cover all crucial aspects
model to generate sentences which are indis-
and the NMT model may be trained to deviate
tinguishable from the human-generated sen-
from the data distribution and generate subopti-
tences. We build a conditional generative ad-
mal sentences. Intuitively, The model should be
versarial net which can be applied to any end-
trained to directly generate a human-like transla-
to-end NMT systems. We do not assume the
tion instead of covering the human designed ap-
specific architecture of the NMT model.
proximation features. From the Turing test per-
spective, we should enlighten the model to be
• The extensive experiments on Chinese-to-
aware of what is the human-generated sentence
English translation tasks show that the pro-
like and to generate the sentence which is indistin-
posed CSGAN-NMT significantly outper-
guishable form the human-generated one. Based
forms the strong attention-based NMT model
on the analysis above, we propose that a good
which serves as the baseline. We present de-
training objective for NMT includes: 1) The loss
tailed, quantitative results to demonstrate the
function should be defined on the sentence level
effectiveness of the proposed CSGAN-NMT.
rather than the word level; 2) No any manually de-
This indicates the feasibility of applying the
fined approximation feature is used to guide the
GANs into traditional NLP tasks.
NMT model; 3) The NMT model should be di-
rectly exposed to the true data distribution. Specif-
• We successfully leverage the discriminator to
ically, the model should be trained to directly out-
clean the training data for NMT.
put the translation indistinguishable from human-
generated translations and if one poor sentence • We test different architectures for the dis-
is generated, the model should be penalized with criminator, the convolutional neural network
how far the poor sentence is from the human- (CNN) based and the recurrent neural net-
generated one. work (RNN) based one. We found that the
Borrowing the idea of generative adversarial RNNs are not applicable for the discrimina-
training in computer vision (Goodfellow et al., tor.
2014; Denton et al., 2015), we build a condi-
tional sequence generative adversarial net (CS- • We report our specific training strategies for
GAN) which implements the training objective the proposed CSGAN-NMT. This provides a
mentioned above. In the proposed CSGAN-NMT, new reliable route for applying generative ad-
we jointly train two models, a generator (imple- versarial nets into other NLP tasks.
mented as the traditional NMT model) which gen-
erates the target-language sentence based on the 2 Related work
input source-language sentence, and a discrimi-
2.1 Neural machine translation
nator which conditioned on the source-language
sentence predicts the probability of the target- This subsection briefly describes the attention-
language sentence being a human-generated one. based NMT model which simultaneously con-
During the training process, the generator aims to ducts dynamic alignment and generation of the
fool the discriminator into believing that its output target sentence. The NMT model produces the
is a human-generated sentence, and the discrim- translation sentence by generating one target word
inator makes effort not to be fooled by improv- at every time step. Given an input sequence
ing its ability to distinguish the machine-generated x = (x1 , . . . , xTx ) and previous translated words
sentence from the human-generated one. This (y1 , . . . , yi−1 ), the probability of next word yi is:
kind of adversarial training achieves a win-win
situation when the generator and discriminator p(yi |y1 , . . . , yi−1 , x) = g(yi−1 , si , ci ) (1)
where si is an decoder hidden state for time step i, For sequence generation problem, Yu et al. (2016)
which is computed as: leverage policy gradient reinforcement learning to
back-propagate the reward from the discrimina-
si = f (si−1 , yi−1 , ci ) (2) tor, showing presentable results for poem gener-
ation, speech language generation and music gen-
Here f and g are nonlinear transform functions,
eration. Similarly, Zhang et al. (2016) generate
which can be implemented as long short term
the text from random noise via adversarial train-
memory network(LSTM) or gated recurrent unit
ing. A striking difference from the works men-
(GRU), and ci is a distinct context vector at time
tioned above, our work is in the conditional set-
step i, which is calculated as a weighted sum of
tings where the target-language sentence is gener-
the input annotations hj :
ated conditioned on the source-language one.
Tx
X In parallel to our work, Li et al. (2017) propose
ci = ai,j hj (3) a similar conditional sequence generative adver-
j=1 sarial training for dialogue generation. They use
a hierarchical LSTM architecture for the discrim-
where hj is the annotation of xj from a bidirec- inator. In contrast to their approach, we apply the
tional RNN. The weight aij for hj is calculated CNN-based discriminator for the machine transla-
as: tion task. Furthermore, we present detailed train-
exp(eij )
ai,j = PTx (4) ing strategies for the proposed model and exten-
t=1 exp(ei,t ) sive quantitative results are reported.
where
3 The CSGAN-NMT
ei,j = va tanh(W si−1 + U hj ) (5)
In this section, we describe in detail the
where va is the weight vector, W and U are the CSGAN-NMT that consists of a generator G
weight matrixes. All of the parameters in the which generates the target-language sentence
NMT model are optimized to maximize the fol- based on the source-language sentence and a dis-
lowing conditional log-likelihood of the M sen- criminator D which distinguishes the machine-
tence aligned bilingual samples: generated sentence from the human-generated
one. The sentence generation process is viewed as
M Ty a sequence of actions that are taken according to a
1 XX
`(θ) = log p(yim |y<i
m
, xm , θ) (6) policy regulated by the generator. In this work, we
M
m=1 i=1
take the policy gradient training strategies which
2.2 Generative adversarial net are same as Yu et al. (2016).
Generative adversarial network in which a gen- 3.1 Generator
erative model is trained to generate outputs to fool
Resembling the traditional NMT model, the
the discriminator, has enjoyed great success in
generator G generates the target-language sen-
computer vision and has been widely applied to
tence conditioned on the input source-language
image generation. The conditional generative ad-
sentence. It defines the policy that generates the
versarial nets apply an extension of generative ad-
target sentence y given the source sentence x. The
versarial network to a conditional setting, which
generator takes exactly the same architecture with
enables the networks to condition on some arbi-
the traditional NMT model. Note that we do not
trary external data.
assume the specific architecture of the generator.
However, to the best of our knowledge, this
Here, we adopt the strong attention-based NMT
idea has not been applied in traditional NLP tasks
model which is implemented as the open-source
with comparable success and few quantitative ex-
system dl4mt * , as the generator.
perimental result has been reported. Some re-
cent works have begun to apply the generative ad- 3.2 Discriminator
versarial training into the NLP area: Chen et al.
Recently, the deep discriminative models such
(2016) apply the idea of generative adversarial
as the CNN and RNN have shown a high per-
training to sentiment analysis and Zhang et al.
*
(2017) use the idea to domain adaptation tasks. https://github.com/nyu-dl/dl4mt-tutorial
Maxpooling

average

我 是 一 个 大 学生 我 是 一 个 大 学生

Figure 1: The CNN-based architecture for the discriminator. Figure 2: The BiLSTM-based architecture for the discrimina-
Note that only the source-side CNN is depicted. tor. Only the source-side BiLSTM is depicted.

formance in complicated sequence classification feature maps:


tasks. To test the efficacy of the discriminator,
cj = max{cj1 , . . . , cjT −l+1 }
e (10)
we propose two different architectures for the dis-
criminator: the CNN-based and RNN-based dis- We use various numbers of kernels with differ-
criminators. ent window sizes to extract different features,
CNN-based Since sentences generated by the which are finally concatenated to form the source-
generator have a variable length, the CNN padding language sentence representation cx . Identically,
is used to transform the sentence to a sequence the target-language sentence representation cy can
with the fixed length T which is the maxi- be extracted from the target matrix Y1:T . Finally,
mum length for the input sequence set by the given the source-language sentence, the probabil-
user beforehand. Given the source-language se- ity that the target-language sentence is being real
quence x1 , . . . , xT and target-language sequence can be computed as:
y1 , . . . , yT , we build the source matrix X1:T and p = σ(V [cx ; cy ]) (11)
target matrix Y1:T respectively as:
where V is the transform matrix which transforms
X1:T = x1 ; x2 ; . . . ; xT (7) the concatenation of cx and cy into a 2-dimension
embedding and σ is the logistic function. The
and CNN-based discriminator is depicted as figure 1
Y1:T = y1 ; y2 ; . . . ; yT (8) RNN-based Recurrent neural network has sev-
where xt , yt ∈ Rk is the k-dimensional word em- eral different formations, such as the LSTM, GRU
bedding and the semicolon is the concatenation and the simple recurrent neural network (simple-
operator. For the source matrix X1:T , a kernel RNN). This paper takes the LSTM as an instance.
wj ∈ Rl×k applies a convolutional operation to Given the source-language sequence x1 , . . . , xs , a
a window size of l words to produce a series of LSTM is used to map the input sequence into a se-
feature maps: quence of hidden states h1 , . . . , hs by leveraging
the update function of LSTM cells recursively:
cji = ρ(BN (wj ⊗ Xi:i+l−1 + b)) (9) ht = lstm(ht−1 , xt ) (12)
where ⊗ operator is the summation of element- The vector representation for the source-language
wise production and b is a bias term. ρ is a non- sentence cx is computed as the average of the hid-
linear activation function which is implemented as den states. The target sentence vector cy is com-
ReLU in this paper. Note that the batch normaliza- puted with the same way. Finally, the probability
tion (Ioffe and Szegedy, 2015) which accelerates that the target-language sentence is being real is
the training significantly, is applied to the input of computed as equation 11. We also take the bidi-
the activation function (BN in equation 9). To get rectional LSTM as an alternative for LSTM. The
the final feature with respect to kernel wj , a max- graphical illustration of the BiLSTM-based dis-
over-time pooling operation is leveraged over the criminator is depicted as figure 2.
3.3 Policy gradient training
Following Yu et al. (2016), the objective of the Using the discriminator as a reward function
generator G is defined as to generate a sequence can further improve the generator iteratively by
from the start state to maximize its expected end dynamically updating the discriminator. Once we
reward. Formally, the objective function is com- get more realistic generated sequences, we re-train
puted as: the discriminator as:

((Y1:T −1 , X), yT ) (13)
P
J(θ) = Gθ (Y1:T |X) · RD min −EX,Y ∈Pdata [log D(X, Y )] − EX,Y ∈G [log(1 − D(X, Y ))]
Y1:T −1
(17)
where Y1:T = y1 , . . . , yT indicates the generated
target sequence, RDGθ
is the action-value function After updating the discriminator, we are ready to
re-train the generator. The gradient of the objec-
of a target-language sentence given the source sen- tive function J(θ) w.r.t the generator’s parameter
tence X, i.e. the expected accumulative reward θ is calculated as:
starting from the state (Y1:T −1 , X), taking action T P
1 G
yT , and following the policy Gθ . To estimate the
P
∇J(θ) = T
RDθ ((Y1:t−1 , X), yt ) · ∇θ (Gθ (yt |Y1:t−1 , X))
t=1 yt
action-value function, we consider the estimated 1
T
P G
= Eyt ∈Gθ [RDθ ((Y1:t−1 , X), yt ) · ∇θ log p(yt |Y1:t−1 , X)]
probability of being real by the discriminator D as T
t=1
(18)
the reward:

RD ((Y1:T −1 , X), yT ) = D(X, Y1:T ) − b(X, Y1:T ) (14)
4 Training strategies
where b(X,Y) denotes the baseline value to reduce
the variance of the reward. Practically, we take It is hard to train the generative adversarial
b(X,Y) as a constant, 0.5 for simplicity. The ques- networks since the generator and discriminator
tion is that, given the source sequence, the dis- need to be carefully synchronized. To make this
criminator D only provides a reward value for a work easier to reproduce, this paper gives detailed
finished target sequence. If Y1:T is not a finished strategies for training the CSGAN-NMT model.
target sequence, the value of D(X, Y1:T ) makes Firstly, we use the maximum likelihood estima-
no sense. Therefore, we can’t get the action-value tion to pre-train the generator on the parallel train-
for an intermediate state directly. To evaluate the ing set s until the best translation performance is
action-value for an intermediate state, the Monte achieved.
Carlo search under the policy of G is applied to Then, generate the machine-generated sen-
sample the unknown tokens. Each search ends un- tences by using the generator to decode the train-
til the end of sentence token is sampled or the sam- ing data. We simply use the greedy sampling
pled sentence reaches the maximum length. To ob- method instead of the beam search method for de-
tain more stable reward and reduce the variance, coding. Hence, it is very fast to decode all of the
we represent an N-time Monte Carlo search as: training set.
1 , . . . , Y N } = M C Gθ ((Y , X), N ) (15)
{Y1:T Next, pre-train the discriminator on the combi-
1 1:TN 1:t
nation of the true parallel data and the machine-
where Ti represents the length of the sen- generated data until the classification accuracy
tence sampled by the i’th Monte Carlo search. achieves at f .
(Y1:t , X) = (y1 , . . . , yt , X) is the current state Finally, we jointly train the generator and dis-
N
and Yt+1:T is sampled based on the policy G.
N criminator. The generator is trained with the pol-
The discriminator provides N rewards for the icy gradient training method. We randomly sam-
sampled N sentences respectively. The final re- ple a batch of source sentences from s as the train-
ward for the intermediate state is calculated as the ing examples for the generator and the batch size
average of the N rewards. Hence, for the target is βg . Note that the target sentences are useless
sentence with the length T , we compute the re- when the the generator is undergoing the policy
ward for yt in the sentence level as: gradient training. However, in our practice, we
G
find that updating the generator only with the sim-
RDθ ((Y1:t−1 , X), yt ) = (16) ple policy gradient training leads to unstable train-
( N
1
N
P
n=1
n
D(X, Y1:T n
) − n
b(X, Y1:Tn
n
), Y1:T n
∈ MC Gθ
((Y1:t , X), N ) t<T ing. The translation performance drops sharply
D(X, Y1:t ) − b(X, Y1:t ) t=T after a few updating. We conjecture that this is
Models NIST02(Dev) NIST03 NIST04 NIST05 Ave
1 baseline-small 32.12 28.31 30.21 27.13 29.44
2 baseline-large 34.12 31.05 32.04 29.24 31.61
3 (1)+sgd 32.96 29.09 30.82 27.72 30.14
4 (1)+CSGAN-NMT 33.87 30.26 31.52 28.85 31.12
5 (3)+CSGAN-NMT 34.38 30.56 31.76 29.23 31.48
6 (2)+CSGAN-NMT 36.30 33.11 33.85 31.78 33.76
Table 1: BLEU score on Chinese-English translation tasks. ”+sgd” means using sgd to finetune the model and ”baseline-small”
indicates that the model is trained on the small data set.

because the generator can only indirectly access ator of the CSGAN-NMT is implemented identi-
to the golden target sentence through the reward cally with the baseline model.
passed back from the discriminator, and this re-
ward is used only to promote or discourage the 5.1 Setup
machine-generated sentences. To alleviate this is- We perform two tasks on Chinese-English
sue, we adopt the teacher forcing approach which translation: One for small training data set (1.0M
is similar to Lamb et al. (2016); Li et al. (2017). sentence pairs) and the other one for large-scale
We directly make the discriminator to automati- data set (1.6M sentence pairs). The training sets
cally assign a reward of 1 to the golden target- are randomly extracted from LDC corpora * . The
language sentence and the generator uses this re- large training set is only used to test the feasi-
ward to update itself on the true parallel examples. bility of the proposed model on settings where
We run the teacher forcing training for one time a great amount of training data is accessible. If
once the generator is updated by the policy gradi- no otherwise specified, the following experiments
ent training. After the generator gets updated, we are run on the small training set. We choose the
use the new stronger generator to generate η more NIST02 as the development set. For testing, we
realistic sentences, which are then used to train the use NIST03, NIST04 and NIST05 data sets. We
discriminator. The batch size for training the dis- apply word-level translation in our experiments
criminator is referred as βd . Following Arjovsky and the Chinese sentences are segmented before-
et al. (2017), we clamp the weights of the discrim- hand. To speed up the training procedure, the sen-
inator to a fixed box ( [−,] ) after each gradient tences of length over 50 words are removed. We
update. We perform one optimization step for the limit the vocabulary in both Chinese and English
discriminator for each step of the generator. to the most 30K words and the out-of-vocabulary
In our practice, we set f as 0.82, βg as 100, βd words are replaced with UNK. The word embed-
as 64, η as 5000,  as 1 and the N for Monte Carlo ding dimension is set as 512 and the size of the
search as 20. We apply the Adam optimization hidden layer is 1024. The other hyper-parameters
method, with the initial learning rate of 0.001, for are set according to the section 4. We use case-
pre-training the generator and discriminator. Dur- insensitive 4-gram BLEU score as the evaluation
ing the process of generative adversarial training, metric. We train the NMT models on 4 GPU
the RMSProp optimization method with the initial K80 and it takes about 30 hours to train the base-
learning rate of 0.0001 is utilized for the generator line model on the small training set. The training
and discriminator. time for the proposed CSGAN-NMT model (pre-
training included) is about 3 days.
5 Experiments and Results 5.2 CNN or RNN for discriminator
In this section, we detail our experiments and We test different architectures for the discrimi-
results on the CSGAN-NMT model for Chinese- nator, CNN-based and RNN-based (LSTM for in-
English translation tasks. The open-source NMT stance). Figure 3 shows the BLEU scores on the
system dl4mt, which has been used to build top- development set tested at different time steps. We
performing submissions to shared translation tasks can find that the performance of the RNN-based
at WMT and IWSLT (Sennrich et al., 2017), is *
LDC2002L27, LDC2002T01, LDC2002E18,
used as the baseline model. Note that the gener- LDC2003E07, LDC2004T08, LDC2004E12, LDC2005T10
35 35

30 30

25 25
BLEU

BLEU
20 20
0.6-acc
15 15 0.7-acc
BiLSTM 0.8-acc
10 LSTM 10 0.9-acc
CNN 0.95-acc
5 5
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
Test step Test step

Figure 3: BLEU score on the development set for the Figure 4: BLEU score on the development set for the
CSGAN-NMT with different discriminator. For RNN-based CSGAN-NMT where the discriminators have different initial
architecture, we test LSTM and bidirectional LSTM. accuracy. ”0.6-acc” means the initial accuracy is 0.6.

(LSTM and BiLSTM) discriminators deteriorate the CSGAN-NMT model (see (3) and (4) in ta-
rapidly with the time going. Even more strik- ble 1). Additionally, we get another +0.3 BLEU
ing, the performance of the discriminator based on points improvement when running the CSGAN-
the BiLSTM collapses sharply after several time NMT on the basis of the finetuned baseline model
steps. The training for the RNN-based discrim- (see (3) and (5) in table 1). On the large training
inator is not stable. On the contrary, the CNN- set, we find that the CSGAN-NMT model leads to
based discriminator performs very well. Empiri- improvement up to +2.5 BLEU points on NIST05
cally, with a few times of updating, the classifi- and +2.1 BLEU points on average (see (2) and
cation accuracy of RNN-based discriminators can (6) in table 1). To conclude, these experiments
easily achieve as high as 0.9 which is too strong show that the NMT can be greatly improved by the
for the generator. The sentences generated by the generative adversarial training and the proposed
generator can be easily discriminated by the strong CSGAN-NMT model can achieve consistent im-
discriminator and the generator is discouraged all provement when it is trained on the large data set.
the time. We conjecture that this is why the RNN-
based discriminators work badly in our CSGAN-
NMT. If no otherwise specified, we use the CNN- 5.4 Initial accuracy of the discriminator
based discriminator for the following experiments. The initial accuracy f of the discriminator
which can be viewed as a hyper-parameter, can
be controlled carefully during the process of pre-
5.3 Results on Chinese-English translation training. A natural question is that when shall
Table 1 shows the BLEU score on Chinese- we end the pre-training. Do we need to pre train
English test sets. Compared to the baseline model, the discriminator until that its accuracy reaches as
the proposed CSGAN-NMT model leads to im- high as possible? To test the impact of the initial
provement up to +1.7 BLEU points on average accuracy of the discriminator, we pre train five dis-
when trained on the small data set (see (1) and (4) criminators which have the accuracy as 0.6, 0.7,
in table 1). Naturally, there is a doubt that this 0.8, 0.9 and 0.95 respectively. With the five dis-
improvement may owe much to the small learn- criminators, we train five different CSGAN-NMT
ing rate of the optimization method rather than models and test their translation performance on
the CSGAN-NMT itself. To dispel this doubt and the development set at regular intervals. Figure
verify the efficacy of the proposed model, we use 4 reports the result and we can find that the ini-
stochastic gradient descend optimization method tial accuracy of the discriminator shows great im-
with the learning rate of 0.0001, i.e., the learning pacts on the translation performance of the pro-
rate used in CSGAN-NMT, to finetune the base- posed model. From figure 4, the initial accuracy
line model and we only get +0.7 BLEU points im- of the discriminator needs to be set carefully and
provement (see (3) in table 1). There is a gap as no matter it is set too high (0.9 and 0.95) or too low
large as 1.0 BLEU points on average (30.12 vs (0.6 and 0.7), the CSGAN-NMT performs badly.
31.14) between the finetuned baseline model and Empirically, we pre-train the discriminator until its
accuracy reaches 0.82. are trained on the data set s1 and s2 respectively.
In our practice, we choose the set s2 for five times
5.5 Sample times for Monte Carlo search with different random seed and report the aver-
We are also curious about the sample times N age translation performance of five NMT models
for Monte Carlo search. If N is set as a small num- trained on the five s2 sets respectively. The re-
ber, the intermediate reward computed as equation sults are reported in table 3. We can find that the
16 may be incorrect and if otherwise, the compu- models trained on the cleaned data achieves bet-
tation shall be very time consuming. There is a ter translation performance than the counterpart
trade-off between the accuracy and computation trained on the randomly sampled data. Further-
complexity. Table 2 presents the translation per- more, the model trained on the 60w cleaned data
formance of the CSGAN-NMT on the test sets achieves comparable translation performance with
when the N are set from 5 to 30 with interval 5. the model trained on 80w noisy data on NIST02
From table 2, the proposed CSGAN-NMT model and NIST05. This indicates that the discrimina-
achieves no improvement than the baseline when tor is capable of cleaning the training data for the
N are set less than 15. With N set as 30, we get NMT.
little improvement than the model with N set as
NIST02 NIST04 NIST05
20. However, the training time has exceeded our #
s1 s2 s1 s2 s1 s2
toleration. 60w 30.07 29.12 27.98 27.58 26.32 25.48
80w 31.11 30.32 29.31 28.61 26.97 26.31
N NIST02 NIST03 NIST04 NIST05
5 - - - - Table 3: The translation performance of the NMT model on
test sets when the data size for the sub set is 60w and 80w.
10 - - - - The result for s2 is the average translation performance.
15 33.02 29.81 30.89 28.45
20 33.87 30.25 31.41 28.66
25 33.65 30.24 31.52 28.81 6 Conclusions and Future work
30 33.74 30.21 31.54 28.76
In this work, we propose the CSGAN-NMT
Table 2: The translation performance of the CSGAN-NMT
model with different N for Monte Carlo search. ”-” means
which leverages the generative adversarial train-
that the proposed model shows no improvement than the ing to improve the neural machine translation. We
baseline model or it can’t be trained stably. test different architectures (RNN-based and CNN-
based) for the discriminator of the CSGAN-NMT.
Experimental results show that our proposed
5.6 Discriminator for data cleaning model can significantly outperform the strong
Since the discriminator of the CSGAN-NMT attention-based NMT baseline. Even on the large-
directly outputs the probability that, given the scale training set, the model can achieve consis-
source-language sentence, the target-language tent improvement. We find that the CNN-based
sentence is being a human-generated one. This discriminator performs better than the RNN-based
motivates us to test the feasibility of applying the one and we give some explanations about this. Ad-
discriminator into data cleaning. When we fin- ditionally, we provide detailed training strategies
ished training the CSGAN-NMT model, the ac- for the CSGAN-NMT model. We also demon-
curacy of the discriminator is near 0.6, which is strate that the discriminator in the CSGAN-NMT
a little weak of handling the data cleaning task. has great capability in data cleaning.
Hence, we continue training the discriminator for In the future, we would like to try multi-
4 epoches and its accuracy reaches 0.95. Then, adversarial framework which consists of multi dis-
by feeding the parallel training data into the dis- criminators and generators for the generative ad-
criminator, we get a probability of being human- versarial training. Additionally, we plan to test our
translated for each sentence pair. We select a set of method in other NLP tasks, like dialogue system
examples (s1 ) from the training data by the prob- and question answering. Since the higher BLEU
ability in a descending order. Additionally, we point doesn’t ensure a better sentence, another in-
also randomly choose the other set s2 which has teresting direction is to apply the discriminator to
the same amount of examples with s1 . Two tra- measure the translation performance more equally.
ditional NMT models with the same configuration We believe that it deserves more effort to reduce
the hyper-parameters in the future work. Melvin Johnson, Mike Schuster, Quoc V Le, Maxim
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,
Fernanda Viégas, Martin Wattenberg, Greg Corrado,
et al. 2016. Google’s multilingual neural machine
References translation system: Enabling zero-shot translation.
Martin Arjovsky, Soumith Chintala, and Léon Bot- arXiv preprint arXiv:1611.04558 .
tou. 2017. Wasserstein gan. arXiv preprint
arXiv:1701.07875 . Nal Kalchbrenner and Phil Blunsom. 2013. Recur-
rent continuous translation models. EMNLP pages
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- 1700–1709.
gio. 2014. Neural machine translation by jointly
learning to align and translate. arXiv preprint Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng
arXiv:1409.0473 . Zhang, Aaron Courville, and Yoshua Bengio. 2016.
Professor forcing: A new algorithm for training re-
James Bradbury and Richard Socher. 2016. Metamind current networks. Advances In Neural Information
neural machine translation system for wmt 2016. Processing Systems pages 4601–4609.
In Proceedings of the First Conference on Machine
Translation, Berlin, Germany. Association for Com- Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter,
putational Linguistics. and Dan Jurafsky. 2017. Adversarial learning
for neural dialogue generation. arXiv preprint
Chris Callison-Burch and Miles Osborne. 2006. Re- arXiv:1701.06547 .
evaluating the role of bleu in machine translation re-
search . Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic eval-
Niladri Chatterjee, Anish Johnson, and Madhav Kr- uation of machine translation. Association for Com-
ishna. 2007. Some improvements over the bleu putational Linguistics pages 311–318.
metric for measuring translation quality for hindi.
In Computing: Theory and Applications, 2007. IC- Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,
CTA’07. International Conference on. IEEE, pages and Wojciech Zaremba. 2015. Sequence level train-
485–490. ing with recurrent neural networks. arXiv preprint
arXiv:1511.06732 .
Boxing Chen and Colin Cherry. 2014. A systematic
comparison of smoothing techniques for sentence- Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan-
level bleu. ACL 2014 page 362. dra Birch, Barry Haddow, Julian Hitschler, Marcin
Junczys-Dowmunt, Samuel L?ubli, Antonio Vale-
Xilun Chen, Yu Sun, Ben Athiwaratkun, Claire Cardie, rio Miceli Barone, and Jozef Mokry. 2017. Nema-
and Kilian Weinberger. 2016. Adversarial deep av- tus: a toolkit for neural machine translation .
eraging networks for cross-lingual sentiment classi-
fication. arXiv preprint arXiv:1606.01614 . Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua
Wu, Maosong Sun, and Yang Liu. 2015. Minimum
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- risk training for neural machine translation. arXiv
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger preprint arXiv:1512.02433 .
Schwenk, and Yoshua Bengio. 2014. Learning
phrase representations using rnn encoder-decoder Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014.
for statistical machine translation. arXiv preprint Sequence to sequence learning with neural net-
arXiv:1406.1078 . works. Advances in neural information processing
systems pages 3104–3112.
Emily L Denton, Soumith Chintala, Rob Fergus, et al.
2015. Deep generative image models using a? Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
laplacian pyramid of adversarial networks. In Ad- Le, Mohammad Norouzi, Wolfgang Macherey,
vances in neural information processing systems. Maxim Krikun, Yuan Cao, Qin Gao, Klaus
pages 1486–1494. Macherey, et al. 2016. Google’s neural ma-
chine translation system: Bridging the gap between
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, human and machine translation. arXiv preprint
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron arXiv:1609.08144 .
Courville, and Yoshua Bengio. 2014. Generative ad-
versarial nets. In Advances in neural information Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.
processing systems. pages 2672–2680. 2016. Seqgan: Sequence generative adversarial nets
with policy gradient. The Association for the Ad-
Sergey Ioffe and Christian Szegedy. 2015. Batch nor- vancement of Artificial Intelligence 2017 .
malization: Accelerating deep network training by
reducing internal covariate shift. arXiv preprint Yizhe Zhang, Zhe Gan, and Lawrence Carin. 2016.
arXiv:1502.03167 . Generating text via adversarial training. NIPS .
Yuan Zhang, Regina Barzilay, and Tommi Jaakkola.
2017. Aspect-augmented adversarial net-
works for domain adaptation. arXiv preprint
arXiv:1701.00188 .
Junbo Zhao, Michael Mathieu, and Yann LeCun. 2016.
Energy-based generative adversarial network. arXiv
preprint arXiv:1609.03126 .

You might also like