Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Unsupervised Neural Machine Translation with Weight Sharing

Zhen Yang1,2 , Wei Chen1 , Feng Wang1,2∗, Bo Xu1


1
Institute of Automation, Chinese Academy of Sciences
2
University of Chinese Academy of Sciences
{yangzhen2014, wei.chen.media, feng.wang, xubo}@ia.ac.cn

Abstract context vector, and the decoder network generates


the target sentence iteratively based on the con-
Unsupervised neural machine translation text vector. NMT can be studied in supervised and
(NMT) is a recently proposed approach
arXiv:1804.09057v1 [cs.CL] 24 Apr 2018

unsupervised learning settings. In the supervised


for machine translation which aims to train setting, bilingual corpora is available for training
the model without using any labeled data. the NMT model. In the unsupervised setting, we
The models proposed for unsupervised only have two independent monolingual corpora
NMT often use only one shared encoder with one for each language and there is no bilin-
to map the pairs of sentences from dif- gual training example to provide alignment infor-
ferent languages to a shared-latent space, mation for the two languages. Due to lack of align-
which is weak in keeping the unique ment information, the unsupervised NMT is con-
and internal characteristics of each lan- sidered more challenging. However, this task is
guage, such as the style, terminology, and very promising, since the monolingual corpora is
sentence structure. To address this is- usually easy to be collected.
sue, we introduce an extension by utiliz-
Motivated by recent success in unsupervised
ing two independent encoders but shar-
cross-lingual embeddings (Artetxe et al., 2016;
ing some partial weights which are re-
Zhang et al., 2017b; Conneau et al., 2017), the
sponsible for extracting high-level repre-
models proposed for unsupervised NMT often as-
sentations of the input sentences. Be-
sume that a pair of sentences from two different
sides, two different generative adversarial
languages can be mapped to a same latent repre-
networks (GANs), namely the local GAN
sentation in a shared-latent space (Lample et al.,
and global GAN, are proposed to enhance
2017; Artetxe et al., 2017b). Following this as-
the cross-language translation. With this
sumption, Lample et al. (2017) use a single en-
new approach, we achieve significant im-
coder and a single decoder for both the source and
provements on English-German, English-
target languages. The encoder and decoder, act-
French and Chinese-to-English translation
ing as a standard auto-encoder (AE), are trained to
tasks.
reconstruct the inputs. And Artetxe et al. (2017b)
1 Introduction utilize a shared encoder but two independent de-
coders. With some good performance, they share
Neural machine translation (Kalchbrenner and a glaring defect, i.e., only one encoder is shared
Blunsom, 2013; Sutskever et al., 2014; Cho et al., by the source and target languages. Although the
2014; Bahdanau et al., 2014), directly applying a shared encoder is vital for mapping sentences from
single neural network to transform the source sen- different languages into the shared-latent space,
tence into the target sentence, has now reached im- it is weak in keeping the uniqueness and inter-
pressive performance (Shen et al., 2015; Wu et al., nal characteristics of each language, such as the
2016; Johnson et al., 2016; Gehring et al., 2017; style, terminology and sentence structure. Since
Vaswani et al., 2017). The NMT typically con- each language has its own characteristics, the
sists of two sub neural networks. The encoder net- source and target languages should be encoded
work reads and encodes the source sentence into a and learned independently. Therefore, we conjec-
1
Feng Wang is the corresponding author of this paper ture that the shared encoder may be a factor limit-
ing the potential translation performance. English-German, English-French and
In order to address this issue, we extend the Chinese-to-English translation tasks. Ex-
encoder-shared model, i.e., the model with one perimental results show that the proposed
shared encoder, by leveraging two independent approach consistently achieves great success.
encoders with each for one language. Similarly,
two independent decoders are utilized. For each • Last but not least, we introduce the direc-
language, the encoder and its corresponding de- tional self-attention to model temporal order
coder perform an AE, where the encoder gener- information for the proposed model. Exper-
ates the latent representations from the perturbed imental results reveal that it deserves more
input sentences and the decoder reconstructs the efforts for researchers to investigate the tem-
sentences from the latent representations. To map poral order information within self-attention
the latent representations from different languages layers of NMT.
to a shared-latent space, we propose the weight- 2 Related Work
sharing constraint to the two AEs. Specifically,
we share the weights of the last few layers of two Several approaches have been proposed to train
encoders that are responsible for extracting high- NMT models without direct parallel corpora. The
level representations of input sentences. Simi- scenario that has been widely investigated is one
larly, we share the weights of the first few lay- where two languages have little parallel data be-
ers of two decoders. To enforce the shared-latent tween them but are well connected by one pivot
space, the word embeddings are used as a rein- language. The most typical approach in this sce-
forced encoding component in our encoders. For nario is to independently translate from the source
cross-language translation, we utilize the back- language to the pivot language and from the pivot
translation following (Lample et al., 2017). Addi- language to the target language (Saha et al., 2016;
tionally, two different generative adversarial net- Cheng et al., 2017). To improve the transla-
works (GAN) (Yang et al., 2017), namely the lo- tion performance, Johnson et al. (2016) propose a
cal and global GAN, are proposed to further im- multilingual extension of a standard NMT model
prove the cross-language translation. We utilize and they achieve substantial improvement for lan-
the local GAN to constrain the source and tar- guage pairs without direct parallel training data.
get latent representations to have the same distri- Recently, motivated by the success of cross-
bution, whereby the encoder tries to fool a local lingual embeddings, researchers begin to show in-
discriminator which is simultaneously trained to terests in exploring the more ambitious scenario
distinguish the language of a given latent repre- where an NMT model is trained from monolingual
sentation. We apply the global GAN to finetune corpora only. Lample et al. (2017) and Artetxe
the corresponding generator, i.e., the composition et al. (2017b) simultaneously propose an approach
of the encoder and decoder of the other language, for this scenario, which is based on pre-trained
where a global discriminator is leveraged to guide cross lingual embeddings. Lample et al. (2017)
the training of the generator by assessing how far utilizes a single encoder and a single decoder for
the generated sentence is from the true data distri- both languages. The entire system is trained to
bution 1 . In summary, we mainly make the follow- reconstruct its perturbed input. For cross-lingual
ing contributions: translation, they incorporate back-translation into
the training procedure. Different from (Lample
• We propose the weight-sharing constraint to et al., 2017), Artetxe et al. (2017b) use two in-
unsupervised NMT, enabling the model to dependent decoders with each for one language.
utilize an independent encoder for each lan- The two works mentioned above both use a sin-
guage. To enforce the shared-latent space, gle shared encoder to guarantee the shared latent
we also propose the embedding-reinforced space. However, a concomitant defect is that the
encoders and two different GANs for our shared encoder is weak in keeping the uniqueness
model. of each language. Our work also belongs to this
• We conduct extensive experiments on more ambitious scenario, and to the best of our
1
knowledge, we are one among the first endeav-
The code that we utilized to train
and evaluate our models can be found at ors to investigate how to train an NMT model with
https://github.com/ZhenYangIACAS/unsupervised-NMT monolingual corpora only.
Encs Decs

xs Encs Decs
xs Dg1
xt Enct Decs

Dl

Z xs Encs Dect Dg 2

xt
xt Enct Dect
Enct Dect

Figure 1: The architecture of the proposed model. We implement the shared-latent space assumption
using a weight sharing constraint where the connection of the last few layers in Encs and Enct are
tied (illustrated with dashed lines) and the connection of the first few layers in Decs and Dect are
tied. x̃Enc s −Decs and x̃Enct −Dect are self-reconstructed sentences in each language. x̃Encs −Dect is
s t s
t −Decs
the translated sentence from source to target and x̃Enct is the translation in reversed direction.
Dl is utilized to assess whether the hidden representation of the encoder is from the source or target
language. Dg1 and Dg2 are used to evaluate whether the translated sentences are realistic for each
language respectively. Z represents the shared-latent space.

3 The Approach unsupervised manner or for improving the transla-


tion performance.
3.1 Model Architecture
The model architecture, as illustrated in figure 1,
is based on the AE and GAN. It consists of seven Networks Roles
sub networks: including two encoders Encs and {Encs , Decs } AE for source language
Enct , two decoders Decs and Dect , the local dis- {Enct , Dect } AE for target language
criminator Dl , and the global discriminators Dg1 {Encs , Dect } translation source → target
and Dg2 . For the encoder and decoder, we follow {Enct , Decs } translation target → source
{Encs , Dl } 1st local GAN (GANl1 )
the newly emerged Transformer (Vaswani et al.,
{Enct , Dl } 2nd local GAN (GANl2 )
2017). Specifically, the encoder is composed of a {Enct , Decs , Dg1 } 1st global GAN (GANg1 )
stack of four identical layers 2 . Each layer con- {Encs , Dect , Dg2 } 2nd global GAN (GANg2 )
sists of a multi-head self-attention and a simple
position-wise fully connected feed-forward net- Table 1: Interpretation of the roles for the subnet-
work. The decoder is also composed of four iden- works in the proposed system.
tical layers. In addition to the two sub-layers in
each encoder layer, the decoder inserts a third sub-
layer, which performs multi-head attention over Directional self-attention Compared to recur-
the output of the encoder stack. For more details rent neural network, a disadvantage of the simple
about the multi-head self-attention layer, we refer self-attention mechanism is that the temporal or-
the reader to (Vaswani et al., 2017). We implement der information is lost. Although the Transformer
the local discriminator as a multi-layer perceptron applies the positional encoding to the sequence be-
and implement the global discriminator based on fore processed by the self-attention, how to model
the convolutional neural network (CNN). Several temporal order information within an attention is
ways exist to interpret the roles of the sub net- still an open question. Following (Shen et al.,
works are summarised in table 1. The proposed 2017), we build the encoders in our model on the
system has several striking components , which directional self-attention which utilizes the posi-
are critical either for the system to be trained in an tional masks to encode temporal order information
2
The layer number is selected according to our prelimi- into attention output. More concretely, two posi-
nary experiment, which is presented in appendix A. tional masks, namely the forward mask M f and
backward mask M b , are calculated as: where W1 , W2 and b are trainable parameters
 and they are shared by the two encoders. The
f 0 i<j
Mij = (1) motivation behind is twofold. Firstly, taking the
−∞ otherwise
fixed cross-lingual embedding as the other encod-
 ing component is helpful to reinforce the shared-
0 i>j
Mijb = (2) latent space. Additionally, from the point of multi-
−∞ otherwise
channel encoders (Xiong et al., 2017), provid-
With the forward mask M f , the later token only ing encoding components with different levels of
makes attention connections to the early tokens composition enables the decoder to take pieces of
in the sequence, and vice versa with the back- source sentence at varying composition levels suit-
ward mask. Similar to (Zhou et al., 2016; Wang ing its own linguistic structure.
et al., 2017), we utilize a self-attention network
to process the input sequence in forward direc- 3.2 Unsupervised Training
tion. The output of this layer is taken by an upper Based on the architecture proposed above, we train
self-attention network as input, processed in the the NMT model with the monolingual corpora
reverse direction. only using the following four strategies:
Weight sharing Based on the shared-latent Denoising auto-encoding Firstly, we train the
space assumption, we apply the weight sharing two AEs to reconstruct their inputs respectively.
constraint to relate the two AEs. Specifically, we In this form, each encoder should learn to com-
share the weights of the last few layers of the Encs pose the embeddings of its corresponding lan-
and Enct , which are responsible for extracting guage and each decoder is expected to learn to de-
high-level representations of the input sentences. compose this representation into its corresponding
Similarly, we also share the first few layers of language. Nevertheless, without any constraint,
the Decs and Dect , which are expected to de- the AE quickly learns to merely copy every word
code high-level representations that are vital for one by one, without capturing any internal struc-
reconstructing the input sentences. Compared to ture of the language involved. To address this
(Cheng et al., 2016; Saha et al., 2016) which use problem, we utilize the same strategy of denois-
the fully shared encoder, we only share partial ing AE (Vincent et al., 2008) and add some noise
weights for the encoders and decoders. In the pro- to the input sentences (Hill et al., 2016; Artetxe
posed model, the independent weights of the two et al., 2017b). To this end, we shuffle the input
encoders are expected to learn and encode the hid- sentences randomly. Specifically, we apply a ran-
den features about the internal characteristics of dom permutation ε to the input sentence, verifying
each language, such as the terminology, style, and the condition:
sentence structure. The shared weights are utilized steps
to map the hidden features extracted by the inde- |ε(i) − i| ≤ min(k([ ] + 1), n), ∀i ∈ {1, n}
s
pendent weights to the shared-latent space. (5)
Embedding reinforced encoder We use pre- where n is the length of the input sentence, steps
trained cross-lingual embeddings in the encoders is the global steps the model has been updated, k
that are kept fixed during training. And the and s are the tunable parameters which can be set
fixed embeddings are used as a reinforced en- by users beforehand. This way, the system needs
coding component in our encoder. Formally, to learn some useful structure of the involved lan-
given the input sequence embedding vectors E = guages to be able to recover the correct word order.
{e1 , . . . , et } and the initial output sequence of the In practice, we set k = 2 and s = 100000.
encoder stack H = {h1 , . . . , ht }, we compute Hr Back-translation In spite of denoising auto-
as: encoding, the training procedure still involves a
Hr = g H + (1 − g) E (3) single language at each time, without considering
where Hr is the final output sequence of the en- our final goal of mapping an input sentence from
coder which will be attended by the decoder (In the source/target language to the target/source lan-
Transformer, H is the final output of the encoder), guage. For the cross language training, we uti-
g is a gate unit and computed as: lize the back-translation approach for our unsu-
pervised training procedure. Back-translation has
g = σ(W1 E + W2 H + b) (4) shown its great effectiveness on improving NMT
model with monolingual data and has been widely global GANs are utilized to update the whole pa-
investigated by (Sennrich et al., 2015a; Zhang and rameters of the proposed model, including the
Zong, 2016). In our approach, given an input parameters of encoders and decoders. The pro-
sentence in a given language, we apply the cor- posed model has two global GANs: GANg1 and
responding encoder and the decoder of the other GANg2 . In GANg1 , the Enct and Decs act as
language to translate it to the other language 3 . the generator, which generates the sentence x̃t 4
By combining the translation with its original sen- from xt . The Dg1 , implemented based on CNN,
tence, we get a pseudo-parallel corpus which is assesses whether the generated sentence x̃t is the
utilized to train the model to reconstruct the origi- true target-language sentence or the generated sen-
nal sentence from its translation. tence. The global discriminator aims to distin-
Local GAN Although the weight sharing con- guish among the true sentences and generated sen-
straint is vital for the shared-latent space assump- tences, and it is trained to minimize its classifi-
tion, it alone does not guarantee that the corre- cation error rate. During training, the Dg1 feeds
sponding sentences in two languages will have the back its assessment to finetune the encoder Enct
same or similar latent code. To further enforce and decoder Decs . Since the machine transla-
the shared-latent space, we train a discriminative tion is a sequence generation problem, following
neural network, referred to as the local discrimi- (Yang et al., 2017), we leverage policy gradient re-
nator, to classify between the encoding of source inforcement training to back-propagate the assess-
sentences and the encoding of target sentences. ment. We apply a similar processing to GANg2
The local discriminator, implemented as a multi- (The details about the architecture of the global
layer perceptron with two hidden layers of size discriminator and the training procedure of the
256, takes the output of the encoder, i.e., Hr calcu- global GANs can be seen in appendix B and C).
lated as equation 3, as input, and produces a binary There are two stages in the proposed unsuper-
prediction about the language of the input sen- vised training. In the first stage, we train the pro-
tence. The local discriminator is trained to predict posed model with denoising auto-encoding, back-
the language by minimizing the following cross- translation and the local GANs, until no improve-
entropy loss: ment is achieved on the development set. Specif-
ically, we perform one batch of denoising auto-
LDl (θDl ) = encoding for the source and target languages, one
− Ex∈xs [log p(f = s|Encs (x))] (6) batch of back-translation for the two languages,
− Ex∈xt [log p(f = t|Enct (x))] and another batch of local GAN for the two lan-
guages. In the second stage, we fine tune the pro-
where θDl represents the parameters of the local posed model with the global GANs.
discriminator and f ∈ {s, t}. The encoders are
trained to fool the local discriminator: 4 Experiments and Results
LEncs (θEncs ) =
(7) We evaluate the proposed approach on English-
− Ex∈xs [log p(f = t|Encs (x))] German, English-French and Chinese-to-English
translation tasks 5 . We firstly describe the datasets,
LEnct (θEnct ) = pre-processing and model hyper-parameters we
(8)
− Ex∈xt [log p(f = s|Enct (x))] used, then we introduce the baseline systems, and
where θEncs and θEnct are the parameters of the finally we present our experimental results.
two encoders.
Global GAN We apply the global GANs to 4.1 Data Sets and Preprocessing
fine tune the whole model so that the model is In English-German and English-French transla-
able to generate sentences undistinguishable from tion, we make our experiments comparable with
the true data, i.e., sentences in the training cor- previous work by using the datasets from the
pus. Different from the local GANs which up-
t −Decs
dates the parameters of the encoders locally, the
4
The x̃t is x̃Enc
t in figure 1. We omit the super-
script for simplicity.
3 5
Since the quality of the translation shows little effect on The reason that we do not conduct experiments on
the performance of the model (Sennrich et al., 2015a), we English-to-Chinese translation is that we do not get public
simply use greedy decoding for speed. test sets for English-to-Chinese.
WMT 2014 and WMT 2016 shared tasks respec- embeddings to a shared-latent space 8 .
tively. For Chinese-to-English translation, we use
the datasets from LDC, which has been widely uti- 4.2 Model Hyper-parameters and Evaluation
lized by previous works (Tu et al., 2017; Zhang Following the base model in (Vaswani et al.,
et al., 2017a). 2017), we set the dimension of word embedding
WMT14 English-French Similar to (Lample as 512, dropout rate as 0.1 and the head number
et al., 2017), we use the full training set of 36M as 8. We use beam search with a beam size of 4
sentence pairs and we lower-case them and re- and length penalty α = 0.6. The model is im-
move sentences longer than 50 words, resulting plemented in TensorFlow (Abadi et al., 2015) and
in a parallel corpus of about 30M pairs of sen- trained on up to four K80 GPUs synchronously in
tences. To guarantee no exact correspondence be- a multi-GPU setup on a single machine.
tween the source and target monolingual sets, we For model selection, we stop training when the
build monolingual corpora by selecting English model achieves no improvement for the tenth eval-
sentences from 15M random pairs, and selecting uation on the development set, which is com-
the French sentences from the complementary set. prised of 3000 source and target sentences ex-
Sentences are encoded with byte-pair encoding tracted randomly from the monolingual training
(Sennrich et al., 2015b), which has an English vo- corpora. Following (Lample et al., 2017), we
cabulary of about 32000 tokens, and French vo- translate the source sentences to the target lan-
cabulary of about 33000 tokens. We report results guage, and then translate the resulting sentences
on newstest2014. back to the source language. The quality of the
WMT16 English-German We follow the same model is then evaluated by computing the BLEU
procedure mentioned above to create monolingual score over the original inputs and their reconstruc-
training corpora for English-German translation, tions via this two-step translation process. The
and we get two monolingual training data of 1.8M performance is finally averaged over two direc-
sentences each. The two languages share a vocab- tions, i.e., from source to target and from target
ulary of about 32000 tokens. We report results on to source. BLEU (Papineni et al., 2002) is utilized
newstest2016. as the evaluation metric. For Chinese-to-English,
LDC Chinese-English For Chinese-to-English we apply the script mteval-v11b.pl to evaluate the
translation, our training data consists of 1.6M sen- translation performance. For English-German and
tence pairs randomly extracted from LDC corpora English-French, we evaluate the translation per-
6 . Since the data set is not big enough, we just
formance with the script multi-belu.pl 9 .
build the monolingual data set by randomly shuf-
fling the Chinese and English sentences respec- 4.3 Baseline Systems
tively. In spite of the fact that some correspon- Word-by-word translation (WBW) The first
dence between examples in these two monolingual baseline we consider is a system that per-
sets may exist, we never utilize this alignment in- forms word-by-word translations using the in-
formation in our training procedure (see Section ferred bilingual dictionary. Specifically, it trans-
3.2). Both the Chinese and English sentences are lates a sentence word-by-word, replacing each
encoded with byte-pair encoding. We get an En- word with its nearest neighbor in the other lan-
glish vocabulary of about 34000 tokens, and Chi- guage.
nese vocabulary of about 38000 tokens. The re- Lample et al. (2017) The second baseline is
sults are reported on N IST 02. a previous work that uses the same training and
Since the proposed system relies on the pre- testing sets with this paper. Their model belongs
trained cross-lingual embeddings, we utilize the to the standard attention-based encoder-decoder
monolingual corpora described above to train the framework, which implements the encoder using
embeddings for each language independently by a bidirectional long short term memory network
using word2vec (Mikolov et al., 2013). We then (LSTM) and implements the decoder using a sim-
apply the public implementation 7 of the method
8
proposed by (Artetxe et al., 2017a) to map these The configuration we used to run these open-source
toolkits can be found in appendix D
6 9
LDC2002L27, LDC2002T01, LDC2002E18, https://github.com/moses-
LDC2003E07, LDC2004T08, LDC2004E12, LDC2005T10 smt/mosesdecoder/blob/617e8c8/scripts/generic/multi-
7
https://github.com/artetxem/vecmap bleu.perl;mteval-v11b.pl
en-de de-en en-fr fr-en zh-en
Supervised 24.07 26.99 30.50 30.21 40.02
Word-by-word 5.85 9.34 3.60 6.80 5.09
Lample et al. (2017) 9.64 13.33 15.05 14.31 -
The proposed approach 10.86 14.62 16.97 15.58 14.52

Table 2: The translation performance on English-German, English-French and Chinese-to-English test


sets. The results of (Lample et al., 2017) are copied directly from their paper. We do not present the
results of (Artetxe et al., 2017b) since we use different training sets.

ple forward LSTM. They apply one single encoder language pair Chinese-to-English, the decline is
and decoder for the source and target languages. as large as -1.66 BLEU points. We explain this as
Supervised training We finally consider ex- that the more distant the language pair is, the more
actly the same model as ours, but trained using the different characteristics they have. And the shared
standard cross-entropy loss on the original parallel encoder is weak in keeping the unique characteris-
sentences. This model can be viewed as an upper tic of each language. Additionally, we also notice
bound for the proposed unsupervised model. that using two completely independent encoders,
i.e., setting the number of weight-sharing layers
4.4 Results and Analysis as 0, results in poor translation performance too.
4.4.1 Number of weight-sharing layers This confirms our intuition that the shared layers
We firstly investigate how the number of weight- are vital to map the source and target latent rep-
sharing layers affects the translation performance. resentations to a shared-latent space. In the rest
In this experiment, we vary the number of weight- of our experiments, we set the number of weight-
sharing layers in the AEs from 0 to 4. Shar- sharing layer as 1.
ing one layer in AEs means sharing one layer
for the encoders and in the meanwhile, shar- 18
ing one layer for the decoders. The BLEU 0.53
16
scores of English-to-German, English-to-French
and Chinese-to-English translation tasks are re- 14 1.66
BLEU

ported in figure 2. Each curve corresponds to a


12
different translation task and the x-axis denotes
0.85
the number of weight-sharing layers for the AEs. 10
En2De
We find that the number of weight-sharing layers En2Fr
8 Zh2En
shows much effect on the translation performance.
0 1 2 3 4 5
And the best translation performance is achieved
when only one layer is shared in our system. When
all of the four layers are shared, i.e., only one Figure 2: The effects of the weight-sharing layer
shared encoder is utilized, we get poor translation number on English-to-German, English-to-French
performance in all of the three translation tasks. and Chinese-to-English translation tasks.
This verifies our conjecture that the shared en-
coder is detrimental to the performance of unsu-
pervised NMT especially for the translation tasks 4.4.2 Translation results
on distant language pairs. More concretely, for the Table 2 shows the BLEU scores on English-
related language pair translation, i.e., English-to- German, English-French and English-to-Chinese
French, the encoder-shared model achieves -0.53 test sets. As it can be seen, the proposed ap-
BLEU points decline than the best model where proach obtains significant improvements than the
only one layer is shared. For the more distant lan- word-by-word baseline system, with at least +5.01
guage pair English-to-German, the encoder-shared BLEU points in English-to-German translation
model achieves more significant decline, i.e., -0.85 and up to +13.37 BLEU points in English-to-
BLEU points decline. And for the most distant French translation. This shows that the proposed
en-de de-en en-fr fr-en zh-en
Without weight sharing 10.23 13.84 16.02 14.82 13.75
Without embedding-reinforced encoder 10.45 14.17 16.55 15.27 14.10
Without directional self-attention 10.60 14.21 16.82 15.30 14.29
Without local GANs 10.51 14.35 16.40 15.07 14.12
Without Global GANs 10.34 14.05 16.19 15.21 14.09
Full model 10.86 14.62 16.97 15.58 14.52

Table 3: Ablation study on English-German, English-French and Chinese-to-English translation tasks.


Without weight sharing means no layers are shared in the two AEs.

model only trained with monolingual data effec- achieve improvement up to +0.78 BLEU points on
tively learns to use the context information and English-to-French translation and the local GANs
the internal structure of each language. Com- also obtain improvement up to +0.57 BLEU points
pared to the work of (Lample et al., 2017), our on English-to-French translation. This reveals that
model also achieves up to +1.92 BLEU points im- the proposed model benefits a lot from the cross-
provement on English-to-French translation task. domain loss defined by GANs.
We believe that the unsupervised NMT is very
promising. However, there is still a large room 5 Conclusion and Future work
for improvement compared to the supervised up-
per bound. The gap between the supervised and The models proposed recently for unsupervised
unsupervised model is as large as 12.3-25.5 BLEU NMT use a single encoder to map sentences from
points depending on the language pair and transla- different languages to a shared-latent space. We
tion direction. conjecture that the shared encoder is problem-
atic for keeping the unique and inherent char-
4.4.3 Ablation study acteristic of each language. In this paper, we
To understand the importance of different com- propose the weight-sharing constraint in unsuper-
ponents of the proposed system, we perform an vised NMT to address this issue. To enhance the
ablation study by training multiple versions of cross-language translation performance, we also
our model with some missing components: the propose the embedding-reinforced encoders, local
local GANs, the global GANs, the directional GAN and global GAN into the proposed system.
self-attention, the weight-sharing, the embedding- Additionally, the directional self-attention is intro-
reinforced encoders, etc. Results are reported duced to model the temporal order information for
in table 3. We do not test the the importance our system.
of the auto-encoding, back-translation and the We test the proposed model on English-
pre-trained embeddings because they have been German, English-French and Chinese-to-English
widely tested in (Lample et al., 2017; Artetxe translation tasks. The experimental results reveal
et al., 2017b). Table 3 shows that the best per- that our approach achieves significant improve-
formance is obtained with the simultaneous use of ment and verify our conjecture that the shared en-
all the tested elements. The most critical compo- coder is really a bottleneck for improving the un-
nent is the weight-sharing constraint, which is vi- supervised NMT. The ablation study shows that
tal to map sentences of different languages to the each component of our system achieves some im-
shared-latent space. The embedding-reinforced provement for the final translation performance.
encoder also brings some improvement on all of Unsupervised NMT opens exciting opportuni-
the translation tasks. When we remove the di- ties for the future research. However, there is
rectional self-attention, we get up to -0.3 BLEU still a large room for improvement compared to
points decline. This indicates that it deserves more the supervised NMT. In the future, we would like
efforts to investigate the temporal order informa- to investigate how to utilize the monolingual data
tion in self-attention mechanism. The GANs also more effectively, such as incorporating the lan-
significantly improve the translation performance guage model and syntactic information into unsu-
of our system. Specifically, the global GANs pervised NMT. Besides, we decide to make more
efforts to explore how to reinforce the temporal or- International Joint Conference on Artificial Intelli-
der information for the proposed model. gence. pages 3974–3980.

Acknowledgements Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-


cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
This work is supported by the National Key Re- Schwenk, and Yoshua Bengio. 2014. Learning
phrase representations using rnn encoder-decoder
search and Development Program of China un-
for statistical machine translation. arXiv preprint
der Grant No. 2017YFB1002102, and Beijing arXiv:1406.1078 .
Engineering Research Center under Grant No.
Z171100002217015. We would like to thank Xu Alexis Conneau, Guillaume Lample, Marc’Aurelio
Shuang for her preparing data used in this work. Ranzato, Ludovic Denoyer, and Herv Jgou. 2017.
Word translation without parallel data .
Additionally, we also want to thank Jiaming Xu,
Suncong Zheng and Wenfu Wang for their invalu- Jonas Gehring, Michael Auli, David Grangier, Denis
able discussions on this work. Yarats, and Yann N Dauphin. 2017. Convolutional
sequence to sequence learning .

References Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016.


Learning distributed representations of sentences
Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene from unlabelled data. TACL .
Brevdo, Zhifeng Chen, Craig Citro, Greg S. Cor-
rado, Andy Davis, Jeffrey Dean, Matthieu Devin, Melvin Johnson, Mike Schuster, Quoc V Le, Maxim
Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,
Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Fernanda Viégas, Martin Wattenberg, Greg Corrado,
Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh et al. 2016. Google’s multilingual neural machine
Levenberg, Dan Mané, Rajat Monga, Sherry Moore, translation system: Enabling zero-shot translation.
Derek Murray, Chris Olah, Mike Schuster, Jonathon arXiv preprint arXiv:1611.04558 .
Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal-
war, Paul Tucker, Vincent Vanhoucke, Vijay Vasude- Nal Kalchbrenner and Phil Blunsom. 2013. Recur-
van, Fernanda Viégas, Oriol Vinyals, Pete Warden, rent continuous translation models. EMNLP pages
Martin Wattenberg, Martin Wicke, Yuan Yu, and Xi- 1700–1709.
aoqiang Zheng. 2015. TensorFlow: Large-scale ma-
chine learning on heterogeneous systems . Guillaume Lample, Ludovic Denoyer, and
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. Marc’Aurelio Ranzato. 2017. Unsupervised
Learning principled bilingual mappings of word em- machine translation using monolingual corpora only
beddings while preserving monolingual invariance. .
In Conference on Empirical Methods in Natural
Language Processing. pages 2289–2294. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. tions of words and phrases and their compositional-
2017a. Learning bilingual word embeddings with ity. In Advances in neural information processing
(almost) no bilingual data. In Meeting of the Asso- systems. pages 3111–3119.
ciation for Computational Linguistics. pages 451–
462. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic eval-
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and uation of machine translation. Association for Com-
Kyunghyun Cho. 2017b. Unsupervised neural ma- putational Linguistics pages 311–318.
chine translation .
Amrita Saha, Mitesh M Khapra, Sarath Chandar, Ja-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- narthanan Rajendran, and Kyunghyun Cho. 2016.
gio. 2014. Neural machine translation by jointly A correlational encoder decoder architecture for
learning to align and translate. arXiv preprint pivot based sequence generation. arXiv preprint
arXiv:1409.0473 . arXiv:1606.04754 .
Yong Cheng, Yang Liu, Qian Yang, Maosong Sun, and
Wei Xu. 2016. Neural machine translation with Rico Sennrich, Barry Haddow, and Alexandra Birch.
pivot languages. arXiv preprint arXiv:1611.04928 2015a. Improving neural machine translation
. models with monolingual data. arXiv preprint
arXiv:1511.06709 .
Yong Cheng, Qian Yang, Yang Liu, Maosong Sun, Wei
Xu, Yong Cheng, Qian Yang, Yang Liu, Maosong Rico Sennrich, Barry Haddow, and Alexandra Birch.
Sun, and Wei Xu. 2017. Joint training for pivot- 2015b. Neural machine translation of rare words
based neural machine translation. In Twenty-Sixth with subword units. Computer Science .
Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua bilingual lexicon induction. In Meeting of the Asso-
Wu, Maosong Sun, and Yang Liu. 2015. Minimum ciation for Computational Linguistics. pages 1959–
risk training for neural machine translation. arXiv 1970.
preprint arXiv:1512.02433 .
Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei
Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Xu. 2016. Deep recurrent models with fast-forward
Shirui Pan, and Chengqi Zhang. 2017. Disan: Di- connections for neural machine translation. arXiv
rectional self-attention network for rnn/cnn-free lan- preprint arXiv:1606.04199 .
guage understanding .
A Experiments on the layer number for
Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. encoders and decoders
Sequence to sequence learning with neural net-
works. Advances in neural information processing To determine the number of layers for encoders
systems pages 3104–3112. and decoders in our system beforehand, we con-
Zhaopeng Tu, Yang Liu, Shuming Shi, and Tong duct experiments on English-German translation
Zhang. 2017. Learning to remember translation his- tasks to test how the amount of layers in encoders
tory with a continuous cache . and decoders affects the translation performance.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob We vary the number of layers from 2 to 6 and the
Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz results are reported in table 4. We can find that the
Kaiser, and Illia Polosukhin. 2017. Attention is all translation performance achieves substantial im-
you need . provement with the layer number increasing from
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, 2 to 4. However, with layer number set larger than
and Pierre-Antoine Manzagol. 2008. Extracting 4, we get little improvement. To make a trade-off
and composing robust features with denoising au- between the translation performance and the com-
toencoders. In Proceedings of the 25th interna- putation complexity, we set the layer number as 4
tional conference on Machine learning. ACM, pages
1096–1103.
for our encoders and decoders.

Mingxuan Wang, Zhengdong Lu, Jie Zhou, and layer num en-de de-en
Qun Liu. 2017. Deep neural machine transla-
tion with linear associative unit. arXiv preprint 2 11.57 14.01
arXiv:1705.00861 . 3 12.43 14.99
4 12.86 15.62
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Le, Mohammad Norouzi, Wolfgang Macherey, 5 12.91 15.83
Maxim Krikun, Yuan Cao, Qin Gao, Klaus 6 12.95 15.79
Macherey, et al. 2016. Google’s neural ma-
chine translation system: Bridging the gap between Table 4: The experiments on the number of layers
human and machine translation. arXiv preprint
for encoders and decoders.
arXiv:1609.08144 .

Hao Xiong, Zhongjun He, Xiaoguang Hu, and Hua Wu.


2017. Multi-channel encoder for neural machine B The architecture of the global
translation. arXiv preprint arXiv:1712.02109 . discriminator
Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2017. The global discriminator is applied to classify the
Improving neural machine translation with condi- generated sentences as source language, target lan-
tional sequence generative adversarial nets .
guage or generated sentences. Following (Yang
Jiacheng Zhang, Yang Liu, Huanbo Luan, Jingfang Xu, et al., 2017), we implement the global discrimina-
and Maosong Sun. 2017a. Prior knowledge integra- tor based on CNN. Since sentences generated by
tion for neural machine translation using posterior the generator (the composition of the encoder and
regularization. In Meeting of the Association for
Computational Linguistics. pages 1514–1523.
decoder) have variable lengths, the CNN padding
is used to transform the sentences to sequences
Jiajun Zhang and Chengqing Zong. 2016. Exploit- with fixed length T , which is the maximum length
ing source-side monolingual data in neural machine set for the output of the generator. Given the gen-
translation. In Conference on Empirical Methods in
Natural Language Processing. pages 1535–1545. erated sequences x1 , . . . , xT , we build the matrix
X1:T as:
Meng Zhang, Yang Liu, Huanbo Luan, and Maosong
Sun. 2017b. Adversarial training for unsupervised X1:T = x1 ; x2 ; . . . ; xT (9)
where xt ∈ Rk is the k-dimensional word embed- D The configurations for the open-source
ding and the semicolon is the concatenation oper- toolkits
ator. For the matrix X1:T , a kernel wj ∈ Rl×k
We train the word embedding use the following
applies a convolutional operation to a window size
script:
of l words to produce a series of feature maps:
./word2vec -train text -output embedding.txt -
cbow 0 -size 512 -window 10 -negative 10 -hs 0
cji = ρ(BN (wj ⊗ Xi:i+l−1 + b)) (10)
-sample 1e- -threads 50 -binary 0 -min-count 5 -
iter 10
where ⊗ operator is the summation of element-
After we get the embeddings for both the source
wise production and b is a bias term. ρ is a non-
and target languages, we use the open-source
linear activation function which is implemented as
VecMap 10 to map these embeddings to a shared-
ReLu in this paper. To get the final feature with
latent space with the following scripts:
respect to kernel wj , a max-over-time pooling op-
python3 normalize embeddings.py unit center -i
eration is leveraged over the feature maps:
s embedding.txt -o s embedding.normalized.txt
python3 normalize embeddings.py unit center -i
cj = max{cj1 , . . . , cjT −l+1 }
e (11)
t embedding.txt -o t embedding.normalized.txt
python3 map embeddings.py –
We use various numbers of kernels with different
orthogonal s embedding.normalized.txt
window sizes to extract different features, which
t embedding.normalized.txt
are then concatenated to form the final sentence
s embedding.mapped.txt t embedding.mapped.txt
representation xc . Finally, we pass xc through a
–numerals –self learning -v
fully connected layer and a softmax layer to gen-
erate the probability p(fg |x1 , . . . , xT ) as:

p(fg |x1 , . . . , xT ) = sof tmax(V ∗ xc ) (12)

where V is the transformation matrix and fg ∈


{true, generated}.

C The training procedure of the global


GAN
We apply the global GANs to finetune the whole
model. Here, we provide detailed strategies for
training the global GANs. Firstly, we generate the
machine-generated source language sentences by
using Enct and Encs to decode the monolingual
data in target language. Similarly, we get the gen-
erated sentences in target language with Encs and
Dect by decoding source language monolingual
data. We simply use the greedy sampling method
instead of the beam search method for decoding.
Next, we pre-train Dg1 on the combination of true
monolingual data and the generated data in the
source language. Similarly, we also pre-train Dg2
on the combination of true monolingual data and
the generated data in the target language. Finally,
we jointly train the generators and discriminators.
The generators are trained with policy gradient
training methods. For the details about the pol-
icy gradient training, we refer the reader to (Yang
10
et al., 2017). https://github.com/artetxem/vecmap

You might also like