Professional Documents
Culture Documents
Unsupervised Neural Machine Translation With Weight Sharing
Unsupervised Neural Machine Translation With Weight Sharing
xs Encs Decs
xs Dg1
xt Enct Decs
Dl
Z xs Encs Dect Dg 2
xt
xt Enct Dect
Enct Dect
Figure 1: The architecture of the proposed model. We implement the shared-latent space assumption
using a weight sharing constraint where the connection of the last few layers in Encs and Enct are
tied (illustrated with dashed lines) and the connection of the first few layers in Decs and Dect are
tied. x̃Enc s −Decs and x̃Enct −Dect are self-reconstructed sentences in each language. x̃Encs −Dect is
s t s
t −Decs
the translated sentence from source to target and x̃Enct is the translation in reversed direction.
Dl is utilized to assess whether the hidden representation of the encoder is from the source or target
language. Dg1 and Dg2 are used to evaluate whether the translated sentences are realistic for each
language respectively. Z represents the shared-latent space.
ple forward LSTM. They apply one single encoder language pair Chinese-to-English, the decline is
and decoder for the source and target languages. as large as -1.66 BLEU points. We explain this as
Supervised training We finally consider ex- that the more distant the language pair is, the more
actly the same model as ours, but trained using the different characteristics they have. And the shared
standard cross-entropy loss on the original parallel encoder is weak in keeping the unique characteris-
sentences. This model can be viewed as an upper tic of each language. Additionally, we also notice
bound for the proposed unsupervised model. that using two completely independent encoders,
i.e., setting the number of weight-sharing layers
4.4 Results and Analysis as 0, results in poor translation performance too.
4.4.1 Number of weight-sharing layers This confirms our intuition that the shared layers
We firstly investigate how the number of weight- are vital to map the source and target latent rep-
sharing layers affects the translation performance. resentations to a shared-latent space. In the rest
In this experiment, we vary the number of weight- of our experiments, we set the number of weight-
sharing layers in the AEs from 0 to 4. Shar- sharing layer as 1.
ing one layer in AEs means sharing one layer
for the encoders and in the meanwhile, shar- 18
ing one layer for the decoders. The BLEU 0.53
16
scores of English-to-German, English-to-French
and Chinese-to-English translation tasks are re- 14 1.66
BLEU
model only trained with monolingual data effec- achieve improvement up to +0.78 BLEU points on
tively learns to use the context information and English-to-French translation and the local GANs
the internal structure of each language. Com- also obtain improvement up to +0.57 BLEU points
pared to the work of (Lample et al., 2017), our on English-to-French translation. This reveals that
model also achieves up to +1.92 BLEU points im- the proposed model benefits a lot from the cross-
provement on English-to-French translation task. domain loss defined by GANs.
We believe that the unsupervised NMT is very
promising. However, there is still a large room 5 Conclusion and Future work
for improvement compared to the supervised up-
per bound. The gap between the supervised and The models proposed recently for unsupervised
unsupervised model is as large as 12.3-25.5 BLEU NMT use a single encoder to map sentences from
points depending on the language pair and transla- different languages to a shared-latent space. We
tion direction. conjecture that the shared encoder is problem-
atic for keeping the unique and inherent char-
4.4.3 Ablation study acteristic of each language. In this paper, we
To understand the importance of different com- propose the weight-sharing constraint in unsuper-
ponents of the proposed system, we perform an vised NMT to address this issue. To enhance the
ablation study by training multiple versions of cross-language translation performance, we also
our model with some missing components: the propose the embedding-reinforced encoders, local
local GANs, the global GANs, the directional GAN and global GAN into the proposed system.
self-attention, the weight-sharing, the embedding- Additionally, the directional self-attention is intro-
reinforced encoders, etc. Results are reported duced to model the temporal order information for
in table 3. We do not test the the importance our system.
of the auto-encoding, back-translation and the We test the proposed model on English-
pre-trained embeddings because they have been German, English-French and Chinese-to-English
widely tested in (Lample et al., 2017; Artetxe translation tasks. The experimental results reveal
et al., 2017b). Table 3 shows that the best per- that our approach achieves significant improve-
formance is obtained with the simultaneous use of ment and verify our conjecture that the shared en-
all the tested elements. The most critical compo- coder is really a bottleneck for improving the un-
nent is the weight-sharing constraint, which is vi- supervised NMT. The ablation study shows that
tal to map sentences of different languages to the each component of our system achieves some im-
shared-latent space. The embedding-reinforced provement for the final translation performance.
encoder also brings some improvement on all of Unsupervised NMT opens exciting opportuni-
the translation tasks. When we remove the di- ties for the future research. However, there is
rectional self-attention, we get up to -0.3 BLEU still a large room for improvement compared to
points decline. This indicates that it deserves more the supervised NMT. In the future, we would like
efforts to investigate the temporal order informa- to investigate how to utilize the monolingual data
tion in self-attention mechanism. The GANs also more effectively, such as incorporating the lan-
significantly improve the translation performance guage model and syntactic information into unsu-
of our system. Specifically, the global GANs pervised NMT. Besides, we decide to make more
efforts to explore how to reinforce the temporal or- International Joint Conference on Artificial Intelli-
der information for the proposed model. gence. pages 3974–3980.
Mingxuan Wang, Zhengdong Lu, Jie Zhou, and layer num en-de de-en
Qun Liu. 2017. Deep neural machine transla-
tion with linear associative unit. arXiv preprint 2 11.57 14.01
arXiv:1705.00861 . 3 12.43 14.99
4 12.86 15.62
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Le, Mohammad Norouzi, Wolfgang Macherey, 5 12.91 15.83
Maxim Krikun, Yuan Cao, Qin Gao, Klaus 6 12.95 15.79
Macherey, et al. 2016. Google’s neural ma-
chine translation system: Bridging the gap between Table 4: The experiments on the number of layers
human and machine translation. arXiv preprint
for encoders and decoders.
arXiv:1609.08144 .