Professional Documents
Culture Documents
How Do Source-Side Monolingual Word Embeddings Impact Neural Machine Translation?
How Do Source-Side Monolingual Word Embeddings Impact Neural Machine Translation?
Leveraging the information encoded in pre-trained Various works have explored the use of monolin-
monolingual word embeddings (Bengio et al., gual data in NMT. On using target side mono-
2003; Mikolov et al., 2013; Pennington et al., lingual data, (Gulcehre et al., 2015) incorporated
2014; Bojanowski et al., 2017) is a common prac- a pre-trained RNN language model, while (Sen-
tice in various natural language processing tasks, nrich et al., 2016a) trained an auxiliary model with
for example: parsing (Dyer et al., 2015) (Kiper- reverse translation direction to construct pseudo
wasser and Goldberg, 2016), relation extraction parallel data. (Currey et al., 2017) copied target
(Peng et al., 2017), natural language understand- monolingual data to the source side and mix it
ing (Cheng et al., 2016a), sequence labeling (Ma with real parallel data to train a unified encoder-
and Hovy, 2016), and so on. While research in decoder model. It should be noted that although
neural machine translation (NMT) seeks to make some of these techniques are very popular in prac-
use of monolingual data by techniques such as tice, they are only capable of incorporating target-
back-translation (Sennrich et al., 2016a), the use side monolingual data, and hence they are outside
of pre-trained monolingual word embeddings does the immediate scope of this paper.
not seem to be the standard practice. On using source side monolingual data, (Cheng
In this paper, we study the interaction between et al., 2016b; Zhang and Zong, 2016) proposed
the source-side monolingual word embeddings self-learning for NMT, which learns the transla-
tion by reconstruction. Also, several works have Context
Vector
shown promising results in incorporating pre- (input to the
decoder)
trained source-side word embedding into NMT
systems. (Rios Gonzales et al., 2017) used sense
Attention
embedding to improve word sense disambiguation Mechanism
4 4
6 6
log-likelihood
log-likelihood
8 8
10 10
12 baseline 12 baseline
update update
14 dual 14 dual
0 5000 10000 15000 20000 0 100000 200000 300000 400000
Batch Updates Batch Updates
(a) Training Curve on 100k Sample (b) Training Curve on Unsampled Data
2.8
5.5 3.0
6.0 3.2
log-likelihood
log-likelihood
6.5 3.4
7.0 3.6
7.5 3.8
8.0 4.0
baseline baseline
8.5 update 4.2 update
9.0 dual 4.4 dual
2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Epochs Epochs
(c) Development Curve on 100k Sample (d) Development Curve on Unsampled Data
Figure 2: Plot of objective function vs batch updates / epochs for Chinese-English training data
over the baseline sometimes. This strengthens our the parallel training data has a minor domain mis-
hypothesis above. On the other hand, if we allow match from the monolingual training data (par-
the source-side word embedding to be updated, liament proceedings vs. news). In terms of
we can observe improvements over the baseline incorporation strategy, it can be observed that
more often. We hence conclude that pre-trained while the dual strategy is not very helpful with
source-side monolingual word embedding cannot small word embeddings, improvements over up-
directly benefit NMT performance, and adjust- date strategy can almost always be obtained with
ment by NMT training is necessary for it to be extended embeddings. We can also notice that un-
beneficial to the NMT system performance. der the German-English setting with extended em-
Comparing the Chinese-English and German- beddings, the different between update and dual
English experiments, we notice that incorporat- embedding method is more often significant com-
ing embeddings for German-English experiments pared to that of Chinese-English experiments, sig-
yields significant improvements over baselines naling that dual embedding method is more robust
more often, mainly because of the improvements to domain variances between monolingual data
obtained by extended fixed embedding incorpo- and bi-text.
ration. We think this is due to fact that the Comparing the results obtained under differ-
German-English test set generally has higher pre- ent parallel training data scales, it can be ob-
BPE OOV rate across different sample size of the served that the benefit of source-side monolin-
parallel training data. gual word embedding (compared to the baseline)
Comparing the results under different mono- seems to be decreasing along with the increasing
lingual data scales, for Chinese-English experi- amount of data, verifying the intuition that ex-
ments, the extended embedding always performs tra monolingual information is most useful under
better than small embedding, while for German- low-resource settings. On the other hand, the dual
English experiments, we got mixed results when embedding model is able to obtain the largest per-
comparing results with two different kinds of em- formance gains over update initialization both for
beddings. This can be explained by the fact that Chinese-English and the German-English training
small embedding extended embedding
Sentence Pairs Baseline
fixed update dual fixed update dual
100,000 15.99 15.36† 16.44 16.89† 15.90 17.55† 17.56†
250,000 21.72 22.05 22.16 22.08 22.05 23.04† 23.38†
500,000 26.28 25.48† 26.51 26.29 26.13 27.09† 26.76†
1,000,000 29.75 29.54 29.59 29.83 29.39 30.6† 30.89†
2,013,142 32.44 31.40† 33.21† 33.21 32.61 33.26† 34.03*†
Table 2: Chinese-English experiment Results with Different Data Sizes. Asterisks are appended when the
difference between update and dual method translation output is significant, and daggers are appended
when difference between any system output and the baseline is significant. The significance level is
p < 0.05.
Table 3: German-English experiment Results with Different Data Sizes. Asterisks are appended when the
difference between update and dual method translation output is significant, and daggers are appended
when difference between any system output and the baseline is significant. The significance level is
p < 0.05.
data with extended embeddings. This indicates ble 4 when comparing the translation results from
that the dual embedding model is able to get the systems trained with unsampled parallel data (sen-
best of both worlds as expected – it leverages more tence 6-8 and sentence 14-16).
on the initialized word embedding at low resource
settings, but is able to learn useful supplementary To further verify the observation above, we did
features when relatively large amount of parallel a simple human evaluation where we take the
training data is available. translation output from Chinese-English systems
trained on unsampled parallel data with extended
4.2 Analysis embedding incorporation. We first took the sin-
gleton words (before BPE) in the unsampled par-
Qualitative Analysis
allel data and filtered the test sentences containing
Translation of test sets were examined manu- these singleton words. We then manually read the
ally to evaluate the qualitative improvements ob- filtered and shuffled test sentences and answered
tained by incorporating pre-trained word embed- the yes/no question: are the singleton words ap-
dings within NMT. In the case of small embed- peared in the sentence translated correctly in the
ding, we have learned from significance test that test output? We chose to analyze singleton words
many of these improvements are not statistically rather than OOV words because (1) the translation
significant. But even for the significant ones, we of singleton words is less noisy; (2) if a word oc-
didn’t observe very specific patterns for qualitative curs in both the training set and the test set, it’s
improvements. more likely to occur in the monolingual data and
In the case of extended embedding, however, we hence embedding incorporation will add extra in-
observed an specific improvement of the transla- formation to translate these words. We found 226k
tion adequacy for rare words (mainly named enti- singleton words in the training data and 134 oc-
ties) in the training data, and the usage of dual em- currences of these words in the test data. The
bedding model often brings further improvements results is shown in the Table 5 and we can see
in that aspect. Such improvement is evident in Ta- that both update and dual incorporation method
1 Source 1: Len@@ ovo 即将 要 在 7月 17日 ( 美国 时间 ) 推出 的 Think@@ Pad T@@ 61 p , 也是
T 系列 最@@ 高级 的 笔记本 电脑 , 给 我们 一 个 相当 不错 的 惊喜 : U@@ W@@ B ( Wis@@ air
Ul@@ tra Wi@@ deb@@ and ) .
2 Reference 1: The ThinkPad T61p that Lenovo is about to introduce on July 17 ( U.S. time ) ,
also the most advanced notebook computer of the T series , has given us a very pleasant surprise :
UWB ( Wisair Ultra Wideband ) .
3 Baseline 100k: On July 17 ( US time ) launched , the number of SWB’s highest notebook notebook computers ,
which is also a good source of notebook notebook computers . : UWB .
4 Update 100k: Stewart is scheduled to be launched on July 17 ( the US time ) to launch a very good surprise of the
top - - - - - - - - - - - - - - - - - - - - - - - - - - - - ranking computer computer to us .
5 Dual 100k: Founded to July 17 , Inc will be launched to be launched on the current model notebook , which is
also a well - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
6 Baseline Unsampled: It will be launched on July 17 ( USA time ) as a top-class notebook computer and gives us
a rather good surprise: UWB Ultra Witra Wireless .
7 Update Unsampled: It is about to be launched on July 17 ( US time ) , as well as the T ’s most advanced
notebooks of notebooks, to us: UWB (Wisair Ultra Wireless and) .
8 Dual Unsampled: The Dell Pad T6p , which is about to be launched on July 17 ( US time ) , is also the T-series ’s
most advanced notebooks , giving us a rather good surprise: UWB ( Wisair Ultra Wideband ) .
9 Source 2: 由 王@@ 兵@@ 兵黑 砖@@ 窑 案 引发 的 山西 黑 砖@@ 窑 奴@@ 工 事件 , 曾 国内外 一
度 引起 关注 , 中央 高层 批示 要求 严查 .
10 Reference 2: The Shanxi black brick kiln slave labor incident touched off by the black brick kiln case of
Wang Bingbing once attracted attention from inside the country and abroad . The top leadership of the cen-
tral government had given directive demanding stern prosecution .
11 Baseline 100k: From the incident caused by Wang Wei , the Shanxi case caused by Wang Wei , and from home
and abroad , and his attention to the high - level instructions of the central authorities .
12 Update 100k: In the case of Wang Shanxi , the Shaanxi Shuan case of Shanxi Province in Shanxi Province ,
has attracted great attention to the party and abroad . .
13 Dual 100k: In the case of the Shanxi , the government of Shanxi’s Shanxi Shengsheng incident , has once again
attracted attention from home and abroad .
14 Baseline Unsampled: At a time , the central authorities ' instructions have aroused concern at home and
abroad , and the central authorities have issued instructions to investigate the incident .
15 Update Unsampled: At one point at home and abroad , there was a great deal of attention at home and abroad ,
and the central authorities demanded a strict investigation .
16 Dual Unsampled: The incident , which was triggered by the black brick kiln of Wang Wei - bing , has once
aroused concern at home and abroad , and the central authorities ’ high - level instructions have been set for
investigation .
Table 4: Two translation snippets from the 100k sample and unsampled Chinese-English experiments
with several named entities highlighted in corresponding colors. All the embedding incorporated are
extended embeddings. The @@ symbol is the token breaking symbol produced by BPE processing.
system accuracy poration, we did notice that the systems with em-
baseline 29.10% bedding incorporated tends to produce repetitive
update 32.09% output more frequently (e.g. sentence 4 and 5),
dual 33.58% to which we don’t have a very good explanation.
We conjecture that such problem could be reme-
Table 5: Human-evaluated singleton word transla- died by coverage modeling techniques such as (Tu
tion accuracies on Chinese-English test set. et al., 2016) and (Wu et al., 2016), but leave the
verification of it as future work. We also acknowl-
improves the singleton translation accuracy. This edge that the improvements on rare word trans-
agrees with our observation as presented in Table lation is not obvious under low-resource settings
4. (e.g. sentence 11-13) because translation outputs
On the other hand, while the BLEU scores im- are often too noisy to tell much useful qualitative
proves significantly over baselines under the low- trend.
resource settings with extended embedding incor-
100 100
80 80
System BLEU
System BLEU
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Baseline BLEU Baseline BLEU
(a) ZH-EN, Extended Update Embedding (b) ZH-EN, Extended Dual Embedding
100 100
80 80
System BLEU
System BLEU
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Baseline BLEU Baseline BLEU
(c) DE-EN, Extended Update Embedding (d) DE-EN, Extended Dual Embedding
Figure 3: Sentence-Level BLEU Scatter Plot Between Baseline and Embedding-Incorporated Systems.
The dots on the upper-left part of the red line corresponds to system output sentences that are better than
the baseline, and vice versa.
2.0 1.75
1.50
norm of update
norm of update
1.5 1.25
1.00
1.0
0.75
0.50
0.5
0.25
0.0 0.00
1-5 5-25 25-100 100-500 500-2.5k 2.5k-10k 10k-50k 50k+ 1-5 5-25 25-100 100-500 500-2.5k 2.5k-10k 10k-50k 50k+
word frequencies word frequencies
(a) ZH-EN (b) DE-EN
Figure 4: Average Norm of Update on Word Embeddings Grouped by Word Frequency in Training Data
Quantitative Analysis that the fixed part of the dual embedding is pre-
venting the updated part of the dual embedding to
For quantitative analysis, we focus on the small-
perform too much correction over its pre-trained
scale experiment (trained with 100k sample of par-
value. This conservativeness in performing update
allel data) with extended embedding incorporation
may account for the extra robustness of dual em-
as they seem to pose most interesting improve-
bedding incorporation we observed in the qualita-
ments in terms of BLEU scores.
tive analysis.
We started with computing the sentence-level
BLEU with MultEval toolkit (Clark et al., 2011) 5 Conclusion
and generating scatter plot of sentence-level
BLEU score of the update and dual embedding Our analysis on using source-side monolingual
system against the baseline system for each out- word embeddings in NMT indicates that (1) the
put sentence decoded on the test set, as shown in source-side embeddings should be updated during
Figure 3. The purpose of this analysis is to exam- NMT training; (2) the source-side embeddings are
ine the variance of output sentence before/after the more effective when bilingual training data is lim-
embedding incorporation. It could first be noticed ited, especially when OOV rates is high. More-
that across all embedding incorporation methods, over, source-side embedding incorporation is also
the dots are shifted to the upper left side of the useful under some high-resource settings when in-
red line, which means sentence-level BLEU score corporated properly; (3) the effect of source-side
tends to increase after incorporation. This agrees word embedding strengthens when extra monolin-
with the increase of corpus-level BLEU score in gual data is provided for training, and the domain
Table 2 and Table 3. On the other hand, all embed- of the monolingual data also seems to matter.
ding incorporation methods incurs similar amount We recommend that incorporating pre-trained
of variances in translation outputs, even on dif- embeddings as input become a standard practice
ferent language pairs. In terms of comparison for NMT when bilingual training data is scarce, es-
across incorporation strategies, the update strategy pecially when extra source-side monolingual data
seems to incur slightly more drastic BLEU scores is available. While incorporating pre-trained em-
changes (dots that are close to the right and up- beddings at high-resource settings may also be
per part of the horizontal and vertical axes, respec- helpful, we advise that extra caution should be
tively), but the difference is not significant enough used to ensure the monolingual data is in-domain,
to make a strong argument. and appropriate incorporation strategy should be
selected.
Another problem we are interested in is the
norm of update on the word embeddings during
the NMT training process. More specifically, for References
each words in the dictionary, we take their word
Mostafa Abdou, Vladan Gloncak, and Ondřej Bo-
embedding before/after the training process and jar. 2017. Variable mini-batch sizing and pre-
compare the norm of their difference. Figure 4 trained embeddings. In Proceedings of the Sec-
shows the norm of update grouped by the word ond Conference on Machine Translation. Associa-
frequency in the training data. It should be noted tion for Computational Linguistics, pages 680–686.
http://aclweb.org/anthology/W17-4780.
that the norm of update is increasing roughly lin-
early along with the frequency of words up till Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and
50000. This implies that for each iteration, un- Christian Jauvin. 2003. A neural probabilistic lan-
guage model. Journal of machine learning research
less the word has been seen extremely frequently, 3(Feb):1137–1155.
the norm of update performed on the word em-
bedding is about the same on average. We also see Piotr Bojanowski, Edouard Grave, Armand Joulin, and
that norm of update performed on dual incorpora- Tomas Mikolov. 2017. Enriching word vectors
with subword information. Transactions of the As-
tion strategy is consistently lower than update in- sociation of Computational Linguistics 5:135–146.
corporation strategy. Because the pre-trained part http://aclweb.org/anthology/Q17-1010.
of the embedding is fixed, and the dual strategy
Ondřej Bojar, Rajen Chatterjee, Christian Federmann,
is essentially learning a correction term over the Yvette Graham, Barry Haddow, Shujian Huang,
pre-trained word embedding rather than rewriting Matthias Huck, Philipp Koehn, Qun Liu, Varvara
the pre-trained value completely, we conjecture Logacheva, Christof Monz, Matteo Negri, Matt
Post, Raphael Rubino, Lucia Specia, and Marco Guillaume Klein, Yoon Kim, Yuntian Deng,
Turchi. 2017. Findings of the 2017 conference Jean Senellart, and Alexander Rush. 2017.
on machine translation (wmt17) pages 169–214. Opennmt: Open-source toolkit for neu-
http://aclweb.org/anthology/W17-4717. ral machine translation pages 67–72.
http://www.aclweb.org/anthology/P17-4012.
Jianpeng Cheng, Li Dong, and Mirella Lapata.
2016a. Long short-term memory-networks for Philipp Koehn. 2004. Statistical significance tests for
machine reading. In Proceedings of the 2016 machine translation evaluation. In Proceedings of
Conference on Empirical Methods in Natural the 2004 conference on empirical methods in natural
Language Processing, EMNLP 2016, Austin, language processing.
Texas, USA, November 1-4, 2016. pages 551–561.
http://aclweb.org/anthology/D/D16/D16-1053.pdf. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua Brooke Cowan, Wade Shen, Christine Moran,
Wu, Maosong Sun, and Yang Liu. 2016b. Semi- Richard Zens, Chris Dyer, Ondrej Bojar, Alexan-
supervised learning for neural machine translation dra Constantin, and Evan Herbst. 2007. Moses:
pages 1965–1974. https://doi.org/10.18653/v1/P16- Open source toolkit for statistical machine trans-
1185. lation. In Proceedings of the 45th Annual Meet-
ing of the Association for Computational Linguistics
Jonathan H Clark, Chris Dyer, Alon Lavie, and Noah A Companion Volume Proceedings of the Demo and
Smith. 2011. Better hypothesis testing for statistical Poster Sessions. Association for Computational Lin-
machine translation: Controlling for optimizer insta- guistics, Prague, Czech Republic, pages 177–180.
bility. In Proceedings of the 49th Annual Meeting of http://www.aclweb.org/anthology/P07-2045.
the Association for Computational Linguistics: Hu-
man Language Technologies: short papers-Volume Xuezhe Ma and Eduard H. Hovy. 2016. End-
2. Association for Computational Linguistics, pages to-end sequence labeling via bi-directional lstm-
176–181. cnns-crf. In Proceedings of the 54th An-
nual Meeting of the Association for Compu-
Anna Currey, Antonio Valerio Miceli Barone, and tational Linguistics, ACL 2016, August 7-12,
Kenneth Heafield. 2017. Copied monolin- 2016, Berlin, Germany, Volume 1: Long Papers.
gual data improves low-resource neural machine http://aclweb.org/anthology/P/P16/P16-1101.pdf.
translation. In Proceedings of the Second
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
Conference on Machine Translation. Association
rado, and Jeff Dean. 2013. Distributed representa-
for Computational Linguistics, pages 148–156.
tions of words and phrases and their compositional-
http://aclweb.org/anthology/W17-4715.
ity. In Advances in neural information processing
systems. pages 3111–3119.
Mattia Antonino Di Gangi and Marcello Federico.
2017. Monolingual embeddings for low resourced Nanyun Peng, Hoifung Poon, Chris Quirk,
neural machine translation. In Proceedings of the Kristina Toutanova, and Wen-tau Yih. 2017.
14th International Workshop on Spoken Language Cross-sentence n-ary relation extraction with
Translation. pages 97–104. graph lstms. Transactions of the Associa-
tion for Computational Linguistics 5:101–115.
Chris Dyer, Miguel Ballesteros, Wang Ling, Austin https://transacl.org/ojs/index.php/tacl/article/view/1028.
Matthews, and Noah A. Smith. 2015. Transition-
based dependency parsing with stack long short- Jeffrey Pennington, Richard Socher, and Christopher
term memory. In Proceedings of the 53rd Annual Manning. 2014. Glove: Global vectors for word
Meeting of the Association for Computational Lin- representation. In Proceedings of the 2014 Con-
guistics and the 7th International Joint Conference ference on Empirical Methods in Natural Language
on Natural Language Processing (Volume 1: Long Processing (EMNLP). Association for Computa-
Papers). Association for Computational Linguistics, tional Linguistics, Doha, Qatar, pages 1532–1543.
pages 334–343. https://doi.org/10.3115/v1/P15- http://www.aclweb.org/anthology/D14-1162.
1033.
Annette Rios Gonzales, Laura Mascarell, and Rico
Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Sennrich. 2017. Improving word sense dis-
Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, ambiguation in neural machine translation with
Holger Schwenk, and Yoshua Bengio. 2015. On us- sense embeddings. In Proceedings of the Sec-
ing monolingual corpora in neural machine transla- ond Conference on Machine Translation. Associ-
tion. arXiv preprint arXiv:1503.03535 . ation for Computational Linguistics, pages 11–19.
http://aclweb.org/anthology/W17-4702.
Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim-
ple and accurate dependency parsing using bidirec- Rico Sennrich, Barry Haddow, and Alexandra Birch.
tional lstm feature representations. Transactions of 2016a. Improving neural machine transla-
the Association of Computational Linguistics 4:313– tion models with monolingual data pages 86–96.
327. http://www.aclweb.org/anthology/Q16-1023. http://www.aclweb.org/anthology/P16-1009.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016b. Neural machine translation of rare
words with subword units pages 1715–1725.
http://www.aclweb.org/anthology/P16-1162.
Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,
and Hang Li. 2016. Coverage-based Neural Ma-
chine Translation. arXiv.org .
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, et al. 2016. Google’s neural ma-
chine translation system: Bridging the gap between
human and machine translation. arXiv preprint
arXiv:1609.08144 .
Matthew D Zeiler. 2012. Adadelta: an adaptive learn-
ing rate method. arXiv preprint arXiv:1212.5701 .
Jiajun Zhang and Chengqing Zong. 2016. Exploit-
ing source-side monolingual data in neural ma-
chine translation. In Proceedings of the 2016
Conference on Empirical Methods in Natural
Language Processing. Association for Computa-
tional Linguistics, Austin, Texas, pages 1535–1545.
https://aclweb.org/anthology/D16-1160.