How Do Source-Side Monolingual Word Embeddings Impact Neural Machine Translation?

How Do Source-side Monolingual Word Embeddings
Impact Neural Machine Translation?

Shuoyang Ding† Kevin Duh†‡
† Center for Language and Speech Processing
‡ Human Language Technology Center of Excellence
Johns Hopkins University
Baltimore, Maryland, United States
[email protected] [email protected]
Abstract with NMT. Our goal is to understand whether they

help translation accuracy, and how to integrate
Using pre-trained word embeddings as in- them into the model in the best way. Specifically,
put layer is a common practice in many
arXiv:1806.01515v2 [cs.CL] 14 Jun 2018
we seek to answer questions such as: (1) Should

natural language processing (NLP) tasks, the embeddings be fixed or updated during NMT
but it is largely neglected for neural ma- training? (2) Are embeddings more effective when
chine translation (NMT). In this paper, bilingual training data is limited? (3) Does the
we conducted a systematic analysis on amount or domain of monolingual data used to
the effect of using pre-trained source-side train the embeddings affect results significantly?
monolingual word embedding in NMT. We answered these questions with experiments on
We compared several strategies, such as varying amounts of bilingual training data drawn
fixing or updating the embeddings dur- from Chinese-English and German-English tasks.
ing NMT training on varying amounts of Additionally, we proposed a simple yet effective
data, and we also proposed a novel strat- strategy called dual embedding, which combines a
egy called dual-embedding that blends the fixed pre-trained embedding with a randomly ini-
fixing and updating strategies. Our results tialized embedding that is updated during NMT
suggest that pre-trained embeddings can training. We found that this strategy generally im-
be helpful if properly incorporated into proves BLEU under various data scenarios, and is
NMT, especially when parallel data is lim- an useful way to exploit pre-trained monolingual
ited or additional in-domain monolingual embeddings in NMT.
data is readily available.
1 Introduction 2 Related Work
Leveraging the information encoded in pre-trained Various works have explored the use of monolin-
monolingual word embeddings (Bengio et al., gual data in NMT. On using target side mono-
2003; Mikolov et al., 2013; Pennington et al., lingual data, (Gulcehre et al., 2015) incorporated
2014; Bojanowski et al., 2017) is a common prac- a pre-trained RNN language model, while (Sen-
tice in various natural language processing tasks, nrich et al., 2016a) trained an auxiliary model with
for example: parsing (Dyer et al., 2015) (Kiper- reverse translation direction to construct pseudo
wasser and Goldberg, 2016), relation extraction parallel data. (Currey et al., 2017) copied target
(Peng et al., 2017), natural language understand- monolingual data to the source side and mix it
ing (Cheng et al., 2016a), sequence labeling (Ma with real parallel data to train a unified encoder-
and Hovy, 2016), and so on. While research in decoder model. It should be noted that although
neural machine translation (NMT) seeks to make some of these techniques are very popular in prac-
use of monolingual data by techniques such as tice, they are only capable of incorporating target-
back-translation (Sennrich et al., 2016a), the use side monolingual data, and hence they are outside
of pre-trained monolingual word embeddings does the immediate scope of this paper.
not seem to be the standard practice. On using source side monolingual data, (Cheng
In this paper, we study the interaction between et al., 2016b; Zhang and Zong, 2016) proposed
the source-side monolingual word embeddings self-learning for NMT, which learns the transla-
tion by reconstruction. Also, several works have Context
Vector
shown promising results in incorporating pre- (input to the
decoder)
trained source-side word embedding into NMT
systems. (Rios Gonzales et al., 2017) used sense
Attention
embedding to improve word sense disambiguation Mechanism
abilities of the translation system. Most similar

to our work is (Abdou et al., 2017), which used Encoder
Bi-LSTM
pre-trained word embedding to help training con-
verge faster, and (Di Gangi and Federico, 2017),
which studied the effect of source monolingual Dual
Embedding
embedding for NMT, but focus on low resource
settings; further, their gated sum approach to com-
bining embeddings has similar motivation to our Figure 1: Encoder part of NMT with dual embed-
dual embedding strategy. The results and analy- ding model. Note that the shaded embeddings are
sis in our work verifies most of the conclusions in initialized with pre-trained embedding and fixed
their paper, while extending them by examining throughout the training process.
both low and high resource settings and providing
additional answers to question of when and how Sample Size ZH-EN DE-EN
are embeddings beneficial. Pre-BPE BPE Pre-BPE BPE
100,000 15.1% 1.73% 24.0% 2.41%
3 Strategies for Incorporating 250,000 10.4% .809% 17.5% .894%
Pre-trained Embeddings 500,000 8.49% .667% 14.4% .480%
1,000,000 6.21% .398% 11.6% .332%
3.1 Initializing Embedding Unsampled 5.11% .284% 7.36% .0923%
We focus on simple strategies to incorporate pre-
trained source-side monolingual embeddings in Table 1: Percentage of OOV Types in the Test Sets
the NMT encoder. The goal is to require minimum
modification of the standard NMT model. The in the training data. Example works adopting such
simplest way is to initialize the NMT model with strategy include (Ma and Hovy, 2016) and (Peng
pre-trained embedding values – for each word in et al., 2017).
the source sentence of the training bitext, we look-
up its pre-trained embedding and initialize the 3.2 Dual Embedding
NMT encoder with it.
To combine the benefit of both strategies intro-
After initialization, there are two strategies
duced above, we propose dual embedding, as
during training. We can fix the embeddings
shown in Figure 1. Basically, we augment the
to their initialized pre-trained values throughout
original encoder-decoder architecture of NMT
training; in other words, NMT training optimizes
model with an extra word embedding of same size,
parameters in the context encoding layers (e.g.
which is concatenated with the original word em-
LSTM), the attention layers, etc., but do not back-
bedding. While the original word embedding is
propagate into the embedding parameters. Exam-
randomly initialized, this extra embedding will be
ple works adopting such strategy include (Dyer
initialized by a pre-trained monolingual word em-
et al., 2015) and (Kiperwasser and Goldberg,
bedding and fixed through out training. The idea
2016).
behind this architecture is that we would like to
Alternatively, we can update the embeddings
learn a correction term over the original mono-
with all other NMT model parameters during
lingual word embedding, rather than rewriting the
training. Here, the pre-trained word embeddings
monolingual word embedding altogether.
only provide an initialization that is different from
the baseline of random initialization. Updating the 4 Experiments
initialization allows embeddings to adjust to val-
ues that are suitable for the NMT overall objec- We conducted experiments for Chinese-English
tive, whereas fixing embeddings potentially allows (ZH-EN) and German-English (DE-EN) language
generalization to test words that are not observed pairs. For ZH-EN experiments, a collection of
LDC Chinese-English data with about 2.01 mil- build all the pre-trained monolingual word embed-
lion sentence pairs was used as training set, while ding, with embedding dimension 300. We used
NIST OpenMT 2005 and OpenMT 2008 dataset 2-layer LSTM for both the encoder and decoder,
were used as development set and test set, respec- and the hidden dimension of LSTM was set to
tively. For DE-EN experiments, we used WMT 1024. We updated the parameter with Adadelta
2016 data for training, newstest2008 as develop- (Zeiler, 2012) with learning rate 1.0, = 1e − 6
ment set, and newstest2015 as test set. To inves- and ρ = 0.95, and performed dropout on all the
tigate the effectiveness of pre-trained monolingual LSTM layers and embedding layers with a rate of
embedding on systems trained on different amount 0.2. We trained all of our models for 20 epochs,
of bilingual data, we varied the amount of train- the best model selected by the perplexity on the
ing data in the experiments by performing random validation set will be used to decode on test set.
sampling on the full parallel data; Table 1 shows We evaluated our models with uncased BLEU
the number and percentage of OOV word types for calculated using multi-bleu.perl that
each training subset. comes with the Moses decoder (Koehn et al.,
The monolingual data used for ZH-EN exper- 2007). We also performed pairwise signifi-
iments is the XMU LDC monolingual data pro- cance testing (Koehn, 2004) between some key
vided in WMT 2017 news translation evaluation experimental setups.
(Bojar et al., 2017), while for DE-EN the German
news crawl 2016 dataset was used as the mono- 4.1 Results
lingual data. To investigate whether monolingual Figure 2 shows the training and development
data size is a significant factor in NMT translation log-probability curve under 100k and unsampled
quality, we experimented with two different kinds Chinese-English training data3 , respectively. We
of monolingual word embeddings: can see from the figure that adding pre-trained
• small: only the source-side of the parallel word embedding does not speedup convergence of
corpus is used for pre-training the training process, which verifies the conclusion
• extended: additionally, the monolingual cor- from (Abdou et al., 2017). However, our result ex-
pus is used for pre-training tends over the previous work to show that while
BPE was applied (Sennrich et al., 2016b) for there is no change in the convergence speed, the
both parallel and monolingual corpus with oper- objective function value (especially the develop-
ation number 49,500, while total vocabulary size ment loss function value) at convergence is signif-
was set to 50,0001 . icantly higher when pre-trained word embedding
We compared the following strategies to incor- is incorporated, which indicates stronger model is
porate pre-trained monolingual word embeddings. learned. We also notice that contrary to the general
• fixed initialization: the source-side word em- belief, pre-trained embedding initialization does
bedding is fixed during training not make the initial objective value significantly
• update initialization: the source-side word higher. This implies that fixing word embedding
embedding is updated during training throughout the training process as what has been
• dual embedding: one half of the word em- done in some other NLP literatures may not be a
bedding parameter contains the fixed pre- very good strategy for neural machine translation.
trained vector, while the other half is updated The hypothesis is further verified by the result that
during training and acts as a correction term. follows.
Table 2 and Table 3 show the uncased BLEU
These were compared to the baseline of using ran-
scores and the significance of difference between
dom initialization.
some pairs of results. Comparing the results of
A modified fork of OpenNMT-py2 (Klein et al.,
fixed strategy and others, it can first be noticed
2017) was used to run all NMT experiments, while
that fixed initialization does not only almost con-
fastText (Bojanowski et al., 2017) was used to
sistently generate worst BLEU score among the
1
To reduce variance introduced by BPE over different size three, but also significantly hurts the performance
of the training data, we used a single BPE model trained on
the source-side monolingual data and a comparable sample 3
The same plots were generated for German-English ex-
of English news crawl 2016 monolingual data. periments and small embedding incorporations as well, and
2
https://github.com/shuoyangd/OpenNMT- the trend looks very similar. Hence they are omitted to avoid
py/tree/mono emb repetitiveness.
2
4 4
6 6
log-likelihood
log-likelihood
8 8
10 10
12 baseline 12 baseline
update update
14 dual 14 dual
0 5000 10000 15000 20000 0 100000 200000 300000 400000
Batch Updates Batch Updates
(a) Training Curve on 100k Sample (b) Training Curve on Unsampled Data
2.8
5.5 3.0
6.0 3.2
log-likelihood
log-likelihood
6.5 3.4
7.0 3.6
7.5 3.8
8.0 4.0
baseline baseline
8.5 update 4.2 update
9.0 dual 4.4 dual
2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Epochs Epochs
(c) Development Curve on 100k Sample (d) Development Curve on Unsampled Data
Figure 2: Plot of objective function vs batch updates / epochs for Chinese-English training data
over the baseline sometimes. This strengthens our the parallel training data has a minor domain mis-
hypothesis above. On the other hand, if we allow match from the monolingual training data (par-
the source-side word embedding to be updated, liament proceedings vs. news). In terms of
we can observe improvements over the baseline incorporation strategy, it can be observed that
more often. We hence conclude that pre-trained while the dual strategy is not very helpful with
source-side monolingual word embedding cannot small word embeddings, improvements over up-
directly benefit NMT performance, and adjust- date strategy can almost always be obtained with
ment by NMT training is necessary for it to be extended embeddings. We can also notice that un-
beneficial to the NMT system performance. der the German-English setting with extended em-
Comparing the Chinese-English and German- beddings, the different between update and dual
English experiments, we notice that incorporat- embedding method is more often significant com-
ing embeddings for German-English experiments pared to that of Chinese-English experiments, sig-
yields significant improvements over baselines naling that dual embedding method is more robust
more often, mainly because of the improvements to domain variances between monolingual data
obtained by extended fixed embedding incorpo- and bi-text.
ration. We think this is due to fact that the Comparing the results obtained under differ-
German-English test set generally has higher pre- ent parallel training data scales, it can be ob-
BPE OOV rate across different sample size of the served that the benefit of source-side monolin-
parallel training data. gual word embedding (compared to the baseline)
Comparing the results under different mono- seems to be decreasing along with the increasing
lingual data scales, for Chinese-English experi- amount of data, verifying the intuition that ex-
ments, the extended embedding always performs tra monolingual information is most useful under
better than small embedding, while for German- low-resource settings. On the other hand, the dual
English experiments, we got mixed results when embedding model is able to obtain the largest per-
comparing results with two different kinds of em- formance gains over update initialization both for
beddings. This can be explained by the fact that Chinese-English and the German-English training
small embedding extended embedding
Sentence Pairs Baseline
fixed update dual fixed update dual
100,000 15.99 15.36† 16.44 16.89† 15.90 17.55† 17.56†
250,000 21.72 22.05 22.16 22.08 22.05 23.04† 23.38†
500,000 26.28 25.48† 26.51 26.29 26.13 27.09† 26.76†
1,000,000 29.75 29.54 29.59 29.83 29.39 30.6† 30.89†
2,013,142 32.44 31.40† 33.21† 33.21 32.61 33.26† 34.03*†
Table 2: Chinese-English experiment Results with Different Data Sizes. Asterisks are appended when the
difference between update and dual method translation output is significant, and daggers are appended
when difference between any system output and the baseline is significant. The significance level is
p < 0.05.
small embedding extended embedding

Sentence Pairs Baseline
fixed update dual fixed update dual
100,000 13.55 14.02† 14.75† 14.91† 15.14† 15.28† 15.96*†
250,000 20.11 20.18 20.92*† 20.47 20.74† 20.76† 20.95†
500,000 23.09 23.46 24.14† 24.11† 23.55† 24.14† 24.64*†
1,000,000 26.17 25.95 26.29 26.21 26.27 26.19 26.59*†
4,562,102 29.71 28.73† 30.00† 30.07 29.17† 29.13† 30.00*
Table 3: German-English experiment Results with Different Data Sizes. Asterisks are appended when the
difference between update and dual method translation output is significant, and daggers are appended
when difference between any system output and the baseline is significant. The significance level is
p < 0.05.
data with extended embeddings. This indicates ble 4 when comparing the translation results from
that the dual embedding model is able to get the systems trained with unsampled parallel data (sen-
best of both worlds as expected – it leverages more tence 6-8 and sentence 14-16).
on the initialized word embedding at low resource
settings, but is able to learn useful supplementary To further verify the observation above, we did
features when relatively large amount of parallel a simple human evaluation where we take the
training data is available. translation output from Chinese-English systems
trained on unsampled parallel data with extended
4.2 Analysis embedding incorporation. We first took the sin-
gleton words (before BPE) in the unsampled par-
Qualitative Analysis
allel data and filtered the test sentences containing
Translation of test sets were examined manu- these singleton words. We then manually read the
ally to evaluate the qualitative improvements ob- filtered and shuffled test sentences and answered
tained by incorporating pre-trained word embed- the yes/no question: are the singleton words ap-
dings within NMT. In the case of small embed- peared in the sentence translated correctly in the
ding, we have learned from significance test that test output? We chose to analyze singleton words
many of these improvements are not statistically rather than OOV words because (1) the translation
significant. But even for the significant ones, we of singleton words is less noisy; (2) if a word oc-
didn’t observe very specific patterns for qualitative curs in both the training set and the test set, it’s
improvements. more likely to occur in the monolingual data and
In the case of extended embedding, however, we hence embedding incorporation will add extra in-
observed an specific improvement of the transla- formation to translate these words. We found 226k
tion adequacy for rare words (mainly named enti- singleton words in the training data and 134 oc-
ties) in the training data, and the usage of dual em- currences of these words in the test data. The
bedding model often brings further improvements results is shown in the Table 5 and we can see
in that aspect. Such improvement is evident in Ta- that both update and dual incorporation method
1 Source 1: Len@@ ovo 即将要在 7月 17日 ( 美国时间 ) 推出的 Think@@ Pad T@@ 61 p , 也是
T 系列最@@ 高级的笔记本电脑 , 给我们一个相当不错的惊喜 : U@@ W@@ B ( Wis@@ air
Ul@@ tra Wi@@ deb@@ and ) .
2 Reference 1: The ThinkPad T61p that Lenovo is about to introduce on July 17 ( U.S. time ) ,
also the most advanced notebook computer of the T series , has given us a very pleasant surprise :
UWB ( Wisair Ultra Wideband ) .
3 Baseline 100k: On July 17 ( US time ) launched , the number of SWB’s highest notebook notebook computers ,
which is also a good source of notebook notebook computers . : UWB .
4 Update 100k: Stewart is scheduled to be launched on July 17 ( the US time ) to launch a very good surprise of the
top - - - - - - - - - - - - - - - - - - - - - - - - - - - - ranking computer computer to us .
5 Dual 100k: Founded to July 17 , Inc will be launched to be launched on the current model notebook , which is
also a well - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
6 Baseline Unsampled: It will be launched on July 17 ( USA time ) as a top-class notebook computer and gives us
a rather good surprise: UWB Ultra Witra Wireless .
7 Update Unsampled: It is about to be launched on July 17 ( US time ) , as well as the T ’s most advanced
notebooks of notebooks, to us: UWB (Wisair Ultra Wireless and) .
8 Dual Unsampled: The Dell Pad T6p , which is about to be launched on July 17 ( US time ) , is also the T-series ’s
most advanced notebooks , giving us a rather good surprise: UWB ( Wisair Ultra Wideband ) .
9 Source 2: 由王@@ 兵@@ 兵黑砖@@ 窑案引发的山西黑砖@@ 窑奴@@ 工事件 , 曾国内外一
度引起关注 , 中央高层批示要求严查 .
10 Reference 2: The Shanxi black brick kiln slave labor incident touched off by the black brick kiln case of
Wang Bingbing once attracted attention from inside the country and abroad . The top leadership of the cen-
tral government had given directive demanding stern prosecution .
11 Baseline 100k: From the incident caused by Wang Wei , the Shanxi case caused by Wang Wei , and from home
and abroad , and his attention to the high - level instructions of the central authorities .
12 Update 100k: In the case of Wang Shanxi , the Shaanxi Shuan case of Shanxi Province in Shanxi Province ,
has attracted great attention to the party and abroad . .
13 Dual 100k: In the case of the Shanxi , the government of Shanxi’s Shanxi Shengsheng incident , has once again
attracted attention from home and abroad .
14 Baseline Unsampled: At a time , the central authorities ' instructions have aroused concern at home and
abroad , and the central authorities have issued instructions to investigate the incident .
15 Update Unsampled: At one point at home and abroad , there was a great deal of attention at home and abroad ,
and the central authorities demanded a strict investigation .
16 Dual Unsampled: The incident , which was triggered by the black brick kiln of Wang Wei - bing , has once
aroused concern at home and abroad , and the central authorities ’ high - level instructions have been set for
investigation .
Table 4: Two translation snippets from the 100k sample and unsampled Chinese-English experiments
with several named entities highlighted in corresponding colors. All the embedding incorporated are
extended embeddings. The @@ symbol is the token breaking symbol produced by BPE processing.
system accuracy poration, we did notice that the systems with em-
baseline 29.10% bedding incorporated tends to produce repetitive
update 32.09% output more frequently (e.g. sentence 4 and 5),
dual 33.58% to which we don’t have a very good explanation.
We conjecture that such problem could be reme-
Table 5: Human-evaluated singleton word transla- died by coverage modeling techniques such as (Tu
tion accuracies on Chinese-English test set. et al., 2016) and (Wu et al., 2016), but leave the
verification of it as future work. We also acknowl-
improves the singleton translation accuracy. This edge that the improvements on rare word trans-
agrees with our observation as presented in Table lation is not obvious under low-resource settings
4. (e.g. sentence 11-13) because translation outputs
On the other hand, while the BLEU scores im- are often too noisy to tell much useful qualitative
proves significantly over baselines under the low- trend.
resource settings with extended embedding incor-
100 100
80 80
System BLEU
System BLEU
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Baseline BLEU Baseline BLEU
(a) ZH-EN, Extended Update Embedding (b) ZH-EN, Extended Dual Embedding
100 100
80 80
System BLEU
System BLEU
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Baseline BLEU Baseline BLEU
(c) DE-EN, Extended Update Embedding (d) DE-EN, Extended Dual Embedding
Figure 3: Sentence-Level BLEU Scatter Plot Between Baseline and Embedding-Incorporated Systems.
The dots on the upper-left part of the red line corresponds to system output sentences that are better than
the baseline, and vice versa.
2.5 update update

dual 2.00 dual
2.0 1.75
1.50
norm of update
norm of update
1.5 1.25
1.00
1.0
0.75
0.50
0.5
0.25
0.0 0.00
1-5 5-25 25-100 100-500 500-2.5k 2.5k-10k 10k-50k 50k+ 1-5 5-25 25-100 100-500 500-2.5k 2.5k-10k 10k-50k 50k+
word frequencies word frequencies
(a) ZH-EN (b) DE-EN
Figure 4: Average Norm of Update on Word Embeddings Grouped by Word Frequency in Training Data
Quantitative Analysis that the fixed part of the dual embedding is pre-
venting the updated part of the dual embedding to
For quantitative analysis, we focus on the small-
perform too much correction over its pre-trained
scale experiment (trained with 100k sample of par-
value. This conservativeness in performing update
allel data) with extended embedding incorporation
may account for the extra robustness of dual em-
as they seem to pose most interesting improve-
bedding incorporation we observed in the qualita-
ments in terms of BLEU scores.
tive analysis.
We started with computing the sentence-level
BLEU with MultEval toolkit (Clark et al., 2011) 5 Conclusion
and generating scatter plot of sentence-level
BLEU score of the update and dual embedding Our analysis on using source-side monolingual
system against the baseline system for each out- word embeddings in NMT indicates that (1) the
put sentence decoded on the test set, as shown in source-side embeddings should be updated during
Figure 3. The purpose of this analysis is to exam- NMT training; (2) the source-side embeddings are
ine the variance of output sentence before/after the more effective when bilingual training data is lim-
embedding incorporation. It could first be noticed ited, especially when OOV rates is high. More-
that across all embedding incorporation methods, over, source-side embedding incorporation is also
the dots are shifted to the upper left side of the useful under some high-resource settings when in-
red line, which means sentence-level BLEU score corporated properly; (3) the effect of source-side
tends to increase after incorporation. This agrees word embedding strengthens when extra monolin-
with the increase of corpus-level BLEU score in gual data is provided for training, and the domain
Table 2 and Table 3. On the other hand, all embed- of the monolingual data also seems to matter.
ding incorporation methods incurs similar amount We recommend that incorporating pre-trained
of variances in translation outputs, even on dif- embeddings as input become a standard practice
ferent language pairs. In terms of comparison for NMT when bilingual training data is scarce, es-
across incorporation strategies, the update strategy pecially when extra source-side monolingual data
seems to incur slightly more drastic BLEU scores is available. While incorporating pre-trained em-
changes (dots that are close to the right and up- beddings at high-resource settings may also be
per part of the horizontal and vertical axes, respec- helpful, we advise that extra caution should be
tively), but the difference is not significant enough used to ensure the monolingual data is in-domain,
to make a strong argument. and appropriate incorporation strategy should be
selected.
Another problem we are interested in is the
norm of update on the word embeddings during
the NMT training process. More specifically, for References
each words in the dictionary, we take their word
Mostafa Abdou, Vladan Gloncak, and Ondřej Bo-
embedding before/after the training process and jar. 2017. Variable mini-batch sizing and pre-
compare the norm of their difference. Figure 4 trained embeddings. In Proceedings of the Sec-
shows the norm of update grouped by the word ond Conference on Machine Translation. Associa-
frequency in the training data. It should be noted tion for Computational Linguistics, pages 680–686.
http://aclweb.org/anthology/W17-4780.
that the norm of update is increasing roughly lin-
early along with the frequency of words up till Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and
50000. This implies that for each iteration, un- Christian Jauvin. 2003. A neural probabilistic lan-
guage model. Journal of machine learning research
less the word has been seen extremely frequently, 3(Feb):1137–1155.
the norm of update performed on the word em-
bedding is about the same on average. We also see Piotr Bojanowski, Edouard Grave, Armand Joulin, and
that norm of update performed on dual incorpora- Tomas Mikolov. 2017. Enriching word vectors
with subword information. Transactions of the As-
tion strategy is consistently lower than update in- sociation of Computational Linguistics 5:135–146.
corporation strategy. Because the pre-trained part http://aclweb.org/anthology/Q17-1010.
of the embedding is fixed, and the dual strategy
Ondřej Bojar, Rajen Chatterjee, Christian Federmann,
is essentially learning a correction term over the Yvette Graham, Barry Haddow, Shujian Huang,
pre-trained word embedding rather than rewriting Matthias Huck, Philipp Koehn, Qun Liu, Varvara
the pre-trained value completely, we conjecture Logacheva, Christof Monz, Matteo Negri, Matt
Post, Raphael Rubino, Lucia Specia, and Marco Guillaume Klein, Yoon Kim, Yuntian Deng,
Turchi. 2017. Findings of the 2017 conference Jean Senellart, and Alexander Rush. 2017.
on machine translation (wmt17) pages 169–214. Opennmt: Open-source toolkit for neu-
http://aclweb.org/anthology/W17-4717. ral machine translation pages 67–72.
http://www.aclweb.org/anthology/P17-4012.
Jianpeng Cheng, Li Dong, and Mirella Lapata.
2016a. Long short-term memory-networks for Philipp Koehn. 2004. Statistical significance tests for
machine reading. In Proceedings of the 2016 machine translation evaluation. In Proceedings of
Conference on Empirical Methods in Natural the 2004 conference on empirical methods in natural
Language Processing, EMNLP 2016, Austin, language processing.
Texas, USA, November 1-4, 2016. pages 551–561.
http://aclweb.org/anthology/D/D16/D16-1053.pdf. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua Brooke Cowan, Wade Shen, Christine Moran,
Wu, Maosong Sun, and Yang Liu. 2016b. Semi- Richard Zens, Chris Dyer, Ondrej Bojar, Alexan-
supervised learning for neural machine translation dra Constantin, and Evan Herbst. 2007. Moses:
pages 1965–1974. https://doi.org/10.18653/v1/P16- Open source toolkit for statistical machine trans-
1185. lation. In Proceedings of the 45th Annual Meet-
ing of the Association for Computational Linguistics
Jonathan H Clark, Chris Dyer, Alon Lavie, and Noah A Companion Volume Proceedings of the Demo and
Smith. 2011. Better hypothesis testing for statistical Poster Sessions. Association for Computational Lin-
machine translation: Controlling for optimizer insta- guistics, Prague, Czech Republic, pages 177–180.
bility. In Proceedings of the 49th Annual Meeting of http://www.aclweb.org/anthology/P07-2045.
the Association for Computational Linguistics: Hu-
man Language Technologies: short papers-Volume Xuezhe Ma and Eduard H. Hovy. 2016. End-
2. Association for Computational Linguistics, pages to-end sequence labeling via bi-directional lstm-
176–181. cnns-crf. In Proceedings of the 54th An-
nual Meeting of the Association for Compu-
Anna Currey, Antonio Valerio Miceli Barone, and tational Linguistics, ACL 2016, August 7-12,
Kenneth Heafield. 2017. Copied monolin- 2016, Berlin, Germany, Volume 1: Long Papers.
gual data improves low-resource neural machine http://aclweb.org/anthology/P/P16/P16-1101.pdf.
translation. In Proceedings of the Second
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
Conference on Machine Translation. Association
rado, and Jeff Dean. 2013. Distributed representa-
for Computational Linguistics, pages 148–156.
tions of words and phrases and their compositional-
ity. In Advances in neural information processing
systems. pages 3111–3119.
Mattia Antonino Di Gangi and Marcello Federico.
2017. Monolingual embeddings for low resourced Nanyun Peng, Hoifung Poon, Chris Quirk,
neural machine translation. In Proceedings of the Kristina Toutanova, and Wen-tau Yih. 2017.
14th International Workshop on Spoken Language Cross-sentence n-ary relation extraction with
Translation. pages 97–104. graph lstms. Transactions of the Associa-
tion for Computational Linguistics 5:101–115.
Chris Dyer, Miguel Ballesteros, Wang Ling, Austin https://transacl.org/ojs/index.php/tacl/article/view/1028.
Matthews, and Noah A. Smith. 2015. Transition-
based dependency parsing with stack long short- Jeffrey Pennington, Richard Socher, and Christopher
term memory. In Proceedings of the 53rd Annual Manning. 2014. Glove: Global vectors for word
Meeting of the Association for Computational Lin- representation. In Proceedings of the 2014 Con-
guistics and the 7th International Joint Conference ference on Empirical Methods in Natural Language
on Natural Language Processing (Volume 1: Long Processing (EMNLP). Association for Computa-
Papers). Association for Computational Linguistics, tional Linguistics, Doha, Qatar, pages 1532–1543.
pages 334–343. https://doi.org/10.3115/v1/P15- http://www.aclweb.org/anthology/D14-1162.
1033.
Annette Rios Gonzales, Laura Mascarell, and Rico
Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Sennrich. 2017. Improving word sense dis-
Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, ambiguation in neural machine translation with
Holger Schwenk, and Yoshua Bengio. 2015. On us- sense embeddings. In Proceedings of the Sec-
ing monolingual corpora in neural machine transla- ond Conference on Machine Translation. Associ-
tion. arXiv preprint arXiv:1503.03535 . ation for Computational Linguistics, pages 11–19.
Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim-
ple and accurate dependency parsing using bidirec- Rico Sennrich, Barry Haddow, and Alexandra Birch.
tional lstm feature representations. Transactions of 2016a. Improving neural machine transla-
the Association of Computational Linguistics 4:313– tion models with monolingual data pages 86–96.
327. http://www.aclweb.org/anthology/Q16-1023. http://www.aclweb.org/anthology/P16-1009.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016b. Neural machine translation of rare
words with subword units pages 1715–1725.
http://www.aclweb.org/anthology/P16-1162.
Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,
and Hang Li. 2016. Coverage-based Neural Ma-
chine Translation. arXiv.org .
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, et al. 2016. Google’s neural ma-
chine translation system: Bridging the gap between
human and machine translation. arXiv preprint
arXiv:1609.08144 .
Matthew D Zeiler. 2012. Adadelta: an adaptive learn-
ing rate method. arXiv preprint arXiv:1212.5701 .
Jiajun Zhang and Chengqing Zong. 2016. Exploit-
ing source-side monolingual data in neural ma-
chine translation. In Proceedings of the 2016
Conference on Empirical Methods in Natural
Language Processing. Association for Computa-
tional Linguistics, Austin, Texas, pages 1535–1545.
https://aclweb.org/anthology/D16-1160.

How Do Source-Side Monolingual Word Embeddings Impact Neural Machine Translation?

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How Do Source-Side Monolingual Word Embeddings Impact Neural Machine Translation?

Uploaded by

Copyright:

Available Formats

How Do Source-side Monolingual Word Embeddings

Impact Neural Machine Translation?

Abstract with NMT. Our goal is to understand whether they

we seek to answer questions such as: (1) Should

1 Introduction 2 Related Work

abilities of the translation system. Most similar

small embedding extended embedding

2.5 update update

You might also like