Professional Documents
Culture Documents
2016 Recurrent Dropout Without Memory Loss
2016 Recurrent Dropout Without Memory Loss
This paper presents a novel approach to re- network regularization. While showing signif-
current neural network (RNN) regulariza- icant improvements when used in feed-forward
tion. Differently from the widely adopted architectures, e.g., Convolutional Neural Net-
dropout method, which is applied to for- works (Krizhevsky et al., 2012), the application of
ward connections of feed-forward archi- dropout in RNNs has been somewhat limited. In-
tectures or RNNs, we propose to drop neu- deed, so far dropout in RNNs has been applied
rons directly in recurrent connections in a in the same fashion as in feed-forward architec-
way that does not cause loss of long-term tures: it is typically injected in input-to-hidden
memory. Our approach is as easy to imple- and hidden-to-output connections, i.e., along the
ment and apply as the regular feed-forward input axis, but not between the recurrent connec-
dropout and we demonstrate its effective- tions (time axis). Given that RNNs are mainly
ness for Long Short-Term Memory net- used to model sequential data with the goal of cap-
work, the most popular type of RNN turing short- and long-term interactions, it seems
cells. Our experiments on NLP bench- natural to also regularize the recurrent weights.
marks show consistent improvements even This observation has led us and other researchers
when combined with conventional feed- (Moon et al., 2015; Gal, 2015) to the idea of ap-
forward dropout. plying dropout to the recurrent connections in
RNNs.
1 Introduction
In this paper we propose a novel recurrent
Recurrent Neural Networks, LSTMs in particular, dropout technique and demonstrate how our
have recently become a popular tool among NLP method is superiour to other recurrent dropout
researchers for their superior ability to model and methods recently proposed in (Moon et al., 2015;
learn from sequential data. These models have Gal, 2015). Additionally, we answer the follow-
shown state-of-the-art results on various public ing questions which helps to understand how to
benchmarks ranging from sentence classifica- best apply recurrent dropout: (i) how to apply
tion (Wang et al., 2015; Irsoy and Cardie, 2014; the dropout in recurrent connections of the LSTM
Liu et al., 2015) and various tagging prob- architecture in a way that prevents possible cor-
lems (Dyer et al., 2015) to language mod- ruption of the long-term memory; (ii) what is
elling (Kim et al., 2015; Zhang et al., 2015), the relationship between our recurrent dropout
text generation (Zhang and Lapata, 2014) and the widely adopted dropout in input-to-hidden
and sequence-to-sequence prediction and hidden-to-output connections; (iii) how the
tasks (Sutskever et al., 2014). dropout mask in RNNs should be sampled: once
Having shown excellent ability to capture and per step or once per sequence. The latter question
learn complex linguistic phenomena, RNN ar- of sampling the mask appears to be crucial in some
chitectures are prone to overfitting. Among cases to make the recurrent dropout work and, to
the most widely used techniques to avoid over- the best of our knowledge, has received very little
fitting in neural networks is the dropout regu- attention in the literature. Our work is the first one
larization (Hinton et al., 2012). Since its intro- to provide empirical evaluation of the differences
between these two sampling approaches. insights from variational Bayesian inference to
Regarding empirical evaluation, we first high- propose a variant of LSTM with dropout that
light the problem of information loss in memory achieves consistent improvements over a baseline
cells of LSTMs when applying recurrent dropout. architecture without dropout.
We demonstrate that previous approaches of drop- The main contribution of this paper is a new
ping hidden state vectors cause loss of memory recurrent dropout technique, which is most use-
while our proposed method to use dropout mask in ful in gated recurrent architectures such as LSTMs
hidden state update vectors does not suffer from and GRUs. We demonstrate that applying dropout
this problem. We experiment on three widely to arbitrary vectors in LSTM cells may lead to
adopted NLP tasks: word- and character-level loss of memory thus hindering the ability of the
Language Modeling and Named Entity Recog- network to encode long-term information. In
nition. The results demonstrate that our recur- other words, our technique allows for adding a
rent dropout helps to achieve better regulariza- strong regularizer on the model weights respon-
tion and yields improvements across all the tasks, sible for learning short and long-term dependen-
even when combined with the conventional feed- cies without affecting the ability to capture long-
forward dropout. Furthermore, we compare our term relationships, which are especially important
dropout scheme with the recently proposed alter- to model when dealing with natural language. Fi-
native recurrent dropout methods and show that nally, we compare our method with alternative
our technique is superior in almost all cases. recurrent dropout methods recently introduced
in (Moon et al., 2015; Gal, 2015) and demonstrate
2 Related Work that our method allows to achieve better results.
Figure 1: Illustration of the three types of dropout in recurrent connections of LSTM networks. Dashed
arrows refer to dropped connections. Input connections are omitted for clarity.
Table 2: Perplexity scores of the LSTM network on word level Language Modeling task (lower is better).
Upper and lower parts of the table report results without and with forward dropout respectively. Networks
with forward dropout use 0.2 and 0.5 dropout rates in input and output connections respectively. Values
in bold show best results for each of the recurrent dropout schemes with and without forward dropout.
trained with the dropout scheme of (Gal, 2015). combined with the forward dropout – for LSTMs
We make the following observations: (i) our ap- we are able to bring down perplexity on the vali-
proach shows better results than the alternatives; dation set from 130 to 91.6.
(ii) per-step mask sampling is better when drop- To demonstrate the effect of our approach on the
ping hidden state directly; (iii) on this task our learning process, we also present learning curves
method using per-step sampling seems to yield re- of LSTM networks trained with and without re-
sults similar to per-sequence sampling; (iv) in this current dropout (Fig. 2). Models trained using
case forward dropout yields better results than any our recurrent dropout scheme have slower con-
of the three recurrent dropouts; and finally (v) both vergence than models without dropout and usu-
our approach and that of (Gal, 2015) are effective ally have larger training error and lower validation
when combined with the forward dropout, though errors. This behaviour is consistent with what is
ours is more effective. expected from a regularizer and is similar to the
We make the following observations: (i) drop- effect of the feed-forward dropout applied to non-
ping hidden state updates yields better results than recurrent networks (Hinton et al., 2012).
dropping hidden states; (ii) per-step mask sam-
4.3 Character Level Language Modeling
pling is better when dropping hidden state directly;
(iii) contrary to our expectations, when we apply Data. We train our networks on the dataset de-
dropout to hidden state updates per-step sampling scribed in the previous section. It contains approx-
seems to yield results similar to per-sequence sam- imately 6 million characters, and a vocabulary of
pling; (iv) applying dropout to hidden state up- 50 characters. We use the provided partitions train,
dates rather than hidden states in some cases leads validation and test partitions.
to a perplexity decrease by more than 30 points; Setup. We use networks with 1024 units to solve
and finally (v) our approach is effective even when the character level LM task. The characters are
Moon et al. (2015) Gal (2015) Ours
Dropout rate Sampling
Valid Test Valid Test Valid Test
0.0 – 1.460 1.457 1.460 1.457 1.460 1.457
0.25 per-step 1.435 1.394 1.345 1.308 1.338 1.301
0.5 per-step 1.610 1.561 1.387 1.348 1.355 1.316
0.25 per-sequence 1.433 1.390 1.341 1.304 1.356 1.319
0.5 per-sequence 1.691 1.647 1.408 1.369 1.496 1.450
0.0 – 1.362 1.326 1.362 1.326 1.362 1.326
0.25 per-step 1.471 1.428 1.381 1.344 1.358 1.321
0.5 per-step 1.668 1.622 1.463 1.425 1.422 1.380
0.25 per-sequence 1.455 1.413 1.387 1.348 1.403 1.363
0.5 per-sequence 1.681 1.637 1.477 1.435 1.567 1.522
Table 3: Bit-per-character scores of the LSTM network on character level Language Modelling task
(lower is better). Upper and lower parts of the table report results without and with forward dropout re-
spectively. Networks with forward dropout use 0.2 and 0.5 dropout rates in input and output connections
respectively. Values in bold show best results for each of the recurrent dropout schemes with and without
forward dropout.
pairs their ability to remember even further. The While our experimental results show that ap-
best result over most of the datasets is achieved plying recurrent dropout method leads to signif-
by the GRU network with recurrent dropout. The icant improvements across various NLP bench-
only exception is the Twitter2015 dataset, where marks (especially when combined with conven-
the LSTM network shows better results. tional forward dropout), its benefits for other tasks,
e.g., sequence-to-sequence prediction, or other do-
5 Conclusions mains, e.g., Speech Recognition, remain unex-
plored. We leave it as our future work.
This paper presents a novel recurrent dropout
method specifically tailored to the gated recurrent
Acknowledgments
neural networks. Our approach is easy to im-
plement and is even more effective when com- This project has received funding from the Eu-
bined with conventional forward dropout. We ropean Union’s Framework Programme for Re-
have shown that for LSTMs and GRUs applying search and Innovation HORIZON 2020 (2014-
dropout to arbitrary cell vectors results in subopti- 2020) under the Marie Skodowska-Curie Agree-
mal performance. We discuss in detail the cause of ment No. 641805. Stanislau Semeniuta thanks
this effect and propose a simple solution to over- the support from Pattern Recognition Company
come it. The effectiveness of our approach is ver- GmbH. We gratefully acknowledge the support of
ified on three different public NLP benchmarks. NVIDIA Corporation with the donation of the Ti-
Our findings along with our empirical results tan X GPU used for this research.
allow us to answer the questions posed in Sec-
tion 1: i) while is straight-forward to use dropout
in vanilla RNNs due to their strong similarity with References
the feed-forward architectures, its application to
[Bengio et al.1994] Yoshua Bengio, Patrice
LSTM networks is not so straightforward. We
Simard, and Paolo Frasconi. 1994. Learning
demonstrate that recurrent dropout is most ef-
long-term dependencies with gradient descent
fective when applied to hidden state update vec-
is difficult. IEEE Transactions on Neural
tors in LSTMs rather than to hidden states; (ii)
Networks, 5(2):157–166.
we observe an improvement in the network’s per-
formance when our recurrent dropout is coupled [Bengio et al.2015] Samy Bengio, Oriol Vinyals,
with the standard forward dropout, though the Navdeep Jaitly, and Noam Shazeer. 2015.
extent of this improvement depends on the val- Scheduled sampling for sequence prediction
ues of dropout rates; (iii) contrary to our expec- with recurrent neural networks. CoRR,
tations, networks trained with per-step and per- abs/1506.03099.
sequence mask sampling produce similar results
when using our recurrent dropout method, both [Bluche et al.2015] Theodore Bluche, Christopher
being better than the dropout scheme proposed by Kermorvant, and Jérôme Louradour. 2015.
Moon et al. (2015). Where to apply dropout in recurrent neural net-
works for handwriting recognition? In 13th In- [Krizhevsky et al.2012] Alex Krizhevsky, Ilya
ternational Conference on Document Analysis Sutskever, and Geoffrey E. Hinton. 2012.
and Recognition, ICDAR 2015, Tunis, Tunisia, Imagenet classification with deep convolutional
August 23-26, 2015, pages 681–685. neural networks. In F. Pereira, C.J.C. Burges,
L. Bottou, and K.Q. Weinberger, editors,
[Cho et al.2014] KyungHyun Cho, Bart van Mer-
Advances in Neural Information Process-
rienboer, Dzmitry Bahdanau, and Yoshua Ben-
ing Systems 25, pages 1097–1105. Curran
gio. 2014. On the properties of neural ma-
Associates, Inc.
chine translation: Encoder-decoder approaches.
CoRR, abs/1409.1259. [Le et al.2015] Quoc V. Le, Navdeep Jaitly, and
[Dauphin et al.2015] Yann N. Dauphin, Harm Geoffrey E. Hinton. 2015. A simple way to
de Vries, Junyoung Chung, and Yoshua Bengio. initialize recurrent networks of rectified linear
2015. Rmsprop and equilibrated adaptive learn- units. CoRR, abs/1504.00941.
ing rates for non-convex optimization. CoRR, [Liu et al.2015] Pengfei Liu, Xipeng Qiu, Xinchi
abs/1502.04390. Chen, Shiyu Wu, and Xuanjing Huang. 2015.
[Dyer et al.2015] Chris Dyer, Miguel Ballesteros, Multi-timescale long short-term memory neu-
Wang Ling, Austin Matthews, and A. Noah ral network for modelling sentences and docu-
Smith. 2015. Transition-based dependency ments. In ACL. Association for Computational
parsing with stack long short-term memory. In Linguistics.
ACL, pages 334–343. Association for Compu-
[Mikolov et al.2011] T. Mikolov, S. Kombrink,
tational Linguistics.
L. Burget, J.H. Cernocky, and Sanjeev Khudan-
[Gal2015] Yarin Gal. 2015. A theoretically pur. 2011. Extensions of recurrent neural net-
grounded application of dropout in recurrent work language model. In Acoustics, Speech
neural networks. arXiv:1512.05287. and Signal Processing (ICASSP), 2011 IEEE
International Conference on, pages 5528–5531,
[Hinton et al.2012] Geoffrey E. Hinton, Nitish Sri- May.
vastava, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. 2012. Improving neural [Mikolov et al.2013] Tomas Mikolov, Ilya
networks by preventing co-adaptation of feature Sutskever, Kai Chen, Greg S Corrado, and
detectors. CoRR, abs/1207.0580. Jeff Dean. 2013. Distributed representations of
words and phrases and their compositionality.
[Hochreiter and Schmidhuber1997] Sepp Hochre-
In C.J.C. Burges, L. Bottou, M. Welling,
iter and Jürgen Schmidhuber. 1997. Long
Z. Ghahramani, and K.Q. Weinberger, editors,
short-term memory. Neural Comput.,
Advances in Neural Information Process-
9(8):1735–1780, November.
ing Systems 26, pages 3111–3119. Curran
[Irsoy and Cardie2014] Ozan Irsoy and Claire Associates, Inc.
Cardie. 2014. Opinion mining with deep re-
current neural networks. In Proceedings of [Mikolov et al.2014] Tomas Mikolov, Armand
the 2014 Conference on Empirical Methods in Joulin, Sumit Chopra, Michaël Mathieu, and
Natural Language Processing (EMNLP), pages Marc’Aurelio Ranzato. 2014. Learning longer
720–728. Association for Computational Lin- memory in recurrent neural networks. CoRR,
guistics. abs/1412.7753.
[Karpathy et al.2015] Andrej Karpathy, Justin [Moon et al.2015] Taesup Moon, Heeyoul Choi,
Johnson, and Fei-Fei Li. 2015. Visualizing Hoshik Lee, and Inchul Song. 2015. Rn-
and understanding recurrent networks. CoRR, ndrop: A novel dropout for rnns in asr. Au-
abs/1506.02078. tomatic Speech Recognition and Understanding
(ASRU).
[Kim et al.2015] Yoon Kim, Yacine Jernite,
David Sontag, and Alexander M. Rush. 2015. [Ng et al.2015] Joe Yue-Hei Ng, Matthew J.
Character-aware neural language models. Hausknecht, Sudheendra Vijayanarasimhan,
CoRR, abs/1508.06615. Oriol Vinyals, Rajat Monga, and George
Toderici. 2015. Beyond short snippets: sequence learning with neural networks. In
Deep networks for video classification. CoRR, NIPS, pages 3104–3112.
abs/1503.08909.
[Tjong Kim Sang and De Meulder2003] Erik F.
[Pascanu et al.2013] Razvan Pascanu, Tomas Tjong Kim Sang and Fien De Meulder. 2003.
Mikolov, and Yoshua Bengio. 2013. On the Introduction to the conll-2003 shared task:
difficulty of training recurrent neural networks. Language-independent named entity recogni-
In Proceedings of the 30th International Con- tion. In Proceedings of the Seventh Conference
ference on Machine Learning, ICML 2013, on Natural Language Learning at HLT-NAACL
Atlanta, GA, USA, 16-21 June 2013, pages 2003 - Volume 4, CONLL ’03, pages 142–
1310–1318. 147, Stroudsburg, PA, USA. Association for
Computational Linguistics.
[Pham et al.2013] Vu Pham, Christopher Kermor-
vant, and Jérôme Louradour. 2013. Dropout [Wang et al.2015] Xin Wang, Yuanchao Liu,
improves recurrent neural networks for hand- Chengjie SUN, Baoxun Wang, and Xiaolong
writing recognition. CoRR, abs/1312.4569. Wang. 2015. Predicting polarities of tweets
by composing word embeddings with long
[Rosenthal et al.2015] Sara Rosenthal, Preslav short-term memory. In ACL, pages 1343–1353.
Nakov, Svetlana Kiritchenko, Saif Mohammad, Association for Computational Linguistics.
Alan Ritter, and Veselin Stoyanov. 2015.
[Zaremba et al.2014] Wojciech Zaremba, Ilya
Semeval-2015 task 10: Sentiment analysis in
Sutskever, and Oriol Vinyals. 2014. Recur-
twitter. In Proceedings of the 9th International
rent neural network regularization. CoRR,
Workshop on Semantic Evaluation (SemEval
abs/1409.2329.
2015), pages 451–463, Denver, Colorado, June.
Association for Computational Linguistics. [Zhang and Lapata2014] Xingxing Zhang and
Mirella Lapata. 2014. Chinese poetry gen-
[Severyn and Moschitti2015] Aliaksei Severyn eration with recurrent neural networks. In
and Alessandro Moschitti. 2015. Twitter Proceedings of the 2014 Conference on Empir-
sentiment analysis with deep convolutional ical Methods in Natural Language Processing
neural networks. In Proceedings of the 38th (EMNLP), pages 670–680. Association for
International ACM SIGIR Conference on Computational Linguistics.
Research and Development in Information
Retrieval, Santiago, Chile, August 9-13, 2015, [Zhang et al.2015] Xingxing Zhang, Liang Lu, and
pages 959–962. Mirella Lapata. 2015. Tree recurrent neural
networks with application to language model-
[Sutskever et al.2014] Ilya Sutskever, Oriol ing. CoRR, abs/1511.00060.
Vinyals, and Quoc V. Le. 2014. Sequence to