2016 Recurrent Dropout Without Memory Loss

Recurrent Dropout without Memory Loss
Stanislau Semeniuta1 Aliaksei Severyn2 Erhardt Barth1

1
Universität zu Lübeck, Institut für Neuro- und Bioinformatik
{stas,barth}@inb.uni-luebeck.de
2
Google Research
[email protected]
Abstract duction it has become, together with the L2

weight decay, the standard method for neural
arXiv:1603.05118v2 [cs.CL] 5 Aug 2016
This paper presents a novel approach to re- network regularization. While showing signif-
current neural network (RNN) regulariza- icant improvements when used in feed-forward
tion. Differently from the widely adopted architectures, e.g., Convolutional Neural Net-
dropout method, which is applied to for- works (Krizhevsky et al., 2012), the application of
ward connections of feed-forward archi- dropout in RNNs has been somewhat limited. In-
tectures or RNNs, we propose to drop neu- deed, so far dropout in RNNs has been applied
rons directly in recurrent connections in a in the same fashion as in feed-forward architec-
way that does not cause loss of long-term tures: it is typically injected in input-to-hidden
memory. Our approach is as easy to imple- and hidden-to-output connections, i.e., along the
ment and apply as the regular feed-forward input axis, but not between the recurrent connec-
dropout and we demonstrate its effective- tions (time axis). Given that RNNs are mainly
ness for Long Short-Term Memory net- used to model sequential data with the goal of cap-
work, the most popular type of RNN turing short- and long-term interactions, it seems
cells. Our experiments on NLP bench- natural to also regularize the recurrent weights.
marks show consistent improvements even This observation has led us and other researchers
when combined with conventional feed- (Moon et al., 2015; Gal, 2015) to the idea of ap-
forward dropout. plying dropout to the recurrent connections in
RNNs.
1 Introduction
In this paper we propose a novel recurrent
Recurrent Neural Networks, LSTMs in particular, dropout technique and demonstrate how our
have recently become a popular tool among NLP method is superiour to other recurrent dropout
researchers for their superior ability to model and methods recently proposed in (Moon et al., 2015;
learn from sequential data. These models have Gal, 2015). Additionally, we answer the follow-
shown state-of-the-art results on various public ing questions which helps to understand how to
benchmarks ranging from sentence classifica- best apply recurrent dropout: (i) how to apply
tion (Wang et al., 2015; Irsoy and Cardie, 2014; the dropout in recurrent connections of the LSTM
Liu et al., 2015) and various tagging prob- architecture in a way that prevents possible cor-
lems (Dyer et al., 2015) to language mod- ruption of the long-term memory; (ii) what is
elling (Kim et al., 2015; Zhang et al., 2015), the relationship between our recurrent dropout
text generation (Zhang and Lapata, 2014) and the widely adopted dropout in input-to-hidden
and sequence-to-sequence prediction and hidden-to-output connections; (iii) how the
tasks (Sutskever et al., 2014). dropout mask in RNNs should be sampled: once
Having shown excellent ability to capture and per step or once per sequence. The latter question
learn complex linguistic phenomena, RNN ar- of sampling the mask appears to be crucial in some
chitectures are prone to overfitting. Among cases to make the recurrent dropout work and, to
the most widely used techniques to avoid over- the best of our knowledge, has received very little
fitting in neural networks is the dropout regu- attention in the literature. Our work is the first one
larization (Hinton et al., 2012). Since its intro- to provide empirical evaluation of the differences
between these two sampling approaches. insights from variational Bayesian inference to
Regarding empirical evaluation, we first high- propose a variant of LSTM with dropout that
light the problem of information loss in memory achieves consistent improvements over a baseline
cells of LSTMs when applying recurrent dropout. architecture without dropout.
We demonstrate that previous approaches of drop- The main contribution of this paper is a new
ping hidden state vectors cause loss of memory recurrent dropout technique, which is most use-
while our proposed method to use dropout mask in ful in gated recurrent architectures such as LSTMs
hidden state update vectors does not suffer from and GRUs. We demonstrate that applying dropout
this problem. We experiment on three widely to arbitrary vectors in LSTM cells may lead to
adopted NLP tasks: word- and character-level loss of memory thus hindering the ability of the
Language Modeling and Named Entity Recog- network to encode long-term information. In
nition. The results demonstrate that our recur- other words, our technique allows for adding a
rent dropout helps to achieve better regulariza- strong regularizer on the model weights respon-
tion and yields improvements across all the tasks, sible for learning short and long-term dependen-
even when combined with the conventional feed- cies without affecting the ability to capture long-
forward dropout. Furthermore, we compare our term relationships, which are especially important
dropout scheme with the recently proposed alter- to model when dealing with natural language. Fi-
native recurrent dropout methods and show that nally, we compare our method with alternative
our technique is superior in almost all cases. recurrent dropout methods recently introduced
in (Moon et al., 2015; Gal, 2015) and demonstrate
2 Related Work that our method allows to achieve better results.
Neural Network models often suffer from overfit- 3 Recurrent Dropout

ting, especially when the number of network pa-
rameters is large and the amount of training data is In this section we first show how the idea
small. This has led to a lot of research directed to- of feed-forward dropout (Hinton et al., 2012) can
wards improving their generalization ability. Be- be applied to recurrent connections in vanilla
low we primarily discuss some of the methods RNNs. We then introduce our recurrent dropout
aimed at improving regularization of RNNs. method specifically tailored for gated architec-
Pham et al. (2013) and Zaremba et al. (2014) tures such as LSTMs and GRUs. We draw par-
have shown that LSTMs can be effectively regu- allels and contrast our approach with alternative
larized by using dropout in forward connections. recurrent dropout techniques recently proposed
While this already allows for effective regulariza- in (Moon et al., 2015; Gal, 2015) showing that our
tion of recurrent networks, it is intuitive that intro- method is favorable when considering potential
ducing dropout also in the hidden state may force memory loss issues in long short-term architec-
it to create more robust representations. Indeed, tures.
Moon et al. (2015) have extended the idea of drop-
ping neurons in forward direction and proposed to 3.1 Dropout in vanilla RNNs
drop cell states as well showing good results on Vanilla RNNs process the input sequences as fol-
a Speech Recognition task. Bluche et al. (2015) lows:
carry out a study to find where dropout is most
effective, e.g. input-to-hidden or hidden-to-output ht = f (Wh [xt , ht−1 ] + bh ), (1)
connections. The authors conclude that it is more
beneficial to use it once in the correct spot, rather where xt is the input at time step t; ht and ht−1
than to put it everywhere. Bengio et al. (2015) are hidden vectors that encode the current and
have proposed an algorithm called scheduled sam- previous states of the network; Wh is parameter
pling to improve performance of recurrent net- matrix that models input-to-hidden and hidden-to-
works on sequence-to-sequence labeling tasks. A hidden (recurrent) connections; b is a vector of
disadvantage of this work is that the scheduled bias terms, and f is the activation function.
sampling is specifically tailored to this kind of As RNNs model sequential data by a fully-
tasks, what makes it impossible to use in, for ex- connected layer, dropout can be applied by simply
ample, sequence-to-label tasks. Gal (2015) uses dropping the previous hidden state of a network.
Specifically, we modify Equation 1 in the follow-
ing way: ct = d(ft ∗ ct−1 + it ∗ gt ) (8)
ht = f (Wh [xt , d(ht−1 )] + bh ), (2) In contrast to dropout techniques proposed by

Gal (2015) and Moon et al. (2015), we propose to
where d is the dropout function defined as follows: apply dropout to the cell update vector gt as fol-
( lows:
mask ∗ x, if train phase
d(x) = (3)
(1 − p)x otherwise, ct = ft ∗ ct−1 + it ∗ d(gt ) (9)
Different from methods of (Moon et al., 2015;
where p is the dropout rate and mask is a vector,
Gal, 2015), our approach does not require sam-
sampled from the Bernoulli distribution with suc-
pling of the dropout masks once for every train-
cess probability 1 − p.
ing sequence. On the contrary, as we will show in
3.2 Dropout in LSTM networks Section 4, networks trained with a dropout mask
sampled per-step achieve results that are at least as
Long Short-Term Memory net-
good and often better than per-sequence sampling.
works (Hochreiter and Schmidhuber, 1997)
Figure 1 shows differences between approaches to
have introduced the concept of gated inputs in
dropout.
RNNs, which effectively allow the network to
The approach of (Gal, 2015) differs from ours
preserve its memory over a larger number of
in the overall strategy – they consider network’s
time steps during both forward and backward
hidden state as input to subnetworks that com-
passes, thus alleviating the problem of vanishing
pute gate values and cell updates and the pur-
gradients (Bengio et al., 1994). Formally, it is
pose of dropout is to regularize these subnet-
expressed with the following equations:
works. Our approach considers the architecture
as a whole with the hidden state as its key part
and regularize the whole network. The approach
   
it σ(Wi xt , ht−1 + bi )
 ft  σ(Wf xt , ht−1 + bf ) of (Moon et al., 2015) on the other hand is seem-
 = (4) ingly similar to ours. In Section 3.2 we argue that

ot   σ(Wo xt , ht−1 + bo ) 
our method is a more principled way to drop re-

gt f (Wg xt , ht−1 + bg )
current connections in gated architectures.
ct = ft ∗ ct−1 + it ∗ gt (5) It should be noted that while being different, the
three discussed dropout schemes are not mutually
ht = ot ∗ f (ct ), (6) exclusive. It is in general possible to combine our
where it , ft , ot are input, output and forget gates at approach and the other two. We expect the merge
step t; gt is the vector of cell updates and ct is the of our scheme and that of (Gal, 2015) to hold the
updated cell vector used to update the hidden state biggest potential. The relations between recurrent
ht ; σ is the sigmoid function and ∗ is the element- dropout schemes are however out of scope of this
wise multiplication. paper and we rather focus on studying the rela-
Gal (2015) proposes to drop the previous hid- tionships of different dropout approaches with the
den state when computing values of gates and up- conventional forward dropout.
dates of the current step, where he samples the Gated Recurrent Unit (GRU) networks are
dropout mask once for every sequence: a recently introduced variant of a recur-
rent network with hidden state protected by
    gates (Cho et al., 2014). Different from LSTMs,
it σ(Wi xt , d(ht−1 ) + bi ) GRU networks use only two gates rt and zt to
 ft  σ(Wf xt , d(ht−1 ) + bf ) update the cell’s hidden state ht :
 =
ot   σ(Wo xt , d(ht−1 ) + bo )  (7)


gt f (Wg xt , d(ht−1 ) + bg )
zt σ(Wz xt , ht−1 + bz )
= (10)
Moon et al. (2015) propose to apply dropout di- rt σ(Wr xt , ht−1 + br )
rectly to the cell values and use per-sequence sam-
pling as well: gt = f (Wg xt , rt ∗ ht−1 + bg ) (11)
ht-1 o * ht ht-1 o * ht ht-1 o * ht
i i i
g g g
f f f
* * *
ct-1 * + ct ct-1 * + ct ct-1 * + ct
(a) Moon et al., 2015 (b) Gal, 2015 (c) Ours
Figure 1: Illustration of the three types of dropout in recurrent connections of LSTM networks. Dashed
arrows refer to dropped connections. Input connections are omitted for clarity.
ht = (1 − zt ) ∗ ht−1 + zt ∗ gt (12) the network is able to learn long-term dependen-

cies, it is not capable of exploiting them during
Similarly to the LSTMs, we propoose to apply
test phase. Note that our assumption of all gates
dropout to the hidden state updates vector gt :
being equal to 1 helps the network to preserve hid-
ht = (1 − zt ) ∗ ht−1 + zt ∗ d(gt ) (13) den state, since in a real network gate values lie
within (0, 1) interval. In practice trained networks
tend to saturate gate values (Karpathy et al., 2015)
We found that an intuitive idea to drop pre- what makes gates to behave as binary switches.
vious hidden states directly, as proposed in The fact that Moon et al. (2015) have achieved an
Moon et al. (2015), produces mixed results. We improvement can be explained by the experimen-
have observed that it helps the network to gen- tation domain. Le et al. (2015) have proposed a
eralize better when not coupled with the forward simple yet effective way to initialize vanilla RNNs
dropout, but is usually no longer beneficial when and reported that they have achieved a good result
used together with a regular forward dropout. in the Speech Recognition domain while having
The problem is caused by the scaling of neu- an effect similar to the one caused by Eq. 17. One
ron activations during inference. Consider the hid- can reduce the influence of this effect by select-
den state update rule in the test phase of an LSTM ing a low dropout rate. This solution however is
network. For clarity, we assume every gate to be partial, since it only increases the number of steps
equal to 1: required to completely forget past history and does
not remove the problem completely.
ht = (ht−1 + gt )p, (14) One important note is that the dropout function
from Eq. 3 can be implemented as:
where gt are update vectors computed by Eq. 4 (
and p is the probability to not drop a neuron. As mask ∗ x/p, if train phase
d(x) = (18)
ht−1 was, in turn, computed using the same rule, x otherwise
we can rewrite this equation as:
In this case the above argument holds as well, but
ht = ((ht−2 + gt−1 )p + gt )p (15) instead of observing exponentially decreasing hid-
den states during testing, we will observe expo-
Recursively expanding h for every timestep re- nentially increasing values of hidden states during
sults in the following equation: training.
Our approach addresses the problem discussed
ht = ((((h0 + g0 )p + g1 )p + ...)p + gt )p (16) previously by dropping the update vectors g.
Since we drop only candidates, we do not scale
Pushing p inside parenthesis, Eq. 16 can be written
the hidden state directly. This allows for solving
as:
X t the scaling issue, as Eq. 17 becomes:
t+1
ht = p h0 + pt−i+1 gi (17) t
X t
X
i=0 ht = ph0 + p gi = ph0 + p gi (19)
Since p is a value between zero and one, sum com- i=0 i=0
ponents that are far away in the past are multi- Moreover, since we only drop differences that are
plied by a very low value and are effectively re- added to the network’s hidden state at each time-
moved from the summation. Thus, even though step, this dropout scheme allows us to use per-step
mask sampling while still being able to learn long- ing dropout from (Moon et al., 2015) with per-
term dependencies. Thus, our approach allows to sequence sampling, networks are able to discover
freely apply dropout in the recurrent connections the long-term dependency, but fail to use it on
of a gated network without hindering its ability to the test set due to the scaling issue. Interestingly,
process long-term relationships. in Medium case results on the test set are worse
We note that the discussed problem does not than random. Networks trained with per-step sam-
affect vanilla RNNs because they overwrite their pling exhibit different behaviour: in Short case
hidden state at every timestep. Lastly, the ap- they are capable of capturing the temporal de-
proach of Gal (2015) is not affected by the issue pendency and generalizing to the test set, but re-
as well. quire 10-20 times more iterations to do so. In
Medium case these networks do not fit into the
4 Experiments allocated number of iterations. This suggests that
applying dropout to hidden states as suggested in
First, we empirically demonstrate the issues linked
(Moon et al., 2015) corrupts memory cells hinder-
to memory loss when using various dropout ing the long-term memory capacity of LSTMs.
techniques in recurrent nets (see Sec. 3.2).
In contrast, using our recurrent dropout meth-
For this purpose we experiment with training
ods, networks are able to solve the problem in all
LSTM networks on one of the synthetic tasks
cases. We have also ran the same experiments for
from (Hochreiter and Schmidhuber, 1997), specif-
longer sequences, but found that the results are
ically the Temporal Order task. We then validate
equivalent to the Medium case. We also note that
the effectiveness of our recurrent dropout when
the approach of (Gal, 2015) does not seem to ex-
applied to vanilla RNNs, LSTMs and GRUs on
hibit the memory loss problem.
three diverse public benchmarks: Language Mod-
elling, Named Entity Recognition, and Twitter 4.2 Word Level Language Modeling
Sentiment classification.
Data. Following Mikolov et al. (2011) we use
4.1 Synthetic Task the Penn Treebank Corpus to train our Language
Modeling (LM) models. The dataset contains ap-
Data. In this task the input sequences are gener-
proximately 1 million words and comes with pre-
ated as follows: all but two elements in a sequence
defined training, validation and test splits, and a
are drawn randomly from {C, D} and the remain-
vocabulary of 10k words.
ing two symbols from {A, B}. Symbols from {A,
Setup. In our LM experiments we use recur-
B} can appear at any position in the sequence.
rent networks with a single layer with 256 cells.
The task is to classify a sequence into one of four
Network parameters were initialized uniformly in
classes ({AA, AB, BA, BB}) based on the order
[-0.05, 0.05]. For training, we use plain SGD
of the symbols. We generate data so that every se-
with batch size 32 with the maximum norm gradi-
quence is split into three parts with the same size
ent clipping (Pascanu et al., 2013). Learning rate,
and emit one meaningful symbol in first and sec-
clipping threshold and number of Backpropaga-
ond parts of a sequence. The prediction is taken
tion Through Time (BPTT) steps were set to 1,
after the full sequence has been processed. We use
10 and 35 respectively. For the learning rate de-
two modes in our experiments: Short with se-
cay we use the following strategy: if the valida-
quences of length 15 and Medium with sequences
tion error does not decrease after each epoch, we
of length 30.
divide the learning rate by 1.5. The aforemen-
Setup. We use LSTM with one layer that con-
tioned choices were largely guided by the work
tains 256 hidden units and recurrent dropout with
of Mikolov et al. (2014). To ease reproducibility
0.5 strength. Network is trained by SGD with a
of our results on the LM and synthetic tasks, we
learning rate of 0.1 for 5k epochs. The networks
have released the source code of our experiments1 .
are trained on 200 mini-batches with 32 sequences
Results. Table 2 reports the results for LSTM
and tested on 10k sequences.
networks. We also present results when the
Results. Table 1 reports the results on the
dropout is applied directly to hidden states as
Temporal Order task when recurrent dropout is
in (Moon et al., 2015) and results of networks
applied using our method and methods from
1
(Moon et al., 2015) and (Gal, 2015). Us- https://github.com/stas-semeniuta/drop-rnn
Moon et al. (2015) Gal (2015); Ours
Sampling short sequences medium sequences short sequences medium sequences
Train Test Train Test Train Test Train Test
per-step 100% 100% 25% 25% 100% 100% 100% 100%
per-sequence 100% 25% 100% <25% 100% 100% 100% 100%
Table 1: Accuracies on the Temporal Order task.
Moon et al. (2015) Gal (2015) Ours

Dropout rate Sampling
Valid Test Valid Test Valid Test
0.0 – 130.0 125.2 130.0 125.2 130.0 125.2
0.25 per-step 113.0 108.7 119.8 114.2 106.1 100.0
0.5 per-step 124.0 116.5 118.3 112.5 102.8 98.0
0.25 per-sequence 121.0 113.0 120.5 114.0 106.3 100.7
0.5 per-sequence 137.7 126.2 125.2 117.9 103.2 96.8
0.0 – 94.1 89.5 94.1 89.5 94.1 89.5
0.25 per-step 113.5 105.8 92.9 88.4 91.6 87.0
0.5 per-step 140.6 130.1 98.6 92.5 100.6 95.5
0.25 per-sequence 105.7 99.9 94.5 89.7 92.4 87.6
0.5 per-sequence 125.4 117.4 98.4 92.5 107.8 101.8
Table 2: Perplexity scores of the LSTM network on word level Language Modeling task (lower is better).
Upper and lower parts of the table report results without and with forward dropout respectively. Networks
with forward dropout use 0.2 and 0.5 dropout rates in input and output connections respectively. Values
in bold show best results for each of the recurrent dropout schemes with and without forward dropout.
trained with the dropout scheme of (Gal, 2015). combined with the forward dropout – for LSTMs
We make the following observations: (i) our ap- we are able to bring down perplexity on the vali-
proach shows better results than the alternatives; dation set from 130 to 91.6.
(ii) per-step mask sampling is better when drop- To demonstrate the effect of our approach on the
ping hidden state directly; (iii) on this task our learning process, we also present learning curves
method using per-step sampling seems to yield re- of LSTM networks trained with and without re-
sults similar to per-sequence sampling; (iv) in this current dropout (Fig. 2). Models trained using
case forward dropout yields better results than any our recurrent dropout scheme have slower con-
of the three recurrent dropouts; and finally (v) both vergence than models without dropout and usu-
our approach and that of (Gal, 2015) are effective ally have larger training error and lower validation
when combined with the forward dropout, though errors. This behaviour is consistent with what is
ours is more effective. expected from a regularizer and is similar to the
We make the following observations: (i) drop- effect of the feed-forward dropout applied to non-
ping hidden state updates yields better results than recurrent networks (Hinton et al., 2012).
dropping hidden states; (ii) per-step mask sam-
4.3 Character Level Language Modeling
pling is better when dropping hidden state directly;
(iii) contrary to our expectations, when we apply Data. We train our networks on the dataset de-
dropout to hidden state updates per-step sampling scribed in the previous section. It contains approx-
seems to yield results similar to per-sequence sam- imately 6 million characters, and a vocabulary of
pling; (iv) applying dropout to hidden state up- 50 characters. We use the provided partitions train,
dates rather than hidden states in some cases leads validation and test partitions.
to a perplexity decrease by more than 30 points; Setup. We use networks with 1024 units to solve
and finally (v) our approach is effective even when the character level LM task. The characters are
Moon et al. (2015) Gal (2015) Ours
Dropout rate Sampling
Valid Test Valid Test Valid Test
0.0 – 1.460 1.457 1.460 1.457 1.460 1.457
0.25 per-step 1.435 1.394 1.345 1.308 1.338 1.301
0.5 per-step 1.610 1.561 1.387 1.348 1.355 1.316
0.25 per-sequence 1.433 1.390 1.341 1.304 1.356 1.319
0.5 per-sequence 1.691 1.647 1.408 1.369 1.496 1.450
0.0 – 1.362 1.326 1.362 1.326 1.362 1.326
0.25 per-step 1.471 1.428 1.381 1.344 1.358 1.321
0.5 per-step 1.668 1.622 1.463 1.425 1.422 1.380
0.25 per-sequence 1.455 1.413 1.387 1.348 1.403 1.363
0.5 per-sequence 1.681 1.637 1.477 1.435 1.567 1.522
Table 3: Bit-per-character scores of the LSTM network on character level Language Modelling task
(lower is better). Upper and lower parts of the table report results without and with forward dropout re-
spectively. Networks with forward dropout use 0.2 and 0.5 dropout rates in input and output connections
respectively. Values in bold show best results for each of the recurrent dropout schemes with and without
forward dropout.
embedded into 256 dimensional space before be-

ing processed by the LSTM. All parameters of
the networks are initialized uniformly in [-0.01,
0.01]. We train our networks on non-overlapping
sequences of 100 characters. The networks are
trained with the Adam (?) algorithm with initial
learning rate of 0.001 for 50 epochs. We decrease
9 the learning rate by 0.97 after every epoch starting
with dropout from epoch 10. To avoid exploding gradints, we
8
without dropout use MaxNorm gradient clipping with threshold set
7 to 10.
Results. Results of our experiments are given
6 in Table 3. Note that on this task regulariz-
ing only the recurrent connections is more ben-
5
eficial than only the forward ones. In particu-
4 lar, LSTM networks trained with our approach
and the approach of (Gal, 2015) yield a lower
3 bit-per-character (bpc) score than those trained
0 5000 10000 15000 20000 25000 30000
with forward dropout onlyWe attribute it to pro-
nounced long term dependencies. In addition,
Figure 2: Learning curves of LSTM networks our approach is the only one that improves over
when training without and with 0.25 per-step re- baseline LSTM with forward dropout. The over-
current dropout. Solid and dashed lines show all best result is achieved by a network trained
training and validation errors respectively. Best with our dropout with 0.25 dropout rate and per-
viewed in color. step sampling, closely followed by network with
Gal (2015) dropout.
4.4 Named Entity Recognition

Data. To assess our recurrent Named Entity
Recognition (NER) taggers when using recurrent
dropout we use a public benchmark from CONLL
2003 (Tjong Kim Sang and De Meulder, 2003). Dropout rate RNN LSTM GRU
The dataset contains approximately 300k words
5 word long sequences
split into train, validation and test partitions. Each
word is labeled with either a named entity class 0.0 85.60 85.32 86.00
it belongs to, such as Location or Person, or 0.25 86.48 86.42 86.70
as being not named. The majority of words are 15 word long sequences
labeled as not named entities. The vocabulary size 0.0 86.12 86.12 86.44
is about 22k words. 0.25 86.95 86.88 87.10
Setup. Previous state-of-the-art NER systems
have shown the importance of using word con- Table 4: F1 scores (higher is better) on NER task.
text features around entities. Hence, we slightly
modify the architecture of our recurrent networks
gain from using recurrent dropout is larger for
to consume the context around the target word
the LSTM network. We have experimented with
by simply concatenating their embeddings. The
higher recurrent dropout rates, but found that it
size of the context window is fixed to 5 words
led to excessive regularization.
(the word to be labeled, two words before and
two words after). The recurrent layer size is 4.5 Twitter Sentiment Analysis
1024 units. The network inputs include only word
Data. We use Twitter sentiment cor-
embeddings (initialized with pretrained word2vec
pus from SemEval-2015 Task 10 (subtask
embeddings (Mikolov et al., 2013) and kept static)
B) (Rosenthal et al., 2015). It contains 15k
and capitalization features. For training we use the
labeled tweets split into training and validation
RMSProp algorithm (Dauphin et al., 2015) with ρ
partitions. The total number of words is approx-
fixed at 0.9 and a learning rate of 0.01 and mul-
imately 330k and the vocabulary size is 22k.
tiply the learning rate by 0.99 after every epoch.
The task consists of classifying a tweet into three
We also combine our recurrent dropout (with per-
classes: positive, neutral, and negative.
sequence mask sampling) with the conventional
Performance of a classifier is measured by the av-
forward dropout with the rate 0.2 in input and 0.5
erage of F1 scores of positive and negative
in output connections. Lastly, we found that us-
classes. We evaluate our models on a number of
ing relu(x) = max(x, 0) nonlinearity resulted in
datasets that were used for benchmarking during
higher performance than tanh(x).
the last years.
To speed up the training we use a length ex- Setup. We use recurrent networks in the standard
pansion approach described in (Ng et al., 2015), sequence labeling manner - we input words to a
where training is performed in two stages: (i) we network one by one and take the label at the last
first sample short 5-words input sequences with step. Similarly to (Severyn and Moschitti, 2015),
their contexts and train for 25 epochs; (ii) we we use 1 million of weakly labeled tweets to pre-
fine tune the network on input 15-words sequences train our networks. We use networks composed of
for 10 epochs. We found that further fine tuning 500 neurons in all cases. Our models are trained
on longer sequences yielded negligible improve- with the RMSProp algorithm with a learning rate
ments. Such strategy allows us to significantly of 0.001. We use our recurrent dropout regular-
speed up the training when compared to training ization with per-step mask sampling. All the other
from scratch on full-length input sentences. We settings are equivalent to the ones used in the NER
use full sentences for testing. task.
Results. F1 scores of our taggers are reported in Results. The results of these experiments are pre-
Table 4 when trained on short 5-word and longer sented in Table 5. Note that in this case our al-
15-word input sequences. We note that the gap gorithm decreases the performance of the vanilla
between networks trained with and without our RNNs while this is not the case for LSTM and
dropout scheme is larger for networks trained on GRU networks. This is due to the nature of the
shorter sequences. It suggests that dropout in re- problem: differently from LM and NER tasks,
current connections might have an impact on how a network needs to aggregate information over a
well a network generalizes to sequences that are long sequence. Vanilla RNNs notoriously have
longer than the ones used during training. The difficulties with this and our dropout scheme im-
Dropout rate Twitter13 LiveJournal14 Twitter15 Twitter14 SMS13 Sarcasm14
RNN
0 67.54 71.20 59.35 68.90 64.51 53.58
0.25 66.35 67.70 59.91 67.76 64.46 48.74
LSTM
0 67.97 69.82 57.84 67.95 61.47 53.49
0.25 69.11 71.39 61.35 68.08 65.45 53.80
GRU
0 67.09 70.80 59.07 67.02 67.11 51.01
0.25 69.04 72.10 60.34 69.65 65.73 54.77
Table 5: F1 scores (higher is better) on Sentiment Evaluation task
pairs their ability to remember even further. The While our experimental results show that ap-
best result over most of the datasets is achieved plying recurrent dropout method leads to signif-
by the GRU network with recurrent dropout. The icant improvements across various NLP bench-
only exception is the Twitter2015 dataset, where marks (especially when combined with conven-
the LSTM network shows better results. tional forward dropout), its benefits for other tasks,
e.g., sequence-to-sequence prediction, or other do-
5 Conclusions mains, e.g., Speech Recognition, remain unex-
plored. We leave it as our future work.
This paper presents a novel recurrent dropout
method specifically tailored to the gated recurrent
Acknowledgments
neural networks. Our approach is easy to im-
plement and is even more effective when com- This project has received funding from the Eu-
bined with conventional forward dropout. We ropean Union’s Framework Programme for Re-
have shown that for LSTMs and GRUs applying search and Innovation HORIZON 2020 (2014-
dropout to arbitrary cell vectors results in subopti- 2020) under the Marie Skodowska-Curie Agree-
mal performance. We discuss in detail the cause of ment No. 641805. Stanislau Semeniuta thanks
this effect and propose a simple solution to over- the support from Pattern Recognition Company
come it. The effectiveness of our approach is ver- GmbH. We gratefully acknowledge the support of
ified on three different public NLP benchmarks. NVIDIA Corporation with the donation of the Ti-
Our findings along with our empirical results tan X GPU used for this research.
allow us to answer the questions posed in Sec-
tion 1: i) while is straight-forward to use dropout
in vanilla RNNs due to their strong similarity with References
the feed-forward architectures, its application to
[Bengio et al.1994] Yoshua Bengio, Patrice
LSTM networks is not so straightforward. We
Simard, and Paolo Frasconi. 1994. Learning
demonstrate that recurrent dropout is most ef-
long-term dependencies with gradient descent
fective when applied to hidden state update vec-
is difficult. IEEE Transactions on Neural
tors in LSTMs rather than to hidden states; (ii)
Networks, 5(2):157–166.
we observe an improvement in the network’s per-
formance when our recurrent dropout is coupled [Bengio et al.2015] Samy Bengio, Oriol Vinyals,
with the standard forward dropout, though the Navdeep Jaitly, and Noam Shazeer. 2015.
extent of this improvement depends on the val- Scheduled sampling for sequence prediction
ues of dropout rates; (iii) contrary to our expec- with recurrent neural networks. CoRR,
tations, networks trained with per-step and per- abs/1506.03099.
sequence mask sampling produce similar results
when using our recurrent dropout method, both [Bluche et al.2015] Theodore Bluche, Christopher
being better than the dropout scheme proposed by Kermorvant, and Jérôme Louradour. 2015.
Moon et al. (2015). Where to apply dropout in recurrent neural net-
works for handwriting recognition? In 13th In- [Krizhevsky et al.2012] Alex Krizhevsky, Ilya
ternational Conference on Document Analysis Sutskever, and Geoffrey E. Hinton. 2012.
and Recognition, ICDAR 2015, Tunis, Tunisia, Imagenet classification with deep convolutional
August 23-26, 2015, pages 681–685. neural networks. In F. Pereira, C.J.C. Burges,
L. Bottou, and K.Q. Weinberger, editors,
[Cho et al.2014] KyungHyun Cho, Bart van Mer-
Advances in Neural Information Process-
rienboer, Dzmitry Bahdanau, and Yoshua Ben-
ing Systems 25, pages 1097–1105. Curran
gio. 2014. On the properties of neural ma-
Associates, Inc.
chine translation: Encoder-decoder approaches.
CoRR, abs/1409.1259. [Le et al.2015] Quoc V. Le, Navdeep Jaitly, and
[Dauphin et al.2015] Yann N. Dauphin, Harm Geoffrey E. Hinton. 2015. A simple way to
de Vries, Junyoung Chung, and Yoshua Bengio. initialize recurrent networks of rectified linear
2015. Rmsprop and equilibrated adaptive learn- units. CoRR, abs/1504.00941.
ing rates for non-convex optimization. CoRR, [Liu et al.2015] Pengfei Liu, Xipeng Qiu, Xinchi
abs/1502.04390. Chen, Shiyu Wu, and Xuanjing Huang. 2015.
[Dyer et al.2015] Chris Dyer, Miguel Ballesteros, Multi-timescale long short-term memory neu-
Wang Ling, Austin Matthews, and A. Noah ral network for modelling sentences and docu-
Smith. 2015. Transition-based dependency ments. In ACL. Association for Computational
parsing with stack long short-term memory. In Linguistics.
ACL, pages 334–343. Association for Compu-
[Mikolov et al.2011] T. Mikolov, S. Kombrink,
tational Linguistics.
L. Burget, J.H. Cernocky, and Sanjeev Khudan-
[Gal2015] Yarin Gal. 2015. A theoretically pur. 2011. Extensions of recurrent neural net-
grounded application of dropout in recurrent work language model. In Acoustics, Speech
neural networks. arXiv:1512.05287. and Signal Processing (ICASSP), 2011 IEEE
International Conference on, pages 5528–5531,
[Hinton et al.2012] Geoffrey E. Hinton, Nitish Sri- May.
vastava, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. 2012. Improving neural [Mikolov et al.2013] Tomas Mikolov, Ilya
networks by preventing co-adaptation of feature Sutskever, Kai Chen, Greg S Corrado, and
detectors. CoRR, abs/1207.0580. Jeff Dean. 2013. Distributed representations of
words and phrases and their compositionality.
[Hochreiter and Schmidhuber1997] Sepp Hochre-
In C.J.C. Burges, L. Bottou, M. Welling,
iter and Jürgen Schmidhuber. 1997. Long
Z. Ghahramani, and K.Q. Weinberger, editors,
short-term memory. Neural Comput.,
Advances in Neural Information Process-
9(8):1735–1780, November.
ing Systems 26, pages 3111–3119. Curran
[Irsoy and Cardie2014] Ozan Irsoy and Claire Associates, Inc.
Cardie. 2014. Opinion mining with deep re-
current neural networks. In Proceedings of [Mikolov et al.2014] Tomas Mikolov, Armand
the 2014 Conference on Empirical Methods in Joulin, Sumit Chopra, Michaël Mathieu, and
Natural Language Processing (EMNLP), pages Marc’Aurelio Ranzato. 2014. Learning longer
720–728. Association for Computational Lin- memory in recurrent neural networks. CoRR,
guistics. abs/1412.7753.
[Karpathy et al.2015] Andrej Karpathy, Justin [Moon et al.2015] Taesup Moon, Heeyoul Choi,
Johnson, and Fei-Fei Li. 2015. Visualizing Hoshik Lee, and Inchul Song. 2015. Rn-
and understanding recurrent networks. CoRR, ndrop: A novel dropout for rnns in asr. Au-
abs/1506.02078. tomatic Speech Recognition and Understanding
(ASRU).
[Kim et al.2015] Yoon Kim, Yacine Jernite,
David Sontag, and Alexander M. Rush. 2015. [Ng et al.2015] Joe Yue-Hei Ng, Matthew J.
Character-aware neural language models. Hausknecht, Sudheendra Vijayanarasimhan,
CoRR, abs/1508.06615. Oriol Vinyals, Rajat Monga, and George
Toderici. 2015. Beyond short snippets: sequence learning with neural networks. In
Deep networks for video classification. CoRR, NIPS, pages 3104–3112.
abs/1503.08909.
[Tjong Kim Sang and De Meulder2003] Erik F.
[Pascanu et al.2013] Razvan Pascanu, Tomas Tjong Kim Sang and Fien De Meulder. 2003.
Mikolov, and Yoshua Bengio. 2013. On the Introduction to the conll-2003 shared task:
difficulty of training recurrent neural networks. Language-independent named entity recogni-
In Proceedings of the 30th International Con- tion. In Proceedings of the Seventh Conference
ference on Machine Learning, ICML 2013, on Natural Language Learning at HLT-NAACL
Atlanta, GA, USA, 16-21 June 2013, pages 2003 - Volume 4, CONLL ’03, pages 142–
1310–1318. 147, Stroudsburg, PA, USA. Association for
Computational Linguistics.
[Pham et al.2013] Vu Pham, Christopher Kermor-
vant, and Jérôme Louradour. 2013. Dropout [Wang et al.2015] Xin Wang, Yuanchao Liu,
improves recurrent neural networks for hand- Chengjie SUN, Baoxun Wang, and Xiaolong
writing recognition. CoRR, abs/1312.4569. Wang. 2015. Predicting polarities of tweets
by composing word embeddings with long
[Rosenthal et al.2015] Sara Rosenthal, Preslav short-term memory. In ACL, pages 1343–1353.
Nakov, Svetlana Kiritchenko, Saif Mohammad, Association for Computational Linguistics.
Alan Ritter, and Veselin Stoyanov. 2015.
[Zaremba et al.2014] Wojciech Zaremba, Ilya
Semeval-2015 task 10: Sentiment analysis in
Sutskever, and Oriol Vinyals. 2014. Recur-
twitter. In Proceedings of the 9th International
rent neural network regularization. CoRR,
Workshop on Semantic Evaluation (SemEval
abs/1409.2329.
2015), pages 451–463, Denver, Colorado, June.
Association for Computational Linguistics. [Zhang and Lapata2014] Xingxing Zhang and
Mirella Lapata. 2014. Chinese poetry gen-
[Severyn and Moschitti2015] Aliaksei Severyn eration with recurrent neural networks. In
and Alessandro Moschitti. 2015. Twitter Proceedings of the 2014 Conference on Empir-
sentiment analysis with deep convolutional ical Methods in Natural Language Processing
neural networks. In Proceedings of the 38th (EMNLP), pages 670–680. Association for
International ACM SIGIR Conference on Computational Linguistics.
Research and Development in Information
Retrieval, Santiago, Chile, August 9-13, 2015, [Zhang et al.2015] Xingxing Zhang, Liang Lu, and
pages 959–962. Mirella Lapata. 2015. Tree recurrent neural
networks with application to language model-
[Sutskever et al.2014] Ilya Sutskever, Oriol ing. CoRR, abs/1511.00060.
Vinyals, and Quoc V. Le. 2014. Sequence to

2016 Recurrent Dropout Without Memory Loss

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2016 Recurrent Dropout Without Memory Loss

Uploaded by

Copyright:

Available Formats

Recurrent Dropout without Memory Loss

Stanislau Semeniuta1 Aliaksei Severyn2 Erhardt Barth1

Abstract duction it has become, together with the L2

Neural Network models often suffer from overfit- 3 Recurrent Dropout

ht = f (Wh [xt , d(ht−1 )] + bh ), (2) In contrast to dropout techniques proposed by

ht = (1 − zt ) ∗ ht−1 + zt ∗ gt (12) the network is able to learn long-term dependen-

Table 1: Accuracies on the Temporal Order task.

Moon et al. (2015) Gal (2015) Ours

embedded into 256 dimensional space before be-

4.4 Named Entity Recognition

Table 5: F1 scores (higher is better) on Sentiment Evaluation task

You might also like