Professional Documents
Culture Documents
Convolutional Neural Network PDF
Convolutional Neural Network PDF
Jonas Gehring
Michael Auli
David Grangier
Denis Yarats
Yann N. Dauphin
Facebook AI Research
Abstract
arXiv:1705.03122v3 [cs.CL] 25 Jul 2017
tures which are partially convolutional have shown strong state hi and the last prediction yi ; the result is normalized
performance on larger tasks but their decoder is still recur- to be a distribution over input elements.
rent (Gehring et al., 2016).
Popular choices for recurrent networks in encoder-decoder
In this paper we propose an architecture for sequence to se- models are long short term memory networks (LSTM;
quence modeling that is entirely convolutional. Our model Hochreiter & Schmidhuber, 1997) and gated recurrent units
is equipped with gated linear units (Dauphin et al., 2016) (GRU; Cho et al., 2014). Both extend Elman RNNs (El-
and residual connections (He et al., 2015a). We also use man, 1990) with a gating mechanism that allows the mem-
attention in every decoder layer and demonstrate that each orization of information from previous time steps in order
attention layer only adds a negligible amount of overhead. to model long-term dependencies. Most recent approaches
The combination of these choices enables us to tackle large also rely on bi-directional encoders to build representations
scale problems (3). of both past and future contexts (Bahdanau et al., 2014;
Zhou et al., 2016; Wu et al., 2016). Models with many lay-
We evaluate our approach on several large datasets for ma-
ers often rely on shortcut or residual connections (He et al.,
chine translation as well as summarization and compare to
2015a; Zhou et al., 2016; Wu et al., 2016).
the current best architectures reported in the literature. On
WMT16 English-Romanian translation we achieve a new
state of the art, outperforming the previous best result by 3. A Convolutional Architecture
1.9 BLEU. On WMT14 English-German we outperform
Next we introduce a fully convolutional architecture for se-
the strong LSTM setup of Wu et al. (2016) by 0.5 BLEU
quence to sequence modeling. Instead of relying on RNNs
and on WMT14 English-French we outperform the like-
to compute intermediate encoder states z and decoder states
lihood trained system of Wu et al. (2016) by 1.6 BLEU.
h we use convolutional neural networks (CNN).
Furthermore, our model can translate unseen sentences at
an order of magnitude faster speed than Wu et al. (2016)
on GPU and CPU hardware (4, 5). 3.1. Position Embeddings
First, we embed input elements x = (x1 , . . . , xm ) in dis-
2. Recurrent Sequence to Sequence Learning tributional space as w = (w1 , . . . , wm ), where wj Rf
is a column in an embedding matrix D RV f . We also
Sequence to sequence modeling has been synonymous equip our model with a sense of order by embedding the ab-
with recurrent neural network based encoder-decoder ar- solute position of input elements p = (p1 , . . . , pm ) where
chitectures (Sutskever et al., 2014; Bahdanau et al., 2014). pj Rf . Both are combined to obtain input element rep-
The encoder RNN processes an input sequence x = resentations e = (w1 + p1 , . . . , wm + pm ). We proceed
(x1 , . . . , xm ) of m elements and returns state representa- similarly for output elements that were already generated
tions z = (z1 . . . . , zm ). The decoder RNN takes z and by the decoder network to yield output element represen-
generates the output sequence y = (y1 , . . . , yn ) left to tations that are being fed back into the decoder network
right, one element at a time. To generate output yi+1 , the g = (g1 , . . . , gn ). Position embeddings are useful in our
decoder computes a new hidden state hi+1 based on the architecture since they give our model a sense of which
previous state hi , an embedding gi of the previous target portion of the sequence in the input or output it is currently
language word yi , as well as a conditional input ci derived dealing with (5.4).
from the encoder output z. Based on this generic formula-
tion, various encoder-decoder architectures have been pro- 3.2. Convolutional Block Structure
posed, which differ mainly in the conditional input and the
type of RNN. Both encoder and decoder networks share a simple block
structure that computes intermediate states based on a fixed
Models without attention consider only the final encoder number of input elements. We denote the output of the l-
state zm by setting ci = zm for all i (Cho et al., 2014), or th block as hl = (hl1 , . . . , hln ) for the decoder network,
simply initialize the first decoder state with zm (Sutskever and zl = (z1l , . . . , zm
l
) for the encoder network; we refer
et al., 2014), in which case ci is not used. Architectures to blocks and layers interchangeably. Each block contains
with attention (Bahdanau et al., 2014; Luong et al., 2015) a one dimensional convolution followed by a non-linearity.
compute ci as a weighted sum of (z1 . . . . , zm ) at each time For a decoder network with a single block and kernel width
step. The weights of the sum are referred to as attention k, each resulting state h1i contains information over k input
scores and allow the network to focus on different parts of elements. Stacking several blocks on top of each other in-
the input sequence as it generates the output sequences. At- creases the number of input elements represented in a state.
tention scores are computed by essentially comparing each For instance, stacking 6 blocks with k = 5 results in an in-
encoder state zj to a combination of the previous decoder put field of 25 elements, i.e. each output depends on 25
Convolutional Sequence to Sequence Learning
only. We found adding ej to be beneficial and it resem- of attention mechanisms we use; we exclude source word
bles key-value memory networks where the keys are the zju embeddings. We found this to stabilize learning since the
and the values are the zju + ej (Miller et al., 2016). En- encoder received too much gradient otherwise.
coder outputs zju represent potentially large input contexts
and ej provides point information about a specific input el- 3.5. Initialization
ement that is useful when making a prediction. Once cli
has been computed, it is simply added to the output of the Normalizing activations when adding the output of dif-
corresponding decoder layer hli . ferent layers, e.g. residual connections, requires careful
weight initialization. The motivation for our initialization
This can be seen as attention with multiple hops is the same as for the normalization: maintain the variance
(Sukhbaatar et al., 2015) compared to single step attention of activations throughout the forward and backward passes.
(Bahdanau et al., 2014; Luong et al., 2015; Zhou et al., All embeddings are initialized from a normal distribution
2016; Wu et al., 2016). In particular, the attention of with mean 0 and standard deviation 0.1. For layers whose
the first layer determines a useful source context which output is not directly fed
pto a gated linear unit, we initial-
is then fed to the second layer that takes this information ize weights from N (0, 1/nl ) where nl is the number of
into account when computing attention etc. The decoder input connections to each neuron. This ensures that the
also has immediate access to the attention history of the variance of a normally distributed input is retained.
k 1 previous time steps because the conditional inputs
cl1 l1
are part of hl1 l1
which are input For layers which are followed by a GLU activation, we pro-
ik , . . . , ci ik , . . . , hi
pose a weight initialization scheme by adapting the deriva-
to hi . This makes it easier for the model to take into ac-
l
tions in (He et al., 2015b; Glorot & Bengio, 2010; Ap-
count which previous inputs have been attended to already
pendix A). If the GLU inputs are distributed with mean 0
compared to recurrent nets where this information is in the
and have sufficiently small variance, then we can approx-
recurrent state and needs to survive several non-linearities.
imate the output variance with 1/4 of the input variance
Overall, our attention mechanism considers which words
(Appendix A.1). Hence, we initialize the weights so that
we previously attended to (Yang et al., 2016) and performs
the input to the GLU activations have 4 times the variance
multiple attention hops per time step. In Appendix C,
of the layer input.p
This is achieved by drawing their initial
we plot attention scores for a deep decoder and show that
values from N (0, 4/nl ). Biases are uniformly set to zero
at different layers, different portions of the source are at-
when the network is constructed.
tended to.
We apply dropout to the input of some layers so that in-
Our convolutional architecture also allows to batch the at-
puts are retained with a probability of p. This can be seen
tention computation across all elements of a sequence com-
as multiplication with a Bernoulli random variable taking
pared to RNNs (Figure 1, middle). We batch the computa-
value 1/p with probability p and 0 otherwise (Srivastava
tions of each decoder layer individually.
et al., 2014). The application of dropout will then cause
the variance to be scaled by 1/p. We aim to restore the
3.4. Normalization Strategy
incoming variance by initializing the respective
p layers with
We stabilize learning through careful weight initialization larger weights. Specifically, we use N (0, 4p/nl )p for lay-
(3.5) and by scaling parts of the network to ensure that the ers whose output is subject to a GLU and N (0, p/nl )
variance throughout the network does not change dramati- otherwise (Appendix A.3).
cally. In particular, we scale the output of residual blocks
as well as the attention to preserve the variance of activa- 4. Experimental Setup
tions. We multiplythe sum of the input and output of a
residual block by 0.5 to halve the variance of the sum. 4.1. Datasets
This assumes that both summands have the same variance
We consider three major WMT translation tasks as well as
which is not always true but effective in practice.
a text summarization task.
The conditional input cli generated by the attention is a
WMT16 English-Romanian. We use the same data and
weighted sum of m vectors (2) andp we counteract a change
pre-processing as Sennrich et al. (2016b) but remove sen-
in variance through scaling by m 1/m; we multiply by
tences with more than 175 words. This results in 2.8M sen-
m to scale up the inputs to their original size, assuming the
tence pairs for training and we evaluate on newstest2016.2
attention scores are uniformly distributed. This is generally
not the case but we found it to work well in practice. 2
We followed the pre-processing of https://github.
com/rsennrich/wmt16-scripts/blob/80e21e5/
For convolutional decoders with multiple attention, we sample/preprocess.sh and added the back-translated data
scale the gradients for the encoder layers by the number from http://data.statmt.org/rsennrich/wmt16_
Convolutional Sequence to Sequence Learning
We experiment with word-based models using a source vo- still fit in GPU memory. If the threshold is exceeded, we
cabulary of 200K types and a target vocabulary of 80K simply split the batch until the threshold is met and pro-
types. We also consider a joint source and target byte-pair cess the parts separatedly. Gradients are normalized by the
encoding (BPE) with 40K types (Sennrich et al., 2016a;b). number of non-padding tokens per mini-batch. We also use
weight normalization for all layers except for lookup tables
WMT14 English-German. We use the same setup as Lu-
(Salimans & Kingma, 2016).
ong et al. (2015) which comprises 4.5M sentence pairs for
training and we test on newstest2014.3 As vocabulary we Besides dropout on the embeddings and the decoder out-
use 40K sub-word types based on BPE. put, we also apply dropout to the input of the convolu-
tional blocks (Srivastava et al., 2014). All models are im-
WMT14 English-French. We use the full training set of
plemented in Torch (Collobert et al., 2011) and trained on
36M sentence pairs, and remove sentences longer than 175
a single Nvidia M40 GPU except for WMT14 English-
words as well as pairs with a source/target length ratio ex-
French for which we use a multi-GPU setup on a single
ceeding 1.5. This results in 35.5M sentence-pairs for train-
machine. We train on up to eight GPUs synchronously by
ing. Results are reported on newstest2014. We use a source
maintaining copies of the model on each card and split the
and target vocabulary with 40K BPE types.
batch so that each worker computes 1/8-th of the gradients;
In all setups a small subset of the training data serves as val- at the end we sum the gradients via Nvidia NCCL.
idation set (about 0.5-1% for each dataset) for early stop-
ping and learning rate annealing. 4.3. Evaluation
Abstractive summarization. We train on the Gigaword We report average results over three runs of each model,
corpus (Graff et al., 2003) and pre-process it identically where each differs only in the initial random seed. Trans-
to Rush et al. (2015) resulting in 3.8M training examples lations are generated by a beam search and we normalize
and 190K for validation. We evaluate on the DUC-2004 log-likelihood scores by sentence length. We use a beam
test data comprising 500 article-title pairs (Over et al., of width 5. We divide the log-likelihoods of the final hy-
2007) and report three variants of recall-based ROUGE pothesis in beam search by their length |y|. For WMT14
(Lin, 2004), namely, ROUGE-1 (unigrams), ROUGE-2 (bi- English-German we tune a length normalization constant
grams), and ROUGE-L (longest-common substring). We on a separate development set (newstest2015) and we nor-
also evaluate on a Gigaword test set of 2000 pairs which malize log-likelihoods by |y| (Wu et al., 2016). On other
is identical to the one used by Rush et al. (2015) and we datasets we did not find any benefit with length normaliza-
report F1 ROUGE similar to prior work. Similar to Shen tion.
et al. (2016) we use a source and target vocabulary of 30K
words and require outputs to be at least 14 words long. For word-based models, we perform unknown word re-
placement based on attention scores after generation (Jean
et al., 2015). Unknown words are replaced by looking up
4.2. Model Parameters and Optimization
the source word with the maximum attention score in a pre-
We use 512 hidden units for both encoders and decoders, computed dictionary. If the dictionary contains no trans-
unless otherwise stated. All embeddings, including the out- lation, then we simply copy the source word. Dictionar-
put produced by the decoder before the final linear layer, ies were extracted from the word aligned training data that
have dimensionality 512; we use the same dimensionalities we obtained with fast align (Dyer et al., 2013). Each
for linear layers mapping between the hidden and embed- source word is mapped to the target word it is most fre-
ding sizes (3.2). quently aligned to. In our multi-step attention (3.3) we
simply average the attention scores over all layers. Fi-
We train our convolutional models with Nesterovs accel-
nally, we compute case-sensitive tokenized BLEU, except
erated gradient method (Sutskever et al., 2013) using a mo-
for WMT16 English-Romanian where we use detokenized
mentum value of 0.99 and renormalize gradients if their
BLEU to be comparable with Sennrich et al. (2016b).4
norm exceeds 0.1 (Pascanu et al., 2013). We use a learn-
ing rate of 0.25 and once the validation perplexity stops 4
https://github.com/moses-smt/
improving, we reduce the learning rate by an order of mag- mosesdecoder/blob/617e8c8/scripts/generic/
nitude after each epoch until it falls below 104 . {multi-bleu.perl,mteval-v13a.pl}
Table 2. Accuracy of ensembles with eight models. We show Table 3. CPU and GPU generation speed in seconds on the de-
both likelihood and Reinforce (RL) results for GNMT; Zhou et al. velopment set of WMT14 English-French. We show results for
(2016) and ConvS2S use simple likelihood training. different beam sizes b. GNMT figures are taken from Wu et al.
(2016). CPU speeds are not directly comparable because Wu et al.
(2016) use a 88 core machine versus our 48 core setup.
DUC-2004 Gigaword
RG-1 (R) RG-2 (R) RG-L (R) RG-1 (F) RG-2 (F) RG-L (F)
RNN MLE (Shen et al., 2016) 24.92 8.60 22.25 32.67 15.23 30.56
RNN MRT (Shen et al., 2016) 30.41 10.87 26.79 36.54 16.59 33.44
WFE (Suzuki & Nagata, 2017) 32.28 10.54 27.80 36.30 17.31 33.88
ConvS2S 30.44 10.84 26.90 35.88 17.48 33.29
Table 6. Accuracy on two summarization tasks in terms of Rouge-1 (RG-1), Rouge-2 (RG-2), and Rouge-L (RG-L).
Kernel width Encoder layers model structure. We expect our model to benefit from these
5 9 13 improvements as well.
3 20.61 21.17 21.63
5 20.80 21.02 21.42 6. Conclusion and Future Work
7 20.81 21.30 21.09
We introduce the first fully convolutional model for se-
quence to sequence learning that outperforms strong re-
Table 7. Encoder with different kernel width in terms of BLEU.
current models on very large benchmark datasets at an or-
Kernel width Decoder layers der of magnitude faster speed. Compared to recurrent net-
3 5 7 works, our convolutional approach allows to discover com-
positional structure in the sequences more easily since rep-
3 21.10 21.71 21.62 resentations are built hierarchically. Our model relies on
5 21.09 21.63 21.24 gating and performs multiple attention steps.
7 21.40 21.31 21.33
We achieve a new state of the art on several public trans-
Table 8. Decoder with different kernel width in terms of BLEU. lation benchmark data sets. On the WMT16 English-
Romanian task we outperform the previous best result by
1.9 BLEU, on WMT14 English-French translation we im-
Aside from increasing the depth of the networks, we can prove over the LSTM model of Wu et al. (2016) by 1.6
also change the kernel width. Table 7 shows that encoders BLEU in a comparable setting, and on WMT14 English-
with narrow kernels and many layers perform better than German translation we ouperform the same model by 0.5
wider kernels. These networks can also be faster since the BLEU. In future work, we would like to apply convolu-
amount of work to compute a kernel operating over 3 input tional architectures to other sequence to sequence learn-
elements is less than half compared to kernels over 7 ele- ing problems which may benefit from learning hierarchical
ments. We see a similar picture for decoder networks with representations as well.
large kernel sizes (Table 8). Dauphin et al. (2016) shows
that context sizes of 20 words are often sufficient to achieve Acknowledgements
very good accuracy on language modeling for English.
We thank Benjamin Graham for providing a fast 1-D con-
5.7. Summarization volution, and Ronan Collobert as well as Yann LeCun for
helpful discussions related to this work.
Finally, we evaluate our model on abstractive sentence
summarization which takes a long sentence as input and
outputs a shortened version. The current best models on References
this task are recurrent neural networks which either opti- Ba, Jimmy Lei, Kiros, Jamie Ryan, and Hinton, Ge-
mize the evaluation metric (Shen et al., 2016) or address offrey E. Layer normalization. arXiv preprint
specific problems of summarization such as avoiding re- arXiv:1607.06450, 2016.
peated generations (Suzuki & Nagata, 2017). We use stan-
dard likelhood training for our model and a simple model Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio,
with six layers in the encoder and decoder each, hidden Yoshua. Neural machine translation by jointly learning
size 256, batch size 128, and we trained on a single GPU in to align and translate. arXiv preprint arXiv:1409.0473,
one night. Table 6 shows that our likelhood trained model 2014.
outperforms the likelihood trained model (RNN MLE) of
Shen et al. (2016) and is not far behind the best models on Bojar, Ondej, Chatterjee, Rajen, Federmann, Christian,
this task which benefit from task-specific optimization and Graham, Yvette, Haddow, Barry, Huck, Matthias,
Convolutional Sequence to Sequence Learning
Jimeno-Yepes, Antonio, Koehn, Philipp, Logacheva, He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
Varvara, Monz, Christof, Negri, Matteo, Neveol, Jian. Delving deep into rectifiers: Surpassing human-
Aurelie, Neves, Mariana L., Popel, Martin, Post, Matt, level performance on imagenet classification. In Pro-
Rubino, Raphael, Scarton, Carolina, Specia, Lucia, ceedings of the IEEE International Conference on Com-
Turchi, Marco, Verspoor, Karin M., and Zampieri, Mar- puter Vision, pp. 10261034, 2015b.
cos. Findings of the 2016 conference on machine trans-
Hochreiter, Sepp and Schmidhuber, Jurgen. Long short-
lation. In Proc. of WMT, 2016.
term memory. Neural computation, 9(8):17351780,
Bradbury, James, Merity, Stephen, Xiong, Caiming, and 1997.
Socher, Richard. Quasi-Recurrent Neural Networks.
Ioffe, Sergey and Szegedy, Christian. Batch normalization:
arXiv preprint arXiv:1611.01576, 2016.
Accelerating deep network training by reducing internal
Cho, Kyunghyun, Van Merrienboer, Bart, Gulcehre, covariate shift. In Proceedings of The 32nd International
Caglar, Bahdanau, Dzmitry, Bougares, Fethi, Schwenk, Conference on Machine Learning, pp. 448456, 2015.
Holger, and Bengio, Yoshua. Learning Phrase Represen- Jean, Sebastien, Firat, Orhan, Cho, Kyunghyun, Memi-
tations using RNN Encoder-Decoder for Statistical Ma- sevic, Roland, and Bengio, Yoshua. Montreal Neural
chine Translation. In Proc. of EMNLP, 2014. Machine Translation systems for WMT15. In Proc. of
Chorowski, Jan K, Bahdanau, Dzmitry, Serdyuk, Dmitriy, WMT, pp. 134140, 2015.
Cho, Kyunghyun, and Bengio, Yoshua. Attention-based Kalchbrenner, Nal, Espeholt, Lasse, Simonyan, Karen,
models for speech recognition. In Advances in Neural van den Oord, Aaron, Graves, Alex, and Kavukcuoglu,
Information Processing Systems, pp. 577585, 2015. Koray. Neural Machine Translation in Linear Time.
Collobert, Ronan, Kavukcuoglu, Koray, and Farabet, arXiv, 2016.
Clement. Torch7: A Matlab-like Environment for Ma- LeCun, Yann and Bengio, Yoshua. Convolutional networks
chine Learning. In BigLearn, NIPS Workshop, 2011. for images, speech, and time series. The handbook of
URL http://torch.ch. brain theory and neural networks, 3361(10):1995, 1995.
Dauphin, Yann N., Fan, Angela, Auli, Michael, and Grang- LHostis, Gurvan, Grangier, David, and Auli, Michael. Vo-
ier, David. Language modeling with gated linear units. cabulary Selection Strategies for Neural Machine Trans-
arXiv preprint arXiv:1612.08083, 2016. lation. arXiv preprint arXiv:1610.00072, 2016.
Dyer, Chris, Chahuneau, Victor, and Smith, Noah A. A Lin, Chin-Yew. Rouge: A package for automatic evalu-
Simple, Fast, and Effective Reparameterization of IBM ation of summaries. In Text Summarization Branches
Model 2. In Proc. of ACL, 2013. Out: Proceedings of the ACL-04 Workshop, pp. 7481,
2004.
Elman, Jeffrey L. Finding Structure in Time. Cognitive
Science, 14:179211, 1990. Luong, Minh-Thang, Pham, Hieu, and Manning, Christo-
pher D. Effective approaches to attention-based neural
Gehring, Jonas, Auli, Michael, Grangier, David, and machine translation. In Proc. of EMNLP, 2015.
Dauphin, Yann N. A Convolutional Encoder Model
for Neural Machine Translation. arXiv preprint Meng, Fandong, Lu, Zhengdong, Wang, Mingxuan, Li,
arXiv:1611.02344, 2016. Hang, Jiang, Wenbin, and Liu, Qun. Encoding Source
Language with Convolutional Neural Network for Ma-
Glorot, Xavier and Bengio, Yoshua. Understanding the chine Translation. In Proc. of ACL, 2015.
difficulty of training deep feedforward neural networks.
The handbook of brain theory and neural networks, Mi, Haitao, Wang, Zhiguo, and Ittycheriah, Abe. Vocab-
2010. ulary Manipulation for Neural Machine Translation. In
Proc. of ACL, 2016.
Graff, David, Kong, Junbo, Chen, Ke, and Maeda,
Kazuaki. English gigaword. Linguistic Data Consor- Miller, Alexander H., Fisch, Adam, Dodge, Jesse, Karimi,
tium, Philadelphia, 2003. Amir-Hossein, Bordes, Antoine, and Weston, Jason.
Key-value memory networks for directly reading docu-
Ha, David, Dai, Andrew, and Le, Quoc V. Hypernetworks. ments. In Proc. of EMNLP, 2016.
arXiv preprint arXiv:1609.09106, 2016.
Nallapati, Ramesh, Zhou, Bowen, Gulcehre, Caglar, Xi-
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, ang, Bing, et al. Abstractive text summarization us-
Jian. Deep Residual Learning for Image Recognition. In ing sequence-to-sequence rnns and beyond. In Proc. of
Proc. of CVPR, 2015a. EMNLP, 2016.
Convolutional Sequence to Sequence Learning
Oord, Aaron van den, Kalchbrenner, Nal, and Sutskever, Ilya, Martens, James, Dahl, George E., and Hin-
Kavukcuoglu, Koray. Pixel recurrent neural networks. ton, Geoffrey E. On the importance of initialization and
arXiv preprint arXiv:1601.06759, 2016a. momentum in deep learning. In ICML, 2013.
Oord, Aaron van den, Kalchbrenner, Nal, Vinyals, Oriol, Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence
Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. to Sequence Learning with Neural Networks. In Proc. of
Conditional image generation with pixelcnn decoders. NIPS, pp. 31043112, 2014.
arXiv preprint arXiv:1606.05328, 2016b.
Suzuki, Jun and Nagata, Masaaki. Cutting-off redundant
Over, Paul, Dang, Hoa, and Harman, Donna. Duc in con- repeating generations for neural abstractive summariza-
text. Information Processing & Management, 43(6): tion. arXiv preprint arXiv:1701.00138, 2017.
15061520, 2007.
Waibel, Alex, Hanazawa, Toshiyuki, Hinton, Geoffrey,
Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. Shikano, Kiyohiro, and Lang, Kevin J. Phoneme Recog-
On the difficulty of training recurrent neural networks. nition using Time-delay Neural Networks. IEEE trans-
In Proceedings of The 30th International Conference on actions on acoustics, speech, and signal processing, 37
Machine Learning, pp. 13101318, 2013. (3):328339, 1989.
Rush, Alexander M, Chopra, Sumit, and Weston, Jason. A Wu, Yonghui, Schuster, Mike, Chen, Zhifeng, Le, Quoc V,
neural attention model for abstractive sentence summa- Norouzi, Mohammad, Macherey, Wolfgang, Krikun,
rization. In Proc. of EMNLP, 2015. Maxim, Cao, Yuan, Gao, Qin, Macherey, Klaus, et al.
Googles Neural Machine Translation System: Bridging
Salimans, Tim and Kingma, Diederik P. Weight nor- the Gap between Human and Machine Translation. arXiv
malization: A simple reparameterization to acceler- preprint arXiv:1609.08144, 2016.
ate training of deep neural networks. arXiv preprint
arXiv:1602.07868, 2016. Yang, Zichao, Hu, Zhiting, Deng, Yuntian, Dyer, Chris,
and Smola, Alex. Neural Machine Translation
Schuster, Mike and Nakajima, Kaisuke. Japanese and ko- with Recurrent Attention Modeling. arXiv preprint
rean voice search. In Acoustics, Speech and Signal Pro- arXiv:1607.05108, 2016.
cessing (ICASSP), 2012 IEEE International Conference
on, pp. 51495152. IEEE, 2012. Zhou, Jie, Cao, Ying, Wang, Xuguang, Li, Peng, and Xu,
Wei. Deep Recurrent Models with Fast-Forward Con-
Sennrich, Rico, Haddow, Barry, and Birch, Alexandra. nections for Neural Machine Translation. arXiv preprint
Neural Machine Translation of Rare Words with Sub- arXiv:1606.04199, 2016.
word Units. In Proc. of ACL, 2016a.
xl is the result of the GLU activation function Following (He et al., 2015b), we aim to satisfy the condi-
a
yl1 b
(yl1 ) with yl1 = (yl1 a b
, yl1 ), and yl1
a b
, yl1 tion
1
i.i.d. Next, we formulate upper and lower bounds in or- nl V ar wl = 1, l (18)
4
der to approximate V ar[xl ]. If yl1 follows a symmetric
distribution with mean 0, then so that the activations in a network are neither exponen-
a tially magnified norpreduced. This is achieved by initializ-
V ar xl = V ar yl1 b
(yl1 ) (5) ing Wl from N (0, 4/nl ).
a b
2
= E yl1 (yl1 ) E 2 yl1
a b
(yl1 )
A.2. Backward Pass
(6)
a b
= V ar[yl1 ]E (yl1 ) . 2
(7) The gradient of a convolutional layer is computed via back-
propagation as xl = Wl yl . Considering separate gradi-
A lower bound is given by (1/4)V ar[yl1 a
] when expand- ents yla and ylb for GLU, the gradient of x is given by
ing (6) with E [(yl1 )] = 1/4:
2 b
The approximation for the forward pass can be used for of r and E[x] = 0, the variance after dropout is
V ar[yla ], and for estimating V ar[ylb ] we assume an up-
per bound on E[ 0 (ylb )2 ] of 1/16 since 0 (ylb ) [0, 41 ]. V ar[xr] = E[r]2 V ar[x] + V ar[r]V ar[x] (29)
Hence, 1p
= 1+ V ar[x] (30)
p
1 1 1
V ar[yla ] V ar[xl+1 ] V ar[xl+1 ]V ar[ylb )] = V ar[x] (31)
4 16 p
(23)
1 Assuming that a the input of a convolutional layer has been
V ar[ylb ] V ar[xl+1 ]V ar[yla ] (24)
16 subject to dropout with a retain probability p, the varia-
tions of the forward and backward activations from A.1
We observe relatively small gradients in our network, typ-
and A.2 can now be approximated with
ically around 0.001 at the start of training. Therefore, we
approximate by discarding the quadratic terms above, i.e. 1
V ar[xl+1 ] nl V ar[wl ]V ar[xl ] and (32)
4p
1 1
V ar[yla ] V ar[xl+1 ] (25) V ar[xl ] nl V ar[wla ]V ar[xl+1 ]. (33)
4 4p
V ar[ylb ] 0 (26)
This amounts to a modified initialization of Wl from a nor-
1
V ar[xl ] nl V ar[wla ]V ar[xl+1 ] (27) mal
p distribution with zero mean and a standard deviation of
4 4p/n. For layers without a succeeding GLU p activation
As for the forward pass, the above result can be general- function, we initialize weights from N (0, p/n) to cali-
ized to backpropagation through many successive layers, brate for any immediately preceding dropout application.
resulting in
B. Upper Bound on Squared Sigmoid
YL
1 The sigmoid function (x) can be expressed as a hyper-
V ar[x2 ] V ar[xL+1 ] nl V ar[wla ] (28)
4
l=2
bolic tangent by using the identity tanh(x) = 2 (2x) 1.
The derivative of tanh is tanh0 (x) = 1 tanh2 (x), and
with tanh(x) [0, 1], x 0 it holds that
and a similar condition, i.e. (1/4)nl V ar[wla ] = 1. In the
networks we consider, successions of convolutional layers tanh0 (x) 1, x 0 (34)
usually operate on the same number of inputs so that most Z x Z x
cases nl = nl . Note that Wlb is discarded in the approx- tanh0 (x) dx 1 dx (35)
imation; however, for the sake of consistency we use the 0 0
Layer 7 Layer 8
Figure 3. Attention scores for different decoder layers for a sentence translated from English (y-axis) to German (x-axis). This model
uses 8 decoder layers and a 80k BPE vocabulary.