Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case,
Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen, Zhijie Chen,
Mike Chrzanowski, Adam Coates, Greg Diamos, Ke Ding, Niandong Du, Erich Elsen, Jesse Engel, Weiwei Fang,
Linxi Fan, Christopher Fougner, Liang Gao, Caixia Gong, Awni Hannun, Tony Han, Lappi Vaino Johannes,
Bing Jiang, Cai Ju, Billy Jun, Patrick LeGresley, Libby Lin, Junjie Liu, Yang Liu, Weigao Li, Xiangang Li,
Dongpeng Ma, Sharan Narang, Andrew Ng, Sherjil Ozair, Yiping Peng, Ryan Prenger, Sheng Qian,
Zongfeng Quan, Jonathan Raiman, Vinay Rao, Sanjeev Satheesh, David Seetapun, Shubho Sengupta,
Kavya Srinet, Anuroop Sriram, Haiyuan Tang, Liliang Tang, Chong Wang, Jidong Wang, Kaifu Wang, Yi Wang,
Zhijian Wang, Zhiqian Wang, Shuang Wu, Likai Wei, Bo Xiao, Wen Xie, Yan Xie, Dani Yogatama, Bin Yuan,
Jun Zhan, Zhenyao Zhu
Baidu Silicon Valley AI Lab1 , 1195 Bordeaux Avenue, Sunnyvale CA 94086 USA
Baidu Speech Technology Group, No. 10 Xibeiwang East Street, Ke Ji Yuan, Haidian District, Beijing 100193 CHINA
with almost all state of the art speech work containing some
form of deep neural network (Mohamed et al., 2011; Hin-
ton et al., 2012; Dahl et al., 2011; N. Jaitly & Vanhoucke,
2012; Seide et al., 2011). Convolutional networks have also 1D or 2D
Invariant
been found beneficial for acoustic models (Abdel-Hamid Convolution
et al., 2012; Sainath et al., 2013). Recurrent neural net-
works are beginning to be deployed in state-of-the art rec- Spectrogram
ognizers (Graves et al., 2013; H. Sak et al., 2014) and
work well with convolutional layers for the feature extrac- Figure 1: Architecture of the deep RNN used in both En-
tion (Sainath et al., 2015). glish and Mandarin speech.
End-to-end speech recognition is an active area of re-
search, showing compelling results when used to re-
score the outputs of a DNN-HMM (Graves & Jaitly, et al., 2009; Hannun et al., 2014a). Existing speech systems
2014a) and standalone (Hannun et al., 2014a). The RNN can also be used to bootstrap new data collection. For ex-
encoder-decoder with attention performs well in predict- ample, an existing speech engine can be used to align and
ing phonemes (Chorowski et al., 2015) or graphemes (Bah- filter thousands of hours of audiobooks (Panayotov et al.,
danau et al., 2015; Chan et al., 2015). The CTC loss 2015). We draw inspiration from these past approaches in
function (Graves et al., 2006) coupled with an RNN to bootstrapping larger datasets and data augmentation to in-
model temporal information also performs well in end- crease the effective amount of labeled data for our system.
to-end speech recognition with character outputs (Graves
& Jaitly, 2014a; Hannun et al., 2014b;a; Maas et al.,
2015). The CTC-RNN model also works well in predicting 3. Model Architecture
phonemes (Miao et al., 2015; Sak et al., 2015), though a Figure 1 shows the wireframe of our architecture, and lays
lexicon is still needed in this case. out the swappable components which we explore in de-
Exploiting scale in deep learning has been central to the tail in this paper. Our system (similar at its core to the
success of the field thus far (Krizhevsky et al., 2012; Le one in (Hannun et al., 2014a)), is a recurrent neural net-
et al., 2012). Training on a single GPU resulted in substan- work (RNN) with one or more convolutional input layers,
tial performance gains (Raina et al., 2009), which were sub- followed by multiple recurrent (uni or bidirectional) lay-
sequently scaled linearly to two (Krizhevsky et al., 2012) ers and one fully connected layer before a softmax layer.
or more GPUs (Coates et al., 2013). We take advantage of The network is trained end-to-end using the CTC loss func-
work in increasing individual GPU efficiency for low-level tion (Graves et al., 2006), which allows us to directly pre-
deep learning primitives (Chetlur et al.). We built on the dict the sequences of characters from input audio. 2
past work in using model-parallelism (Coates et al., 2013), The inputs to the network are a sequence of log-
data-parallelism (Dean et al., 2012) or a combination of the spectrograms of power normalized audio clips, calculated
two (Szegedy et al., 2014; Hannun et al., 2014a) to create a on 20ms windows. The outputs are the alphabet of each
fast and highly scalable system for training deep RNNs in language. At each output time-step t, the RNN makes a
speech recognition. prediction, p(`t |x), where `t is either a character in the
Data has also been central to the success of end-to-end alphabet or the blank symbol. In English we have `t 2
speech recognition, with over 7000 hours of labeled speech {a, b, c, . . . , z, space, apostrophe, blank}, where we have
used in (Hannun et al., 2014a). Data augmentation has added the space symbol to denote word boundaries. For
been highly effective in improving the performance of deep the Mandarin system the network outputs simplified Chi-
learning in computer vision (LeCun et al., 2004; Sapp et al., 2
Most of our experiments use bidirectional recurrent lay-
2008; Coates et al., 2011) and speech recognition (Gales ers with clipped rectified-linear units (ReLU) (x) =
min{max{x, 0}, 20} as the activation function.
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
Cost
40
9-layer, 7 RNN
no SortaGrad 11.96 9.78 30
20
Table 1: Comparison of WER on a development set as we 50 100 150 200 250 300
vary depth of RNN, application of BatchNorm and Sorta- Iteration (⇥103 )
Grad, and type of recurrent hidden unit. All networks have
38M parameters—as depth increases, the number of hidden Figure 2: Training curves of two models trained with and
units per layer decreases. The last two columns compare without BatchNorm (BN). We see a wider gap in perfor-
the performance of the model on the dev set as we change mance on the deeper 9-7 network (which has 9 layers in
the type of the recurrent hidden unit. total, 7 of which are vanilla bidirectional RNNs) than the
shallower 5-1 network (in which only 1 of the 5 layers is
a bidirectional RNN). We start the plot after the first epoch
nese characters. of training as the curve is more difficult to interpret due to
At inference time, CTC models are paired a with language the SortaGrad curriculum method mentioned in Section 3.2
model trained on a bigger corpus of text. We use a special-
ized beam search (Hannun et al., 2014b) to find the tran- As in (Laurent et al., 2015), there are two ways of applying
scription y that maximizes BatchNorm to the recurrent operation. A natural extension
is to insert a BatchNorm transformation, B(·), immediately
Q(y) = log(pRNN (y|x)) + ↵ log(pLM (y)) + wc(y) (1)
before every non-linearity as follows:
where wc(y) is the number of words (English) or charac- hlt = f (B(W l hlt 1
+ U l hlt (3)
1 )).
ters (Chinese) in the transcription y. The weight ↵ con-
trols the relative contributions of the language model and In this case the mean and variance statistics are accumu-
the CTC network. The weight encourages more words in lated over a single time-step of the minibatch. We did not
the transcription. These parameters are tuned on a held out find this to be effective.
development set.
An alternative (sequence-wise normalization) is to batch
normalize only the vertical connections. The recurrent
3.1. Batch Normalization for Deep RNNs computation is given by
To efficiently absorb data as we scale the training set, we
hlt = f (B(W l hlt 1
) + U l hlt 1 ). (4)
increase the depth of the networks by adding more recur-
rent layers. However, it becomes more challenging to train
networks using gradient descent as the size and depth in- For each hidden unit we compute the mean and variance
creases. We have experimented with the Batch Normaliza- statistics over all items in the minibatch over the length
tion (BatchNorm) method to train deeper nets faster (Ioffe of the sequence. Figure 2 shows that deep networks con-
& Szegedy, 2015). Recent research has shown that Batch- verge faster with sequence-wise normalization. Table 1
Norm can speed convergence of RNNs training, though shows that the performance improvement from sequence-
not always improving generalization error (Laurent et al., wise normalization increases with the depth of the network,
2015). In contrast, we find that when applied to very deep with a 12% performance difference for the deepest net-
networks of RNNs on large data sets, the variant of Batch- work. We store a running average of the mean and variance
Norm we use substantially improves final generalization er- for the neuron collected during training, and use these for
ror in addition to accelerating training. evaluation (Ioffe & Szegedy, 2015).
We experiment with adding between one and three layers of rt,i = Wi,j ht+j 1,i , for 1 i d. (5)
j=1
convolution. These are both in the time-and-frequency do-
main (2D) and in the time-only domain (1D). In all cases
we use a “same” convolution. In some cases we specify
a stride (subsampling) across either dimension which re- We place the lookahead convolution above all recurrent
duces the size of the output. layers. This allows us to stream all computation below the
lookahead convolution on a finer granularity.
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
Table 2: Comparison of WER for different configurations of convolutional layers. In all cases, the convolutions are
followed by 7 recurrent layers and 1 fully connected layer. For 2D convolutions the first dimension is frequency and the
second dimension is time. Each model is trained with BatchNorm, SortaGrad, and has 35M parameters.
3.6. Adaptation to Mandarin machines, we have found that our ability to scale well is of-
ten bottlenecked by unoptimized routines that are taken for
To port a traditional speech recognition pipeline to another
granted. Therefore, we focus on careful optimization of the
language typically requires a significant amount of new
most important routines used for training. Specifically, we
language-specific development. For example, one often
created customized All-Reduce code for OpenMPI to sum
needs to hand-engineer a pronunciation model (Shan et al.,
gradients across GPUs on multiple nodes, developed a fast
2010). We may also need to explicitly model language-
implementation of CTC for GPUs, and use custom mem-
specific pronunciation features, such as tones in Man-
ory allocators. Taken together, these techniques enable us
darin (Shan et al., 2010; Niu et al., 2013). Since our end-
to sustain overall 45% of theoretical peak performance on
to-end system directly predicts characters, these time con-
each node.
suming efforts are no longer needed. This has enabled us
to quickly create an end-to-end Mandarin speech recogni- Our training distributes work over multiple GPUs in a data-
tion system (that outputs Chinese characters) using the ap- parallel fashion with synchronous SGD, where each GPU
proach described above with only a few changes. uses a local copy of the model to work on a portion of
the current minibatch and then exchanges computed gra-
The only architectural changes we make to our networks
dients with all other GPUs. We prefer synchronous SGD
are due to the characteristics of the Chinese character set.
because it is reproducible, which facilitates discovering
The network outputs probabilities for about 6000 char-
and fixing regressions. In this setup, however, the GPUs
acters, which includes the Roman alphabet, since hybrid
must communicate quickly (using an "All-Reduce" opera-
Chinese-English transcripts are common. We incur an out
tion) at each iteration in order to avoid wasting computa-
of vocabulary error at evaluation time if a character is not
tional cycles. Prior work has used asynchronous updates to
contained in this set. This is not a major concern, as our
mitigate this issue (Dean et al., 2012; Recht et al., 2011).
test set has only 0.74% out of vocab characters.
We instead focused on optimizing the All-Reduce opera-
We use a character level language model in Mandarin as tion itself, achieving a 4x-21x speedup using techniques
words are not usually segmented in text. In Section 6.2 we to reduce CPU-GPU communication for our specific work-
show that our Mandarin speech models show roughly the loads. Similarly, to enhance overall computation, we have
same improvements to architectural changes as our English used highly-optimized kernels from Nervana Systems and
speech models, suggesting that modeling knowledge from NVIDIA that are tuned for our deep learning applications.
development in one language transfers well to others. We similarly discovered that custom memory allocation
routines were crucial to maximizing performance as they
4. System Optimizations reduce the number of synchronizations between GPU and
CPU.
Our networks have tens of millions of parameters, and a
training experiment involves tens of single-precision ex- We also found that the CTC cost computation accounted
aFLOPs. Since our ability to evaluate hypotheses about for a significant fraction of running time. Since no public
our data and models depends on training speed, we created well-optimized code for CTC existed, we developed a fast
a highly optimized training system based on high perfor- GPU implementation that reduced overall training time by
mance computing (HPC) infrastructure.3 Although many 10-20%.4
4
frameworks exist for training deep networks on parallel Details of our CTC implementation will be made available
along with open source code.
3
Our software runs on dense compute nodes with 8 NVIDIA
Titan X GPUs per node with a theoretical peak throughput of 48
single-precision TFLOP/s.
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
Test set Ours Human from the WSJ test set collected in real noisy environments
WSJ eval’92 3.10 5.03 and with artificially added noise. Using all 6 channels of
WSJ eval’93 4.42 8.08 the CHiME audio can provide substantial performance im-
Read
LibriSpeech test-clean 5.15 5.83 provements (Yoshioka et al., 2015). We use a single chan-
LibriSpeech test-other 12.73 12.69 nel for all our models, since access to multi-channel audio
is not yet pervasive. The gap between our system and hu-
VoxForge American-Canadian 7.94 4.85 man level performance is larger when the data comes from
Accented
VoxForge Commonwealth 14.85 8.15 a real noisy environment instead of synthetically adding
VoxForge European 18.44 12.76 noise to clean speech.
VoxForge Indian 22.89 22.15
CHiME eval real 21.59 11.84 6.2. Mandarin
Noisy
low latency transcription, we built a batching scheduler application-specific data can significantly enhance perfor-
called Batch Dispatch that assembles streams of data from mance for the application. Similarly, application-specific
user requests into batches before performing RNN forward language models are important for achieving top accuracy
propagation on these batches. With this scheduler, we can and we leverage strong existing n-gram models with our
trade increased batch size, and consequently improved ef- Deep Speech system. Finally, we note that since our sys-
ficiency, with increased latency. tem is trained from a wide range of labeled training data
to output characters directly, there are idiosyncratic con-
We use an eager batching scheme that processes each batch
ventions for transcriptions in each application that must be
as soon as the previous batch is completed, regardless of
handled in post-processing (such as the formatting of dig-
how much work is ready by that point. This scheduling
its). Thus, while our model has removed many complexi-
algorithm balances efficiency and latency, achieving rel-
ties, more flexibility and application-awareness for end-to-
atively small dynamic batch sizes up to 10 samples per
end deep learning methods are open areas for further re-
batch, with median batch size proportional to server load.
search.
Load Median 98%ile
8. Conclusion
10 streams 44 70
20 streams 48 86 End-to-end deep learning presents the exciting opportunity
30 streams 67 114 to improve speech recognition systems continually with in-
creases in data and computation. Since the approach is
Table 7: Latency distribution (ms) versus load highly generic, we have shown that it can quickly be ap-
plied to new languages. Creating high-performing recog-
nizers for two very different languages, English and Man-
We see in Table 7 that our system achieves a median la-
darin, required essentially no expert knowledge of the lan-
tency of 44 ms, and a 98th percentile latency of 70 ms
guages. Finally, we have also shown that this approach
when loaded with 10 concurrent streams. This server uses
can be efficiently deployed by batching user requests to-
one NVIDIA Quadro K1200 GPU for RNN evaluation. As
gether on a GPU server, paving the way to deliver end-to-
designed, Batch Dispatch shifts work to larger batches as
end Deep Learning technologies to users.
server load grows, keeping latency low.
To achieve these results, we have explored various net-
Our deployment system evaluates RNNs in half-precision
work architectures, finding several effective techniques:
arithmetic, which has no measurable accuracy impact, but
enhancements to numerical optimization through Sorta-
significantly improves efficiency. We wrote our own 16-bit
Grad and Batch Normalization, and lookahead convolution
matrix-matrix multiply routines for this task, substantially
for unidirectional models. This exploration was powered
improving throughput for our relatively small batches.
by a well optimized, high performance computing inspired
Performing the beam search involves repeated lookups in training system that allows us to train full-scale models on
the n-gram language model, most of which translate to un- our large datasets in just a few days.
cached reads from memory. To reduce the cost of these
Overall, we believe our results confirm and exemplify
lookups, we employ a heuristic: only consider the fewest
the value of end-to-end deep learning methods for speech
number of characters whose cumulative probability is at
recognition in several settings. We believe these techniques
least p. In practice, we find that p = 0.99 works well,
will continue to scale.
and additionally we limit the search to 40 characters. This
speeds up the cumulative Mandarin language model lookup
time by a factor of 150x, and has a negligible effect on CER References
(0.1-0.3% relative). Abdel-Hamid, Ossama, Mohamed, Abdel-rahman, Jang, Hui, and
Penn, Gerald. Applying convolutional neural networks con-
7.1. Deep Speech in production environment cepts to hybrid nn-hmm model for speech recognition. In
ICASSP, 2012.
Deep Speech has been integrated with a state-of-the-art
production speech pipeline for user applications. We have Bahdanau, Dzmitry, Chorowski, Jan, Serdyuk, Dmitriy, Brakel,
Philemon, and Bengio, Yoshua. End-to-end attention-based
found several key challenges that affect the deployment of large vocabulary speech recognition. abs/1508.04395, 2015.
end-to-end deep learning methods like ours. First, we have http://arxiv.org/abs/1508.04395.
found that even modest amounts of application-specific
training data is invaluable despite the large quantities of Barker, Jon, Marxer, Ricard Vincent, Emmanuel, and Watanabe,
Shinji. The third ’CHiME’ speech separation and recognition
general speech data used for training. For example, while challenge: Dataset, task and baselines. 2015. Submitted to
we are able to train on more than 10,000 hours of Man- IEEE 2015 Automatic Speech Recognition and Understanding
darin speech, we find that the addition of just 500 hours of Workshop (ASRU).
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
Gales, M. J. F., Ragni, A., Aldamarki, H., and Gautier, C. Support Lippmann, Richard P. Speech recognition by machines and hu-
vector machines for noise robust ASR. In ASRU, pp. 205–2010, mans. Speech communication, 22(1):1–15, 1997.
2009.
Maas, Andrew, Xie, Ziang, Jurafsky, Daniel, and Ng, Andrew.
Graves, A. and Jaitly, N. Towards end-to-end speech recognition Lexicon-free conversational speech recognition with neural
with recurrent neural networks. In ICML, 2014a. networks. In NAACL, 2015.
Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J. Con- Miao, Yajie, Gowayyed, Mohammad, and Metz, Florian. EESEN:
nectionist temporal classification: Labelling unsegmented se- End-to-end speech recognition using deep rnn models and
quence data with recurrent neural networks. In ICML, pp. 369– wfst-based decoding. In ASRU, 2015.
376. ACM, 2006.
Mohamed, A., Dahl, G.E., and Hinton, G.E. Acoustic
Graves, Alex and Jaitly, Navdeep. Towards end-to-end speech modeling using deep belief networks. IEEE Transac-
recognition with recurrent neural networks. In Proceedings tions on Audio, Speech, and Language Processing, (99),
of the 31st International Conference on Machine Learning 2011. URL http://ieeexplore.ieee.org/xpls/abs_all.
(ICML-14), pp. 1764–1772, 2014b. jsp?arnumber=5704567.
Graves, Alex, Mohamed, Abdel-rahman, and Hinton, Geoffrey. N. Jaitly, P. Nguyen, A. Senior and Vanhoucke, V. Application
Speech recognition with deep recurrent neural networks. In of pretrained deep neural networks to large vocabulary speech
ICASSP, 2013. recognition. In Interspeech, 2012.
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
Niu, Jianwei, Xie, Lei, Jia, Lei, and Hu, Na. Context-dependent Yoshioka, T., Ito, N., Delcroix, M., Ogawa, A., Kinoshita, K.,
deep neural networks for commercial mandarin speech recog- Yu, M. F. C., Fabian, W. J., Espi, M., Higuchi, T., Araki, S.,
nition applications. In APSIPA, 2013. and Nakatani, T. The ntt chime-3 system: Advances in speech
enhancement and recognition for mobile multi-microphone de-
Panayotov, Vassil, Chen, Guoguo, Povey, Daniel, and Khudanpur, vices. In IEEE ASRU, 2015.
Sanjeev. Librispeech: an asr corpus based on public domain
audio books. In ICASSP, 2015. Zaremba, Wojciech and Sutskever, Ilya. Learning to execute.
abs/1410.4615, 2014. http://arxiv.org/abs/1410.4615.
Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the
difficulty of training recurrent neural networks. abs/1211.5063,
2012. http://arxiv.org/abs/1211.5063.
Raina, R., Madhavan, A., and Ng, A.Y. Large-scale deep unsuper-
vised learning using graphics processors. In 26th International
Conference on Machine Learning, 2009.
Recht, Benjamin, Re, Christopher, Wright, Stephen, and Niu,
Feng. Hogwild: A lock-free approach to parallelizing stochas-
tic gradient descent. In Advances in Neural Information Pro-
cessing Systems, pp. 693–701, 2011.
Renals, S., Morgan, N., Bourlard, H., Cohen, M., and Franco, H.
Connectionist probability estimators in HMM speech recogni-
tion. IEEE Transactions on Speech and Audio Processing, 2
(1):161–174, 1994.
Robinson, Tony, Hochberg, Mike, and Renals, Steve. The use
of recurrent neural networks in continuous speech recognition.
pp. 253–258, 1996.
Sainath, Tara, Vinyals, Oriol, Senior, Andrew, and Sak, Hasim.
Convolutional, long short-term memory, fully connected deep
neural networks. In ICASSP, 2015.
Sainath, Tara N., rahman Mohamed, Abdel, Kingsbury, Brian,
and Ramabhadran, Bhuvana. Deep convolutional neural net-
works for LVCSR. In ICASSP, 2013.
Sak, Hasim, Senior, Andrew, Rao, Kanishka, and Beaufays, Fran-
coise. Fast and accurate recurrent neural network acous-
tic models for speech recognition. abs/1507.06947, 2015.
http://arxiv.org/abs/1507.06947.
Sapp, Benjaminn, Saxena, Ashutosh, and Ng, Andrew. A fast
data collection and augmentation procedure for object recog-
nition. In AAAI Twenty-Third Conference on Artificial Intelli-
gence, 2008.
Seide, Frank, Li, Gang, and Yu, Dong. Conversational speech
transcription using context-dependent deep neural networks. In
Interspeech, pp. 437–440, 2011.
Shan, Jiulong, Wu, Genqing, Hu, Zhihong, Tang, Xiliu, Jansche,
Martin, and Moreno, Pedro. Search by voice in mandarin chi-
nese. In Interspeech, 2010.
Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the impor-
tance of momentum and initialization in deep learning. In 30th
International Conference on Machine Learning, 2013.
Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre,
Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Van-
houcke, Vincent, and Rabinovich, Andrew. Going deeper with
convolutions. 2014.
Waibel, Alexander, Hanazawa, Toshiyuki, Hinton, Geoffrey,
Shikano, Kiyohiro, and Lang, Kevin. Phoneme recognition us-
ing time-delay neural networks,âĂİ acoustics speech and sig-
nal processing. IEEE Transactions on Acoustics, Speech and
Signal Processing, 37(3):328–339, 1989.