Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

International Journal of Scientific Research in Engineering and Management (IJSREM)

Volume: 07 Issue: 12 | December - 2023 SJIF Rating: 8.176 ISSN: 2582-3930

I. INTRODUCTION
AUTOMATIC SPEECH
RECOGNITION USING A. Significance of the model
The Automatic Speech Recognition model plays a
DEEP NEURAL vital role in converting spoken language into text.
This also enables the seamless interaction between
NETWORKS humans and machines. This has evolved the
communication system by empowering the devices
1st Shivam Kushwaha Computer Science which are voice-controlled, translation the
Engineering Chandigarh UniversityMohali, India languages, and accessibility tools for the hearing
[email protected] impaired. ASR models have a large number of
2nd Piyush Deep Computer Science Engineering applications including healthcare, entertainment,
Chandigarh UniversityMohali, India
[email protected] education, enhancing productivity and customer
rd services.
3 Mohd Muaz Computer Science Engineering
Chandigarh UniversityMohali, India
[email protected] B. Objectives of the research
th
4 Er. Shafalii Sharma Computer Science The primary aim of the research in Automatic
Engineering Chandigarh University Mohali, India Speech Recognition with Deep Neural Networks is
[email protected] to improve the precision, effectiveness, and
reliability of the speech recognition models. The
creation of deep neural network structures helps to
Abstract - This research paper revolves around accurately grasp complex speech patterns, contexts,
the evolution and the present scenario of noise avoidance, and nuances. This allows the
Automatic Speech Recognition systems using speech recognition models to understand different
Deep Neural Networks. It includes the designs, languages, accents and environmental factors with
techniques of training, the evaluations of the higher accuracy. This boosts the performance of the
model performance, and the emerging trends model through innovative neural network designs
that are specific to deep neural networks and data augmentation approaches.
embedded in automatic speech recognition
models. This research also incorporates the
challenges faced while building and deploying the II. LITERATURE REVIEW
speech recognition model including limited data
availability and adaptability. We have examined There have been a number of researches on
how deep neural networks have transformed automatic speech recognition using deep neural
automatic speech recognition and this research networks which have resulted in
provides valuable insights to improve the speech significant advancements in communication and
recognition technology across a number of human-machine interaction. A number of literature
applications from healthcare to smart devices. reviews have been reviewed before conducting this
research on automatic speech recognition using
INDEX TERMS - Automatic Speech DNNs. We have reviewed the evolutions,
Recognition, Deep Neural Networks, Language methodologies, challenges and, future directions in
Modeling, Robustness to Noise, Speech Modelling this specific field. The history of ASR has shown a
significant shift from rule-based models to statistical
models and adoption of the neural networks. Deep
Neural Networks have the ability to model complex
patterns that have emerged as of great significance
in ASR research. They can handle large datasets and
© 2023, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM27292 | Page 1
International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 07 Issue: 12 | December - 2023 SJIF Rating: 8.176 ISSN: 2582-3930

their hierarchical representations. This includes isolated word recognition systems,


connected word recognition systems, continuous
The model accuracy also remains vital including the speech recognition systems, and word spotting
word error rate and character error rate. An end-to- systems. The vocabularies are used to train the
end framework has been introduced for automatic models. These vocabularies can be small or large-
speech recognition, particularly recurrent neural sized vocabularies. There are a number of errors that
networks and attention mechanisms. The LAS can occur in the automatic speech recognition
model also served as a source of inspiration for models such as insertion errors which occur when
speech recognition models. The revolution in natural the system perceives noise as a speech unit,
language processing and diverse fields also showed substitution errors which occur when the recognizer
great contributions to machine translations. The incorrectly identifies an utterance, deletion errors
ASR models have been enhanced with the help of that can occur when the model ignores an utterance.
convolutional neural networks. The main objectives The errors can be direct, intent, and indirect.
also lie in presenting an ASR model that integrates
acoustic modelling and language modelling into a
single recurrent neural network architecture. IV. RESEARCH METHODOLOGIES

A mathematical model for a DNN-based Automatic


III. CHARACTERSTICS OF ASR Speech Recognition system is crucial to demonstrate
the working of the model. ASR systems consist of
Automatic speech recognition models have three three components - an acoustic model, a language
main dimensions that can help to characterize them. model, and a decoding algorithm. Here, we have
This involves dependence, vocabulary semantics and illustrated the mathematical model incorporating all
speech continuity. The speech recognition models the steps necessary for the working of the machine
can be speaker-dependent in which the system has to learning model.
be trainedfor every single speaker or can be speaker- A. Input Representation:
independent in which the training database
contains a number of speech examples from
different speakers which helps the system to
recognize the new speaker. According to the speech
continuity, there are four types of systems.

Figure: Structure of ASR system

Let X,

be the input speech signal, where xt is the feature


vector at time t.

B. Feature Extraction:

Extract the features from the raw signal,e.g., using


Mel-Frequency Cepstral Coefficients.
Let be the feature sequence.

© 2023, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM27292 | Page 2


International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 07 Issue: 12 | December - 2023 SJIF Rating: 8.176 ISSN: 2582-3930

C. Neural Network Architecture:

Define a deep neural network with Llayers. Let,

represent the weights and biases of layer l,


respectively.

The output of layer l is given by

where g is the activation function.

D. Input Layer:

The feature sequence is taken as theinput layer


F(X) as its input.

E. Output Layer:

The output layer produces posteriorprobabilities for


G. Inference:
each phoneme or subword unit.
If there are N units, the output is Given a new input sequence X, the ASR system
predicts the sequence of units using the trained
neural network.
where yi is the probability of unit i.
H. Decoding:
F. Training:
A decoding algorithm is used to map the sequence of
predicted units to the final recognized text.
Define a training dataset.

Minimize the cross-entropy loss function:

where y (i) is the predicted probability. j

© 2023, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM27292 | Page 3


International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 07 Issue: 12 | December - 2023 SJIF Rating: 8.176 ISSN: 2582-3930

VI. CHALLENGES FACED


V. APPLICATIONS OF ASR
A. Data scarcity:
A. Voice assistants: The data that is available for automatic speech
The assistants use automatic speech recognition recognition model is limited which creates a
technology to translate spoken language into text. problem as it lacks the training data and the
This helps to interact with the machines with the diversity. This affects the precision of the system
voice commands. The deep neural network here in accurately converting the speech to the text which
helps to improve voice recognition accuracy and includes different accents, languages, and speaking
understanding of the natural networks. They produce styles which ultimately declines the performance of
realistic speech using the DNN networks which the speech recognition models.
enhances the user experience.
B. Robustness to noise:
In speech processing, robustness to the noise means
the system is capable of sustaining the performance
with accuracy instead of the presence of background
noises or disturbances. This incorporates the
methods that can be utilized such as noise reduction
and resilient extraction of features to improve the
Figure: Working of voice assistants precision and accuracy of speech recognition in
noisy settings.
B. Call centres:
Automatic speech recognition is also used in call C. Model complexity:
centres to improve customer service and the The size and complexity of the neural network
efficiency of the operation. It makes the architecture networks result in the complexity and
transcription of conversations between the advancements in model designing and deployment.
customers and the agents automatic facilitating the The more complex the model, the more it includes a
real-time services. number of parameters and detailed aspects of the
model. They may require larger amounts of data and
C. Language translations: computational resources.
ASR model converts the spoken language into text
and deep neural network architectures improve this D. Scalability:
text for translation. Machine translation models are Expanding the automatic speech recognition
used to translate the speech into different languages. systems using deep neural networks may face some
The synergy between ASR and DNNs helps to hurdles when it is applied across hardware and
translate in real time. managing increased data sizes. Problems may
include the optimization requirements, distributed
D. Security and Authentication: computing resources, and maintaining real-time
Speech recognition and deep neural networks play a efficiency.
vital role in voice-based authentication. ASR
verifies the user's spoken language and DNN
verifies the voice patterns for the biometric VII. RESULTS OF THE RESEARCH
authentication. This combination helps to verify the
user in applications, for example, phone locks, The research in the field of automatic speech
home-control systems, and voice assistants. recognition has shown significant advancements in
the accuracy and robustness of the DNN-based ASR
model. The DNN-based ASR model has resulted in
word error rate and character rate which made it

© 2023, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM27292 | Page 4


International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 07 Issue: 12 | December - 2023 SJIF Rating: 8.176 ISSN: 2582-3930

outperform the traditional systems. It has also shown Noise reduction techniques and acoustic modelling
robustness to various background noises and other help to develop more reliable models. As we get
environmental factors. deeper into the work of neural networks, it will help
to improve the customization and personalization of
the ASR.

IX. CONCLUSION

To conclude, this research paper has provided an


overall exploration of deep neural network-based
automatic speech recognition systems. It mentions
the shift of technologies from the traditional ASR
methodologies to the vital role played by the DNN-
Figure 7.1 The STT demonstration based automatic speech recognition model. This has
revolutionized the field of speech recognition.
The deep neural network architectures have been Dissection of the architectural paradigms of the
optimized. They can handle large datasets and speech recognition model and the training strategies
complex tasks which improves the scalability of the have shown remarkable improvements in the
system. The enhancements have also been made in accuracy and versatility of the ASR model. This
the DNN architectures and methodologies. DNN- research also has a number of challenges including
based ASR models have seen a number of scarcity of data, noise resilience, and speaker
applications in the industry such as healthcare, variability. The ASR continues to evolve making the
education, and smart devices. future of ASR systems promising. It not only
includes the present state of the field but also the
Although a number of improvements have been progressand future innovations.
made, still achieving robustness to all types of noise
and environmental factors is an ongoing focus area.
The advancements are still ongoing for supporting REFERENCES
multilingual language models. Therefore, the
research outcomes in automatic speech recognition [1] Amodei, D., Ananthanarayanan, S., Anubhai, R.,
models have shown remarkable progress in practical Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B.,
deployment, robustness, and improved accuracy. Cheng, Q., Chen, G. and Chen, J., 2016, June. Deep
speech 2: End-to-end speech recognition in English and
Mandarin. In International conference on machine
learning (pp. 173-182). PMLR
VIII. FUTURE DIRECTIONS [2] Chan, W., Jaitly, N., Le, Q. and Vinyals, O., 2016,
March. Listen, attend and spell: A neural network for
There is still a lot to explore in the field of automatic large vocabulary conversational speech recognition. In
speech recognition using deep neural networks. Data 2016 IEEE international conference on acoustics, speech
augmentation is very important in scenarios of low- and signalprocessing (ICASSP) (pp. 4960-4964). IEEE.
resource ASR where there is a limited data set for [3] Cui, X., Goel, V. and Kingsbury, B., 2015. Data
training the model. Techniques like adding noise, augmentation for deep neural network acoustic
altering the speed, or generating synthetic data are modelling. IEEE/ACM Transactions on Audio, Speech,
and Language Processing, 23(9), pp.1469-1477.
very helpful for the ASR models.
[4] Du, J., Wang, Q., Gao, T., Xu, Y., Dai, L.R. and
Lee, C.H., 2014. Robust speech recognition with speech-
Furthermore, robustness and noise handling is very enhanced deep neural networks. In the Fifteenth annual
important. Robust ASR models can transcribe conference of the International Speech Communication
speech even in noisy environments with accuracy. Association.

© 2023, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM27292 | Page 5


International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 07 Issue: 12 | December - 2023 SJIF Rating: 8.176 ISSN: 2582-3930

[5] Espana-Bonet, Cristina, and José AR Fonollosa. Systems Design and Computing, 1(1-2), pp.71-86.
"Automatic speech recognition with deep neural networks [17] Sim, K.C., Qian, Y., Mantena, G., Samarakoon, L.,
for impaired speech." In Advances in Speech and Kundu, S. and Tan, T., 2017. Adaptation of deep neural
Language Technologies for Iberian Languages: Third network acoustic models for robust automatic speech
International Conference, IberSPEECH 2016, Lisbon, recognition. New Era for Robust Speech Recognition:
Portugal, November 23-25, 2016, Proceedings 3, pp. 97- Exploiting Deep Learning, pp.219-243.
107. Springer [18] Serizel, R. and Giuliani, D., 2014. Deep neural
International Publishing, 2016. network adaptation for children's and adults' speech
[6] Fantaye, T.G., Yu, J. and Hailu, T.T., 2020. recognition. Deep neural network adaptation for
Investigation of automatic speech recognition systems via children's and adults' speechrecognition, pp.344-348.
the multilingual deep neural network modelling methods [19] Soundarya, M., Karthikeyan, P.R. and Thangarasu,
for a very low-resource language, Chaha. Journal of G., 2023, March. Automatic Speech Recognition trained
Signal and Information Processing, 11(1), pp.1-21. with Convolutional Neural Network and predicted with
[7] Fendji, J.L.K.E., Tala, D.C., Yenke, B.O. and Recurrent Neural Network. In 2023 9th International
Atemkeng, M., 2022. Automatic speech recognition using Conference on Electrical Energy Systems (ICEES) (pp.
limited vocabulary: A survey. Applied Artificial 41-45). IEEE.
Intelligence, 36(1), p.2095039. [20] Toledano, D.T., Fernández-Gallego, M.P. and
[8] Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Lozano-Diez, A., 2018. Multi-resolution speech analysis
Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y. and for automatic speech recognition using deep neural
Pang, R., 2020. Conformer: Convolution-augmented networks: Experiments on TIMIT. PloS one, 13(10),
transformer for speech recognition. arXiv preprint p.e0205355.
arXiv:2005.08100. [21] Tong, S., Garner, P.N. and Bourlard, H., 2017. An
[9] Han, K., He, Y., Bagchi, D., Fosler-Lussier, E. and investigation of deep neural networks for multilingual
Wang, D., 2015. Deep neural network-based spectral speech recognition training and adaptation (No. CONF,
feature mapping for robust speech recognition. At the pp. 714-718).
Sixteenth annual conference of the International Speech [22] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
Communication Association. Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I.,
[10] Iosifova, O., Iosifov, I., Sokolov, V.Y., 2017. Attention is all you need. Advances in neural
Romanovskyi, O. and Sukaylo, I., 2021. Analysis of information processing systems, 30.
automatic speech recognition methods. Cybersecurity [23] Weng, C., Yu, D., Seltzer, M.L. and Droppo, J.,
Providing in Information and Telecommunication 2015. Deep neural networks for single-channel multi-
Systems, 2923, pp.252-257. talker speech recognition. IEEE/ACM Transactions on
[11] Mukhamadiyev, A., Khujayarov, I., Djuraev, O. and Audio, Speech, and Language Processing, 23(10),
Cho, J., 2022. Automatic speech recognition method pp.1670-1679.
based on deep learning approaches for Uzbek language. [24] Yao, Kaisheng, Dong Yu, Frank Seide, Hang Su, Li
Sensors, 22(10), p.3683. Deng, and Yifan Gong. "Adaptation of context-dependent
[12] Nassif, A.B., Shahin, I., Attili, I., Azzeh, M. and deep neural networks for automatic speech recognition."
Shaalan, K., 2019. Speech recognition using deep neural In 2012 IEEE Spoken Language Technology Workshop
networks: A systematic review. IEEE Access, 7, (SLT),pp. 366-369. IEEE, 2012.
pp.19143-19165.
[25] Yu, D., Siniscalchi, S.M., Deng, L. and Lee, C.H.,
[13] Palaz, D. and Collobert, R., 2015. Analysis of
2012, March. Boosting attribute and phone estimation
CNN-based speech recognition system using raw speech
accuracies with deep neural networks for detection-based
as input (No. REP_WORK). Idiap.
speech recognition. In 2012 IEEE International
[14] Pardede, H.F., Yuliani, A.R. and Sustika, R., 2018.
Conference on Acoustics, Speech and Signal Processing
Convolutional neural network and feature transformation
(ICASSP) (pp. 4169-4172). IEEE.
for distant speech recognition. International Journal of
Electrical and Computer Engineering, 8(6), p.5381.
[15] Qian, Y., Bi, M., Tan, T. and Yu, K., 2016. Very
deep convolutional neural networks for noise robust
speech recognition. IEEE/ACM Transactions on Audio,
Speech, and Language Processing, 24(12), pp.2263-2276.
[16] Sarma, M., 2017. Speech recognition using deep
neural network trends. International Journal of Intelligent

© 2023, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM27292 | Page 6

You might also like