Jabrenee Hussie Honors Thesis

SPEECH EMOTION RECOGNITION USING CONVOLUTIONAL NEURAL
NETWORKS
By
JABRENEE ANGELINA HUSSIE
A THESIS
submitted in partial fulfilment of the requirements

for the Honours Program of Delaware State University
DOVER, DELAWARE
May 2023
This Honors thesis is approved by the following Thesis Committee:

Dr. Fatima Boukari, Research Advisor, Computer Science Department, Delaware State
University
Dr. -------------, Honors Council Committee Member, …. (position), Department of ---------------,
Delaware State University
Dr. Gulnihal Ozbay, Honors Council Committee Chair, Professor, Department of Agriculture
and Natural Resources, Delaware State University
Ms. Shonda Poe, Honors Program Director, URELAH, Delaware State University
SPEECH EMOTION RECOGNITION USING CONVOLUTIONAL NEURAL
NETWORKS
ABSTRACT
Our society is designed to professionally and socially benefit people who find themselves
comfortable in social situations and have strong interpersonal skills. While those who aren’t as
strong in this area develop coping mechanisms to better navigate the world. The development of
these skills come from the ability to recognize the emotions of others and acting accordingly.
The inability to recognize emotions affects around 13% of the population. Doctors recommend
different forms of therapy to cope, while researchers recommend understanding one’s own
physical, emotional, and biological responses to understand others. While these are good tools to
help one learn about emotional responses, everyone reacts to situations differently and learning
how one internally reacts to a situation won’t necessarily help them identify that in others.
Speech-based emotion recognition (SER) research aims to provide aid by analyzing speech to
identify emotions of the speaker by creating deep learning models that will aid in the ability to
recognize emotions through speech. Previous SER studies only represent English speakers. In
this research, deep learning techniques are used to analyze the speech of various speech sound
signals in different languages to develop a model to predict emotions based on the speech that
the model was given. This study analyzes 5 different languages and 7 basic sentiments. This
study has 67% accuracy and a 78% f1 score in the primary model using Convolutional Neural
Networks (CNN). A secondary model using Multilayer Perceptron (MLP) had 85% accuracy.
Five out of seven sentiments were predicted correctly.

Table of Contents
Problem Statement..........................................................................................................................1
Significance of Work......................................................................................................................2
Artificial Intelligence.................................................................................................................2
Psychology..................................................................................................................................2
Methodology...................................................................................................................................3
References.......................................................................................................................................8
List of Figures
Figure 1: Waveplot & Spectrogram of happy emotion sample audio.............................................4
Figure 2: Waveplot & Spectrogram of angry emotion sample audio..............................................4
Figure 3: Graphical depiction of the Accuracy of testing and training for the CNN Model...........6
1
The backbone of human life is our relationships with others. A part of the ability to
maintain those connections is the ability to recognize the emotions of yourself and the people
around you. Our society is designed in a way those who can find themselves comfortable and
confident in social situations, and with strong interpersonal skills excel professionally, and
socially. While those who aren’t as strong in this area develop masks and coping mechanisms to
better navigate a world not designed for them.
Problem Statement
The inability to recognize emotions generally affects 13% of the population (Lo, 2021).
This ailment can be developed from a host of diagnoses that can come at any time in someone’s
life. The presence of this condition has the greatest impact in the social, economic, and mental
well-being of those with this affliction. The effects of living with this condition have a broader
effect on society wherein platonic, professional and romantic relationships all are affected by this
whereas the foundation of maintaining these relationships lie in the ability to interact with,
understand, and empathize with others. People whose brain develops or works differently
(neurodivergents) will suffer from discrimination in various ways whether it be economically or
socially, and employers will suffer as well because studies show that the neurodiverse teams are
30% more productive than neurotypical ones and make few errors (MyDisabilityJobs, 2022). In
the workforce as well as social settings gaps in pay, responsibility, and treatment can already be
seen within many marginalized communities. The intersectionality between these communities
can really create social, and economic disparities that put an immense amount of pressure and
difficulty on one person.
Doctors recommend different forms of therapy to cope, while researchers recommend
that people living with this condition should understand their own responses such as their heart
2
rate and its fluctuations as well as journaling physical and emotional responses that one
experiences (Cherney, 2021). While these are good tools to help one learn about emotional
responses, everyone reacts to situations differently and learning how one internally reacts to a
situation won’t necessarily help them identify that in others. This research aims to bridge the gap
for those who live with these circumstances by analyzing speech to identify emotions of the
speaker to create models that will aid in the ability to recognize emotions through speech.
Significance of Work
Artificial Intelligence
This year we are starting to see a different focus in technology news when it comes to
artificial intelligence. Currently there are a lot of discussions around AI and emotion recognition.
The market for emotion recognition-based AI is growing and is projected to grow substantially
from now to 2029 (TheExpressWire, 2022). The projected usage and lie in the consumer and
producer relations. For example, companies want to better understand customer behavior to
know how to better sell their products. Many industries are branching out into computer-vision
and speech recognition to make their products more human-like (nishi, 2022). The appeal of
emotional AI is that it allows markets to better understand people and provide more useful
services (nishi, 2022). It is even proposed that with the era of online learning the technology can
be of aid to teachers as they can collect and analyze students' reactions to help alleviate the
communication perils between teachers and students which can lead to curriculum evolution to
improve learning outcomes (nishi, 2022).
Psychology
With a lot of mental conditions and disorders that affect one’s ability to recognize
emotion we see a lot of emotional recognition training (Andersen, 2022). Which is generally
3
done by studying facial expressions and attaching them to emotions to help alleviate some of the
pressures of context blindness (Andersen, 2022). For example, what does it look like when
someone is anxious or uninterested? Could you denote that from a simple facial expression?
There is a general understanding when it comes to the intersection between facial expressions,
body language, and emotions but, what about when you don’t have the ability to see those facial
expressions. Conditions such as Prosopagnosia, facial blindness, make speech-based emotion
recognition very important (2019). Living with any mental disorder, diagnosed or undiagnosed,
can be alienating when no one understands what that experience is like, and everyone expects a
form of normalcy that you just can’t give them. Trying to cope with these expectations that can
be really damaging to one’s mental health. For example, people with autism are six times more
likely to attempt death by suicide and seven times more likely to die by suicide in comparison to
those who are not autistic (Jachyra et al., 2022). Those disparities are something that should not
be taken lightly and if speech-based emotion recognition research can help in any way it is worth
exploring.
Methodology
In layman terms, the plan for this study is to use English phrases and their corresponding
translations in French, Spanish, Japanese, Mandarin Chinese and Korean to develop a system of
speech-based emotion recognition. This format was chosen after carefully scouring related works
and noticing the imbalance of representation. Many of the datasets used in related works are in
English only as well as have a bias toward mostly male speakers and the creation of an
alternative dataset will help get a more balanced and worldly view for the study. This is an
important aspect of the study because the inability to recognize emotions doesn’t only affect
English speakers.
4
Data collection is the first step. Collection consists of capturing clips in all the languages
ranging from 10 to 30 seconds from by screen recording two Netflix series. The TESS dataset
was also used in combination with the recorded clips from the two Netflix series Julie and The
Phantoms and Heartstopper. After the data is collected one must classify the clips by the
emotion. The emotions that are focused on in this study are calm, happy, sad, angry, fearful,
disgust, and surprised labeled from 1-7 respectively. A jupyter notebook is utilized for the
analysis in this study. Data will be composed in a spreadsheet which will then be imported in the
notebook. Then the dataset that was created was visualized using pie charts, spectrograms, and
wave plots using the python libraries of matplotlib and librosa.
Figure 1: Waveplot & Spectrogram of happy emotion sample audio
Figure 2: Waveplot & Spectrogram of angry emotion sample audio
Before one starts modeling and doing any kind of analysis we have to prepare the data.
We begin with data augmentation, and then we preprocess the data. Data augmentation is used to
create new data samples and widen the threshold for data we use. Adding these small changes
helps account for changes that occur naturally to build a general model. One makes changes to
the pitch, length, and injecting noise. In data preprocessing one does standardization, label
encoding, and feature extraction. Standardization is scaling each input variable separately by
5
subtracting the mean (called centering) and dividing by the standard deviation to shift the
distribution to have a mean of zero and a standard deviation of one. It is used to improve the
performance of predicting modeling. Label encoding utilizes OneHotEncoder() which converts
categorical data into numerical features of a data set. The purpose of encoding in this manner is
that machine learning algorithms assume that and require data to be numeric. Feature extraction
helps the algorithms built get better grasps of the important parts of the data they should pay
attention to and can easily consume. In this study we use the most common features for audio
analysis which are Zero Crossing Rate, Mel Frequency Cepstral Coefficients, Root Mean Square
Value, and MelSpectrogram.
The model built in the study contains two convolutional blocks, each containing three
convolutional layers. The convolutional blocks are separated by max pooling layers. Then, they
are passed through a global average pooling layer, into 6 fully connected layers and then a
classification output layer. To overcome overfitting, we added a dropout layer to remove some of
the connections between the layers. By lowering the complexity of the model in this way, we can
prevent the model from being able to overfit to the parameters, resulting in a much better
accuracy. The model is trained using Python code together with the PyTorch deep learning
framework. The data is split the data into train and test subsets using train test split of sklearn
80/20 split randomly. Loss function we settled on using is the categorical cross entropy loss
function and the adam was used as an optimization function.
Results
The application of categorical cross entropy loss function for optimization and assessed
the generated loss curves with training time using ADAM optimizer. After training and
validating the proposed CNN model, a Mean Squared Error value of 0.6 and an accuracy of 0.67.
6
An f1 score of 78% was achieved which is a machine learning evaluation metric used to assess a
model’s accuracy. However, the MLP of sklearn we obtained an accuracy of 85% which is still
high. This is attributed to the size of the dataset for the multi-labeled classification and the
problem challenges. 5 to 7 sentiments were predicted correctly.
Figure 3: Graphical depiction of the Accuracy of testing and training for the CNN Model
Future Advancements
In the future, this study would benefit from expansion. We want to expand the number of
clips used in the various languages. We could also add more variety in language to account for
linguistic structure and dialect nuances. The accuracy can be improved by running multiple
experiments to let time and repetition help the model learn more. One can also compare multi-
7
label margin loss, multi-label soft margin loss functions for best results. Lastly you can test
several optimization functions, including adam, adamax, and SGD.
Conclusion
We proposed and implemented a CNN model using an accessible open-source dataset
and a few data samples that we generated from movies available on the internet. The architecture
of the model is composed of linked layers, and we applied Python code together with the
PyTorch deep learning framework and tools. Categorical cross entropy loss function was applied
for optimization and assessed the generated optimization curves with training time and ADAM
optimizer. After training and validating the proposed model, we achieved a Mean Squared Error
value of 0.6 which is still high. This is due to the size of the dataset for the multi-label
classification.
8
References
Andersen, R. (2022, August 12). How to help your autistic child with context blindness. Autism
Parenting Magazine. Retrieved September 11, 2022, from
https://www.autismparentingmagazine.com/autism-context-blindness/
Biswal, A. (2022, September 9). Top 10 deep learning algorithms you should know in 2022.
Simplilearn.com. Retrieved September 11, 2022, from
https://www.simplilearn.com/tutorials/deep-learning-tutorial/deep-learning-algorithm
Cherney, K. (2021, September 9). Alexithymia: Causes, symptoms, and treatments. Healthline.
Retrieved September 11, 2022, from
https://www.healthline.com/health/autism/alexithymia#tips-to-cope
Jachyra, P., Rodgers , J., & Cassidy , S. (2022, July 11). Autistic people are six times more likely
to attempt suicide – poor mental health support may be to blame. The Conversation.
Retrieved September 11, 2022, from https://theconversation.com/autistic-people-are-six-
times-more-likely-to-attempt-suicide-poor-mental-health-support-may-be-to-blame-180266
Lo, I. (2021, February 6). Alexithymia: Do you know what you feel? Psychology Today. Retrieved
September 11, 2022, from https://www.psychologytoday.com/us/blog/living-emotional-
intensity/202102/alexithymia-do-you-know-what-you-feel
9
MyDisabilityJobs. (2022, August 25). Neurodiversity in the workplace: Statistics: Update 2022.
MyDisabilityJobs.com. Retrieved September 11, 2022, from
https://mydisabilityjobs.com/statistics/neurodiversity-in-the-workplace/
NHS. (2019). Prosopagnosia (face blindness). NHS choices. Retrieved September 11, 2022, from
https://www.nhs.uk/conditions/face-blindness/#:~:text=Prosopagnosia%2C%20also
%20known%20as%20face,severe%20impact%20on%20everyday%20life
nishi. (2022, September 1). The future of online learning is being shaped by emotional AI after
covid-19. Inventiva. Retrieved September 11, 2022, from
https://www.inventiva.co.in/trends/the-future-of-online-learning-is-being/
TheExpressWire. (2022, September 5). Artificial Intelligence-emotion recognition market insight
manufacturers analysis, revenue, covid-19 impact, supply, growth, upcoming demand,
regional outlook till 2029. Digital Journal. Retrieved September 11, 2022, from
https://www.digitaljournal.com/pr/artificial-intelligence-emotion-recognition-market-
insight-manufacturers-analysis-revenue-covid-19-impact-supply-growth-upcoming-
demand-regional-outlook-till-2029

Jabrenee Hussie Honors Thesis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Jabrenee Hussie Honors Thesis

Uploaded by

Copyright:

Available Formats

SPEECH EMOTION RECOGNITION USING CONVOLUTIONAL NEURAL

JABRENEE ANGELINA HUSSIE

submitted in partial fulfilment of the requirements

This Honors thesis is approved by the following Thesis Committee:

Five out of seven sentiments were predicted correctly.

better navigate a world not designed for them.

(neurodivergents) will suffer from discrimination in various ways whether it be economically or

difficulty on one person.

Doctors recommend different forms of therapy to cope, while researchers recommend

improve learning outcomes (nishi, 2022).

expressions. Conditions such as Prosopagnosia, facial blindness, make speech-based emotion

wave plots using the python libraries of matplotlib and librosa.

Figure 1: Waveplot & Spectrogram of happy emotion sample audio

Figure 2: Waveplot & Spectrogram of angry emotion sample audio

performance of predicting modeling. Label encoding utilizes OneHotEncoder() which converts

Value, and MelSpectrogram.

function and the adam was used as an optimization function.

problem challenges. 5 to 7 sentiments were predicted correctly.

several optimization functions, including adam, adamax, and SGD.

We proposed and implemented a CNN model using an accessible open-source dataset

Parenting Magazine. Retrieved September 11, 2022, from

Simplilearn.com. Retrieved September 11, 2022, from

Retrieved September 11, 2022, from

Retrieved September 11, 2022, from https://theconversation.com/autistic-people-are-six-

September 11, 2022, from https://www.psychologytoday.com/us/blog/living-emotional-

MyDisabilityJobs.com. Retrieved September 11, 2022, from

covid-19. Inventiva. Retrieved September 11, 2022, from

TheExpressWire. (2022, September 5). Artificial Intelligence-emotion recognition market insight

manufacturers analysis, revenue, covid-19 impact, supply, growth, upcoming demand,

You might also like