Professional Documents
Culture Documents
Jabrenee Hussie Honors Thesis
Jabrenee Hussie Honors Thesis
NETWORKS
By
A THESIS
May 2023
comfortable in social situations and have strong interpersonal skills. While those who aren’t as
strong in this area develop coping mechanisms to better navigate the world. The development of
these skills come from the ability to recognize the emotions of others and acting accordingly.
The inability to recognize emotions affects around 13% of the population. Doctors recommend
different forms of therapy to cope, while researchers recommend understanding one’s own
physical, emotional, and biological responses to understand others. While these are good tools to
help one learn about emotional responses, everyone reacts to situations differently and learning
how one internally reacts to a situation won’t necessarily help them identify that in others.
Speech-based emotion recognition (SER) research aims to provide aid by analyzing speech to
identify emotions of the speaker by creating deep learning models that will aid in the ability to
recognize emotions through speech. Previous SER studies only represent English speakers. In
this research, deep learning techniques are used to analyze the speech of various speech sound
signals in different languages to develop a model to predict emotions based on the speech that
the model was given. This study analyzes 5 different languages and 7 basic sentiments. This
study has 67% accuracy and a 78% f1 score in the primary model using Convolutional Neural
Networks (CNN). A secondary model using Multilayer Perceptron (MLP) had 85% accuracy.
The backbone of human life is our relationships with others. A part of the ability to
maintain those connections is the ability to recognize the emotions of yourself and the people
around you. Our society is designed in a way those who can find themselves comfortable and
confident in social situations, and with strong interpersonal skills excel professionally, and
socially. While those who aren’t as strong in this area develop masks and coping mechanisms to
Problem Statement
The inability to recognize emotions generally affects 13% of the population (Lo, 2021).
This ailment can be developed from a host of diagnoses that can come at any time in someone’s
life. The presence of this condition has the greatest impact in the social, economic, and mental
well-being of those with this affliction. The effects of living with this condition have a broader
effect on society wherein platonic, professional and romantic relationships all are affected by this
whereas the foundation of maintaining these relationships lie in the ability to interact with,
understand, and empathize with others. People whose brain develops or works differently
socially, and employers will suffer as well because studies show that the neurodiverse teams are
30% more productive than neurotypical ones and make few errors (MyDisabilityJobs, 2022). In
the workforce as well as social settings gaps in pay, responsibility, and treatment can already be
seen within many marginalized communities. The intersectionality between these communities
can really create social, and economic disparities that put an immense amount of pressure and
that people living with this condition should understand their own responses such as their heart
2
rate and its fluctuations as well as journaling physical and emotional responses that one
experiences (Cherney, 2021). While these are good tools to help one learn about emotional
responses, everyone reacts to situations differently and learning how one internally reacts to a
situation won’t necessarily help them identify that in others. This research aims to bridge the gap
for those who live with these circumstances by analyzing speech to identify emotions of the
speaker to create models that will aid in the ability to recognize emotions through speech.
Significance of Work
Artificial Intelligence
This year we are starting to see a different focus in technology news when it comes to
artificial intelligence. Currently there are a lot of discussions around AI and emotion recognition.
The market for emotion recognition-based AI is growing and is projected to grow substantially
from now to 2029 (TheExpressWire, 2022). The projected usage and lie in the consumer and
producer relations. For example, companies want to better understand customer behavior to
know how to better sell their products. Many industries are branching out into computer-vision
and speech recognition to make their products more human-like (nishi, 2022). The appeal of
emotional AI is that it allows markets to better understand people and provide more useful
services (nishi, 2022). It is even proposed that with the era of online learning the technology can
be of aid to teachers as they can collect and analyze students' reactions to help alleviate the
communication perils between teachers and students which can lead to curriculum evolution to
Psychology
With a lot of mental conditions and disorders that affect one’s ability to recognize
emotion we see a lot of emotional recognition training (Andersen, 2022). Which is generally
3
done by studying facial expressions and attaching them to emotions to help alleviate some of the
pressures of context blindness (Andersen, 2022). For example, what does it look like when
someone is anxious or uninterested? Could you denote that from a simple facial expression?
There is a general understanding when it comes to the intersection between facial expressions,
body language, and emotions but, what about when you don’t have the ability to see those facial
recognition very important (2019). Living with any mental disorder, diagnosed or undiagnosed,
can be alienating when no one understands what that experience is like, and everyone expects a
form of normalcy that you just can’t give them. Trying to cope with these expectations that can
be really damaging to one’s mental health. For example, people with autism are six times more
likely to attempt death by suicide and seven times more likely to die by suicide in comparison to
those who are not autistic (Jachyra et al., 2022). Those disparities are something that should not
be taken lightly and if speech-based emotion recognition research can help in any way it is worth
exploring.
Methodology
In layman terms, the plan for this study is to use English phrases and their corresponding
translations in French, Spanish, Japanese, Mandarin Chinese and Korean to develop a system of
speech-based emotion recognition. This format was chosen after carefully scouring related works
and noticing the imbalance of representation. Many of the datasets used in related works are in
English only as well as have a bias toward mostly male speakers and the creation of an
alternative dataset will help get a more balanced and worldly view for the study. This is an
important aspect of the study because the inability to recognize emotions doesn’t only affect
English speakers.
4
Data collection is the first step. Collection consists of capturing clips in all the languages
ranging from 10 to 30 seconds from by screen recording two Netflix series. The TESS dataset
was also used in combination with the recorded clips from the two Netflix series Julie and The
Phantoms and Heartstopper. After the data is collected one must classify the clips by the
emotion. The emotions that are focused on in this study are calm, happy, sad, angry, fearful,
disgust, and surprised labeled from 1-7 respectively. A jupyter notebook is utilized for the
analysis in this study. Data will be composed in a spreadsheet which will then be imported in the
notebook. Then the dataset that was created was visualized using pie charts, spectrograms, and
Before one starts modeling and doing any kind of analysis we have to prepare the data.
We begin with data augmentation, and then we preprocess the data. Data augmentation is used to
create new data samples and widen the threshold for data we use. Adding these small changes
helps account for changes that occur naturally to build a general model. One makes changes to
the pitch, length, and injecting noise. In data preprocessing one does standardization, label
encoding, and feature extraction. Standardization is scaling each input variable separately by
5
subtracting the mean (called centering) and dividing by the standard deviation to shift the
distribution to have a mean of zero and a standard deviation of one. It is used to improve the
categorical data into numerical features of a data set. The purpose of encoding in this manner is
that machine learning algorithms assume that and require data to be numeric. Feature extraction
helps the algorithms built get better grasps of the important parts of the data they should pay
attention to and can easily consume. In this study we use the most common features for audio
analysis which are Zero Crossing Rate, Mel Frequency Cepstral Coefficients, Root Mean Square
The model built in the study contains two convolutional blocks, each containing three
convolutional layers. The convolutional blocks are separated by max pooling layers. Then, they
are passed through a global average pooling layer, into 6 fully connected layers and then a
classification output layer. To overcome overfitting, we added a dropout layer to remove some of
the connections between the layers. By lowering the complexity of the model in this way, we can
prevent the model from being able to overfit to the parameters, resulting in a much better
accuracy. The model is trained using Python code together with the PyTorch deep learning
framework. The data is split the data into train and test subsets using train test split of sklearn
80/20 split randomly. Loss function we settled on using is the categorical cross entropy loss
Results
The application of categorical cross entropy loss function for optimization and assessed
the generated loss curves with training time using ADAM optimizer. After training and
validating the proposed CNN model, a Mean Squared Error value of 0.6 and an accuracy of 0.67.
6
An f1 score of 78% was achieved which is a machine learning evaluation metric used to assess a
model’s accuracy. However, the MLP of sklearn we obtained an accuracy of 85% which is still
high. This is attributed to the size of the dataset for the multi-labeled classification and the
Figure 3: Graphical depiction of the Accuracy of testing and training for the CNN Model
Future Advancements
In the future, this study would benefit from expansion. We want to expand the number of
clips used in the various languages. We could also add more variety in language to account for
linguistic structure and dialect nuances. The accuracy can be improved by running multiple
experiments to let time and repetition help the model learn more. One can also compare multi-
7
label margin loss, multi-label soft margin loss functions for best results. Lastly you can test
Conclusion
and a few data samples that we generated from movies available on the internet. The architecture
of the model is composed of linked layers, and we applied Python code together with the
PyTorch deep learning framework and tools. Categorical cross entropy loss function was applied
for optimization and assessed the generated optimization curves with training time and ADAM
optimizer. After training and validating the proposed model, we achieved a Mean Squared Error
value of 0.6 which is still high. This is due to the size of the dataset for the multi-label
classification.
8
References
Andersen, R. (2022, August 12). How to help your autistic child with context blindness. Autism
https://www.autismparentingmagazine.com/autism-context-blindness/
Biswal, A. (2022, September 9). Top 10 deep learning algorithms you should know in 2022.
https://www.simplilearn.com/tutorials/deep-learning-tutorial/deep-learning-algorithm
Cherney, K. (2021, September 9). Alexithymia: Causes, symptoms, and treatments. Healthline.
https://www.healthline.com/health/autism/alexithymia#tips-to-cope
Jachyra, P., Rodgers , J., & Cassidy , S. (2022, July 11). Autistic people are six times more likely
to attempt suicide – poor mental health support may be to blame. The Conversation.
times-more-likely-to-attempt-suicide-poor-mental-health-support-may-be-to-blame-180266
Lo, I. (2021, February 6). Alexithymia: Do you know what you feel? Psychology Today. Retrieved
intensity/202102/alexithymia-do-you-know-what-you-feel
9
MyDisabilityJobs. (2022, August 25). Neurodiversity in the workplace: Statistics: Update 2022.
https://mydisabilityjobs.com/statistics/neurodiversity-in-the-workplace/
NHS. (2019). Prosopagnosia (face blindness). NHS choices. Retrieved September 11, 2022, from
https://www.nhs.uk/conditions/face-blindness/#:~:text=Prosopagnosia%2C%20also
%20known%20as%20face,severe%20impact%20on%20everyday%20life
nishi. (2022, September 1). The future of online learning is being shaped by emotional AI after
https://www.inventiva.co.in/trends/the-future-of-online-learning-is-being/
regional outlook till 2029. Digital Journal. Retrieved September 11, 2022, from
https://www.digitaljournal.com/pr/artificial-intelligence-emotion-recognition-market-
insight-manufacturers-analysis-revenue-covid-19-impact-supply-growth-upcoming-
demand-regional-outlook-till-2029