Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Volume 7, Issue 4, April – 2022 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Detecting Mental Distress through User’s Social


Media Activity
Isha Raina, B. Indra Thannaya
IGDTUW, Delhi

Abstract:- Users of social networking sites can In present era, data has turned into unstructured data from
approach their friends who are interested and numerous businesses, social media, organizations, banking, and so
expressing their thoughts, feelings, and sentiments on includes hidden patterns that, when studiedthoroughly, can
through ideas, photographs, and videos. This opens expose new extents of research and development. However,
the door to studying online information for user reading all this vital material and coming to a decision is not an
emotions and feelings in order to gain a better easy task. Text mining, opinion mining, text recognition, and so on
understanding of their emotions and attitudes when all come into play here.
utilizing these online platforms.
Natural Language Processing (NLP)
Depression may be dangerous to one's health,
particularly if it is recurring and of moderate or It is the practice of using software to recognize and deploy
severe degree. It can make the individual suffer a lot natural language such as speech and text automatically.
and make them perform poorly at job, school, and at
home. Suicide is a possibility when depression is The following are the two primary components of NLP that
severe. It is one of the leading causes of death among are defined and described: -
those between 15 to 29 of age.  Natural Language Generation (NLG): The use of artificial
intelligence (AI) programming to generate written or spoken
Machine learning algorithms and Natural narratives from a data collection is known as natural language
Language As the facts state that around 700,000 generation. NLG incorporates computational linguistics, NLP,
people in one year kill themselves. and NLU, as well as human-to-machine and machine-to-human
interaction.
Processing will be employed in the proposed  Natural Language Understanding (NLU): It primarily entails the
problem statement to detect if a person is going following two major tasks:
through mental distress. The main aim is to discover  The provided natural language input is mapped to eloquent
that commonality within the tweets that can help in representations.
identifying whether the individual is on the edge of  Recognizing different patterns in Natural Language.
mental distress so that there’s no delay is reaching
out and helping the person who is suffering. NLP Pipeline

I. INTRODUCTION The aim is to break the problem down into little chunks and
then use machine learning to address each one individually. Then
Depression, being one of the most common mental intricate things can be performed by chaining together numerous
illness, impacts about 300 million individuals throughout machine learning models that feed into each other.
this world. Early identification is crucial for prompt
action, which can help to prevent the illness from The steps to build a NLP Pipeline are:
worsening. As per WHO stats, around 280 million  Sentence Segmentation: The first and the initial most stage in
people in the whole word are suffering from mental creating a natural language processing pipeline is this one.
distress. In many nations, depression is still Breaking up the content into discrete phrases is the very
underdiagnosed and untreated, resulting in negative self- initial step in the pipeline.
perception and, in the worst-case scenario, suicide [1].  Word Tokenization: After splitting out sentence, breaking the
The need to detect mental distress in individuals is material into discrete phrases is the first stage in the pipeline.
alarming. Hence, this project is aimed at detecting if a The sentences are broken down into words in order to define and
person is depressed by analyzing their social network understand the semantic meaning of each word independently in
posts and tweets. this step.
 Parts of Speech Predictions for Each Token: This step is to find
People have begun to express their experiences and out the part of speech for each work as they get converted into
struggles with mental health illnesses via online forums, tokens now.
microblogs, and tweets as the Internet has grown in  Text Lemmatization: The major goal of this stage is to figure out
popularity. Many researchers were influenced by their what each word's basic form is so it can be acknowledged if
online activities to develop new types of prospective different sentences are talking about the same entity or not. This
health-care solutions and approaches for early depression technique is known as lemmatization or determining the most
detection systems. They attempted to get a greater fundamental form or lemma of each word in the phrase.
performance increase by employing several Natural  Identifying Stop Words: Before undertaking any statistical
Language Processing (NLP) methodologies and text analysis, filtering out terms called stop words is important. In
categorization methods. [2]

IJISRT22APR1384 www.ijisrt.com 950


Volume 7, Issue 4, April – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
this step identification and elimination is done in this step. Random Forest: Random Forest (RF) is a set of decision tree
We mainly remove all the stop words present inn our data classifiers trained using the bagging approach, in which a number
corpus. of different learning models are combined to improve the overall
 Dependency Parsing: This stage governs how the words output. [6]
in a sentence are associated to one another. Dependency
parsing is the term for this. The aim is to create a tree that Implementation Steps are given below:
gives each word in the text a single parent word. The key  Data Pre-processing step
verb in the expression will be the tree's root.  Fitting into the training set our algorithm which isRandom
 Named Entity Recognition (NER): The goal of Named Forest
Entity Recognition is to identify these nouns and assign  Predicting the test result
them towards the real-world concepts they reflect.  Testing the accuracy of the result obtained
 Visualizing the test set result.
Machine learning (ML) is the ponder of computer
calculations that can learn and create on their claim with II. BACKGROUND LITERATURE
involvement and information. It could be a component of
manufactured insights. Machine learning algorithms create Improved feature selection and combination helps in
a model based on trained information to create forecasts or improving classifier performance and accuracy.Raza Ul
judgments without having to be unequivocally modified to Mustafaa, Noman Ashrafb, Fahad Shabbir Ahmedd, Javed
do so. Machine learning calculations are utilized in a wide Ferzunda, Basit Shahzadc, Alexander Gelbukh, 2020 concluded
run of applications, such as medication, mail filtering, in their paper A Multiclass Depression Detection in social
discourse acknowledgment, and computer vision, when media based on Sentiment Analysis [7] Using Neural Network,
existing algorithms are difficult or incomprehensible to SVM,RFand 1D Convolutional Neural Networks they achieved
construct. 91% accuracy. In the similar fashion Michael M. Tadesse,
Hongfei Lin, Bo Xu ,Liang Yang in his paper “Detection of
The algorithms used in this project are: Logistic Depression - Related Posts in Reddit Social Media Forum”,
Regression & Support Vector Machine (SVM) and Random 2019 [2] obtained an accuracy of 80% using Random Forest,
Forest. SVM, Decision tree, Logistic Regression, Adaptive Boosting,
Multilayer Perceptron and stated that the model's accuracy is
Logistic Regression: This is a kind of linear excellent. Machine learning and deep learning algorithms can
classification technique which is used to estimate the be used to improve the model and further study.Akshi Kumar,
possibility of a binary answer based on one or more Aditi Sharma, Anshika Arora found a way to improvise the
predictors. [3][4]. The method of modelling the likelihood models to get a better result using Boosting, Random forest,
of a discrete result given an input variable is known as Multinomial, Naive Bayes. [9] Their research was implemented
logistic regression. The most frequent logistic regression on Twitter data Depression Detection via Harvesting Social
models have a binary result, which might be true or false, Media: A Multimodal Dictionary Learning Solution [10]
yes or no, and so forth. Multinomial logistic regression can algorithm provided Depression behavior discovery. Multimodal
be used to model circumstances with more than two Depressive Dictionary Learning (MDL) method achieved the
distinctconclusions. When it comes to classification jobs, best performance with 85% in F1-Measure with NB and MSNL
logistic regression is a useful analytical method for algorithms (2017).
assessing if a new instance fits best into a category.
Because components of cyber security, such as threat K-Nearest Neighbours , Decision tree, SVM, Ensemble,
detection, are classification issues, logistic regression is a the algorithms when put together prove to be sufficient and
valuable analytic tool. efficient in the detection of the depression with accuracy
between 60 and 80% [11] as seen in the paper “Depression
Support Vector Machine (SVM): Support-vector detection from social network data using machine learning
machines are a part of supervised learning models that techniques” where Facebook data was used.
analyse data for classification and regression analysis using
learning techniques in machine learning. The Support On the other hand Munmun De Choudhury, Michael
Vector Machine (SVM) model portrays examples as points Gamon, Scott Counts, Eric Horvitz in 2018 used PCA, Support
in a high-dimensional space used for classification, with the Vector Machine classifier(SVM) and concluded that Findings
points of the various categories separated by a large and methods used in the research are useful in developing
distance.Unused occurrences are at that point mapped into toolsfor identifying the onset of major depression, for use
the same space and classified concurring to which side of byhealthcare agencies. [12]
the crevice they arrive on[5]. By implicitly converting their
inputs into high-dimensional vector space, the kernel Using Convolutional Neural Network the accuracy was
technique allows SVMs to execute non-linear classification found out to be 82% on Twitter data in Multimodal mental
successfully. The purpose of the support vector technique is health analysis in social media [13]. Deep learning and neural
used to find a hyperplane that differentiates amongst data networking were the major components of this method. David
points in an N-dimensional space (N = the number of B Yaden, Margaret L Kern, Lyle H Ungar, and Johannes C
characteristics). Eichstaedt, Sharath Chandra Guntuku, David B Yaden,
Margaret L Kern, Lyle H Ungar, and Johannes C
Eichstaedt[14] compared the results with algorithms like SVM

IJISRT22APR1384 www.ijisrt.com 951


Volume 7, Issue 4, April – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
and Linear Regression and obtained an accuracy of but failed to detect depression with a good accuracy The best
~70% in the paper Detecting depression and mental results are shown by SVM with a precision of 73.6% and
illness on social media: an integrative review. accuracy of 77.5%.

Hoyun Song∗Jinseon You∗Jin-Woo Chung Jong C. Some of the challenges with current models:
Park [15] proved that FAN considers only four  There is a need to find out more features that can relate to
featureswhich are not sufficient on themselves entirely human behavior and help in the detection of depression.
to detect depression as they used Feature Attention  The n-gram and tf-idf based features did not perform as
Network, Multilayer Perceptron(MLP), GloVe, GRU, expected over the dataset.
one of theRecurrent Neural Network(RNN) variants, L2  There are several improvements to be made for better
Regularization, Adam Optimizer, ConvolutionalNeural optimizations.
Network(CNN-E,CNN-R) it was observed that it  Use of word embedding proved to be a disadvantage and the
outperforms all themodels except the CNN-R model, main issue with CNN model is the high amount of increase in
FAN shows a similar F1- score to the baseline the training time.
methodologies.In 2020 Zhenpeng Chen, Yanbin Cao,  Fine grain emotion analysis can be done for the purpose of
and HuihanYaodeve developed a model anxiety detection.
DeepMojiModel,SEntiMojiModel [16]SEntiMoji was  There is a requirement to work on the ethical aspects and
beneficial for tasks that mainly depend on emotion terms to extend this form of study (i.e. depression detection).
identification. The method of construction of
 There is a need to build a smart AI system that can analyze
variousdatasets can be different. so, the performance
the symptoms from tweets accurately. The lack of a perfectly
should be analysedrationally was the main objective and
accurate model is a big disadvantage.
aspect behind their research.

MyStem, Udpipe, Linis-Crowd III. DATASET DESCRIPTION


SentimentDictionary, Random Forest, Support Vector
Machine used SVM+PM-r model with 75.1% ROC- The dataset used is Sentiment140 [23] dataset with 1.6
AUC score using Vkontakte dataset[17]. They million tweets.
discovered that the n-gram and tf-idf based features did The data consists of 4 columns namely Target,
not perform as expected over the dataset. User_Name, ID&Tweet_Text. We have combined some part of
With a F1 score of 0.51 Random Forest, Logistic the Sentiment140 and the scraped depressive tweets to form a
Regression, Naive bayes, CNN, GloVe W+N, etc new dataset.
algorithms were used but did not come out to be a
suitable metric for this task they used Reddit data [18]. Data
Column Description
Kali Cornn in her research paper “Identifying Type
Depression on social media” [19] used Logistic Polarity of tweet ( 4 -
Target Int
Regression, Support Vector Machine, BERT- based depressed, 0 - positive)
model, Character based CNN model without The user that tweeted
User_Name Int
embeddings, Character-based CNN with pre trained (armotley)
word embeddings and got BERT accuracy 85.7% & ID The id of the tweet (2087) Object
CNN accuracy 92.5%. Use of word embedding proved The text of the tweet (about to
Tweet_Text Object
to be a disadvantage. The major issue with CNN model file taxes)
is the high amount of increase in the training time. Table 1: Shows the dataset description with column definition
Random forest with two threshold functions, two
independent RF classifiers was used in “Early detection According to the creators of dataset “Our method was
of depression: social network analysis and random forest unusual in that our training data was generated automatically
techniques” [20] They concluded that Time-based rather than by people manually annotating tweets. We took the
approach is effective and that different model view that any tweet containing positive emoticons, such as:),
combinations can be compared and studied in future for was positive, and any tweet with negative emoticons, such as:(,
better results and performance. Again in “Study of was negative. We gathered these tweets using the Twitter
depression analysis using machine learning techniques” Search API and a keyword search”
by Devakunchari Ramalingam, Vaibhav Sharma,
Priyanka Zar [21] SVM and RF was used applied on IV. METHODOLOGY
Weibo and Twitter dataset which gave an accuracy of The proposed method is based on machine learning
82% but the lack of a perfectly accurate model was a big processing algorithms like Support Vector Machine, Random
disadvantage. Forest and Logistic Regression and for text preprocessing we
Hatoon S. AlSagri, Mourad Ykhlef [22] used SVM, will be using NLP.
Naive Bayes, Decision Tree and Machine learning based The process of the methodology that will be followed is:
approach for depression detection in twitter using
 Data Collection: In this step the data is collected based on the
content and activity features where they concluded that
problem statement. The result of this phase is often a data
the Decision tree is more comprehensive and evaluative
representation that will be used for training.

IJISRT22APR1384 www.ijisrt.com 952


Volume 7, Issue 4, April – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 Data Preparation: This is a crucial part in the process, and VII. CONCLUSION AND FUTURE WORK
people generally spend up to 80% of their time here.
Having a clean data collection improves the accuracy of If we totally depend on the findings of our model and
model in the long run. The obtained data is subsequently algorithm, we may come to the conclusion that there is a clear
cleaned up by deleting any redundant, undesired, or null need to realize that individuals are going through a variety of
variables that might impair the model's or algorithm's different situations, which can be observed in their statements.
accuracy.
The future potential in this subject might include developing
 Model Selection: In ML, there are a variety of algorithms
a model for sending frightening information to individuals, as well
tochoose from. The need is to figure out which algorithm
as the government and medical authorities, in order to minimize
is the best out of all the options. In this project SVM,
depression instances in the country and worldwide
Logistic Regression, and Random Forest will beused.
 Model Training: In this step the data set is linked to an REFERENCES
algorithm, which learns and develops predictions using
advanced mathematical modelling. These algorithms are [1.] EEEE - M. J. Friedrich, ‘‘Depression is the leading cause
usually classified into one of three groups: of disability around the world,’’ JAMA, vol. 317, no. 15,
 Binary – Divide into two groups. p. 1517, Apr. 2017
 Classification - Sort into a variety of categories. [2.] Tadesse, Michael M., et al. "Detection of depression-
 Predict a numeric value using regression. related posts in reddit social media forum." IEEE Access 7
 Model Evaluation: Model evaluation means finding out (2019): 44883-44893.
the given factors namely: accuracy, precision and recall. [3.] Gortmaker, Steven L. "Theory and methods--Applied
These are the three basic measures used to evaluate a Logistic Regression by David W. Hosmer Jr and Stanley
classification model. Lemeshow." Contemporary sociology 23.1 (1994): 159.
 Parameter Tuning: Tuning is the process of enhancing a [4.] Agresti, Alan. An introduction to categorical data analysis.
model's performance while avoiding overfitting or John Wiley & Sons, 2018.
excessive variance. The model is fine-tuned by altering [5.] Noble, William S. "What is a support vector machine?."
the learning rates or the values of the test and train Nature biotechnology 24.12 (2006): 1565-1567.
datasets. This is performed in machine learning by [6.] Xu, Baoxun, Yunming Ye, and Lei Nie. "An improved
picking appropriate "hyperparameters." random forest classifier for image classification." 2012
 Predictions: Data is used in machine learning to answer IEEE International Conference on Information and
problems. So, inference, or prediction, is the point when Automation. IEEE, 2012.
we get to answer certain questions. This is the [7.] Mustafa, Raza Ul, et al. "A multiclass depression
culmination of all of the efforts, and it is here that the detection in social media based on sentiment analysis."
benefit of machine learning is apparent. Proceedings of the 17th IEEE International Conference on
Information Technology—New Generations. Springer,
V. IMPLEMENTATION STEPS 2020.
[8.] Stephen, Jini Jojo, and P. Prabu. "Detecting the magnitude
 Deleting columns that are not important of depression in Twitter users using sentiment analysis."
 Data Cleaning: Removing any links, twitter handles International Journal of Electrical and Computer
(username), punctuations & special characters present Engineering 9.4 (2019): 3247.
in the data [9.] Kumar, Akshi, Aditi Sharma, and Anshika Arora.
 Text Processing: Removing stopwords by importing "Anxious depression prediction in real-time social data."
nltk library International Conference on Advances in Engineering
 Text Tokenization Science Management & Technology (ICAESMT)-2019,
Uttaranchal University, Dehradun, India. 2019.
VI. RESULT [10.] Shen, Guangyao, et al. "Depression Detection via
The accuracy of the model with different algorithms is Harvesting Social Media: A Multimodal Dictionary
given in table- Learning Solution." IJCAI. 2017.
[11.] Islam, Md Rafiqul, et al. "Depression detection from
Metrics Logistic SVM Random social network data using machine learning techniques."
Regression Forest Health information science and systems 6.1 (2018): 1-12.
[12.] De Choudhury, Munmun, et al. "Predicting depression via
Accuracy 0.72373279 0.72630636 0.70706053
social media." Seventh international AAAI conference on
Precision 0.72401699 0.72946045 0.70738364 weblogs and social media. 2013.
Recall 0.72358226 0.72586216 0.70718665 [13.] Yazdavar, Amir Hossein, et al. "Multimodal mental health
F1 Score 0.72354649 0.72509602 0.70701695 analysis in social media." Plos one 15.4 (2020): e0226248.
[14.] Guntuku, Sharath Chandra, et al. "Detecting depression
and mental illness on social media: an integrative review."
Current Opinion in Behavioral Sciences 18 (2017): 43-49.
[15.] Song, Hoyun, et al. "Feature attention network:
interpretable depression detection from social media."

IJISRT22APR1384 www.ijisrt.com 953


Volume 7, Issue 4, April – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Proceedings of the 32nd Pacific Asia Conference
on Language, Information and Computation. 2018.
[16.] Chen, Zhenpeng, et al. "Emoji-powered sentiment
and emotion detection from software developers’
communication data." ACM Transactions on
Software Engineering and Methodology (TOSEM)
30.2 (2021): 1-48.
[17.] Stankevich, Maxim, et al. "Depression detection
from social media texts." Elizarov, A., Novikov, B.,
Stupnikov., S (eds.) Data Analytics and
Management in Data Intensive Domains: XXI
International Conference DAMDID/RCDL. 2019.
[18.] Trotzek, Marcel, Sven Koitka, and Christoph M.
Friedrich. "Utilizing neural networks and linguistic
metadata for early detection of depression
indications in text sequences." IEEE Transactions
on Knowledge and Data Engineering 32.3 (2018):
588-601.
[19.] Cornn, Kali. "Identifying depression on social
media." Department of Statistics Stanford
University Stanford, CA 94305 (2020).
[20.] Cacheda, Fidel, et al. "Early detection of
depression: social network analysis and random
forest techniques." Journal of medical Internet
research 21.6 (2019): e12554.
[21.] Ramalingam, Devakunchari, Vaibhav Sharma, and
Priyanka Zar. "Study of depression analysis using
machine learning techniques." Int. J. Innov.
Technol. Explor. Eng 8.7C2 (2019): 187-191.
[22.] AlSagri, Hatoon S., and Mourad Ykhlef. "Machine
learning-based approach for depression detection in
twitter using content and activity features." IEICE
Transactions on Information and Systems 103.8
(2020): 1825-1832.
[23.] https://www.kaggle.com/kazanova/sentiment140 -
Dataset

IJISRT22APR1384 www.ijisrt.com 954

You might also like