Suicidal Ideation Detection On Social Media
Suicidal Ideation Detection On Social Media
state, including the subjective nature of ideation and the stigma surrounding
mental health.Next, the abstract explores different techniques used for suicidal
processing, and sentiment analysis. It discusses the use of various data sources
privacy concerns and the potential for false positives or negatives. It emphasizes
the importance of balancing the benefits of early intervention with the need to
V
including improved mental health care delivery, targeted interventions, and
prevention strategies.
TABLE OF CONTENT
VI
5.1 INTRODUCTION 21
5.2 HARDWARE REQUIREMENTS 21
5.3 SOFTWARE REQUIREMENTS 21
6. SYSTEM IMPLEMENTATION 22
6.1 MODULE DESCRIPTION 22
6.1.1 DATA COLLECTION 22
6.1.2 DATA PREPROCESSING 22
6.1.3 FEATURE EXTRACTION 23
6.1.4 SPLITTING THE DATASET 25
6.1.5 MODEL TRAINING 25
6.1.6 MODEL EVALUATION 25
6.1.7 DEPLOYMENT AND MONITORING 25
6.2 ALGORITHMS USED 26
6.2.1 SUPPORT VECTOR MACHINE 26
6.3 LIBARIES USED 28
7. PROJECT OUTPUT 31
8. CONCLUSION AND FUTUREWORK 32
ANNEXURE 33
REFERENCES 37
VII
LIST OF FIGURES
3 4.3.3 ER DIAGRAM 16
VIII
LIST OF ABBREVIATIONS
IX
CHAPTER 1
INTRODUCTION
1
reporting can be time-consuming, resource-intensive, and may not capture real-
time changes in ideation.
In recent years, researchers have explored the use of technology, data analytics,
social media posts, online forums, and electronic health records, for identifying
these techniques can potentially detect patterns and trends that may go
socio-economic status, along with psychosocial factors and clinical history, can
accuracy.
2
However, ethical considerations are crucial in the development and deployment
intervention and follow-up are essential to protect the well-being and rights of
individuals.
leveraging technology and data analytics. These methods offer the potential for
3
CHAPTER 2
LITERATURE SURVEY
4
textual features, the CNN–BiLSTM model outperformed the XGBoost
model, achieving 95% suicidal ideation detection accuracy, compared with
the latter’s 91.5% accuracy. Conversely, when using LIWC features,
XGBoost showed better performance than CNN–BiLSTM.
5
Title: "Predicting Suicidal Ideation in Online Forums"
Online social networking platforms like Twitter, Redditt and Facebook are
becoming a new way for the people to express themselves freely without
worrying about social stigma. This paper presents a methodology and
experimentation using social media as a tool to analyse the suicidal ideation in a
better way, thus helping in preventing the chances of being the victim of this
unfortunate mental disorder. The data is collected from Twitter, one of the
7
popular Social Networking Sites (SNS). The Tweets are then pre-processed and
annotated manually. Finally, various machine learning and ensemble methods
are used to automatically distinguish Suicidal and Non-Suicidal tweets. This
experimental study will help the researchers to know and understand how SNS
are used by the people to express their distress related feelings and emotions.
The study further confirmed that it is possible to analyse and differentiate these
tweets using human coding and then replicate the accuracy by machine
classification. However, the power of prediction for detecting genuine
suicidality is not confirmed yet, and this study does not directly communicate
and intervene the people having suicidal behaviour.Suicidal ideation is one of
the most severe mental health issues faced by people all over the world. There
are various risk factors involved that can lead to suicide. The most common &
critical risk factors among them are depression, anxiety, social isolation and
hopelessness. Early detection of these risk factors can help in preventing or
reducing the number of suicidial behaviour.
8
CHAPTER 3
EXISTING SYSTEM
9
Imbalanced Data and Rare Events: Suicidal ideation detection is a rare
event in social media data, which often leads to imbalanced datasets.
Naive Bayes classifiers might struggle with imbalanced data because they
tend to bias predictions towards the majority class. Special techniques
such as oversampling, undersampling, or using different evaluation
metrics would be necessary to address this issue.
3.3PROPOSED SYSTEM
The proposed system for detecting suicidal ideation in social media posts
utilizes Support Vector Machines (SVM) as its core algorithm. This system
begins with the collection of a diverse dataset containing social media posts,
each labeled to distinguish between those indicating suicidal ideation and those
that do not. Following data collection, preprocessing techniques are applied to
cleanse the text data, including noise removal, tokenization, stemming or
lemmatization, and elimination of stopwords. Next, feature extraction methods
such as TF-IDF or word embeddings are employed to represent each post as a
numerical vector, capturing semantic relationships and word importance. The
SVM algorithm is then trained on this labeled dataset, aiming to find the
optimal hyperplane that separates suicidal and non-suicidal content in the
feature space. Model evaluation is conducted using metrics like accuracy,
precision, recall, and F1-score, ensuring robustness and generalization.
Subsequently, the trained SVM classifier is applied to new, unseen social media
posts to predict the presence of suicidal ideation, enabling appropriate
interventions such as providing mental health resources or alerting moderators.
Continuous monitoring and refinement of the system's performance are
emphasized, along with ethical considerations regarding user privacy,
algorithmic bias, and human oversight throughout the process. Through this
approach, the proposed system aims to contribute to early intervention and
support for individuals expressing signs of distress on social media platforms.
High Accuracy: SVMs are known for their ability to handle high-
dimensional data and find complex decision boundaries. They can
effectively separate data points into different classes, leading to high
classification accuracy.
10
Versatility: SVMs can handle both linear and non-linear data through the
use of different kernel functions. This flexibility allows them to capture
intricate relationships in textual data, making them suitable for detecting
nuanced expressions of suicidal ideation.
Effective with Small Datasets: SVMs can perform well even with small
training datasets, making them suitable for tasks where labeled data may
be limited, such as detecting suicidal ideation.
11
CHAPTER 4
SYSTEM ARCHITECURE
SVM Algorithm
12
4.2 DATA FLOW DIAGRAM
LEVEL 0 DFD
LEVEL 1 DFD
13
LEVEL 2 DFD
14
4.3.1 CLASS DIAGRAM
15
Fig 4.3.2.1 Use case Diagram
4.3.3 ER DIAGRAM
16
Fig 4.3.3.1 ER Diagram
17
Fig 4.3.4.1 Sequence Diagram
18
and communication patterns between objects, providing a dynamic view of how
objects collaborate to achieve a particular functionality.
19
4.3.6 ACTIVITY DIAGRAM
An activity diagram is a type of UML diagram that models the flow of
activities, actions, and decisions within a system or process. It provides a visual
representation of the workflow or behavior of a system, illustrating the sequence
of actions and the conditions or decisions that control the flow.
20
CHAPTER 5
SYSTEM REQUIREMENTS
5.1 INTRODUCTION
21
CHAPTER 6
SYSTEM IMPLEMENTATION
● Stop Word Removal: Stop words are words that do not contribute to
the meaning of a sentence, such as "a", "an", "the", "and", and "but".
Removing stop words can reduce the size of the vocabulary and
improve the efficiency of the model. However, it is important to note
that stop words can sometimes carry important information, and their
removal may not always be appropriate.
22
suffixes of words, while lemmatization involves converting words to their
base form based on their part of speech.
23
measures the occurrence frequency of a word in a document, while
IDF penalizes common words and rewards rare words. The TF-IDF
representation helps to capture the relevance of words in a document
collection.
Word Embeddings: Word embeddings are dense vector
representations of words that capture semantic and syntactic
relationships. Pre-trained word embeddings such as Word2Vec,
GloVe, or FastText can be used to obtain word vectors. These
embeddings encode semantic information, allowing models to capture
the meaning and context of words.
Character-level Features: In some cases, character-level features can
be used alongside word-level features. For example, character n-grams
or character-level embeddings can capture subword information,
useful for morphologically rich languages or out-of-vocabulary words.
Named Entity Recognition (NER) Features: NER features identify and
classify named entities such as names, organizations, locations, and
dates in the text. These features can be used to provide additional
information about the entities present in the text.
Part-of-Speech (POS) Tags: POS tags represent the grammatical
category of words in a sentence. Including POS tags as features can
provide information about the syntactic structure of the text, which
can be useful in various NLP tasks such as parsing or sentiment
analysis.
Dependency Parsing: Dependency parsing represents the grammatical
relationships between words in a sentence using a parse tree.
24
Extracting dependency-based features can capture the syntactic
structure and relationships between words.
25
Once satisfied with the model's performance, deploy it into production.
Continuously monitor the model's performance over time and retrain it
periodically with updated data to maintain its effectiveness.
6.2 ALGORITHMS USED
The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that we can
easily put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane.
26
Fig. 6.2.1.1 SVM Diagram
EXAMPLE
SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if
we want a model that can accurately identify whether it is a cat or dog, so such
a model can be created by using the SVM algorithm. We will first train our
model with lots of images of cats and dogs so that it can learn about different
features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of
cat and dog. On the basis of the support vectors, it will classify it as a cat.
Consider the below.
27
Fig 6.2.1.2 Example Diagram
PANDAS
NLTK
29
o Stemming and Lemmatization: NLTK supports stemming,
which reduces words to their base or root form by removing
suffixes. It also offers lemmatization, a similar process that
reduces words to their canonical form (lemma) based on their
dictionary definitions. Both techniques are used to normalize
text and improve the performance of text analysis tasks.
o Named Entity Recognition (NER): NLTK provides tools
for identifying and classifying named entities (e.g., persons,
organizations, locations) in text. NER is essential for
extracting structured information from unstructured text and
is widely used in information extraction, question answering,
and entity linking applications.
o WordNet: NLTK provides an interface to WordNet, a lexical
database of English words and their semantic relationships.
WordNet is widely used in natural language processing for
tasks like synonymy detection, semantic similarity
measurement, and word sense disambiguation.
SKLEARN
CHAPTER 7
PROJECT OUTPUTS
31
CHAPTER 8
CONCLUSION
This project uses Support Vector Machines (SVM) for the detection
of suicidal ideation, a robust approach emerges, offering a combination of
resilience, adaptability, and interpretability. SVMs excel in navigating high-
dimensional feature spaces, a crucial asset when analyzing textual data,
where numerous features—such as words or n-grams—abound. Their
capacity to discern complex relationships between these features and
indicators of suicidal ideation ensures a comprehensive understanding of the
underlying data patterns. Moreover, the flexibility inherent in SVMs,
32
particularly in the selection of kernel functions, empowers the model to
capture nonlinear associations effectively. This adaptability enhances
predictive accuracy, accommodating the diverse manifestations of suicidal
ideation across various linguistic contexts. Importantly, SVMs offer
interpretable decision boundaries, facilitating insight into the factors
influencing suicidal ideation detection. Through examination of support
vectors and associated weights, researchers and clinicians can glean valuable
understanding of the most influential features, aiding in both model
refinement and clinical interpretation. Furthermore, SVMs demonstrate
strong generalization capabilities, crucial for reliable detection across
diverse populations and real-world scenarios
FUTURE WORK
ANNEXURE 1
SOURCE CODE
importjoblib
import re
importnltk
33
fromnltk.corpus import stopwords
defpreprocess_text(text):
text = text.lower()
tokens = word_tokenize(text)
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()
returnpreprocessed_text
model_filename = 'svm_model.pkl'
vectorizer_filename = 'vectorizer.pkl'
loaded_model = joblib.load(model_filename)
loaded_vectorizer = joblib.load(vectorizer_filename)
f=open("newfile.txt","r")
input_text=[]
34
for t in f:
input_text.append(t)
print(input_text)
processed_input=[]
preprocessed_input = preprocess_text(text)
processed_input.append(preprocessed_input)
preprocessed_input_tfidf = loaded_vectorizer.transform(processed_input)
prediction = loaded_model.predict(preprocessed_input_tfidf)
print(prediction)
count=0
suicidal=[]
Non_suicidal=[]
for i in prediction:
if i==1:
suicidal.append(input_text[count])
else:
35
Non_suicidal.append(input_text[count])
count+=1
print(suicidal)
print(Non_suicidal)
importmatplotlib.pyplot as plt
plt.title('Prediction Results')
plt.show()
import pandas as pd
df=pd.DataFrame(suicidal)
df1=pd.DataFrame(Non_suicidal)
df.to_csv('suicidal.csv',index=False)
df1.to_csv('Non_suicidal.csv',index=False)
36
df=pd.read_csv("Non_suicidal.csv")
print(df)
REFERENCES
1. O'Dea, B., Wan, S., Batterham, P. J., Calear, A. L., Paris, C., &
Christensen, H. (2015). Detecting suicidality on Twitter. Internet
Interventions, 2(2), 183-188.
2. De Choudhury, M., Kiciman, E., Dredze, M., Coppersmith, G., &
Kumar, M. (2016). Discovering shifts to suicidal ideation from mental
health content in social media. Proceedings of the 2016 CHI Conference
on Human Factors in Computing Systems, 2098-2110.
3. Cheng, Q., Li, T. M., Kwok, C. L., Zhu, T., Yip, P. S., & Li, S. (2017).
Predicting suicidal behaviors using the Chinese microblog suicidal
37
expression detection model: Model development and validation. Journal
of Medical Internet Research, 19(7), e243.
4. Coppersmith, G., Dredze, M., Harman, C., & Hollingshead, K. (2018).
From ADHD to SAD: Analyzing the language of mental health on
Twitter through self-reported diagnoses. Proceedings of the Third
Workshop on Computational Linguistics and Clinical Psychology, 1-10.
5. Saleh, M., & Shih, P. C. (2019). An exploration of Twitter use for suicide
prevention. Proceedings of the 2019 CHI Conference on Human Factors
in Computing Systems, Paper No. 481.
6. Homan, C. M., Johar, R., Liu, A., & Lytle, M. C. (2019). Analyzing
suicide warning signs in Twitter. Proceedings of the International AAAI
Conference on Web and Social Media 679-670
7. Burnap, P., Colombo, G., Scourfield, J., & Williams, M. L. (2020).
Automated machine learning for the detection of suicidal behavior. JMIR
Mental Health, 7(3), e15924.
8. Du, J., Zhang, L., Yu, Y., & Zhou, G. (2020). Suicide detection on social
media using textual, structural, and visual information. Journal of
Medical Internet Research, 22(7), e17517.
9. Pesce, A., Laconi, S., &Tagliagambe, S. (2020). Textual analysis of
suicide notes for prevention. Journal of Medical Internet Research,
22(12), e22241.
38