Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 46

ABSTRACT

Suicidal ideation is a significant public health concern, and the early

detection of individuals at risk is crucial for effective intervention and

prevention. This abstract summarizes the current state of research in suicidal

ideation detection and explores various approaches and methodologies

employed in this domain.The abstract begins by discussing the importance of

identifying suicidal ideation and its impact on individuals and society. It

highlights the challenges associated with detecting this complex psychological

state, including the subjective nature of ideation and the stigma surrounding

mental health.Next, the abstract explores different techniques used for suicidal

ideation detection, including machine learning algorithms, natural language

processing, and sentiment analysis. It discusses the use of various data sources

such as social media posts, online forums, and electronic health

records.Additionally, it examines the integration of contextual information,

including demographic factors, psychosocial variables, and clinical history, to

enhance the accuracy of detection models.Furthermore, the abstract addresses

the ethical considerations associated with suicidal ideation detection, including

privacy concerns and the potential for false positives or negatives. It emphasizes

the importance of balancing the benefits of early intervention with the need to

protect individuals' rights and well-being.The abstract concludes by

highlighting the potential impact of accurate suicidal ideation detection,

V
including improved mental health care delivery, targeted interventions, and

prevention strategies.
TABLE OF CONTENT

CHAPTER TITLE PAGE NO.


NO.
ABSTRACT
LIST OF TABLES ii
LIST OF FIGURES iii
LIST OF ABBREVIATIONS iv
1. INTRODUCTION 1
1.1 BACKGROUND UNDERSTANDING 1
1.2 INTRODUCTIONOF CLASSIFIER 3
2. LITERATURE SURVEY 4
3. EXISTING SYSTEM
3.1 OVERVIEW OF EXISTING SYSTEM 9
3.2 LIMITATION OF EXISTING SYSTEM 9
3.3 PROPOSED SYSTEM 10
3.4ADVANTAGE OF PROPOSED 10
SYSTEM
4. SYSTEM ARCHITECTURE 12
4.1 DATA FLOW DIAGRAM 13
4.2 UML DIAGRAM 14
4.3.1 CLASS DIAGRAM 15
4.3.2 USE CASE DIAGRAM 15
4.3.3 ER DIAGRAM 16
4.3.4 SEQUENCE DIAGRAM 17
4.3.5 COLLABORATION DIAGRAM 18
4.3.6 ACTIVITY DIAGRAM 20
5. SYSTEM REQUIREMENTS 21

VI
5.1 INTRODUCTION 21
5.2 HARDWARE REQUIREMENTS 21
5.3 SOFTWARE REQUIREMENTS 21
6. SYSTEM IMPLEMENTATION 22
6.1 MODULE DESCRIPTION 22
6.1.1 DATA COLLECTION 22
6.1.2 DATA PREPROCESSING 22
6.1.3 FEATURE EXTRACTION 23
6.1.4 SPLITTING THE DATASET 25
6.1.5 MODEL TRAINING 25
6.1.6 MODEL EVALUATION 25
6.1.7 DEPLOYMENT AND MONITORING 25
6.2 ALGORITHMS USED 26
6.2.1 SUPPORT VECTOR MACHINE 26
6.3 LIBARIES USED 28
7. PROJECT OUTPUT 31
8. CONCLUSION AND FUTUREWORK 32
ANNEXURE 33
REFERENCES 37

VII
LIST OF FIGURES

S.NO TITLE PAGE NO

1 4.3.1 CLASS DIAGRAM 15

2 4.3.2 USECASE DIAGRAM 15

3 4.3.3 ER DIAGRAM 16

4 4.3.4 SEQUENCE DIAGRAM 17

5 4.3.5 COLLABORTION DIAGRAM 18

6 4.3.6 ACTIVITY DIAGRAM 20

7 6.1.1 DATA COLLECTION 22

8 6.1.2 DATA PRE PROCESSING 22

9 6.1.3 FEATURE EXTRACTION 23

10 6.1.4 SPLITTING THE DATASET 25

11 6.1.5 MODEL TRAINING 25

12 6.1.6 MODEL EVALUATION 25

13 6.1.7 DEPLOYMENT AND MONITORING 25

14 6.2.1 SUPPORT VECTOR MACHINE 26


ALGORITHM

VIII
LIST OF ABBREVIATIONS

RNN – Recurrent Neural Network


LSTM – Long Short-Term Memory
LR – Logistic Regression
CNN – Convolution Neural Network
SVM – Support Vector Machine
K-NN – K-Nearest Neighbours
RF– Random Forest
DT – Decision Tree
BERT – Bidirectional Encoder Representations from Transformers
NB – Naïve Bayes
OS – Operating System
iNLTK– Indic Natural Language Tool Kit
GNU – GNU’s Not Unix
HD1 – Hate Speech Dataset 1
HD2 – Hate Speech Dataset 2
GCR-NN - Graph Convolutional Recurrent Neural Network
UML – Unified Modelling Language
NLP – Natural Language Processing
ML – Machine Learning
AI – Artificial Intelligence

GUI – Graphical User Interface


UI – User Interface
DFD – Data Flow Diagram

IX
CHAPTER 1
INTRODUCTION

Before technical and application details, the background understating of


problem source is very important. If source the background information is
understood well, then easier to follow project description. In this chapter,
existing state of society, their statistical information, how that issue will affect
our future generation were described.

1.1 BACKGROUND UNDERSTANDING


Millions of young people spend their time on social networking, and the
sharing of information is online. Social networks have the ability to
communicate and to share information with anyone, at any time, and in the
number of people at the same time. There are over 3 billion social media users
around the world. According to the National Crime Security Council (NCPC),
cyberbullying is available online where mobile phones, video game apps, or any
other way to send or send text, photos, or videos deliberately injure or
embarrass another person. Suicidal ideation refers to thoughts, fantasies, or
preoccupations with self-harm or taking one's own life. It is a critical indicator
of mental distress and is often a precursor to suicide attempts. Detecting and
identifying individuals experiencing suicidal ideation is essential for timely
intervention and prevention of suicide.Traditionally, the detection of suicidal
ideation has relied on self-reporting through interviews, questionnaires, or
clinical assessments. However, these methods have limitations, as individuals
may be hesitant to disclose their thoughts due to stigma, fear of judgment, or
lack of insight into their own mental state. Moreover, relying solely on self-

1
reporting can be time-consuming, resource-intensive, and may not capture real-
time changes in ideation.
In recent years, researchers have explored the use of technology, data analytics,

and computational methods to improve the detection of suicidal ideation.

Machine learning algorithms, natural language processing (NLP), and sentiment

analysis techniques have been employed to analyze large datasets, including

social media posts, online forums, and electronic health records, for identifying

linguistic and emotional markers associated with suicidal ideation.These

computational approaches leverage patterns and signals in language use,

sentiment, and contextual information to identify individuals at risk. They aim

to uncover subtle linguistic cues, such as increased negative emotions,

expressions of hopelessness, or specific keywords related to suicide, that may

indicate the presence of suicidal ideation. By analyzing a large volume of data,

these techniques can potentially detect patterns and trends that may go

unnoticed through traditional methods.Additionally, researchers have explored

the integration of contextual factors to enhance the accuracy of suicidal ideation

detection models. Demographic variables, such as age, gender, and

socio-economic status, along with psychosocial factors and clinical history, can

provide valuable insights into an individual's risk profile. Incorporating this

contextual information into detection algorithms can help tailor interventions

and support strategies to specific populations and improve overall detection

accuracy.

2
However, ethical considerations are crucial in the development and deployment

of suicidal ideation detection systems. Privacy concerns, data security, and

potential harm from false positives or negatives must be carefully addressed.

Ensuring informed consent, data anonymization, and appropriate protocols for

intervention and follow-up are essential to protect the well-being and rights of

individuals.

In conclusion, the field of suicidal ideation detection has evolved from

traditional self-reporting methods to include computational approaches

leveraging technology and data analytics. These methods offer the potential for

early identification and intervention, contributing to improved mental health

care delivery and prevention strategies. However, ongoing research is necessary

to refine detection models, validate their effectiveness across diverse

populations, and address ethical considerations in order to maximize the impact

of suicidal ideation detection efforts.

1.2 INTRODUCTION TO CLASSIFIER


We used Support Vector Machines (SVM), A well-known efficient binary
classifier to train our model. SVM algorithm, training data is used to learn a
classification function. It can classify new data not previously seen in one of the
two categories. It separates the training data set into two categories using a large
hyperplane.

3
CHAPTER 2
LITERATURE SURVEY

Title :“Detection of Suicidal Ideation from Social Media Data Using


Machine Learning Techniques”
Authors: Coppersmith et al. (2018)
Published in: Proceedings of the 26th International Conference on World Wide
Web
Individuals who suffer from suicidal ideation frequently express their
views and ideas on social media. Thus, several studies found that people
who are contemplating suicide can be identified by analyzing social media
posts. However, finding and comprehending patterns of suicidal ideation
represent a challenging task. Therefore, it is essential to develop a machine
learning system for automated early detection of suicidal ideation or any
abrupt changes in a user’s behavior by analyzing his or her posts on social
media. In this paper, we propose a methodology based on experimental
research for building a suicidal ideation detection system using publicly
available Reddit datasets, word-embedding approaches, such as TF-IDF
and Word2Vec, for text representation, and hybrid deep learning and
machine learning algorithms for classification. A convolutional neural
network and Bidirectional long short-term memory (CNN–BiLSTM)
model and the machine learning XGBoost model were used to classify
social posts as suicidal or non-suicidal using textual and LIWC-22-based
features by conducting two experiments. To assess the models’
performance, we used the standard metrics of accuracy, precision, recall,
and F1-scores. A comparison of the test results showed that when using

4
textual features, the CNN–BiLSTM model outperformed the XGBoost
model, achieving 95% suicidal ideation detection accuracy, compared with
the latter’s 91.5% accuracy. Conversely, when using LIWC features,
XGBoost showed better performance than CNN–BiLSTM.

Title: “Predicting Suicidal Ideation through Reddit Posts”

Authors: Cheng et al. (2017)

Published in: Proceedings of the Fifth International Conference on


Computational Social Science
We aimed to predict an individual suicide risk level from lon-gitudinal posts on
Reddit discussion forums. Through partic-ipating in a shared task competition
hosted by CLPsych2019, we received two annotated datasets: a training dataset
with 496 users (31,553 posts) and a test dataset with 125 users (9610 posts). We
submitted results from our three best-performing machine-learning models:
SVM, Naïve Bayes, and an ensemble model. Each model provided a user’s
suicide risk level in four categories, i.e., no risk, low risk, moderate risk, and
severe risk. Among the three models, the ensemble model had the best macro-
averaged F1 score 0.379 when tested on the holdout test dataset. The NB model
had the best perfor-mance in two additional binary-classification tasks, i.e., no
risk vs. flagged risk (any risk level other than no risk) with F1 score
0.836 and no or low risk vs. urgent risk (moderate or severe risk) with F1 score
0.736. We conclude that the NB model may serve as a tool for identifying users
with flagged or urgent suicide risk based on longitudinal posts on Reddit
discussion forums.

5
Title: "Predicting Suicidal Ideation in Online Forums"

Authors: Huang et al. (2020)

Suicide ideation expressed in social media has an impact on language usage.


Many at-risk individuals use social forum platforms to discuss their problems or
get access to information on similar tasks. The key objective of our study is to
present ongoing work on automatic recognition of suicidal posts. We address
the early detection of suicide ideation through deep learning and machine
learning-based classification approaches applied to Reddit social media. For
such purpose, we employ an LSTM-CNN combined model to evaluate and
compare to other classification models. Our experiment shows the combined
neural network architecture with word embedding techniques can achieve the
best relevance classification results. Additionally, our results support the
strength and ability of deep learning architectures to build an effective model
for a suicide risk assessment in various text classification tasks.

Title: "Predicting Suicidal Ideation among Veterans


using Electronic Health Records"

Authors: Simon et al. (2018)

The purpose of this article was to determine whether longitudinal historical


data, commonly available in electronic health record (EHR) systems, can be
used to predict patients' future risk of suicidal behavior. Method: Bayesian
models were developed using a retrospective cohort approach. EHR data from a
large health care database spanning 15 years (1998-2012) of inpatient and
outpatient visits were used to predict future documented suicidal behavior (i.e.,
suicide attempt or death). Patients with three or more visits (N=1,728,549) were
6
included. ICD-9-based case definition for suicidal behavior was derived by
expert clinician consensus review of 2,700 narrative EHR notes (from 520
patients), supplemented by state death certificates. Model performance was
evaluated retrospectively using an independent testing set. Results: Among the
study population, 1.2% (N=20,246) met the case definition for suicidal
behavior. The model achieved sensitive (33%-45% sensitivity), specific (90%-
95% specificity), and early (3-4 years in advance on average) prediction of
patients' future suicidal behavior. The strongest predictors identified by the
model included both well-known (e.g., substance abuse and psychiatric
disorders) and less conventional (e.g., certain injuries and chronic conditions)
risk factors, indicating that a data-driven approach can yield more
comprehensive risk profiles. Conclusions: Longitudinal EHR data, commonly
available in clinical settings, can be useful for predicting future risk of suicidal
behavior. This modeling approach could serve as an early warning system to
help clinicians identify high-risk patients for further screening. By analyzing the
full phenotypic breadth of the EHR, computerized risk screening approaches
may enhance prediction beyond what is feasible for individual clinicians.

Title: "Detecting Suicidal Ideation on Twitter: A Comparative


Analysis of Machine Learning Approaches"

Authors: O'Dea et al. (2018)

Online social networking platforms like Twitter, Redditt and Facebook are
becoming a new way for the people to express themselves freely without
worrying about social stigma. This paper presents a methodology and
experimentation using social media as a tool to analyse the suicidal ideation in a
better way, thus helping in preventing the chances of being the victim of this
unfortunate mental disorder. The data is collected from Twitter, one of the
7
popular Social Networking Sites (SNS). The Tweets are then pre-processed and
annotated manually. Finally, various machine learning and ensemble methods
are used to automatically distinguish Suicidal and Non-Suicidal tweets. This
experimental study will help the researchers to know and understand how SNS
are used by the people to express their distress related feelings and emotions.
The study further confirmed that it is possible to analyse and differentiate these
tweets using human coding and then replicate the accuracy by machine
classification. However, the power of prediction for detecting genuine
suicidality is not confirmed yet, and this study does not directly communicate
and intervene the people having suicidal behaviour.Suicidal ideation is one of
the most severe mental health issues faced by people all over the world. There
are various risk factors involved that can lead to suicide. The most common &
critical risk factors among them are depression, anxiety, social isolation and
hopelessness. Early detection of these risk factors can help in preventing or
reducing the number of suicidial behaviour.

8
CHAPTER 3

EXISTING SYSTEM

3.1 OVERVIEW OF EXISTING SYSTEM

The existing system of detecting suicidal ideation through social media


posts using Naive Bayes involves several steps. Initially, a dataset of posts is
collected and labeled to distinguish between those indicating suicidal ideation
and those that don't. Following this, the text data undergoes preprocessing,
including cleaning and tokenization, to prepare it for analysis. Features are then
extracted from the text, typically using techniques like Bag-of-Words or TF-
IDF. These features are fed into a Naive Bayes classifier during training, where
probabilities of features given each class are calculated. The trained model is
evaluated using various metrics to assess its performance. Once validated, the
classifier is applied to new text data to predict the presence of suicidal ideation.
Based on these predictions, appropriate interventions are implemented, such as
providing mental health resources or alerting moderators.

3.2LIMITATIONS OF EXISTING SYSTEM

 Assumption of Feature Independence: Naive Bayes classifiers assume


that features (words in this case) are independent of each other given the
class label. However, in natural language, words often have complex
relationships and dependencies, especially when dealing with nuanced
topics like suicidal ideation. For example, certain combinations of words
or phrases might indicate a higher risk of suicidal ideation, and Naive
Bayes would struggle to capture these dependencies.

 Limited Contextual Understanding: Naive Bayes models lack the


ability to capture the contextual nuances of language. Detecting suicidal
ideation requires understanding the context in which certain words or
phrases are used, as well as considering broader linguistic and behavioral
patterns. Naive Bayes, being a simple probabilistic model, might not be
able to handle this level of complexity effectively.

9
 Imbalanced Data and Rare Events: Suicidal ideation detection is a rare
event in social media data, which often leads to imbalanced datasets.
Naive Bayes classifiers might struggle with imbalanced data because they
tend to bias predictions towards the majority class. Special techniques
such as oversampling, undersampling, or using different evaluation
metrics would be necessary to address this issue.

3.3PROPOSED SYSTEM

The proposed system for detecting suicidal ideation in social media posts
utilizes Support Vector Machines (SVM) as its core algorithm. This system
begins with the collection of a diverse dataset containing social media posts,
each labeled to distinguish between those indicating suicidal ideation and those
that do not. Following data collection, preprocessing techniques are applied to
cleanse the text data, including noise removal, tokenization, stemming or
lemmatization, and elimination of stopwords. Next, feature extraction methods
such as TF-IDF or word embeddings are employed to represent each post as a
numerical vector, capturing semantic relationships and word importance. The
SVM algorithm is then trained on this labeled dataset, aiming to find the
optimal hyperplane that separates suicidal and non-suicidal content in the
feature space. Model evaluation is conducted using metrics like accuracy,
precision, recall, and F1-score, ensuring robustness and generalization.
Subsequently, the trained SVM classifier is applied to new, unseen social media
posts to predict the presence of suicidal ideation, enabling appropriate
interventions such as providing mental health resources or alerting moderators.
Continuous monitoring and refinement of the system's performance are
emphasized, along with ethical considerations regarding user privacy,
algorithmic bias, and human oversight throughout the process. Through this
approach, the proposed system aims to contribute to early intervention and
support for individuals expressing signs of distress on social media platforms.

3.4 ADVANTAGES OF PROPOSED SYSTEM

 High Accuracy: SVMs are known for their ability to handle high-
dimensional data and find complex decision boundaries. They can
effectively separate data points into different classes, leading to high
classification accuracy.

10
 Versatility: SVMs can handle both linear and non-linear data through the
use of different kernel functions. This flexibility allows them to capture
intricate relationships in textual data, making them suitable for detecting
nuanced expressions of suicidal ideation.

 Robustness to Overfitting: SVMs have regularization parameters that


help prevent overfitting, ensuring that the model generalizes well to
unseen data. This robustness is particularly important in detecting
suicidal ideation, where the dataset may be imbalanced and noisy.

 Efficiency: Despite their complexity, SVMs are computationally


efficient, especially when using sparse data representations like TF-IDF
or word embeddings. This efficiency makes them scalable to large
datasets typically found in social media platforms.

 Interpretability: SVMs provide clear decision boundaries, making their


predictions interpretable. Mental health professionals and moderators can
understand the rationale behind the model's classifications, enabling
informed decision-making regarding interventions and support.

 Effective with Small Datasets: SVMs can perform well even with small
training datasets, making them suitable for tasks where labeled data may
be limited, such as detecting suicidal ideation.

 Well-Studied and Established: SVMs have been extensively studied


and widely used in various domains, including text classification. Their
theoretical foundations are well-understood, providing confidence in their
reliability and effectiveness

11
CHAPTER 4
SYSTEM ARCHITECURE

4.1 SYSTEM ARCHITECTURE

System architecture refers to the high-level structure and organization of a


system. It encompasses the design principles, components, modules,
relationships, and interactions that define the system's overall behavior and
functionality. System architecture provides a blueprint for developing,
deploying, and maintaining a system, ensuring that it meets the desired
requirements and objectives.

SVM Algorithm

Fig 4.1 System Architecture

12
4.2 DATA FLOW DIAGRAM

A data flow diagram (DFD) is a visual representation of how data flows


within a system or process. It illustrates the movement of data from its sources,
through various processes, to its destinations. DFDs are commonly used in
system analysis and design to understand and communicate the data
requirements and data transformations within a system.

LEVEL 0 DFD

Fig 4.2.1 Level 0 DFD

LEVEL 1 DFD

Fig 4.2.2 Level 1 DFD

13
LEVEL 2 DFD

Fig 4.2.3 Level 2 DFD

4.3 UML DIAGRAM

UML, or Unified Modeling Language, is a standardized graphical


notation used for modeling software systems. It provides a set of diagrams and
symbols that allow software developers, designers, and stakeholders to visually
represent various aspects of a system's structure, behavior, and interactions.

UML diagrams help in understanding, designing, documenting, and


communicating different aspects of a software system. They capture different
viewpoints and perspectives of a system, allowing stakeholders to gain a
common understanding and effectively communicate ideas.

14
4.3.1 CLASS DIAGRAM

A class diagram is a type of UML diagram that represents the static


structure and relationships of classes in a system. It provides an overview of the
classes, their attributes, methods, and associations with other classes.

Fig 4.3.1.1 Class Diagram

4.3.2 USE CASE DIAGRAM

System architecture refers to the high-level structure and organization of a


system. It encompasses the design principles, components, modules,
relationships, and interactions that define the system's overall behavior and
functionality. System architecture provides a blueprint for developing,
deploying, and maintaining a system, ensuring that it meets the desired
requirements and objectives.

15
Fig 4.3.2.1 Use case Diagram

4.3.3 ER DIAGRAM

An entity-relationship (ER) diagram is a visual representation of the


entities (objects or concepts) within a system or domain, their attributes, and the
relationships between them. It is a popular modeling technique used in database
design to describe the structure and organization of data.

ER diagrams help to visualize the structure of a database and its relationships,


enabling effective database design and development. They assist in identifying
entities, their attributes, and the relationships between them, which are essential
for constructing an accurate and efficient database schema.

16
Fig 4.3.3.1 ER Diagram

4.3.4 SEQUENCE DIAGRAM


A sequence diagram is a type of UML diagram that illustrates the
interactions and order of messages exchanged between objects or components
within a system over a specific time period. It shows the dynamic behavior of a
system and how different objects collaborate to accomplish a specific
functionality or scenario.

17
Fig 4.3.4.1 Sequence Diagram

4.3.5 COLLABORATION DIAGRAM


Collaboration diagrams, also known as communication diagrams, are a type
of UML diagram that visualize the interactions and relationships between
objects or components in a system. They focus on the structural organization

18
and communication patterns between objects, providing a dynamic view of how
objects collaborate to achieve a particular functionality.

Fig 4.3.5.1 Collaboration Diagram

19
4.3.6 ACTIVITY DIAGRAM
An activity diagram is a type of UML diagram that models the flow of
activities, actions, and decisions within a system or process. It provides a visual
representation of the workflow or behavior of a system, illustrating the sequence
of actions and the conditions or decisions that control the flow.

Fig 4.3.6.1 Activity Diagram

20
CHAPTER 5

SYSTEM REQUIREMENTS

5.1 INTRODUCTION

For development of system, knowledge about hardware and software


configurations must be known. Following of this chapter, hardware
requirements such as CPU, RAM, GPU, Storage, Software requirements such as
OS, Programming language, IDE, frameworks such as machine learning and
deep learning frameworks, web hosting framework and libraries used for system
development will discuss.

5.2 HARDWARE REQUIREMENTS

● CPU: Laptop or PC with Intel Core i5 6th generation processor or


higher with clock speed 2.5 GHz or above. Equivalent processors in
AMD will also be optimal.
● RAM: Minimum 4 GB of RAM is required
● GPU: NVIDIA GeForce GTX 960 or higher.
● Storage: SSD is recommended for faster pre-processing of data than
HDD.

5.3 SOFTWARE REQUIREMENTS

● OS: Windows 7 or higher version but Windows 10 is recommended /


Minimum Ubuntu 16.04 is required.
● Python: Programming Language used for Machine Learning and
Deep Learning.
● JupyterNotebook : Development environment.

21
CHAPTER 6
SYSTEM IMPLEMENTATION

6.1 MODULE DESCRIPTION

6.1.1 DATA COLLECTION


-Gathering tweet data.
-Website :http://www.kaggle.com

6.1.2 DATA PRE-PROCESSING

● Tokenization: Tokenization is the process of breaking down the raw


text into individual tokens or words. This is done to enable the
computer to understand the structure of the text. Tokenization can be
done at different levels, such as word level, character level, or subword
level.

● Stop Word Removal: Stop words are words that do not contribute to
the meaning of a sentence, such as "a", "an", "the", "and", and "but".
Removing stop words can reduce the size of the vocabulary and
improve the efficiency of the model. However, it is important to note
that stop words can sometimes carry important information, and their
removal may not always be appropriate.

Stemming and Lemmatization: Stemming and lemmatization are


techniques used to reduce the inflectional and derivational forms of words
to their base form or lemma. This helps to reduce the vocabulary size and
improve the accuracy of the model. Stemming involves removing the

22
suffixes of words, while lemmatization involves converting words to their
base form based on their part of speech.

● Text Normalization: Text normalization involves standardizing the


text data by converting uppercase letters to lowercase, removing
punctuation and special characters, and correcting spelling and
grammatical errors. This helps to ensure that the text data is consistent
and uniform, which can improve the accuracy of the model.

● Feature Engineering: Feature engineering involves selecting and


transforming the most relevant features from the text data to be used in
the model. This may involve techniques such as n-gram generation,
which involves grouping adjacent words together to capture context, or
word embedding, which involves representing words as vectors in a
high-dimensional space.

6.1.3 FEATURE EXTRACTION

 Bag-of-Words (BoW): In BoW representation, the text is treated as a


collection of words, disregarding grammar and word order. Each
document is represented by a vector where each element corresponds
to the count or presence of a specific word in the document. BoW is a
simple and effective way to represent text, but it doesn't capture the
sequential information.

 TF-IDF (Term Frequency-Inverse Document Frequency): TF-IDF


calculates the importance of a word in a document by multiplying the
term frequency (TF) and inverse document frequency (IDF). TF

23
measures the occurrence frequency of a word in a document, while
IDF penalizes common words and rewards rare words. The TF-IDF
representation helps to capture the relevance of words in a document
collection.
 Word Embeddings: Word embeddings are dense vector
representations of words that capture semantic and syntactic
relationships. Pre-trained word embeddings such as Word2Vec,
GloVe, or FastText can be used to obtain word vectors. These
embeddings encode semantic information, allowing models to capture
the meaning and context of words.
 Character-level Features: In some cases, character-level features can
be used alongside word-level features. For example, character n-grams
or character-level embeddings can capture subword information,
useful for morphologically rich languages or out-of-vocabulary words.
 Named Entity Recognition (NER) Features: NER features identify and
classify named entities such as names, organizations, locations, and
dates in the text. These features can be used to provide additional
information about the entities present in the text.
 Part-of-Speech (POS) Tags: POS tags represent the grammatical
category of words in a sentence. Including POS tags as features can
provide information about the syntactic structure of the text, which
can be useful in various NLP tasks such as parsing or sentiment
analysis.
 Dependency Parsing: Dependency parsing represents the grammatical
relationships between words in a sentence using a parse tree.

24
Extracting dependency-based features can capture the syntactic
structure and relationships between words.

6.1.4 SPLITTING THE DATASET


 Divide the dataset into training, validation, and test sets. The training set
is used to train the SVM model, the validation set helps tune
hyperparameters, and the test set evaluates the model's performance on
unseen data.
6.1.5 MODEL TRAINING
 Train the SVM model using the training data. Tune hyperparameters
such as the choice of kernel (linear, polynomial, or radial basis function)
and regularization parameter (C) using cross-validation on the validation
set. Grid search or random search can be employed for hyperparameter
tuning.

6.1.6 MODEL EVALUATION


 Evaluate the trained SVM model on the test set using appropriate
metrics such as accuracy, precision, recall, F1-score, and ROC-AUC
(Receiver Operating Characteristic - Area Under the Curve).
Additionally, consider other metrics like confusion matrix and
classification report to understand the model's performance across
different classes.

6.1.7 DEPLOYMENT AND MONITORING

25
 Once satisfied with the model's performance, deploy it into production.
Continuously monitor the model's performance over time and retrain it
periodically with updated data to maintain its effectiveness.
6.2 ALGORITHMS USED

6.2.1 SUPPORT VECTOR MACHINE ALGORITHM

Support Vector Machine or SVM is one of the most popular Supervised


Learning algorithms, which is used for Classification as well as Regression
problems. However, primarily, it is used for Classification problems in Machine
Learning.

The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that we can
easily put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the


hyperplane. These extreme cases are called as support vectors, and hence
algorithm is termed as Support Vector Machine. Consider the below diagram in
which there are two different categories that are classified using a decision
boundary or hyperplane.

26
Fig. 6.2.1.1 SVM Diagram

EXAMPLE

SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if
we want a model that can accurately identify whether it is a cat or dog, so such
a model can be created by using the SVM algorithm. We will first train our
model with lots of images of cats and dogs so that it can learn about different
features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of
cat and dog. On the basis of the support vectors, it will classify it as a cat.
Consider the below.

27
Fig 6.2.1.2 Example Diagram

6.3 LIBRARIES USED

PANDAS

● The pandas library is an open-source Python library that provides


high-performance, easy-to-use data structures and data analysis tools. It's
particularly designed for working with structured or tabular data,
commonly found in data science, machine learning, finance, and other
domains.
● Key features:
o DataFrame: A two-dimensional labeled data structure with
columns of potentially different types. It's akin to a
spreadsheet or SQL table, making it easy to work with
structured data. Each column in a DataFrame is a Series,
allowing for vectorized operations and intuitive data
manipulation.
o Series: A one-dimensional labeled array capable of holding
data of any type (integer, float, string, etc.). Series are the
28
building blocks of DataFrames and provide powerful
indexing and selection capabilities.
o Data Input/Output: Pandas supports reading and writing
data from various file formats, including CSV, Excel, SQL
databases, JSON, HTML, and more. This makes it convenient
for loading data from different sources into pandas
DataFrames for analysis.
o Data Manipulation: Pandas offers a wide range of functions
for manipulating data, including indexing and selection,
filtering, sorting, grouping, joining, reshaping, and handling
missing data. These operations enable users to clean,
transform, and analyze datasets efficiently.

NLTK

● NLTK (Natural Language Toolkit) is a leading platform for building


Python programs to work with human language data. It provides easy-
to-use interfaces to over 50 corpora and lexical resources, such as
WordNet, along with a suite of text processing libraries for
classification, tokenization, stemming, tagging, parsing, and semantic
reasoning.
● Key features:
o Corpora and Lexical Resources: NLTK offers access to a
vast collection of language datasets and resources, including
corpora for various languages, WordNet (a lexical database
for English), PropBank, Penn Treebank, and more. These
resources are invaluable for tasks like text analysis,
information retrieval, and natural language understanding.
o Tokenization: NLTK provides tools for breaking text into
words, sentences, or other linguistic units, a process known
as tokenization. This functionality is crucial for preprocessing
text data before further analysis or modeling.
o Part-of-Speech Tagging (POS): NLTK includes pre-trained
models for tagging each word in a sentence with its
corresponding part of speech (e.g., noun, verb, adjective).
POS tagging is essential for many natural language
processing tasks, such as parsing, named entity recognition,
and sentiment analysis.

29
o Stemming and Lemmatization: NLTK supports stemming,
which reduces words to their base or root form by removing
suffixes. It also offers lemmatization, a similar process that
reduces words to their canonical form (lemma) based on their
dictionary definitions. Both techniques are used to normalize
text and improve the performance of text analysis tasks.
o Named Entity Recognition (NER): NLTK provides tools
for identifying and classifying named entities (e.g., persons,
organizations, locations) in text. NER is essential for
extracting structured information from unstructured text and
is widely used in information extraction, question answering,
and entity linking applications.
o WordNet: NLTK provides an interface to WordNet, a lexical
database of English words and their semantic relationships.
WordNet is widely used in natural language processing for
tasks like synonymy detection, semantic similarity
measurement, and word sense disambiguation.

SKLEARN

● Scikit-learn, often abbreviated as sklearn, is one of the most popular


machine learning libraries in Python. It provides simple and efficient
tools for data mining and data analysis, built on top of NumPy, SciPy,
and matplotlib.
● Key features:
o Wide Range of Machine Learning Algorithms: Scikit-learn
includes implementations of a diverse set of supervised and
unsupervised learning algorithms, including:
 Supervised Learning: Classification (e.g., SVM, k-NN,
Decision Trees), Regression (e.g., Linear Regression,
Random Forests), and Ensemble methods.
 Unsupervised Learning: Clustering (e.g., KMeans,
DBSCAN), Dimensionality Reduction (e.g., PCA, t-
SNE), and Anomaly Detection.
30
o Data Preprocessing and Feature Engineering:Scikit-learn
provides tools for preprocessing data before feeding it into
machine learning models. This includes scaling features,
handling missing values, encoding categorical variables, and
generating polynomial features. These preprocessing techniques
are essential for improving model performance and
generalization.
o Model Evaluation and Selection:Scikit-learn offers functions
for evaluating the performance of machine learning models
using various metrics, such as accuracy, precision, recall, F1-
score, ROC-AUC, and more. It also provides tools for cross-
validation, hyperparameter tuning, and model selection to
ensure robustness and generalization.
o Integration with NumPy and SciPy:Scikit-learn seamlessly
integrates with NumPy and SciPy, leveraging their efficient
numerical computations and mathematical functions.
o Pipeline for Streamlined Workflows:Scikit-learn allows users
to create data processing pipelines that automate the sequence
of operations, including data preprocessing, feature extraction,
model training, and prediction. Pipelines simplify code
organization, improve reproducibility, and facilitate model
deployment.

CHAPTER 7

PROJECT OUTPUTS

31
CHAPTER 8

CONCLUSION

This project uses Support Vector Machines (SVM) for the detection
of suicidal ideation, a robust approach emerges, offering a combination of
resilience, adaptability, and interpretability. SVMs excel in navigating high-
dimensional feature spaces, a crucial asset when analyzing textual data,
where numerous features—such as words or n-grams—abound. Their
capacity to discern complex relationships between these features and
indicators of suicidal ideation ensures a comprehensive understanding of the
underlying data patterns. Moreover, the flexibility inherent in SVMs,
32
particularly in the selection of kernel functions, empowers the model to
capture nonlinear associations effectively. This adaptability enhances
predictive accuracy, accommodating the diverse manifestations of suicidal
ideation across various linguistic contexts. Importantly, SVMs offer
interpretable decision boundaries, facilitating insight into the factors
influencing suicidal ideation detection. Through examination of support
vectors and associated weights, researchers and clinicians can glean valuable
understanding of the most influential features, aiding in both model
refinement and clinical interpretation. Furthermore, SVMs demonstrate
strong generalization capabilities, crucial for reliable detection across
diverse populations and real-world scenarios

FUTURE WORK

In future research on suicidal ideation detection using SVMs, several


promising avenues warrant exploration. Firstly, there's scope for refining the
model through further optimization of SVM parameters and the exploration
of advanced kernel functions. Additionally, there's a need to delve into novel
approaches for feature engineering, including the integration of deep
learning-based embeddings and context-aware representations. This could
enable the extraction of richer semantic information from textual data,
potentially leading to more nuanced and accurate detection outcomes.
Finally, the integration of multimodal data sources, such as text, images, and
social network interactions, holds promise for enhancing the
comprehensiveness and accuracy of suicidal ideation detection models,
paving the way for more holistic approaches to mental health assessment
and intervention.

ANNEXURE 1

SOURCE CODE

importjoblib

import re

importnltk

33
fromnltk.corpus import stopwords

fromnltk.tokenize import word_tokenize

fromnltk.stem import WordNetLemmatizer

fromsklearn.feature_extraction.text import TfidfVectorizer

defpreprocess_text(text):

text = text.lower()

text = re.sub(r"[^a-zA-Z\s]", "", text)

tokens = word_tokenize(text)

stop_words = set(stopwords.words("english"))

tokens = [token for token in tokens if token not in stop_words]

lemmatizer = WordNetLemmatizer()

tokens = [lemmatizer.lemmatize(token) for token in tokens]

preprocessed_text = " ".join(tokens)

returnpreprocessed_text

model_filename = 'svm_model.pkl'

vectorizer_filename = 'vectorizer.pkl'

loaded_model = joblib.load(model_filename)

loaded_vectorizer = joblib.load(vectorizer_filename)

# Get input text from user

f=open("newfile.txt","r")

input_text=[]

34
for t in f:

input_text.append(t)

print(input_text)

processed_input=[]

for text in input_text:

preprocessed_input = preprocess_text(text)

processed_input.append(preprocessed_input)

preprocessed_input_tfidf = loaded_vectorizer.transform(processed_input)

prediction = loaded_model.predict(preprocessed_input_tfidf)

print(prediction)

count=0

suicidal=[]

Non_suicidal=[]

for i in prediction:

if i==1:

suicidal.append(input_text[count])

else:

35
Non_suicidal.append(input_text[count])

count+=1

print(suicidal)

print(Non_suicidal)

importmatplotlib.pyplot as plt

labels = ['Suicidal', 'Non-Suicidal']

sizes = [len(suicidal), len(Non_suicidal)]

colors = ['red', 'green']

explode = (0.1, 0) # explode 1st slice

plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f


%%', shadow=True, startangle=140)

plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.

plt.title('Prediction Results')

plt.show()

import pandas as pd

df=pd.DataFrame(suicidal)

df1=pd.DataFrame(Non_suicidal)

df.to_csv('suicidal.csv',index=False)

df1.to_csv('Non_suicidal.csv',index=False)
36
df=pd.read_csv("Non_suicidal.csv")

print(df)

REFERENCES

1. O'Dea, B., Wan, S., Batterham, P. J., Calear, A. L., Paris, C., &
Christensen, H. (2015). Detecting suicidality on Twitter. Internet
Interventions, 2(2), 183-188.
2. De Choudhury, M., Kiciman, E., Dredze, M., Coppersmith, G., &
Kumar, M. (2016). Discovering shifts to suicidal ideation from mental
health content in social media. Proceedings of the 2016 CHI Conference
on Human Factors in Computing Systems, 2098-2110.
3. Cheng, Q., Li, T. M., Kwok, C. L., Zhu, T., Yip, P. S., & Li, S. (2017).
Predicting suicidal behaviors using the Chinese microblog suicidal
37
expression detection model: Model development and validation. Journal
of Medical Internet Research, 19(7), e243.
4. Coppersmith, G., Dredze, M., Harman, C., & Hollingshead, K. (2018).
From ADHD to SAD: Analyzing the language of mental health on
Twitter through self-reported diagnoses. Proceedings of the Third
Workshop on Computational Linguistics and Clinical Psychology, 1-10.

5. Saleh, M., & Shih, P. C. (2019). An exploration of Twitter use for suicide
prevention. Proceedings of the 2019 CHI Conference on Human Factors
in Computing Systems, Paper No. 481.
6. Homan, C. M., Johar, R., Liu, A., & Lytle, M. C. (2019). Analyzing
suicide warning signs in Twitter. Proceedings of the International AAAI
Conference on Web and Social Media 679-670
7. Burnap, P., Colombo, G., Scourfield, J., & Williams, M. L. (2020).
Automated machine learning for the detection of suicidal behavior. JMIR
Mental Health, 7(3), e15924.
8. Du, J., Zhang, L., Yu, Y., & Zhou, G. (2020). Suicide detection on social
media using textual, structural, and visual information. Journal of
Medical Internet Research, 22(7), e17517.
9. Pesce, A., Laconi, S., &Tagliagambe, S. (2020). Textual analysis of
suicide notes for prevention. Journal of Medical Internet Research,
22(12), e22241.

38

You might also like