Globally, cardiovascular disease (CVDs) is the primary cause of morbidity and
mortality, accounting for more than 70% of all fatalities. According to the 2017 Global
Burden of Disease research, cardiovascular disease is responsible for about 43% of
all fatalities [1,2]. Common risk factors for heart disease in high-income nations in-
clude lousy diet, cigarette use, excessive sugar consumption, and obesity or excess
body fat [3,4]. However, low- and middle-income nations also see a rise in chronic ill-
ness prevalence [5]. Between 2010 and 2015, the global economic burden of cardi-
ovascular diseases was expected to reach roughly USD 3.7 trillion [6,7] (Mozaffarian
et al., 2015; Maiga et al., 2019).

In addition, technologies such as electrocardiograms and CT scans, critical for

diagnosing coronary heart disease, are sometimes too costly and impractical for con-
sumers. The reason mentioned above alone has resulted in the deaths of 17 million
people [5]. Twenty-five to thirty percent of firms’ annual medical expenses were at-
tributable to employees with cardiovascular disease [8]. Therefore, early detection of
heart disease is essential to lessen its physical and monetary cost to people and in-
stitutions. According to the WHO estimate, the overall number of deaths from CVDs
would rise to 23.6 million by 2030, with heart disease and stroke being the leading
causes [9]. To save lives and decrease the cost burden on society, it is vital to apply
data mining and machine learning methods to anticipate the chance of having heart
Machine learning plays a crucial role in the medical field. Using machine learn-
ing, we can diagnose, detect, and predict various diseases. Recently, there has been
a growing interest in using data mining and machine learning techniques to predict
the likelihood of developing certain diseases. The already-existing work contains ap-
plications of data mining techniques for predicting the disease. Although some stud-
ies have attempted to predict the future risk of the progression of the disease, they
have yet to find accurate results [12]. The main goal of this paper is to accurately
predict the possibility of heart disease in the human body
In this research, we aim to investigate the effectiveness of various machine learning
algorithms in predicting heart disease. To achieve this goal, we employed a variety
of techniques, including random forest [13], decision tree classifier, multilayer per-
ceptron, and XGBoost [14], to build predictive models. In order to improve the con-
vergence of the models, we applied k-modes clustering to preprocess the dataset
and scale it. The dataset used in this study is publicly available on Kaggle. All the
computation, preprocessing, and visualization were conducted on Google Colab us-
ing Python. Previous studies have reported accuracy rates of up to 94% [15] using
machine learning techniques for heart disease prediction. However, these studies
have often used small sample sizes, and the results may not be generalizable to lar-
ger populations. Our study aims to address this limitation by using a larger and more
diverse dataset, which is expected to increase the generalizability of the results.
In recent years, the healthcare industry has seen a significant advancement in
the field of data mining and machine learning. These techniques have been widely
adopted and have demonstrated efficacy in various healthcare applications, particu-
larly in the field of medical cardiology. The rapid accumulation of medical data has
presented researchers with an unprecedented opportunity to develop and test new
algorithms in this field. Heart disease remains a leading cause of mortality in devel-
oping nations [12,13,14,15,16], and identifying risk factors and early signs of the dis-
ease has become an important area of research. The utilization of data mining and
machine learning techniques in this field can potentially aid in the early detection and
prevention of heart disease.

The purpose of the study described by Narain et al. (2016) [17] is to create an innov-
ative machine-learning-based cardiovascular disease (CVD) prediction system in or-
der to increase the precision of the widely used Framingham risk score (FRS). With
the help of data from 689 individuals who had symptoms of CVD and a validation
dataset from the Framingham research, the proposed system—which uses a
quantum neural network to learn and recognize patterns of CVD—was experiment-
ally validated and compared with the FRS. The suggested system’s accuracy in fore-
casting CVD risk was determined to be 98.57%, which is much greater than the
FRS’s accuracy of 19.22% and other existing techniques. According to the study’s
findings, the suggested approach could be a useful tool for doctors in forecasting
CVD risk, assisting in the creation of better treatment plans, and facilitating early dia-

In a study conducted by Shah et al. (2020) [18], the authors aimed to develop a
model for predicting cardiovascular disease using machine learning techniques. The
data used for this purpose were obtained from the Cleveland heart disease dataset,
which consisted of 303 instances and 17 attributes, and were sourced from the UCI
machine learning repository. The authors employed a variety of supervised classific-
ation methods, including naive Bayes, decision tree, random forest, and k-nearest
neighbor (KKN). The results of the study indicated that the KKN model exhibited the
highest level of accuracy, at 90.8%. The study highlights the potential utility of ma-
chine learning techniques in predicting cardiovascular disease, and emphasizes the
importance of selecting appropriate models and techniques to achieve optimal res-

In a study by Drod et al. (2022) [2], the objective was to use machine learning (ML)
techniques to identify the most significant risk variables for cardiovascular disease
(CVD) in patients with metabolic-associated fatty liver disease (MAFLD). Blood bio-
chemical analysis and subclinical atherosclerosis assessment were performed on
191 MAFLD patients. A model to identify those with the highest risk of CVD was built
using ML approaches, such as multiple logistic regression classifier, univariate fea-
ture ranking, and principal component analysis (PCA). According to the study, hyper-
cholesterolemia, plaque scores, and duration of diabetes were the most crucial clin-
ical characteristics. The ML technique performed well, correctly identifying 40/47
(85.11%) high-risk patients and 114/144 (79.17%) low-risk patients with an AUC of
0.87. According to the study’s findings, an ML method is useful for detecting MAFLD
patients with widespread CVD based on simple patient criteria.
In a study published by Alotalibi (2019) [19], the author aimed to investigate the utility
of machine learning (ML) techniques for predicting heart failure disease. The study
utilized a dataset from the Cleveland Clinic Foundation, and implemented various ML
algorithms, such as decision tree, logistic regression, random forest, naive Bayes,
and support vector machine (SVM), to develop prediction models. A 10-fold cross-
validation approach was employed during the model development process. The res-
ults indicated that the decision tree algorithm achieved the highest accuracy in pre-
dicting heart disease, with a rate of 93.19%, followed by the SVM algorithm at
92.30%. This study provides insight into the potential of ML techniques as an effect-
ive tool for predicting heart failure disease and highlights the decision tree algorithm
as a potential option for future research.

Through a comparison of multiple algorithms, Hasan and Bao (2020) [20] carried out
a study with the main objective of identifying the most efficient feature selection ap-
proach for anticipating cardiovascular illness. The three well-known feature selection
methods (filter, wrapper, and embedding) were first taken into account, and then a
feature subset was recovered from these three algorithms using a Boolean process-
based common “True” condition. This technique involved retrieving feature subsets
in two stages. A number of models, including random forest, support vector classi-
fier, k-nearest neighbors, naive Bayes, and XGBoost, were taken into account in or-
der to justify the comparative accuracy and identify the best predictive analytics. As a
standard for comparison with all features, the artificial neural network (ANN) was
used. The findings demonstrated that the most accurate prediction results for cardi-
ovascular illness were provided by the XGBoost classifier coupled with the wrapper
technique. XGBoost delivered an accuracy of 73.74%, followed by SVC with 73.18%
and ANN with 73.20%.

The primary drawback of the prior research is its limited dataset, resulting in a high
risk of overfitting. The models developed may not be appropriate for large datasets.
In contrast, we utilized a cardiovascular disease dataset consisting of 70,000 pa-
tients and 11 features, thereby reducing the chance of overfitting. Table 1 presents a
concise review of cardiovascular disease prediction studies performed on large data-
sets, further reinforcing the effectiveness of using a substantial dataset.

