HDPM: An Effective Heart Disease Prediction Model For A Clinical Decision Support System

Received July 7, 2020, accepted July 12, 2020, date of publication July 20, 2020, date of current version
July 30, 2020.

Digital Object Identifier 10.1109/ACCESS.2020.3010511
HDPM: An Effective Heart Disease Prediction

Model for a Clinical Decision Support System
NORMA LATIF FITRIYANI 1 , MUHAMMAD SYAFRUDIN 1 ,
GANJAR ALFIAN 2 , (Member, IEEE), AND JONGTAE RHEE1
1 Department of Industrial and Systems Engineering, Dongguk University, Seoul 04620, South Korea
2 Industrial AI Research Center, Nano Information Technology Academy, Dongguk University, Seoul 04626, South Korea
Corresponding authors: Muhammad Syafrudin ([email protected]) and Jongtae Rhee ([email protected])
This work was supported by the Dongguk University Research Fund, in 2019, under Grant S-2019-G0041-00035.
ABSTRACT Heart disease, one of the major causes of mortality worldwide, can be mitigated by early
heart disease diagnosis. A clinical decision support system (CDSS) can be used to diagnose the subjects’
heart disease status earlier. This study proposes an effective heart disease prediction model (HDPM) for
a CDSS which consists of Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to
detect and eliminate the outliers, a hybrid Synthetic Minority Over-sampling Technique-Edited Nearest
Neighbor (SMOTE-ENN) to balance the training data distribution and XGBoost to predict heart disease. Two
publicly available datasets (Statlog and Cleveland) were used to build the model and compare the results with
those of other models (naive bayes (NB), logistic regression (LR), multilayer perceptron (MLP), support
vector machine (SVM), decision tree (DT), and random forest (RF)) and of previous study results. The
results revealed that the proposed model outperformed other models and previous study results by achieving
accuracies of 95.90% and 98.40% for Statlog and Cleveland datasets, respectively. In addition, we designed
and developed the prototype of the Heart Disease CDSS (HDCDSS) to help doctors/clinicians diagnose the
patients’/subjects’ heart disease status based on their current condition. Therefore, early treatment could be
conducted to prevent the deaths caused by late heart disease diagnosis.
INDEX TERMS Heart disease, disease prediction model, clinical decision support system, outlier data,
imbalanced data, machine learning.
I. INTRODUCTION diet, consuming fruits and vegetables, doing regular physical

Heart disease is a cardiovascular disease (CVD) that remains activity, and discontinuing use of tobacco and alcohol which
the number one cause of death globally and contributes to eventually could help to reduce the risk of heart disease [4].
approximately 30% of all global deaths [1]. If unmitigated, The early heart disease identification of high-risk individuals
the total number of deaths globally is projected to increase to and the improved diagnosis using a prediction model have
around 22 million in 2030. The American Heart Association generally been recommended to reduce the fatality rate and
reported that nearly half of American adults are affected by improve the decision-making for further prevention and treat-
CVDs, equating to nearly 121.5 million adults [2]. In Korea, ment [5]–[7]. A prediction model that is implemented in the
heart disease is among the top three leading causes of death clinical decision support system (CDSS) can be used to help
and contributed to nearly 45% of total deaths in 2018 [3]. clinicians assess the risk of heart disease and provide appro-
Heart disease is a condition when plaque on arterial walls priate treatments to manage the risk further [8]. In addition,
can block the flow of blood and cause a heart attack or numerous studies have also reported that the implementation
stroke. Several risk factors that can lead to heart disease of CDSS can improve preventive care, clinical decision mak-
include unhealthy diet, physical inactivity, and excessive use ing and decision quality [9]–[12].
of tobacco and alcohol. These risk factors can be minimized Machine learning-based clinical decision making have
by practicing good daily lifestyle such as salt reduction in the recently been applied in healthcare area. Previous stud-
ies have shown that machine learning algorithms (MLAs)
The associate editor coordinating the review of this manuscript and such as chaos firefly algorithm [13], backpropagation neural
approving it for publication was Rajeswari Sundararajan. network (BPNN) [14], multilayer perceptron (MLP) [15],
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
133034 VOLUME 8, 2020
N. L. Fitriyani et al.: HDPM: Effective HDPM for a CDSS
logistic regression (LR) [16], support vector machine (SVM) and applicability of our proposed model for real-world
[17], and random forest (RF) [18] have been success- case study. It is expected that the developed system
fully used to help as decision making tools for heart dis- can be used as a practical guideline for the healthcare
ease prediction based on individual data. Several studies practitioners.
have also revealed the advantage of a hybrid model which The remainder of this study is organized as follows.
achieved good performance in predicting heart disease such Section II summarized the literature review. Section III
as majority voting of naïve bayes (NB), bayes net (BN), presents the proposed HDPM including datasets description,
RF, and MLP [19], two stacked SVMs [20], and RF with overall design, and modules of the proposed model as well
a linear model [21]. However, in the machine learning as performance evaluation metrics. Section IV discusses the
field, outlier and imbalance data may arise and impact on performance evaluation of proposed model, including the sta-
the performance of the prediction model. Previous studies tistical test and comparison with previous studies. Section V
have reported that by incorporating Density-Based Spatial presents the practical applications of the proposed model in
Clustering of Applications with Noise (DBSCAN)-based the real case scenario. Finally, the concluding remarks and
to detect and eliminate the outlier data [22]–[24], and by future research directions are presented in Section VI.
balancing the distribution of data using a hybrid Synthetic
Minority Over-sampling Technique-Edited Nearest Neighbor II. LITERATURE REVIEW
(SMOTE-ENN) [25]–[28], the prediction models’ perfor- Several studies have reported the development of heart dis-
mances were significantly enhanced. ease diagnosis based on machine learning models with the
To the best of our knowledge, no study has investi- aim of providing an HDPM with enhanced performance.
gated a heart disease prediction model (HDPM) by utilizing Two publicly available heart disease datasets, namely Stat-
DBSCAN, SMOTE-ENN and XGBoost machine learning. log and Cleveland, have been widely used to compare the
Therefore, we propose an effective HDPM for a CDSS which performance of prediction models among researchers. For
consists of DBSCAN-based to detect and eliminate the out- Statlog dataset, a heart disease clinical decision support sys-
liers, SMOTE-ENN to balance the training data distribution tem based on chaos firefly algorithm and rough sets-based
and XGBoost to predict heart disease. Our challenge is to attribute reduction (CFARS-AR) was developed by Long
detect and remove the outlier data and to balance the distri- et al. (2015) [13]. The rough sets were used to reduce the
bution of the training dataset to improve the performance of number of attributes while the chaos firefly algorithm was
the HDPM. Two publicly available datasets (Statlog [29] and used to classify the disease. The developed model was then
Cleveland [30]) were used to build the model and to eval- compared with other models such as NB, SVM and ANN.
uate their performance compared with that of other models The results revealed that the proposed model achieved the
(NB, LR, MLP, SVM, decision tree (DT), and RF) and of highest performance among all the models with accuracy,
previous study results. In addition, we ensured the applica- sensitivity, and specificity of 88.3%, 84.9%, and 93.3%,
bility of the proposed model by designing and implement- respectively. The combination of rough sets-based attributes
ing the model into a Heart Disease CDSS (HDCDSS) to selection and BPNN (RS-BPNN) was proposed by Nahato
diagnose the subjects based on their current condition. The et al. (2015) [14]. With the selected attributes, the pro-
developed HDCDSS is expected to help clinicians diagnose posed RS-BPNN achieved accuracy of up to 90.4%. Dwivedi
the patients effectively and efficiently and thereby improv- (2018) [31] compared six machine learning models (ANN,
ing heart disease clinical decision making. Therefore, early SVM, LR, k-nearest neighbor (kNN), classification tree and
treatment could be conducted to prevent the deaths caused by NB) with various performance metrics. The results showed
late heart disease diagnosis. Contributions of our study can that LR performed better than the other models by achiev-
be summarized as follows. ing up to 85%, 89%, 81%, and 85 for the accuracy, sen-
• Improving accuracy of heart disease prediction model. sitivity, specificity, and precision, respectively. Amin et al.
We proposed HDPM by integrating DBSCAN outlier (2019) [32] performed comparison analysis by identifying
detection, SMOTE-ENN, and XGBoost to improve pre- significant attributes and applying machine learning models
diction accuracy. The HDPM learned from two public (k-NN, DT, NB, LR, SVM, Neural Network (NN) and a
datasets and the trained model was utilized to predict hybrid (voting with NB and LR)). The experiment results
the subjects’ heart disease status based on their current revealed that the hybrid model (voting with NB and LR) with
condition. selected attributes achieved the highest accuracy (87.41%).
• Performance analysis and comparison with state-of- Cleveland heart disease dataset has been widely
the arts models. The proposed HDPM was evaluated used by researchers to generate predictive models.
with other classification models and compared with the Verma et al. (2016) [15] developed a hybrid prediction
results from previous studies. In addition, we presented model based on correlation feature subset (CFS), particle
the statistical evaluation to confirm the significant of our swam optimization (PSO), K-means clustering and MLP.
model as compared to other models. The results showed that the proposed hybrid model achieved
• Real case system development. We designed and devel- accuracy of up to 90.28%. Haq et al. (2018) [16] performed
oped the prototype of the system to show the feasibility a comparative study on a hybrid model based on various
VOLUME 8, 2020 133035

FIGURE 1. The proposed Heart Disease Prediction Model (HDPM) for the Heart Disease Clinical Decision Support System (HDCDSS).
feature selection techniques (relief, minimal-redundancy- None of the aforementioned previous studies have applied
maximal-relevance (mRMR), least absolute shrinkage and outlier detection and data balancing method to improve the
selection operator (LASSO)) and machine learning models accuracy of classification model, especially for the case of
(LR, kNN, ANN, SVM, DT, NB, and RF). Their study heart disease datasets. Thus, in this study we used outlier
revealed that the features reduction affects the performance detection and data balancing methods to improve the model
of the models. The study concluded that a combination of performance. In addition, the XGBoost classifier is then used
Relief-based feature selection and LR-based machine learn- to learn and generate the prediction model.We expect that
ing algorithm (MLA) provides higher accuracy (up to 89%) our proposed model will achieve higher performance than
as compared with other combinations used in the study. that of state-of-the-art models and previous study results.
Saqlain et al. (2019) [17] proposed a technique based on Finally, we also design and develop the HDCDSS to help doc-
mean Fisher score feature selection algorithm (MFSFSA) tors/clinicians diagnose the patients’/subjects’ heart disease
and SVM classification model. The selected features are status based on their current condition. Thus, early treatment
based on the higher Fisher score than the mean score. Then, could be conducted to prevent the risks further.
SVM used the selected feature subset to learn and calculate
the MCC through a validation process. The study revealed III. MATERIALS AND METHODS
that the combination of FSFSA and SVM generates accu- The proposed HDPM was developed to provide high perfor-
racy, sensitivity, and specificity of up to 81.19%, 72.92%, mance prediction in the presence or absence of heart disease
and 88.68%, respectively. Latha and Jeeva (2019) [19] pro- given the current condition of the subjects. The flow-chart
posed a hybrid model with majority voting of NB, BN, RF, in Figure 1 shows how the proposed HDPM is developed.
and MLP. The proposed model achieved an accuracy of up First, the heart disease datasets are collected. Second, the data
to 85.48%. pre-processing for data transformation and feature selection
Ali et al. (2019) [20] proposed two stacked SVMs to are conducted. Third, the DBSCAN-based outlier detection
improve the diagnosis process. The first SVM was used to method is applied to find the outlier data given the optimal
remove the non-relevant features and the second to predict parameter. Fourth, the detected outlier data are then removed
heart disease. The results revealed that the proposed model from the training dataset. Fifth, the data balancing based on
achieved better performance than other models and previous SMOTE-ENN method is used to balance the training dataset.
study results. Mohan et al. (2019) [21] introduced a hybrid RF Sixth, the XGBoost-based MLA is used to learn from the
with a linear model (HRFLM) to enhance the performance of training dataset and generate the HDPM. Finally, the per-
the HDPM. They found that the proposed method achieved formance metrics are presented to evaluate the performance
accuracy, precision, sensitivity, f-measure and specificity of of the proposed model and the generated HDPM is then
up to 88.4%, 90.1%, 92.8%, 90%, and 82.6%, respectively. implemented within the CDSS. In our study, we utilized
Recently, Gupta et al. (2020) [18] developed a machine 10-fold cross-validation method to avoid the overfitting.
intelligence framework consisting of factor analysis of mixed Cross-validation allows the models to learn from different
data (FAMD) and RF-based MLA. The FAMD was used to sets of training data by repeated sampling; hence maximizing
find the relevant features and the RF to predict the disease. the data used for validation and possibly, helping to pre-
The experimental results showed that the proposed method vent from overfitting. Previous study has demonstrated that
outperformed other models and previous study results by 10-fold cross-validation can be used to maintain the bias-
achieving the accuracy, sensitivity, and specificity of up to variance trade-off which eventually provide the generalized
93.44%, 89.28%, and 96.96%, respectively. model and protect against overfitting [33], [34].
133036 VOLUME 8, 2020

TABLE 1. The detailed dataset attributes description and distribution (mean and standard deviation (STD)) for dataset I (Statlog).
The detailed steps, including datasets and modules 79 raw attributes, although only 13 attributes are used, and
descriptions, and the performances metrics are presented in one attribute as an output class. We removed 6 subjects’ data
the following subsections. In addition, the performance of the due to missing values and used the remaining 297 data in the
proposed model with the state-of-the-art models is evaluated pre-processing stage. The original class value is a multi-class
and the results are presented in the results and discussion variable with the value range from 0 to 4. The 0 value is
section. Finally, we ensure the applicability of the proposed used to represent the absence of heart disease while the
model by embedding the HDPM into the HDCDSS to diag- values from 1 to 4 are used to represent the presence of heart
nose the subjects’ heart disease status based on their current disease with its stage condition. In this study, we followed
condition. previous studies [16]–[21, [32] in converting the class value
from a multi-class variable to a binary-class variable. The
A. HEART DISEASE DATASET final class variable is set to 0 if heart disease is not present
We used two heart disease datasets (Statlog and Cleveland; in the subject and to 1 for all the subjects who have been
termed datasets I and II, respectively) to investigate how heart diagnosed as having heart disease. We pre-processed the data
disease can be identified by applying the machine learning by applying the previous rule to the records. Finally, after data
model. The proposed model is then applied to those two pre-processing, the final dataset II consists of 297 subjects
datasets and with the expectation of providing a general and with 137 and 160 subjects being labelled with the presence
robust HDPM. (positive class) and absence (negative class) of heart disease,
The University of California Irvine (UCI) Repository Stat- respectively. A detailed attributes description (including data
log Heart Disease database website presents dataset I to type and range) and distribution (mean and STD) for dataset II
investigate heart disease [29]. The original dataset consists is given in Table 2.
of 270 subjects, 13 attributes and one output class (120 and For both datasets, the absence and presence of heart disease
150 subjects are labelled with the presence (positive class) are treated as negative (0) and positive (1), respectively. The
and absence (negative class) of heart disease, respectively). correlation between attributes can affect the performance of
There are no missing values in dataset I. A detailed attributes the machine learning model. Data correlation by utilizing
description (including data type and range) and distribution Pearson’s Correlation Coefficient (PCC) can be used as a cal-
(mean and standard deviation (STD)) for dataset I are given culation tool to determine the relationship between attributes.
in Table 1. PCC varies from −1 to +1, with a positive and a nega-
Dr. Robert Detrano, M.D., provided dataset II (Cleve- tive value indicating a highly positive and highly negative
land Heart Disease dataset) to investigate heart disease correlation between the variables, respectively, and a value
that was collected from the V.A. Medical Center, Long close to zero indicating a low correlation between them. The
Beach and Cleveland Clinic Foundation in California, United heatmap correlation between attributes for datasets I and II
States [30]. The original dataset comprises 303 subjects and are given in Figure 2(a) and 2(b), respectively. The gray color
VOLUME 8, 2020 133037

TABLE 2. The detailed dataset attributes description and distribution (mean and standard deviation (STD)) for dataset II (Cleveland).
FIGURE 2. Heatmap of attributes correlation for (a) dataset I (Statlog) and (b) dataset II (Cleveland).
indicates that the correlation is close to 0, while the red and select the most important attribute to improve the model per-
blue colors indicate that the correlation between variables is formance for the two datasets [37], [38]. Figure 3(a) and 3(b)
close to +1 and −1, respectively. The attributes chol and fbs show the attribute significant score based on the IG
are seen to have a correlation that is close to 0 toward the method for datasets I and II, respectively. In this case,
attribute class, which suggests that both only have a small or both datasets have the same lowest attributes scores (chol,
even no correlation with the attribute class. Thus, we could trestbps, and fbs), which we therefore removed from both
possibly remove these features to improve the performance datasets, and used the remaining attributes (age, sex, cp,
of our proposed model. restecg, thalach, exang, oldpeak, slope, ca, and thal) for
In addition, we applied attribute selection by using the further analysis. We expect that by using the two datasets
Information Gain (IG) method [35] in Weka V3.8 [36] to and the selected attributes, our proposed model will be
133038 VOLUME 8, 2020

FIGURE 3. Attribute significance score provided by the Information Gain (IG) method for (a) dataset I (Statlog) and (b) dataset II (Cleveland).
FIGURE 4. An illustration of (a) eps, core, border and outlier point and (b) DBSCAN cluster model with MinPts = 5.
sufficiently robust for predicting heart disease with high The ‘‘border point’’ y is defined as the number of neigh-
performance. boring data points is less than MinPts, but y belongs to the
neighboring core data point of x. Finally, the ‘‘outlier point’’
B. DBSCAN-BASED OUTLIER DATA DETECTION AND z is marked as a point z is neither a core point nor a border
REMOVAL point. Figure 4(a) illustrates eps, core x, border y, and outlier
In this study, we utilized DBSCAN [39] to cluster and detect z point using MinPts = 5. As can be seen in Figure 4(b),
the outliers from both training datasets. The goal of DBSCAN the point B and C are border point, A is a core point, and
is to find the dense regions which can be identified by the N is a noise point. Arrows indicate direct density reachability.
number of objects that are close to a specific point (core Point B and C are density connected, because both are density
point) and the points that are outside the regions are treated as reachable from point A. N is not density reachable and do not
outliers. In general, two parameters need to be determined for belong to any cluster (with MinPts = 5), and thus considered
DBSCAN: epsilon (eps) and minimum points (MinPts). The to be a noise point or outlier. First, the algorithm checks
eps is defined as the neighborhood radius around a point of x the specific point (any point) to be considered as a core
(ε-neighborhood) while the MinPts is defined as the mini- point or not. The core point is if at least MinPts points are
mum number of neighboring data points within the eps. There within the eps of it. The border points are the points that
are three points that can be used to determine the normal and can be reached from core point (within distance eps from
outlier data are core point, border point, and outlier point. core point). Next, the core and border points are becoming
A ‘‘core point’’ x is marked as any point that has a number of cluster and marked as visited points by the algorithm. Finally,
neighboring data points either greater than or equal to MinPts. the algorithm keeps iterating to check other unvisited point
VOLUME 8, 2020 133039

FIGURE 5. Optimal eps value using 5-NN and DBSCAN outlier detection result for datasets I (Statlog) (a), (b) and II (Cleveland) (c), (d), respectively.
(to be considered as core point) to find the unvisited border respectively. We found that the ‘‘knee’’ appears at around
points. The points that are not belonging to the clusters are the distance of 9 and 8 for datasets I and II, respectively.
considered as outlier. The detailed pseudocode for DBSCAN Furthermore, we applied the DBSCAN method by using
is presented in Algorithm 1. MinPts = 5, eps = 9 and MinPts = 5, eps = 8 for datasets I
The optimal eps value is calculated by averaging the dis- and II, respectively. Figure 5(b) and (d) show the results of
tance of every point to its kNN. The value of k corresponds to DBSCAN implementation for datasets I and II visualized
the MinPts value, which is defined by the user. In this study, in two-dimensional graphs. The results showed that in both
we followed previous studies [40]–[43] to utilize 5-nearest datasets, the DBSCAN clustered the data into a single cluster
neighbors (5-NN) to find the optimal eps value. Most of the as cluster 1 and the un-clustered data (with x symbol) are
previous studies utilized MinPts = 5 and optimized their treated as outliers (see Figure 5(b) and (d)). The optimal
eps value based on MinPts. Finally, according to Ester et al. parameters and the final outlier data for both datasets are pre-
(1996) [39], the eps can be obtained by presenting k-dist sented in Table 3. Finally, we removed all the detected outlier
graph. First, k-distances are visualized as a k-dist graph and data in each training dataset and used the remaining normal
shown in ascending order to find the ‘‘knee’’ value where a data for further analysis. In addition, we performed experi-
sharp change appears beside the k-distance curve for the opti- mental analysis to find the impact of outlier removal on the
mal eps value estimation. We implemented the calculation performance of the model. Figure 6 shows the impact of out-
of kNNs and DBSCAN in R programming V3.5.1 and used lier data elimination based on DBSCAN as compared to orig-
R packages such as fpc V2.2-2 and DBSCAN V1.1-3. inal data. Outlier removal based on DBSCAN significantly
Figure 5(a) and (c) show the sorted 5-NN distribu- improved the model accuracy for all datasets, from accuracy
tion graph and optimal eps value for datasets I and II, 80.74%, 80.03% to 85.41%, 85.26% for dataset I, and II,
133040 VOLUME 8, 2020

Algorithm 1 DBSCAN Pseudocode TABLE 3. The parameters and result of DBSCAN-based outlier detection.
Input: dataset, D; minimum point, minPts; radius, eps
Output: clustered C and un-clustered data UC
for each sample point SP in dataset D do
if SP is not visited then
mark SP as visited
neigbrPts ← samples points in ε-neighborhood of SP TABLE 4. SMOTE-ENN data balancing results.
if sizeof(neigbrPts) < minPts then

mark SP as UC
end
else
add SP to new cluster C
for each sample point SP’ in neigbrPts do
if SP’ is not visited then
mark SP’ as visited to deal with imbalanced data. Figure 7 illustrates the three
neigbrPts’ ← samples points in subcategories of data balancing methods. The over-sampling
ε-neighborhood of SP’ method balances the training data by generating data samples
if sizeof(neigbrPts’) ≥ minPts then for the minority class while the under-sampling achieves that
neigbrPts ← neigbrPts + neigbrPts’ goal by eliminating the data samples in the majority class.
end Meanwhile, the hybrid method achieves the balanced data by
end combining the over-sampling and under-sampling methods.
if SP’ is not a member of any cluster then We used a hybrid SMOTE-ENN [25] method to balance
add SP’ to cluster C the imbalance heart disease training datasets. In general,
end SMOTE is used to over-sample the minority class until the
end training dataset is balanced, then the Edited Nearest Neigh-
end bor (ENN) is used to eliminate the unwanted overlapping
end samples between two classes while maintaining the balanced
end distributions. The pseudocode of SMOTE-ENN is explained
in Algorithm 2. Previous studies have shown that the com-
bination of SMOTE and ENN (SMOTE-ENN) provides bet-
ter performances than that of either alone [25], [26]. For
all datasets, the minority and majority classes are the sub-
jects who were diagnosed with the presence (positive class)
and absence (negative class) of heart disease, respectively.
The original percentage of minority class over the total
number of subjects for datasets I and II are 44.19% and
46.05%, respectively. The SMOTE technique was applied
to increase the number of minority class by randomly gen-
erating new samples from the NNs of the minority class
sample. Then the ENN was used to remove the unwanted
overlapping samples. After SMOTE-ENN implementation,
the total number of minority class increases, and the updated
percentage of minority class for datasets I and II becomes
FIGURE 6. Impact of DBSCAN-based outlier elimination on model more balanced, at 50.79% and 49.5%, respectively. We uti-
accuracy.
lized Python V3.6.5 and the Imbalanced-learn python library
V0.4.3 [44] to implement SMOTE-ENN, producing evenly
balanced class distributions (see Table 4).
respectively, with average improvement as much as 4.95%. The SMOTE-ENN ensures that when creating the new arti-
Furthermore, previous studies [22]–[24] also revealed that ficial samples and eliminating the overlapped samples, it will
by removing outlier data, it has improved the performance follow the distribution pattern from the original samples.
accuracy. Figure 8 shows the data distribution of attributes ‘‘age’’ and
‘‘thalach’’ before and after SMOTE-ENN implementation
C. SMOTE-ENN-BASED DATA BALANCING for all training datasets. For each dataset, the distribution
Data sampling or data balancing is a common method of attributes ‘‘age’’ and ‘‘thalach’’ follow the normal dis-
comprised of three subcategories, over-sampling, under- tribution pattern. The SMOTE-ENN implementation keeps
sampling, and hybrid method, and is used in machine learning the original data distribution pattern of dataset I, as shown
VOLUME 8, 2020 133041

Algorithm 2 SMOTE-ENN Pseudocode modifications in terms of regularization, loss function and

Input Data, D; column sampling. Gradient boosting is a technique in which
Output Balanced data, BD new models are created and used to predict the error or
1: foreach data point in minority class mp of data D residuals, after which the scores are summed to get the final
do prediction result. The gradient descent method is used to
2: Compute the k-nearest neighbor K mpi minimize the loss score when new models are created. The
3: Generate new synthetic datapoint objective function needs to be used to measure the model
mpnew = mpi + mp ˆ i − mpi + δ performance, which consists of two parts: training loss and
4: Add the mpnew to D with mpi class regularization. The regularization term penalizes the com-
5: end for plexity of the model and prevents overfitting. The objective
6: foreach data point p in data D do function (loss function and regularization) can be presented
7: if pi class <> majority class of k-nearest as follows.
neighbors then X X
8: Remove pi from D L (φ) = l ŷi, , yi + (fk );
9: end if i k
10: end for
11: return BD where
1
(f ) = γ T + λ kwk2 (1)
2
The term l here is the differentiable convex loss function
that calculates the difference between the prediction ŷi and
the target yi . While the regularized term penalizes the
complexity of the model and the number of leaves in the tree
are represented using T . Furthermore, each fk corresponds to
an independent tree structure q and leaf weight w. Finally,
the term γ corresponds to the threshold and pre-pruning is
performed while optimizing to limit the growth of the tree
and λ is used to smooth the final learned weights to prevent
overfitting.
We implemented XGBoost using the XGBoost V0.81 python
library. The outlier data from heart disease training
datasets are eliminated by using the DBSCAN method, and
SMOTE-ENN is used to balance the training dataset. Finally,
FIGURE 7. Impact of DBSCAN-based outlier elimination on model XGBoost is used to learn from the training dataset and
accuracy.
generate the HDPM. We measured five performance metrics
to compare the performance of the proposed model with
that of state-of-the-art models and previous study results.
in Figure 8(b), such that the updated dataset I retains a similar In addition, we ensured the applicability of the proposed
pattern of data distribution (normal distribution). Dataset II model by implementing the model into the HDCDSS to
exhibited a similar distribution pattern to that of the orig- diagnose the subjects based on their current condition.
inal dataset (Figure 8c) and in the updated dataset after We used five performance metrics to evaluate the perfor-
SMOTE-ENN implementation (see Figure 8(d)). In general, mance of the proposed model. A confusion matrix was used
the purpose of the HDPM is to minimize the errors during to measure four different potential outputs from the model:
learning; thus, we expect that the HDPM performance can be true positive (TP), true negative (TN), false positive (FP),
enhanced from the balanced training datasets. and false negative (FN). TP and TN outputs are defined
as the number of subjects correctly classified as ‘‘positive’’
D. XGBOOST-BASED MACHINE LEARNING (presence of heart disease) and ‘‘negative’’ (healthy/ absence
ALGORITHM (MLA) AND EVALUATION METRICS of heart disease), respectively, and FP and FN outputs as
After we balanced the training datasets, the MLA is used the number of subjects incorrectly classified as ‘‘positive’’
to learn and generate the HDPM. We used the extreme gra- (presence of heart disease) when they are actually ‘‘neg-
dient boosting (XGBoost) algorithm to detect the presence ative’’ (healthy/ absence of heart disease) and incorrectly
or absence of heart disease. XGBoost is a type of super- classified as ‘‘negative’’ (healthy/ absence of heart disease)
vised machine learning used for classification and regression when they are actually ‘‘positive’’ (presence of heart dis-
modelling [45]. XGBoost is an enhanced algorithm based on ease), respectively. We employed 10-fold cross validation
the implementation of gradient boosting DTs with several to generate the models for all classification models, with
133042 VOLUME 8, 2020

FIGURE 8. Data distribution of attributes ‘‘age’’ and ‘‘thalach’’ before and after SMOTE-ENN implementation for each dataset I (Statlog) (a), (b) and
dataset II (Cleveland) (c), (d), respectively.
the final performance metric being the average. We imple- and XGBoost. In addition, the following five performance
mented all the classification models in Python V3.6.5 by metrics are measured. Accuracy (acc) is calculated as
utilizing three libraries: sklearn V0.20.2, imbalanced-learn
TP + TN
V0.4.3 and XGBoost V0.81. We performed the experiments acc = , (2)
on a computer with Intel Core i7-4790 (3.60 GHz × 8 cores), TP + FN + FP + TN
16 GB RAM that runs with Windows 10 Pro 64-bit. The precision (pre) is calculated as
sklearn library is an open source python programming tool
for machine learning, the imbalanced-learn library is also an TP
pre = , (3)
open source python tool-box that consists of several methods TP + FP
to deal with imbalanced data, and the XGBoost library is an
open source tool that implements the XGBoost algorithms in recall/sensitivity/true positive rate (rec/sen/TPR) is calculated
several programming languages, including Python. To sim- as
plify the implementation of the experimentations, we used TP
default parameters provided by sklearn, imbalanced-learn rec = , (4)
TP + FN
VOLUME 8, 2020 133043

TABLE 5. Performance evaluation for dataset I (Statlog).
F-measure (f ) is calculated as can be calculated as

2pr TN
f = , (5) TNR = . (10)
p+r TN + FP
and MCC is calculated as IV. RESULTS AND DISCUSSIONS
(TP × TN ) − (FP × FN ) A. PERFORMANCE EVALUATION OF PROPOSED HDPM
MCC = √ .
(TP + FP) (TP + FN ) (TN + FP) (TN + FN ) The proposed HDPM was applied to both datasets and
(6) showed positive results for increasing the prediction accuracy
as compared to other models. We selected six state-of-the-art
The value of MCC ranges from −1 to +1, which represent MLAs (NB, LR, MLP, SVM, DT, and RF) that have been
the performance of the classification model. The best model widely used in the research community and have a proven
is achieved when the value of MCC is close or equal to track record for accuracy and efficiency for comparison.
+1 while the worst model is close or equal to −1. In addition, We performed 10-fold cross-validation for all models and
we also used the value of the area under the receiver operating collected eight performance metrics: accuracy (acc), preci-
characteristic curve (AUC) to compare the performance of the sion (pre), recall/sensitivity/true positive rate (rec/sec/TPR),
proposed model with that of other existing models. For the f-measure (f ), MCC, false positive rate (FPR), false nega-
given k training data, the AUC can be calculated as [46], [47] tive rate (FNR), and true negative rate (TNR). The findings
1 Xk + Xk − revealed that the proposed model outperformed other models
AUC x + , x − = + −

1 + − (7) by achieving acc, pre, rec/sec, f up to 95.90%, 97.14%,
k k i=1 j=1 h xi >h xj ,
94.67%, 95.35% for dataset I and 98.40%, 98.57%, 98.33%,
where the term 1 corresponds to a ‘1’ when 98.32% for dataset II, respectively. In term of MCC, the pro-
h xi+ >h xj−
posed HDPM achieved the highest MCC value up to 0.92 and

the elements h xi+ > h xj− , ∀i = 1, 2, . . . ,+ , ∀j =

0.97 for datasets I and II, respectively, which confirms the
1, 2, . . . , k − , and ‘0’ otherwise. The best model is achieved superiority of our proposed model relative to other models.
when the value of AUC is close or equal to 1. Additionally, In addition, in terms of false positive rate (FPR) and true
we presented several additional metrics to measure the per- positive rate (TNR), the results revealed that the proposed
formance of the model such as false positive rate (FPR), false model achieved lowest FPR and highest TNR as compared
negative rate (FNR), and true negative rate (TNR). FPR is with other models. The proposed model achieved FPR and
used to represent the false alarm which the positive prediction high TNR by up to 4.52%, 95.48% and 1.67%, 98.33% for
result (presence of heart disease) will be given when the dataset I and II, respectively. The low FPR and high TNR
actual prediction output value is negative (absence of heart value of the proposed model represented the capability of
disease). The FPR can be calculated as the HDPM model to minimize miss-rate and optimize pre-
FP diction accuracy for both negative and positive subjects. The
FPR = . (8)
FP + TN detailed performance results are presented in Table 5 and 6 for
We used the FNR to represent the miss rate which is the datasets I and II, respectively.
probability that a positive prediction result will be missed by We further investigated the performance of the proposed
the test. The FNR can be calculated as HDPM using a receiver operating characteristic (ROC) curve
visualization since a previous study [48] has used it to eval-
FN
FNR = . (9) uate and illustrate the diagnostic capability as its threshold
FN + TP is changed. The ROC curve consists of the TP rate as the
Finally, the TNR or specificity is used to show the probability y-axis and FP rate as the x-axis with the area under the ROC
that the actual negative subjects will test negative. The TNR curve (AUC) being calculated to show the performance of the
133044 VOLUME 8, 2020

TABLE 6. Performance evaluation for dataset II (Cleveland).
FIGURE 9. ROC curve visualization to compare the proposed model with other models for datasets (a) I (Statlog) and (b) II (Cleveland).
model. The best model is achieved when the value of AUC is t (tabulated) = 2.78 and collected the h, p-value, and t
close or equal to 1. Figure 9 shows that the proposed HDPM (calculated) values for all datasets. The null hypothesis is
achieved higher AUC score than that of other models of up accepted when the paired t-test return value of h = 0, and
to 1.00 and 1.00 for datasets I and II, respectively, which the null hypothesis is rejected if h = 1, which indicates
confirmed that the proposed model outperformed other state- a significance different between the proposed HDPM and
of-the-art models. the existing one. This could be supported by evidence that
In addition, we followed a previous study [49] to eval- the p-value is less than the significance level (0.05) and t
uate the performance of the model using statistical-based (calculated) is greater than t (tabulated). In Table 7, show-
significance testing to prove the significance of our proposed ing the paired t-test result for both datasets, the proposed
HDPM as compared with other state-of-the-art models. The HDPM is significantly different from the other models since
paired t-test [50], [51] was applied to statistically test the for all datasets, h = 1, p-value < significance level, and
significance between the proposed HDPM and other state- t (calculated) > t (tabulated). Therefore, the proposed model
of-the-art models. We defined h = 0, i.e., the null hypothe- has significant different as compared with other state-of-the-
sis, as being no significance different between the proposed art models.
HDPM and other existing models. We performed 10-fold
cross validation to collect ten accuracy data for all the models B. BENCHMARK WITH PREVIOUS STUDY RESULTS
in Python V3.6.5 and applied the paired t-test using Scipy In this section, we performed comparison study of our pro-
V1.2.0 library. We defined the significance level = 0.05, posed HDPM with the results from previous studies. It should
VOLUME 8, 2020 133045

TABLE 7. The results of paired t-test for datasets I (Statlog) and II (Cleveland).
TABLE 8. Benchmark with previous study results for dataset I (Statlog).
TABLE 9. Benchmark with previous study results for dataset II (Cleveland).
be noted that since we utilized the same datasets, we directly In terms of accuracy, the proposed HDPM achieved the
took the results from previous studies without implementing highest accuracy with an average improvement of 8.12%
their techniques. The detailed comparison results with previ- as compared with previous study results. Overall, we can
ous studies for datasets I and II are given in Table 8 and 9, conclude that our proposed method outperformed all the
respectively. previous study results in terms of accuracy, f-measure, MCC
Previous studies have utilized the Statlog dataset for gen- and AUC.
erating the machine learning model to diagnose the heart In addition, several researchers have also used the Cleve-
disease. Long et al. (2015) [13] proposed the CFARS-AR land dataset to predict heart disease. Verma et al. (2016) [15]
and achieved acc = 88.3% and rec/sen = 84.9%. Nahato developed a hybrid model with CFS selection, PSO, K-means
et al. (2015) [14] used the rough set method with RS-BPNN clustering and MLP and achieved acc = 90.28%. Haq et al.
and achieved acc = 90.40%, rec/sen = 94.67% and AUC = (2018) [16] proposed a hybrid system using Relief-based fea-
0.92. Dwivedi (2018) [31] used LR and achieved acc = ture selection and LR, and achieved acc = 89%, rec/sec =
85%, pre = 85%, rec/sec = 89%, and f = 87%. Amin 77%, MCC = 0.89, and AUC = 0.88. Saqlain et al.
et al. (2019) [32] utilized the voting method with NB and (2019) [17] used MFSFSA and SVM and achieved acc =
LR and achieved acc = 87.41%. The proposed HDPM 81.19%, rec/sec = 72.92%, and MCC = 0.85. Latha and
achieved acc = 95.90%, pre = 97.14%, rec/sec = Jeeva (2019) [19] used majority voting with NB, BN, RF,
94.67%, f = 95.35%, MCC = 0.92, and AUC = 1.00. and MLP and achieved acc = 85.48%. Ali et al. (2019) [20]
133046 VOLUME 8, 2020

FIGURE 10. Heart Disease Clinical Decision Support System (HDCDSS) (a) architecture framework, (b) diagnosis form, and (c) diagnosis result.
used stacked SVMs and achieved acc = 92.22%, rec/sec = data pre-processing and training/testing approaches. In addi-
82.92%, and MCC = 0.85. Mohan et al. (2019) [21] devel- tion, the prediction model performance depends on several
oped HRFLM and achieved acc = 88.4%, pre = 90.1%, factors such as features selections, data types and its size,
rec/sec = 92.8%, and f = 90%. Gupta et al. (2020) [18] noise filtering, hyperparameters, data sampling, model selec-
utilized the FAMD-based feature extraction and RF algo- tion, etc. Therefore, these general comparison (as presented
rithm and achieved acc = 93.44%, rec/sec = 89.28%, in Table 8 and 9) cannot be used as the main evidence to
f = 92.59%, MCC = 0.87, and AUC = 0.93. Finally, conclude the performance of given prediction models but it
the proposed HDPM achieved acc = 98.40%, pre = 98.57%, can be used simply as a general comparison between the
rec/sec = 98.33%, f = 98.32%, MCC = 0.97, and proposed HDPM and previous studies.
AUC = 1.00. In terms of accuracy, the proposed HDPM
achieved the highest accuracy with an average improvement
of 9.83% as compared with previous study results. Overall, V. APPLICATION FOR THE HEART DISEASE CLINICAL
we can conclude that our proposed method outperformed all DECISION SUPPORT SYSTEM (HDCDSS)
the previous study results in all six-performance metrics (acc, The prototype of the web-based Heart Disease Clinical Deci-
pre, rec/sen, f , MCC, and AUC). sion Support System (HDCDSS) was developed to pro-
It should be noted that a direct comparison of the presented vide a simple and convenient way for medical clinicians to
results is not fair since they have been derived by different diagnose subjects/patients based on their current condition.
VOLUME 8, 2020 133047
The HDCDSS was developed in Python V3.6.5 by utilizing the unbalanced training dataset and XGBoost MLA was
Flask V1.0.2 as a Python Web Server Gateway Interface adopted to learn and generate the prediction model. Two pub-
(WSGI) with Bootstrap V3.3.7 for data representation, while licly available datasets of heart disease were utilized by pro-
the proposed HDPM was loaded using Joblib V0.14.1 and duce the generalized prediction model. We performed evalu-
XGBoost V0.81. The patients’ data and the prediction results ation analysis of our proposed model with other classification
were stored into MongoDB by using Pymongo V3.7.1. Mon- models and the results from previous studies. In addition,
goDB was selected since it has been widely adopted in the we presented the statistical evaluation to confirm the signif-
healthcare field [52], [53]. As illustrated in Figure 10(a), clin- icant of our model as compared to other models. The exper-
icians can access the HDCDSS through their web-browser imental results confirmed that the proposed model achieved
in the same local network since the medical data are confi- better performance than that of state-of-the-art models and
dential information and cannot be stored in the cloud. The previous study results, by achieving an accuracy up to 95.90%
personal data such as patient id (id), age, and gender are then and 98.40% for datasets I and II, respectively. In addition,
combined with the diagnosis data, such as resting electro- the statistical-based analysis result also showed the signifi-
cardiographic result (restecg), maximum heart rate (thalach), cant improvement for the proposed model as compared with
exercise induced angine (exang), ST depression induced by the other models.
exercise relative to rest (oldpeak), slope of the peak exercise Furthermore, we also designed and developed the proposed
ST segment (slope), number of major vessels (0-3) colored by HDPM into the Heart Disease Clinical Decision Support
fluoroscopy (ca), and defect type (thal), and then transmitted System (HDCDSS) to diagnose the subjects’/patients’ heart
into a secure web server through an application programming disease status effectively and efficiently. The HDCDSS gath-
interface (API) and stored in a database. The proposed HDPM ered the patient data combined with other diagnosis data and
generated from datasets I (Statlog) and II (Cleveland) is then transmitted them to a secure web server. All the transmitted
used to predict the subjects’ heart disease status based on the diagnosis data were then stored into MongoDB, which can
inputted data, and the prediction result is then sent back to the effectively provide timely response with rapidly increasing
HDCDSS’s diagnosis result interface. medical data. The proposed HDPM was then loaded to diag-
Figure 10(b) shows the HDCDSS diagnosis form in which nose the patients’ current heart disease status, which was
clinicians can fill out the patients’ information, including later sent back to the HDCDSS’s diagnosis result interface.
their currents conditions. Once all the input fields are filled, Thus, the developed HDCDSS is expected to help clinicians
the user can press the ‘‘diagnose’’ button to send all the to diagnose patients and improving heart disease clinical deci-
data to the secure web server, which loads the trained pro- sion making effectively and efficiently. Finally, the overall
posed HDPM to diagnose the subjects’ heart disease status. designed and developed HDCDSS in this study can be used
Figure 10(c) shows the diagnosis result interface after sending as a practical guideline for the healthcare practitioners.
the data to the web server. The result includes the previ- In the future, we will consider the comparison of other data
ously submitted data and the status (presence or absence) sampling with the model hyper-parameters and broader med-
of heart disease. The developed HDCDSS is expected to ical datasets. In addition, a comparison and analysis study
help clinicians to diagnose patients and improving heart with different outlier detection methods could be further
disease clinical decision making effectively and efficiently. investigated. Furthermore, with the increasing concerns about
Therefore, early treatment could be conducted to prevent the privacy, security and time-sensitive applications, edge com-
deaths caused by late heart disease diagnosis. This proto- puting and edge device concepts could be further studied with
type/demonstration is only limited to the specific datasets; the goal of improving the medical clinical decision support
therefore, the trained prediction model cannot be applied for system. In this study, we have not obtained any feedback from
other demographic patients/subjects. Once we have collected heart specialist yet. In the future, once specific demographic
more complex datasets, it could improve the predictive per- dataset (from Korea) is collected, the comments from local
formance for wider demographic patients/subjects. In addi- heart specialist for verifying dataset and prediction model
tion, we have not applied the developed model in the clinical could be presented.
trial due to limitation of the dataset. In our case, we have used
the dataset based on specific demographic patient (USA). The ACKNOWLEDGMENT
clinical trial could be applied to our model once we gather This article is a tribute made of deep respect of a wonder-
another demographic patient (for example in Korea) and it is ful person, friend, advisor, and supervisor, Yong-Han Lee
beyond the scope of our current study. (1965–2017).
VI. CONCLUSION REFERENCES

We proposed an effective heart disease prediction model [1] World Health Organization. (2017). Cardiovascular Diseases (CVDs).
(HDPM) for heart disease diagnosis by integrating DBSCAN, [Online]. Available: https://www.who.int/health-topics/cardiovascular-
SMOTE-ENN, and XGBoost-based MLA to improve pre- diseases/
[2] E. J. Benjamin et al., ‘‘Heart disease and stroke statistics—2019 update:
diction accuracy. The DBSCAN was applied to detect and A report from the American heart association,’’ Circulation, vol. 139,
remove the outlier data, SMOTE-ENN was used to balance no. 10, pp. e56–e528, Mar. 2019, doi: 10.1161/CIR.0000000000000659.
133048 VOLUME 8, 2020

[3] Statistics Korea. (2018). Causes of Death Statistics in 2018. [19] C. B. C. Latha and S. C. Jeeva, ‘‘Improving the accuracy of pre-
[Online]. Available: http://kostat.go.kr/portal/eng/pressReleases/8/ diction of heart disease risk based on ensemble classification tech-
10/index.board?bmode=read&bSeq=&aSeq=378787 niques,’’ Inform. Med. Unlocked, vol. 16, Jan. 2019, Art. no. 100203, doi:
[4] World Health Organization. (2017). Cardiovascular Diseases 10.1016/j.imu.2019.100203.
(CVDs). [Online]. Available: https://www.who.int/news-room/fact- [20] L. Ali, A. Niamat, J. A. Khan, N. A. Golilarz, X. Xingzhong, A. Noor,
sheets/detail/cardiovascular-diseases-(cvds) R. Nour, and S. A. C. Bukhari, ‘‘An optimized stacked support vec-
[5] P. Greenland, J. S. Alpert, G. A. Beller, E. J. Benjamin, M. J. Budoff, tor machines based expert system for the effective prediction of heart
Z. A. Fayad, E. Foster, M. A. Hlatky, J. M. Hodgson, F. G. Kushner, failure,’’ IEEE Access, vol. 7, pp. 54007–54014, 2019, doi: 10.1109/
M. S. Lauer, L. J. Shaw, S. C. Smith, A. J. Taylor, W. S. Weintraub, and ACCESS.2019.2909969.
N. K. Wenger, ‘‘2010 ACCF/AHA guideline for assessment of cardiovas- [21] S. Mohan, C. Thirumalai, and G. Srivastava, ‘‘Effective heart disease
cular risk in asymptomatic adults: A report of the American college of prediction using hybrid machine learning techniques,’’ IEEE Access, vol. 7,
cardiology foundation/American heart association task force on practice pp. 81542–81554, 2019, doi: 10.1109/ACCESS.2019.2923707.
guidelines,’’ Circulation, vol. 122, no. 25, pp. e584–e636, Dec. 2010, doi: [22] X. Liu, Q. Yang, and L. He, ‘‘A novel DBSCAN with entropy and prob-
10.1161/CIR.0b013e3182051b4c. ability for mixed data,’’ Cluster Comput., vol. 20, no. 2, pp. 1313–1323,
Jun. 2017, doi: 10.1007/s10586-017-0818-3.
[6] J. Perk et al., ‘‘European guidelines on cardiovascular disease prevention in [23] C.-H. Lin, K.-C. Hsu, K. R. Johnson, M. Luby, and Y. C. Fann, ‘‘Applying
clinical practice (version 2012): The fifth joint task force of the European density-based outlier identifications using multiple datasets for validation
society of cardiology and other societies on cardiovascular disease pre- of stroke clinical outcomes,’’ Int. J. Med. Inform., vol. 132, Dec. 2019,
vention in clinical practice (constituted by representatives of nine soci- Art. no. 103988, doi: 10.1016/j.ijmedinf.2019.103988.
eties and by invited experts) ∗ developed with the special contribution [24] Z. H. Ismail, A. K. K. Chun, and M. I. S. Razak, ‘‘Efficient herd—
of the European association for cardiovascular prevention & rehabilitation Outlier detection in livestock monitoring system based on density—Based
(EACPR),’’ Eur. Heart J., vol. 33, no. 13, pp. 1635–1701, Jul. 2012, doi: spatial clustering,’’ IEEE Access, vol. 7, pp. 175062–175070, 2019, doi:
10.1093/eurheartj/ehs092. 10.1109/ACCESS.2019.2952912.
[7] G.-M. Park and Y.-H. Kim, ‘‘Model for predicting cardiovascular disease: [25] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, ‘‘A study of the
Insights from a Korean cardiovascular risk model,’’ Pulse, vol. 3, no. 2, behavior of several methods for balancing machine learning training data,’’
pp. 153–157, 2015, doi: 10.1159/000438683. ACM SIGKDD Explor. Newslett., vol. 6, no. 1, pp. 20–29, Jun. 2004, doi:
[8] G. J. Njie, K. K. Proia, A. B. Thota, R. K. C. Finnie, D. P. Hopkins, 10.1145/1007730.1007735.
S. M. Banks, D. B. Callahan, N. P. Pronk, K. J. Rask, D. T. Lackland, [26] T. Le, M. Lee, J. Park, and S. Baik, ‘‘Oversampling techniques for
and T. E. Kottke, ‘‘Clinical decision support systems and prevention,’’ bankruptcy prediction: Novel features from a transaction dataset,’’ Sym-
Amer. J. Preventive Med., vol. 49, no. 5, pp. 784–795, Nov. 2015, doi: metry, vol. 10, no. 4, p. 79, Mar. 2018, doi: 10.3390/sym10040079.
10.1016/j.amepre.2015.04.006. [27] T. Le and S. Baik, ‘‘A robust framework for self-care problem identification
[9] V. Sintchenko, E. Coiera, J. R. Iredell, and G. L. Gilbert, ‘‘Comparative for children with disability,’’ Symmetry, vol. 11, no. 1, p. 89, Jan. 2019, doi:
impact of guidelines, clinical data, and decision support on prescrib- 10.3390/sym11010089.
ing decisions: An interactive Web experiment with simulated cases,’’ [28] T. Le, M. T. Vo, B. Vo, M. Y. Lee, and S. W. Baik, ‘‘A hybrid
J. Amer. Med. Inform. Assoc., vol. 11, no. 1, pp. 71–77, Jan. 2004, doi: approach using oversampling technique and cost-sensitive learning for
10.1197/jamia.M1166. bankruptcy prediction,’’ Complexity, vol. 2019, pp. 1–12, Aug. 2019, doi:
[10] K. Kawamoto, C. A. Houlihan, E. A. Balas, and D. F. Lobach, ‘‘Improving 10.1155/2019/8460934.
clinical practice using clinical decision support systems: A systematic [29] Statlog (Heart) Data Set. Accessed: Oct. 2, 2019. [Online]. Available:
review of trials to identify features critical to success,’’ BMJ, vol. 330, http://archive.ics.uci.edu/ml/datasets/statlog+(heart)
[30] Heart Disease Data Set. Accessed: Oct. 2, 2019. [Online]. Available:
no. 7494, p. 765, Apr. 2005, doi: 10.1136/bmj.38398.500764.8F.
https://archive.ics.uci.edu/ml/datasets/Heart+Disease
[11] H. B. Bosworth, M. K. Olsen, T. Dudley, M. Orr, M. K. Goldstein, [31] A. K. Dwivedi, ‘‘Performance evaluation of different machine learning
S. K. Datta, F. McCant, P. Gentry, D. L. Simel, and E. Z. Oddone, ‘‘Patient techniques for prediction of heart disease,’’ Neural Comput. Appl., vol. 29,
education and provider decision support to control blood pressure in pri- no. 10, pp. 685–693, May 2018, doi: 10.1007/s00521-016-2604-1.
mary care: A cluster randomized trial,’’ Amer. Heart J., vol. 157, no. 3, [32] M. S. Amin, Y. K. Chiam, and K. D. Varathan, ‘‘Identification of significant
pp. 450–456, Mar. 2009, doi: 10.1016/j.ahj.2008.11.003. features and data mining techniques in predicting heart disease,’’ Telemat-
[12] J. Hunt, J. Siemienczuk, W. Gillanders, B. LeBlanc, Y. Rozenfeld, ics Inform., vol. 36, pp. 82–93, Mar. 2019, doi: 10.1016/j.tele.2018.11.007.
K. Bonin, and G. Pape, ‘‘The impact of a physician-directed health infor- [33] R. Kohavi, ‘‘A study of cross-validation and bootstrap for accuracy
mation technology system on diabetes outcomes in primary care: A pre- estimation and model selection,’’ in Proc. 14th Int. Joint Conf. Artif.
and post-implementation study,’’ J. Innov. Health Inform., vol. 17, no. 3, Intell. (IJCAI), Montreal, QC, Canada, vol. 2, Aug. 1995, pp. 1137–1145.
pp. 165–174, Sep. 2009, doi: 10.14236/jhi.v17i3.731. [Online]. Available: http://ijcai.org/Proceedings/95-2/Papers/016.pdf
[13] N. C. Long, P. Meesad, and H. Unger, ‘‘A highly accurate firefly based [34] R. Kohavi and D. Wolpert, ‘‘Bias plus variance decomposition for zero-one
algorithm for heart disease prediction,’’ Expert Syst. Appl., vol. 42, no. 21, loss functions,’’ in Proc. 13th Int. Conf. Mach. Learn., San Francisco, CA,
pp. 8221–8231, Nov. 2015, doi: 10.1016/j.eswa.2015.06.024. USA, 1996, pp. 275–283.
[14] K. B. Nahato, K. N. Harichandran, and K. Arputharaj, ‘‘Knowledge mining [35] J. Han and M. Kamber, Data Mining: Concepts and Techniques. San Diego,
from clinical datasets using rough sets and backpropagation neural net- CA, USA: Elsevier, 2012.
work,’’ Comput. Math. Methods Med., vol. 2015, pp. 1–13, Mar. 2015, doi: [36] Weka 3: Data Mining Software in Java. Accessed: Oct. 19, 2019. [Online].
10.1155/2015/460189. Available: https://www.cs.waikato.ac.nz/ml/weka/
[37] A. L. Blum and P. Langley, ‘‘Selection of relevant features and examples
[15] L. Verma, S. Srivastava, and P. C. Negi, ‘‘A hybrid data mining model
in machine learning,’’ Artif. Intell., vol. 97, pp. 245–271, Dec. 1997, doi:
to predict coronary artery disease cases using non-invasive clinical data,’’
10.1016/S0004-3702(97)00063-5.
J. Med. Syst., vol. 40, no. 7, p. 178, Jul. 2016, doi: 10.1007/s10916-016-
[38] I. Guyon and A. Elisseeff, ‘‘An introduction to variable and feature selec-
0536-z.
tion,’’ J. Mach. Learn. Res., vol. 3, pp. 1157–1182, Mar. 2003.
[16] A. U. Haq, J. P. Li, M. H. Memon, S. Nazir, and R. Sun, ‘‘A hybrid intel- [39] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, ‘‘A density-based algorithm
ligent system framework for the prediction of heart disease using machine for discovering clusters in large spatial databases with noise,’’ in Proc.
learning algorithms,’’ Mobile Inf. Syst., vol. 2018, pp. 1–21, Dec. 2018, 2nd Int. Conf. Knowl. Discovery Data Mining, Portland, OR, USA, 1996,
doi: 10.1155/2018/3860146. pp. 226–231.
[17] S. M. Saqlain, M. Sher, F. A. Shah, I. Khan, M. U. Ashraf, M. Awais, [40] M. Ijaz, G. Alfian, M. Syafrudin, and J. Rhee, ‘‘Hybrid prediction
and A. Ghani, ‘‘Fisher score and matthews correlation coefficient-based model for type 2 diabetes and hypertension using DBSCAN-based out-
feature subset selection for heart disease diagnosis using support vector lier detection, synthetic minority over sampling technique (SMOTE),
machines,’’ Knowl. Inf. Syst., vol. 58, no. 1, pp. 139–167, Jan. 2019, doi: and random forest,’’ Appl. Sci., vol. 8, no. 8, p. 1325, Aug. 2018, doi:
10.1007/s10115-018-1185-y. 10.3390/app8081325.
[18] A. Gupta, R. Kumar, H. S. Arora, and B. Raman, ‘‘MIFH: A [41] G. Alfian, M. Syafrudin, and J. Rhee, ‘‘Real-time monitoring system
machine intelligence framework for heart disease diagnosis,’’ IEEE using smartphone-based sensors and NoSQL database for perishable
Access, vol. 8, pp. 14659–14674, 2020, doi: 10.1109/ACCESS.2019. supply chain,’’ Sustainability, vol. 9, no. 11, p. 2073, Nov. 2017, doi:
2962755. 10.3390/su9112073.
VOLUME 8, 2020 133049

[42] M. Syafrudin, G. Alfian, N. Fitriyani, and J. Rhee, ‘‘Performance analysis industrial applications. He was selected and invited to participate with the
of IoT-based sensor, big data processing, and machine learning model World Class Scholar Symposium (SCKD) Event hosted by the Ministry of
for real-time monitoring system in automotive manufacturing,’’ Sensors, Research, Technology, and Higher Education, Indonesia, in August 2019,
vol. 18, no. 9, p. 2946, Sep. 2018, doi: 10.3390/s18092946. to make contributions on accelerating the Indonesian national development.
[43] M. Syafrudin, N. Fitriyani, G. Alfian, and J. Rhee, ‘‘An affordable fast early He was recognized for Excellence in research, teaching, and outreach. He has
warning system for edge computing in assembly line,’’ Appl. Sci., vol. 9, published numerous research articles in several international peer-reviewed
no. 1, p. 84, Dec. 2018, doi: 10.3390/app9010084.
journals, including IEEE ACCESS, Food Control, Sensors, Applied Sciences,
[44] G. Lemaitre, F. Nogueira, and C. K. Aridas, ‘‘Imbalanced-learn: A Python
toolbox to tackle the curse of imbalanced datasets in machine learning,’’ Asia Pacific Journal of Marketing and Logistics, and Sustainability. His
J. Mach. Learn. Res., vol. 18, no. 1, pp. 559–563, Jan. 2017. research interests include industrial artificial intelligence, machine learning,
[45] T. Chen and C. Guestrin, ‘‘XGBoost: A scalable tree boosting sys- information systems, edge-computing, the Internet of Things, big data, health
tem,’’ in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery informatics, and smart factory ranging from theory to design and implemen-
Data Mining, San Francisco, CA, USA, Aug. 2016, pp. 785–794, doi: tation. He serves as a Reviewer Board Members for Sensors and Algorithms
10.1145/2939672.2939785. (MDPI) and a Review Editor for the IoT and Sensor Networks (Frontiers in
[46] C. Marrocco, R. P. W. Duin, and F. Tortorella, ‘‘Maximizing the Communications and Networks).
area under the ROC curve by pairwise feature combination,’’ Pattern
Recognit., vol. 41, no. 6, pp. 1961–1974, Jun. 2008, doi: 10.1016/
j.patcog.2007.11.017.
[47] K.-A. Toh, J. Kim, and S. Lee, ‘‘Maximizing area under ROC curve for bio-
metric scores fusion,’’ Pattern Recognit., vol. 41, no. 11, pp. 3373–3392,
Nov. 2008, doi: 10.1016/j.patcog.2008.04.002.
[48] S. H. Jee et al., ‘‘A coronary heart disease prediction model: The korean GANJAR ALFIAN (Member, IEEE) received the
heart study,’’ BMJ Open, vol. 4, no. 5, May 2014, Art. no. e005025, doi: B.Eng. degree from the Department of Informat-
10.1136/bmjopen-2014-005025. ics Engineering, Universitas Islam Negeri Sunan
[49] G. T. Reddy, M. P. K. Reddy, K. Lakshmanna, D. S. Rajput, R. Kaluri, Kalijaga, Yogyakarta, Indonesia, in 2009, and the
and G. Srivastava, ‘‘Hybrid genetic algorithm and a fuzzy logic classifier M.Eng. and Dr.Eng. degrees from the Department
for heart disease diagnosis,’’ Evol. Intell., vol. 13, no. 2, pp. 185–196, of Industrial and Systems Engineering, Dongguk
Nov. 2019, doi: 10.1007/s12065-019-00327-1. University, Seoul, South Korea, in 2012 and 2016,
[50] B. R. Kirkwood, J. A. C. Sterne, and B. R. Kirkwood, Essential Medical respectively. He has been an Assistant Profes-
Statistics, 2nd ed. Malden, MA, USA: Blackwell Science, 2003. sor with the Industrial AI Research Center, Nano
[51] M. Xu, D. Fralick, J. Z. Zheng, B. Wang, X. M. Tu, and C. Feng, ‘‘The dif-
Information Technology Academy, Dongguk Uni-
ferences and similarities between two-sample T-test and paired T-test,’’
versity, since 2016. In July 2017, he was a short-term Visiting Researcher
Shanghai Arch, Psychiatry, vol. 29, no. 3, pp. 184–188, Jun. 2017, doi:
10.11919/j.issn.1002-0829.217070. with the VSB-Technical University of Ostrava, Czech Republic. He was
[52] G. Alfian, M. Syafrudin, M. Ijaz, M. Syaekhoni, N. Fitriyani, and J. Rhee, recognized for Excellence in research, teaching, and outreach. He has
‘‘A personalized healthcare monitoring system for diabetic patients by uti- published numerous research articles in several international peer-reviewed
lizing BLE-based sensors and real-time data processing,’’ Sensors, vol. 18, journals, including Computers and Industrial Engineering, Journal of Food
no. 7, p. 2183, Jul. 2018, doi: 10.3390/s18072183. Engineering, Journal of Public Transportation, IEEE ACCESS, Food Control,
[53] N. L. Fitriyani, M. Syafrudin, G. Alfian, and J. Rhee, ‘‘Development of Sensors, Applied Sciences, Asia Pacific Journal of Marketing and Logistics,
disease prediction model based on ensemble learning approach for diabetes and Sustainability. His research interests include machine learning, deep
and hypertension,’’ IEEE Access, vol. 7, pp. 144777–144789, 2019, doi: learning, RFID, the Internet of Things, big data, health informatics, and
10.1109/ACCESS.2019.2945129. simulation and car sharing service. He was a recipient of the International
Conference on Science and Technology (ICST) Best Paper Award, in 2019.
NORMA LATIF FITRIYANI received the bache-

lor’s degree from Universitas Islam Negeri Sunan
Kalijaga, Yogyakarta, Indonesia, and the mas-
ter’s degree from the National Taiwan University
of Science and Technology, Taipei, Taiwan. She JONGTAE RHEE received the B.S. degree
is currently pursuing the Ph.D. degree with the in industrial engineering from Seoul National
Department of Industrial and Systems Engineer- University, the M.S. degree in industrial engi-
ing, Dongguk University, Seoul, South Korea. She neering from the Korea Advanced Institute of
has published numerous research articles in sev- Science and Technology, and the Ph.D. degree
eral international peer-reviewed journals, includ- in industrial engineering from the University of
ing IEEE ACCESS, Food Control, Sensors, Applied Sciences, Asia Pacific California, Berkeley. He is currently a Professor
Journal of Marketing and Logistics, and Sustainability. Her research interests with the Department of Industrial and Systems
include health informatics, machine learning, the Internet of Things, sensors, Engineering and the Director of the Industrial
and image processing. Artificial Intelligence Research Center, Dongguk
University, Seoul, South Korea. He has been leading researches and projects
related to practical artificial neural network models, including for production
MUHAMMAD SYAFRUDIN received the bache- and operation planning, personalized healthcare, and smart factory. He was
lor’s degree from Universitas Islam Negeri Sunan recognized for Excellence in research, teaching, and outreach. He has
Kalijaga, Yogyakarta, Indonesia, and the Ph.D. published numerous research articles in several international peer-reviewed
degree from Dongguk University, Seoul, South journals, including the IEEE SENSORS JOURNAL, Expert Systems with Appli-
Korea. He is currently an Assistant Professor cations, Computers and Industrial Engineering, Journal of Food Engineer-
with the Department of Industrial and Systems ing, the International Journal of Production Research, Journal of Public
Engineering, Dongguk University. He is also an Transportation, Journal of Food Agriculture and Environment, IEEE ACCESS,
Instructor with popular practical course on under- Food Control, Sensors, Applied Sciences, Asia Pacific Journal of Marketing
graduate topics in programming languages and and Logistics, and Sustainability. His research interests include industrial
database systems. He has collaborated actively artificial intelligence, machine learning, optimization, the Internet of Things,
with researchers in several other disciplines of engineering, particularly big data, and sensors.
information processing and machine learning on problems at the real-world
133050 VOLUME 8, 2020

HDPM: An Effective Heart Disease Prediction Model For A Clinical Decision Support System

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HDPM: An Effective Heart Disease Prediction Model For A Clinical Decision Support System

Uploaded by

Copyright:

Available Formats

Received July 7, 2020, accepted July 12, 2020, date of publication July 20, 2020, date of current version

July 30, 2020.

HDPM: An Effective Heart Disease Prediction

I. INTRODUCTION diet, consuming fruits and vegetables, doing regular physical

VOLUME 8, 2020 133035

133036 VOLUME 8, 2020

VOLUME 8, 2020 133037

133038 VOLUME 8, 2020

VOLUME 8, 2020 133039

133040 VOLUME 8, 2020

if sizeof(neigbrPts) < minPts then

VOLUME 8, 2020 133041

Algorithm 2 SMOTE-ENN Pseudocode modifications in terms of regularization, loss function and

133042 VOLUME 8, 2020

VOLUME 8, 2020 133043

TABLE 5. Performance evaluation for dataset I (Statlog).

F-measure (f ) is calculated as can be calculated as

133044 VOLUME 8, 2020

TABLE 6. Performance evaluation for dataset II (Cleveland).

VOLUME 8, 2020 133045

TABLE 8. Benchmark with previous study results for dataset I (Statlog).

TABLE 9. Benchmark with previous study results for dataset II (Cleveland).

133046 VOLUME 8, 2020

VI. CONCLUSION REFERENCES

133048 VOLUME 8, 2020

VOLUME 8, 2020 133049

NORMA LATIF FITRIYANI received the bache-

133050 VOLUME 8, 2020

You might also like