Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

Bank Loan Default Caspstone Project Notes-2

1) Model Building- CART, Logistic Regression and Naïve bayes models developed after PCA.
2) Ensembled Methos- Bagging and XG Boost models developed.

# applying PCA to continues variables before building models

Bartlett test to check whether the data is suitable for PCA.

$chisq
[1] 5402.731

$p.value
[1] 0

$df
[1] 190

P Values is less than the 0.05, so the data is suitable for PCA.

KMO Test
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = corrmatrix1)
Overall MSA = 0.78
MSA for each item =
loan_amnt funded_amnt funded_amnt_inv int_rate installment
annual_inc
1.00 0.77 0.77 0.69 0.97
0.88
dti delinq_2yrs inq_last_6mths open_acc revol_bal
revol_util
0.64 0.53 0.48 0.67 0.88
0.60
total_acc out_prncp out_prncp_inv total_pymnt total_pymnt_inv
total_rec_prncp
0.71 0.59 0.59 0.76 0.77
0.76
total_rec_int last_pymnt_amnt
0.62 0.93

As the overall MSA is more than 0.5, so that the data is having required samples in the data set to
perform PCA.
Scree Plot to know number of Components to be considered.

As per the scree polt, we consider the number of components which are more than one. From the above
plot it is clearly evident that we have 7 components.

We are able to allocate variable to each component without rotation itself.

Loadings:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
loan_amnt 0.969
funded_amnt 0.971
funded_amnt_inv 0.969
int_rate 0.631
installment 0.940
annual_inc 0.681
dti 0.556
delinq_2yrs 0.847
inq_last_6mths 0.660
open_acc 0.818
revol_bal -0.511
revol_util 0.643
total_acc 0.812
out_prncp 0.959
out_prncp_inv 0.959
total_pymnt 0.958
total_pymnt_inv 0.957
total_rec_prncp 0.938
total_rec_int 0.685
last_pymnt_amnt 0.700

PC1 PC2 PC3 PC4 PC5 PC6 PC7


SS loadings 7.918 2.529 1.832 1.478 1.210 1.073 0.983
Proportion Var 0.396 0.126 0.092 0.074 0.061 0.054 0.049
Cumulative Var 0.396 0.522 0.614 0.688 0.748 0.802 0.851
Structure of the Data after PCA

'data.frame': 226786 obs. of 15 variables:


$ PC1 : num -1.093 -1.372 -0.395 -1.142 -1.25 ...
$ PC2 : num -0.0861 -0.1612 -0.1204 -0.4262 -0.1412 ...
$ PC3 : num -0.929 -1.535 0.598 -0.083 -1.625 ...
$ PC4 : num 1.212 1.256 -0.156 -0.672 1.072 ...
$ PC5 : num -0.952 0.546 0.552 1.108 0.809 ...
$ PC6 : num -0.115 1.004 -0.793 0.177 1.504 ...
$ PC7 : num -0.243 -0.693 -0.389 -1.384 -0.869 ...
$ term : Factor w/ 2 levels "36 months","60 months": 1 1 1 1 1 2 1 1 1 1 ...
$ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 3 2 2 4 3 ...
$ emp_length : Factor w/ 12 levels "< 1 year","1 year",..: 3 3 3 5 11 7 3 5 1
6 ...
$ home_ownership : Factor w/ 6 levels "ANY","MORTGAGE",..: 6 6 6 6 6 5 5 6 6 6 ...
$ verification_status: Factor w/ 3 levels "Not Verified",..: 3 1 2 2 2 1 2 2 1 1 ...
$ purpose : Factor w/ 14 levels "car","credit_card",..: 2 12 10 14 1 3 3 2 3
5 ...
$ addr_state : Factor w/ 51 levels "AK","AL","AR",..: 4 15 5 4 5 4 5 15 25 5 ...
$ Default : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

Building of CART Model


Decision Tree
> rpart.rules(train_tree)
Decision Tree Rules
Default
0.00 when PrincipleBal < -0.016
0.00 when PrincipleBal is 0.123 to 0.319 & DebtBurden >= 0.39 & CreditLines >=
-0.419 & LoanProfile >= -1.4
0.01 when PrincipleBal is -0.016 to 0.123 & DebtBurden >= -0.52
0.01 when PrincipleBal is 0.123 to 0.319 & DebtBurden >= 0.93 & CreditLines is
-0.930 to -0.419 & LoanProfile >= -1.4
0.02 when PrincipleBal is 0.319 to 0.576 & DebtBurden >= 0.31 & CreditLines >=
0.487 & LoanProfile >= -1.3
0.02 when PrincipleBal is 0.123 to 0.319 & DebtBurden is -0.19 to 0.39 & CreditLines >=
0.394 & LoanProfile >= -1.4
0.02 when PrincipleBal is -0.016 to 0.123 & DebtBurden < -0.52 & CreditLines >=
0.260
0.03 when PrincipleBal is 0.123 to 0.319 & DebtBurden < -0.19 & CreditLines >=
1.011
0.03 when PrincipleBal is 0.319 to 0.576 & DebtBurden >= 1.31 & CreditLines is
-0.042 to 0.487 & LoanProfile >= -1.3
0.08 when PrincipleBal is 0.576 to 0.985 & CreditLines >=
2.163
0.09 when PrincipleBal is 0.123 to 0.319 & DebtBurden >= 0.89 & CreditLines <
-0.930 & LoanProfile >= -1.4
0.11 when PrincipleBal is 0.319 to 0.576 & DebtBurden < 0.31 & CreditLines >=
1.279
0.12 when PrincipleBal is 0.576 to 0.868 & DebtBurden >= 2.00 & CreditLines <
2.163
0.13 when PrincipleBal is 0.319 to 0.576 & DebtBurden >= 1.55 & CreditLines <
-0.042
0.36 when PrincipleBal is 0.576 to 0.662 & DebtBurden < 2.00 & CreditLines is
1.095 to 2.163 & Delinq2Years < 7
0.45 when PrincipleBal is 0.319 to 0.576 & DebtBurden < 1.55 & CreditLines <
-0.042 & LoanProfile >= 0.3
0.49 when PrincipleBal is 0.123 to 0.319 & DebtBurden >= -0.19
& LoanProfile < -1.4
0.52 when PrincipleBal is 0.123 to 0.319 & DebtBurden is -0.19 to 0.39 & CreditLines is
-0.930 to 0.394 & LoanProfile >= -1.4
0.54 when PrincipleBal >= 0.576 & DebtBurden < 2.00 & CreditLines <
2.163 & Delinq2Years >= 7
0.55 when PrincipleBal is 0.319 to 0.576 & DebtBurden >= 0.31 & CreditLines >=
-0.042 & LoanProfile < -1.3
0.57 when PrincipleBal is 0.123 to 0.319 & DebtBurden is 0.39 to 0.93 & CreditLines is
-0.930 to -0.419 & LoanProfile >= -1.4
0.58 when PrincipleBal is 0.319 to 0.576 & DebtBurden is 0.31 to 1.31 & CreditLines is
-0.042 to 0.487 & LoanProfile >= -1.3
0.79 when PrincipleBal is -0.016 to 0.123 & DebtBurden < -0.52 & CreditLines <
0.260
0.87 when PrincipleBal is 0.123 to 0.319 & DebtBurden < -0.19 & CreditLines <
1.011
0.90 when PrincipleBal is 0.123 to 0.319 & DebtBurden is -0.19 to 0.89 & CreditLines <
-0.930 & LoanProfile >= -1.4
0.94 when PrincipleBal is 0.319 to 0.576 & DebtBurden < 0.31 & CreditLines is
-0.042 to 1.279
0.94 when PrincipleBal >= 0.868 & DebtBurden >= 2.00 & CreditLines <
2.163
0.95 when PrincipleBal >= 0.985 & CreditLines >=
2.163
0.98 when PrincipleBal is 0.576 to 0.662 & DebtBurden < 2.00 & CreditLines <
1.095 & Delinq2Years < 7
0.99 when PrincipleBal is 0.319 to 0.576 & DebtBurden < 1.55 & CreditLines <
-0.042 & LoanProfile < 0.3
1.00 when PrincipleBal >= 0.662 & DebtBurden < 2.00 & CreditLines <
2.163 & Delinq2Years < 7

The tree basically forming on the basis of outstanding principle, loan profile means loan amount &
repayment history, debt burden which includes Rate of interest & DTI, credit line opened and
delinquency status.
Observations

1) If a person is having higher outstanding principle, which may lead to default by the person.
2) Higher the loan amount, higher the chances of default.
3) If a person is not having good repayment history of loans, there is a high chance of default.
4) If a person is borrowing at higher rate of interest and who is having high dti, there is a chance of
default.
5) If a person if having higher credit lines opened, there is a higher chance of default.
6) If a person is having continuous delinquency record during last 2 years, there is higher chance of
default.

Bank should focus on the persons who is applying for higher loan amount, applying at higher ROI, who is
having high DTI, who is not having better repayment record and who is having more number of credit
lines opened and should focus on delinquency status also.

CP Table:

Classification tree:
rpart(formula = Default ~ ., data = train_data, method = "class",
minsplit = 100, minbucket = 33, cp = 0)

Variables actually used in tree construction:


[1] CreditLines DebtBurden Delinq2Years LoanProfile PrincipleBal

Root node error: 13344/158750 = 0.084057

n= 158750

CP nsplit rel error xerror xstd


1 0.88189448 0 1.000000 1.000000 0.0082850
2 0.00843076 1 0.118106 0.119230 0.0029741
3 0.00749400 4 0.091202 0.096748 0.0026817
4 0.00367206 6 0.076214 0.084233 0.0025035
5 0.00262290 8 0.068870 0.075764 0.0023752
6 0.00251049 11 0.061001 0.069694 0.0022787
7 0.00079936 13 0.055980 0.064224 0.0021879
8 0.00074940 16 0.053582 0.062800 0.0021636
9 0.00037470 19 0.051334 0.062575 0.0021598
10 0.00032474 20 0.050959 0.061975 0.0021495
11 0.00029976 23 0.049985 0.062500 0.0021585
12 0.00014988 26 0.049086 0.060776 0.0021287
13 0.00000000 30 0.048486 0.059727 0.0021103

Lower xerror is coming at end node that means pruning of CART model is not required.

Prediction of Using the CRAT models both on training data set and test data set.
CP Plot

Confusion Matrix for Prediction on Train Data.

> confusionMatrix(CART_train_predict,train_data$Default)
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 145219 460
1 187 12884

Accuracy : 0.9959
95% CI : (0.9956, 0.9962)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9733

Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.9987
Specificity : 0.9655
Pos Pred Value : 0.9968
Neg Pred Value : 0.9857
Prevalence : 0.9159
Detection Rate : 0.9148
Detection Prevalence : 0.9177
Balanced Accuracy : 0.9821

'Positive' Class : 0
# Prediction on Test data.

> CART_test_predict<-predict(train_tree,newdata = test_data,type="class")

> confusionMatrix(CART_test_predict,test_data$Default)

Confusion Matrix and Statistics

Reference
Prediction 0 1
0 62233 220
1 84 5499

Accuracy : 0.9955
95% CI : (0.995, 0.996)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9707

Mcnemar's Test P-Value : 9.727e-15

Sensitivity : 0.9987
Specificity : 0.9615
Pos Pred Value : 0.9965
Neg Pred Value : 0.9850
Prevalence : 0.9159
Detection Rate : 0.9147
Detection Prevalence : 0.9179
Balanced Accuracy : 0.9801

'Positive' Class : 0

# Area under the Curve on training data set


An object of class "performance"
Slot "x.name":
[1] "None"

Slot "y.name":
[1] "Area under the ROC curve"

Slot "alpha.name":
[1] "none"
Slot "x.values":
list()

Slot "y.values":
[[1]]
[1] 0.9939686

Slot "alpha.values":
list()
ROC Curve on Training Data Set

AUC for Test data

An object of class "performance"


Slot "x.name":
[1] "None"

Slot "y.name":
[1] "Area under the ROC curve"

Slot "alpha.name":
[1] "none"

Slot "x.values":
list()

Slot "y.values":
[[1]]
[1] 0.9917786

Slot "alpha.values":
list()
ROC Curve for Test data.
Building of Logistic Regression Model
Call:
glm(formula = Default ~ ., family = "binomial", data = train_glm[,
-c(11, 13)])

Deviance Residuals:
Min 1Q Median 3Q Max
-3.9643 -0.0365 -0.0103 -0.0009 6.7656

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.98977 0.97992 -1.010 0.312470
LoanProfile -1.78072 0.11259 -15.816 < 2e-16 ***
PrincipleBal 26.36783 0.64769 40.710 < 2e-16 ***
CreditLines -3.91103 0.12789 -30.582 < 2e-16 ***
DebtBurden -4.41904 0.14386 -30.718 < 2e-16 ***
RevBalInq 0.62871 0.08009 7.850 4.16e-15 ***
AnnlIncome -0.96195 0.04989 -19.280 < 2e-16 ***
Delinq2Years -0.77095 0.05518 -13.971 < 2e-16 ***
term60 months -1.43550 0.34107 -4.209 2.57e-05 ***
gradeB -0.63446 0.29829 -2.127 0.033423 *
gradeC -1.77531 0.34393 -5.162 2.45e-07 ***
gradeD -2.77377 0.39693 -6.988 2.79e-12 ***
gradeE -3.38198 0.47332 -7.145 8.98e-13 ***
gradeF -4.56814 0.63313 -7.215 5.39e-13 ***
gradeG -3.95425 1.13365 -3.488 0.000487 ***
emplength1 year -0.44368 0.32801 -1.353 0.176161
emplength10+ years -0.23010 0.24026 -0.958 0.338220
emplength2 years -0.94844 0.32202 -2.945 0.003227 **
emplength3 years -0.18276 0.29258 -0.625 0.532194
emplength4 years -0.49274 0.33281 -1.481 0.138730
emplength5 years -0.47538 0.32595 -1.458 0.144721
emplength6 years -0.25602 0.33402 -0.766 0.443397
emplength7 years 0.01780 0.32996 0.054 0.956990
emplength8 years -0.15582 0.36468 -0.427 0.669165
emplength9 years -0.47669 0.45134 -1.056 0.290888
emplengthn/a 0.18745 0.33982 0.552 0.581216
verifstatusSource Verified -0.57146 0.18561 -3.079 0.002078 **
verifstatusVerified -0.43758 0.17022 -2.571 0.010151 *
AddrstateAL -1.09489 1.09977 -0.996 0.319464
AddrstateAR -0.89128 1.32607 -0.672 0.501506
AddrstateAZ -0.93340 1.05562 -0.884 0.376578
AddrstateCA -0.31088 0.93133 -0.334 0.738525
AddrstateCO -0.84412 1.05379 -0.801 0.423116
AddrstateCT 0.36271 1.01333 0.358 0.720392
AddrstateDC -0.12580 1.35121 -0.093 0.925821
AddrstateDE 0.14886 1.36774 0.109 0.913332
AddrstateFL -0.72136 0.95624 -0.754 0.450628
AddrstateGA -0.81139 1.01475 -0.800 0.423946
AddrstateHI -4.30599 1.99911 -2.154 0.031244 *
AddrstateIA -9.08019 703.35365 -0.013 0.989700
AddrstateID -8.57882 807.60575 -0.011 0.991525
AddrstateIL -0.35229 0.97830 -0.360 0.718769
AddrstateIN 0.01799 1.13778 0.016 0.987384
AddrstateKS -1.79813 1.43052 -1.257 0.208762
AddrstateKY -1.12512 1.29792 -0.867 0.386017
AddrstateLA -1.39298 1.19343 -1.167 0.243126
AddrstateMA -0.73322 1.11967 -0.655 0.512562
AddrstateMD -0.35605 1.00736 -0.353 0.723754
AddrstateME -7.90760 395.78763 -0.020 0.984060
AddrstateMI -0.25938 0.99960 -0.259 0.795262
AddrstateMN -0.60712 1.07424 -0.565 0.571961
AddrstateMO -2.28416 1.38567 -1.648 0.099267 .
AddrstateMS -3.88952 2.64692 -1.469 0.141711
AddrstateMT -3.60489 4.84782 -0.744 0.457112
AddrstateNC -0.01165 0.96958 -0.012 0.990414
AddrstateND -9.88009 707.38497 -0.014 0.988856
AddrstateNE -11.84881 438.60916 -0.027 0.978448
AddrstateNH -0.82337 1.45929 -0.564 0.572600
AddrstateNJ -0.17403 0.97002 -0.179 0.857614
AddrstateNM -1.36861 1.53001 -0.895 0.371048
AddrstateNV 0.07293 1.01150 0.072 0.942523
AddrstateNY -0.55970 0.94880 -0.590 0.555254
AddrstateOH -0.31488 0.98030 -0.321 0.748049
AddrstateOK -1.93305 1.30506 -1.481 0.138553
AddrstateOR -0.42781 1.08921 -0.393 0.694488
AddrstatePA -0.99815 1.02293 -0.976 0.329178
AddrstateRI -0.53969 1.38520 -0.390 0.696825
AddrstateSC -0.12273 1.07585 -0.114 0.909178
AddrstateSD -4.10813 6.81218 -0.603 0.546471
AddrstateTN -1.18543 1.34118 -0.884 0.376765
AddrstateTX -0.23481 0.94520 -0.248 0.803809
AddrstateUT -0.77069 1.28298 -0.601 0.548038
AddrstateVA -0.24881 0.98286 -0.253 0.800153
AddrstateVT -0.80716 1.56293 -0.516 0.605547
AddrstateWA -0.43907 1.01179 -0.434 0.664322
AddrstateWI -0.89688 1.25158 -0.717 0.473619
AddrstateWV -0.69732 1.36134 -0.512 0.608491
AddrstateWY -1.56135 2.15259 -0.725 0.468245
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 91620.1 on 158749 degrees of freedom


Residual deviance: 2449.3 on 158672 degrees of freedom
AIC: 2605.3

Number of Fisher Scoring iterations: 14

From the above logistic regression model it is clearly evident the variables like profile (Loan amount &
loan Repayment status), Outstanding principle, credit lines, debt burden (ROI and DTI), annual income,
revolving credit utilization, grade, 60 months term and delinquency during last 2 years are more
significance variables to predict the probability of default of a customer.
Logistic Regression Prediction on Train Data
Confusion matrix for prediction train data
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 145359 211
1 47 13133

Accuracy : 0.9984
95% CI : (0.9982, 0.9986)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9894

Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.98419
Specificity : 0.99968
Pos Pred Value : 0.99643
Neg Pred Value : 0.99855
Prevalence : 0.08406
Detection Rate : 0.08273
Detection Prevalence : 0.08302
Balanced Accuracy : 0.99193

'Positive' Class : 1

ROC Curve for Train data

AUC Value for Train data

> train.auc.glm = performance(train.roc.glm, "auc")


> train.area.glm = as.numeric(slot(train.auc.glm, "y.values"))
> train.area.glm

[1] 0.9983739
# KS and Gini Values on Train data
> ks.train.glm <- performance(train.roc.glm, "tpr", "fpr")
> train.ks.glm <- max(attr(ks.train.glm, "y.values")[[1]] - (attr(ks.train.glm, "x.values")[[1]]))
> train.ks.glm
[1] 0.9907343

> train.gini.glm = (2 * train.area.glm) - 1


> train.gini.glm
[1] 0.9967478

Logistic Regression Prediction on Test data


> pred_glm_test<-predict(glmmodel_train,newdata=test_glm,type="response")
> glm_pred_matrix_test <- ifelse(pred_glm_test>0.5,1,0)
>
> glm_pred_matrix_test <- as.factor(glm_pred_matrix_test)>
>
> confusionMatrix(glm_pred_matrix_test, test_glm$Default, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 62298 108
1 19 5611

Accuracy : 0.9981
95% CI : (0.9978, 0.9984)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9878

Mcnemar's Test P-Value : 5.776e-15

Sensitivity : 0.98112
Specificity : 0.99970
Pos Pred Value : 0.99663
Neg Pred Value : 0.99827
Prevalence : 0.08406
Detection Rate : 0.08247
Detection Prevalence : 0.08275
Balanced Accuracy : 0.99041

'Positive' Class : 1

ROC Curve on test Data


AUC value on test data
> test.auc.glm = performance(test.roc.glm, "auc")
> test.area.glm = as.numeric(slot(test.auc.glm, "y.values"))
> test.area.glm
[1] 0.9982265

KS and

> ks.test.glm <- performance(test.roc.glm, "tpr", "fpr")


> test.ks.glm <- max(attr(ks.test.glm, "y.values")[[1]] - (attr(ks.test.glm, "x.values")[[1]]))
> test.ks.glm
[1] 0.990809
>
>
> test.gini.glm = (2 * test.area.glm) - 1
> test.gini.glm
[1] 0.996453

Note: I am not able to buld the random forest model to RAM memory
issue in my system. I am building naïve bays model

# Building of Naïve Bayes Model.


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
0 1
0.91594331 0.08405669

Conditional probabilities:
LoanProfile
Y [,1] [,2]
0 0.01504695 1.0129202
1 -0.19902396 0.8010333

PrincipleBal
Y [,1] [,2]
0 -0.2448386 0.2817802
1 2.6740564 1.8088103

CreditLines
Y [,1] [,2]
0 0.02991229 0.9857868
1 -0.32436093 1.0909482

DebtBurden
Y [,1] [,2]
0 0.04074323 0.9703017
1 -0.43759842 1.2151635

RevBalInq
Y [,1] [,2]
0 -0.002209475 0.9901906
1 0.014145492 1.0946685

AnnlIncome
Y [,1] [,2]
0 0.01173404 0.9835334
1 -0.12651457 1.2619096

Delinq2Years
Y [,1] [,2]
0 0.004783965 0.9689868
1 -0.037316490 1.2580965

term
Y 36 months 60 months
0 0.8070575 0.1929425
1 0.6145084 0.3854916

grade
Y A B C D E F G
0 0.191065018 0.320255010 0.254136693 0.144718925 0.062074467 0.022241173 0.005508714
1 0.047811751 0.178806954 0.293989808 0.243705036 0.152502998 0.063998801 0.019184652

emplength
Y < 1 year 1 year 10+ years 2 years 3 years 4 years 5 years 6 years
0 0.08207364 0.06675791 0.30644540 0.09415017 0.08102142 0.06469472 0.07139320 0.05752858
1 0.08505695 0.06864508 0.30987710 0.08820444 0.08258393 0.05912770 0.06040168 0.05110911
emplength
Y 7 years 8 years 9 years n/a
0 0.05568546 0.04657304 0.03796267 0.03571379
1 0.05065947 0.04946043 0.03844424 0.05642986

homeown
Y ANY MORTGAGE NONE OTHER OWN RENT
0 6.877295e-06 5.055706e-01 1.650551e-04 5.914474e-04 8.611061e-02 4.075554e-01
1 0.000000e+00 4.386990e-01 0.000000e+00 0.000000e+00 1.034922e-01 4.578088e-01

verifstatus
Y Not Verified Source Verified Verified
0 0.3562164 0.2903457 0.3534380
1 0.2162020 0.4089478 0.3748501

purpose
Y car credit_card debt_consolidation educational home_improvement house
0 0.0154257734 0.2037742597 0.5812346121 0.0012722996 0.0603826527 0.0066090808
1 0.0066696643 0.1838279376 0.6356414868 0.0000000000 0.0587529976 0.0056954436
purpose
Y major_purchase medical moving other renewable_energy small_business
0 0.0261062129 0.0110862000 0.0077850983 0.0545850928 0.0009490668 0.0161822758
1 0.0183603118 0.0092176259 0.0075689448 0.0520833333 0.0009742206 0.0152877698
purpose
Y vacation wedding
0 0.0064302711 0.0081771041
1 0.0054706235 0.0004496403

Addrstate
Y AK AL AR AZ CA CO
CT
0 2.840323e-03 1.206278e-02 6.684731e-03 2.449005e-02 1.708939e-01 2.374730e-02
1.477243e-02
1 2.323141e-03 1.551259e-02 8.018585e-03 2.278177e-02 1.435102e-01 1.663669e-02
1.408873e-02
Addrstate
Y DC DE FL GA HI IA
ID
0 3.576194e-03 2.737164e-03 6.718430e-02 3.203444e-02 5.694401e-03 2.750918e-05
2.063189e-05
1 1.423861e-03 3.447242e-03 7.441547e-02 3.050060e-02 7.119305e-03 0.000000e+00
0.000000e+00
Addrstate
Y IL IN KS KY LA MA
MD
0 3.669725e-02 1.045349e-02 8.294018e-03 8.899220e-03 1.144382e-02 2.475139e-02
2.378856e-02
1 3.207434e-02 1.513789e-02 5.845324e-03 6.669664e-03 1.371403e-02 2.263189e-02
2.585432e-02
Addrstate
Y ME MI MN MO MS MT
NC
0 6.189566e-05 2.353410e-02 1.756461e-02 1.519195e-02 1.602410e-03 3.136047e-03
2.715156e-02
1 0.000000e+00 2.517986e-02 1.671163e-02 1.266487e-02 5.320743e-03 2.847722e-03
3.095024e-02
Addrstate
Y ND NE NH NJ NM NV
NY
0 3.438648e-05 1.444232e-04 4.814107e-03 3.718554e-02 5.364290e-03 1.440106e-02
8.343535e-02
1 7.494005e-05 9.742206e-04 3.597122e-03 3.829436e-02 6.294964e-03 1.708633e-02
9.854616e-02
Addrstate
Y OH OK OR PA RI SC
SD
0 3.028073e-02 8.438441e-03 1.352764e-02 3.249522e-02 4.325819e-03 1.139568e-02
2.193857e-03
1 2.997602e-02 1.214029e-02 1.034173e-02 3.672062e-02 4.721223e-03 1.101619e-02
2.847722e-03
Addrstate
Y TN TX UT VA VT WA
WI
0 8.816693e-03 7.855247e-02 8.500337e-03 3.151864e-02 1.726201e-03 2.368541e-02
1.242040e-02
1 1.746103e-02 7.868705e-02 7.044365e-03 3.237410e-02 1.049161e-03 1.828537e-02
1.109113e-02
Addrstate
Y WV WY
0 4.814107e-03 2.592740e-03
1 4.571343e-03 1.423861e-03

# Naïve Bayes Prediction on Training Data Set


> nb_pred_train<-predict(nb_train,newdata = train_nb)

> confusionMatrix(nb_pred_train,train_nb$Default,positive = "1")


Confusion Matrix and Statistics

Reference
Prediction 0 1
0 144004 1528
1 1402 11816

Accuracy : 0.9815
95% CI : (0.9809, 0.9822)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2e-16

Kappa : 0.8796
Mcnemar's Test P-Value : 0.02093

Sensitivity : 0.88549
Specificity : 0.99036
Pos Pred Value : 0.89393
Neg Pred Value : 0.98950
Prevalence : 0.08406
Detection Rate : 0.07443
Detection Prevalence : 0.08326
Balanced Accuracy : 0.93792

'Positive' Class : 1

# Naïve Bayes Prediction on Test data

> nb_pred_test<-predict(nb_train,newdata = test_nb)

> confusionMatrix(nb_pred_test,test_nb$Default,positive = "1")


Confusion Matrix and Statistics

Reference
Prediction 0 1
0 61703 666
1 614 5053

Accuracy : 0.9812
95% CI : (0.9801, 0.9822)
No Information Rate : 0.9159
P-Value [Acc > NIR] : <2e-16

Kappa : 0.8773

Mcnemar's Test P-Value : 0.154

Sensitivity : 0.88355
Specificity : 0.99015
Pos Pred Value : 0.89165
Neg Pred Value : 0.98932
Prevalence : 0.08406
Detection Rate : 0.07427
Detection Prevalence : 0.08329
Balanced Accuracy : 0.93685

'Positive' Class : 1

Comparison of Performance measures for CART, Logistic and Naïve Bayes Model

Algorithm Data Set Accuracy Sensitivity Sepcificity AUC KS GINI


CART Train Data 99.87 96.55 99.40 97.04 90.49
99.59
Test data 99.55 99.87 96.15 99.18 96.52 90.52
               

Logistc Train Data 99.84 98.42 99.97 99.84 99.07 99.67


Regression
Test data 99.81 98.11 99.97 99.83 99.08 99.65
               

Train Data 98.15 88.55 99.04      


Naive Bayes
Test data 98.12 88.35 99.02      

After comparison of performance measures for all three models, all measures are better for logistic
regression except for sensitivity which is better in CART model, Which may further can be improved in
logistic regression also.

Insights from Logistic Regression Model

Loan profiles (Loan amount & loan repayment details), annual income, debt
burden (Rate of Interest & Debt income ratio), credit lines, outstanding principle,
loan term of 60 days, grade, delinquency status and revolving balances are the
more significant drivers in prediction the probability of default.
1) Banks should take care while giving loans to lower income customers.
2) Banks should have a look at customer current debt burden status before sanctioning of loans
3) Banks should take more care while giving loans to customers who is accepting higher rate of
interest.
4) Banks should review the applications for higher loan amount.
5) Banks should take care while giving loans to lower grade people
6) From Credit score, bank should track the repayment status of credit lines.
7) Banks should take care while sanctioning of loans to people who is having high credit lines
opened.

Ensemble Methods
Bagging Method:

> train_bagging<-bagging(as.numeric(Default)
~.,data=train_ensmb,control=rpart.control(maxdepth=5, minsplit=4))
> bagging_pred_train<-predict(train_bagging,train_ensmb)
> tabbag_train <- table(train_ensmb$Default, bagging_pred_train>0.5)
> tabbag_train
TRUE
0 145406
1 13344
> tabbag_train <- table(train_ensmb$Default, bagging_pred_train>0.6)
> tabbag_train

TRUE
0 145406
1 13344
> tabbag_train <- table(train_ensmb$Default, bagging_pred_train>0.7)
> tabbag_train

TRUE
0 145406
1 13344

Bagging is highly over fit model, because is it also fitting some noise in the data.

Building of XG Boosting Method

> xgb_model1 <- xgboost(


+ data = as.matrix(features_train),
+ label = label_train,
+ eta = 0.001,
+ max_depth = 5,
+ min_child_weight = 3,
+ nrounds = 100,
+ nfold = 10,
+ objective = "binary:logistic",
+ verbose = 0,
+ early_stopping_rounds = 10)

Prediction on Training data


> xgb_pred_train <- predict(xgb_model1, newdata = features_train)
> xgb_matrix_train <- ifelse(xgb_pred_train>0.5,1,0)
>
> xgb_matrix_train <- as.factor(xgb_matrix_train)
>
> confusionMatrix(xgb_matrix_train, train_xgb$Default, positive = "1")
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 145303 597
1 103 12747
Accuracy : 0.9956
95% CI : (0.9953, 0.9959)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9709

Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.95526
Specificity : 0.99929
Pos Pred Value : 0.99198
Neg Pred Value : 0.99591
Prevalence : 0.08406
Detection Rate : 0.08030
Detection Prevalence : 0.08094
Balanced Accuracy : 0.97728

'Positive' Class : 1

#Prediction Test data


> xgb_pred_test <- predict(xgb_model1, newdata = features_test)
>
>
> xgb_matrix_test <- ifelse(xgb_pred_test>0.5,1,0)
>
> xgb_matrix_test <- as.factor(xgb_matrix_test)
>
> confusionMatrix(xgb_matrix_test, test_xgb$Default, positive = "1")
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 62257 272
1 60 5447

Accuracy : 0.9951
95% CI : (0.9946, 0.9956)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9678

Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.95244
Specificity : 0.99904
Pos Pred Value : 0.98910
Neg Pred Value : 0.99565
Prevalence : 0.08406
Detection Rate : 0.08006
Detection Prevalence : 0.08094
Balanced Accuracy : 0.97574

'Positive' Class : 1

# Tuning of XG boost model

Tunning for Best Nrounds Values (1000)

> tp_xgb <- vector()


> lr <- c(0.001, 0.01, 0.1, 0.3, 0.5, 0.7, 1)
> md <- c(1, 3, 5, 7, 9, 15)
> nr <- c(2, 50, 100, 1000, 10000)
>
> for(i in nr) {
+ xgb.fit <- xgboost(
+ data = features_train,
+ label = label_train,
+ eta = 0.001,
+ max_depth = 5,
+ nrounds = i,
+ nfold = 10,
+ objective = "binary:logistic",
+ verbose = 0,
+ early_stopping_rounds = 0)
+
+ train_xgb.pred.class <- predict(xgb.fit, features_train)
+
+ tp_xgb <- cbind(tp_xgb, sum(train_xgb$Default == 1 & train_xgb.pred.class > 0.5))
+ }
>
> tp_xgb
[,1] [,2] [,3] [,4] [,5]
[1,] 12749 12749 12749 12749 12749
Best Max depth Value.(15)
> tp_xgb <- vector()
> lr <- c(0.001, 0.01, 0.1, 0.3, 0.5, 0.7, 1)
> md <- c(1, 3, 5, 7, 9, 15)
> nr <- c(2, 50, 100, 1000, 10000)
>
> for(i in md) {
+ xgb.fit <- xgboost(
+ data = features_train,
+ label = label_train,
+ eta = 0.001,
+ max_depth = i,
+ nrounds = 100,
+ nfold = 10,
+ objective = "binary:logistic",
+ verbose = 0,
+ early_stopping_rounds = 0)
+
+ train_xgb.pred.class <- predict(xgb.fit, features_train)
+
+ tp_xgb <- cbind(tp_xgb, sum(train_xgb$Default == 1 & train_xgb.pred.class > 0.5))
+ }
>
> tp_xgb
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 12144 12405 12749 12891 12968 12974

Best eta-1
> tp_xgb <- vector()
> lr <- c(0.001, 0.01, 0.1, 0.3, 0.5, 0.7, 1)
> md <- c(1, 3, 5, 7, 9, 15)
> nr <- c(2, 50, 100, 1000, 10000)
>
> for(i in lr) {
+ xgb.fit <- xgboost(
+ data = features_train,
+ label = label_train,
+ eta = i,
+ max_depth = 6,
+ nrounds = 5,
+ nfold = 10,
+ objective = "binary:logistic",
+ verbose = 0,
+ early_stopping_rounds = 0)
+
+ train_xgb.pred.class <- predict(xgb.fit, features_train)
+
+ tp_xgb <- cbind(tp_xgb, sum(train_xgb$Default == 1 & train_xgb.pred.class > 0.5))
+ }
>
> tp_xgb
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 12781 12804 12831 12918 12993 13031 13099

Final Best XGB Model

> xgb_final <- xgboost(


+ data = features_train,
+ label = label_train,
+ eta = 1,
+ max_depth = 15,
+ nrounds = 10000,
+ nfold = 10,
+ objective = "binary:logistic",
+ verbose = 0,
+ early_stopping_rounds = 10)

Prediction on Train data with XGB Tunned model


xgb_pred_train_f <- predict(xgb_final, newdata = features_train)
>
>
> xgb_matrix_train_f <- ifelse(xgb_pred_train_f>0.5,1,0)
>
> xgb_matrix_train_f <- as.factor(xgb_matrix_train_f)
>
> confusionMatrix(xgb_matrix_train_f, train_xgb$Default, positive = "1")
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 145406 0
1 0 13344

Accuracy : 1
95% CI : (1, 1)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 1

Mcnemar's Test P-Value : NA

Sensitivity : 1.00000
Specificity : 1.00000
Pos Pred Value : 1.00000
Neg Pred Value : 1.00000
Prevalence : 0.08406
Detection Rate : 0.08406
Detection Prevalence : 0.08406
Balanced Accuracy : 1.00000

'Positive' Class : 1

Prediction on Test data with XGB Tunned model


xgb_pred_test_f <- predict(xgb_final, newdata = features_test)
>
>
> xgb_matrix_test_f <- ifelse(xgb_pred_test_f>0.5,1,0)
>
> xgb_matrix_test_f <- as.factor(xgb_matrix_test_f)
>
> confusionMatrix(xgb_matrix_test_f, test_xgb$Default, positive = "1")
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 62285 109
1 32 5610

Accuracy : 0.9979
95% CI : (0.9976, 0.9983)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9865

Mcnemar's Test P-Value : 1.55e-10

Sensitivity : 0.98094
Specificity : 0.99949
Pos Pred Value : 0.99433
Neg Pred Value : 0.99825
Prevalence : 0.08406
Detection Rate : 0.08246
Detection Prevalence : 0.08293
Balanced Accuracy : 0.99021

'Positive' Class : 1

Algorithm Data Set Accuracy Sensitivity Sepcificity AUC KS GINI


Train
CART Data 99.59 99.87 96.55 99.40 97.04 90.49
Test data 99.55 99.87 96.15 99.18 96.52 90.52
               
Train
Logistc Regression Data 99.84 98.42 99.97 99.84 99.07 99.67
Test data 99.81 98.11 99.97 99.83 99.08 99.65
               
Train
Naive Bayes Data 98.15 88.55 99.04      
Test data 98.12 88.35 99.02      
Train
XGBoost Before Tuning Data 99.56 95.53 99.93      
Test data 99.51 95.24 99.90      
Train
XGBoost Before After
Data 100.00 100.00 100.00      
Tuning
Test data 99.79 98.09 99.95      

I have tuned XG boosting method based on train data, that’s why we are getting 100% performance
measure for train data set in XGboosting model after tuning. After tuning , the performance measures
for test data also improved.

You might also like