Bartlett Test To Check Whether The Data Is Suitable For PCA

Bank Loan Default Caspstone Project Notes-2
1) Model Building- CART, Logistic Regression and Naïve bayes models developed after PCA.
2) Ensembled Methos- Bagging and XG Boost models developed.
# applying PCA to continues variables before building models
Bartlett test to check whether the data is suitable for PCA.
$chisq
[1] 5402.731
$p.value
[1] 0
$df
[1] 190
P Values is less than the 0.05, so the data is suitable for PCA.
KMO Test
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = corrmatrix1)
Overall MSA = 0.78
MSA for each item =
loan_amnt funded_amnt funded_amnt_inv int_rate installment
annual_inc
1.00 0.77 0.77 0.69 0.97
0.88
dti delinq_2yrs inq_last_6mths open_acc revol_bal
revol_util
0.64 0.53 0.48 0.67 0.88
0.60
total_acc out_prncp out_prncp_inv total_pymnt total_pymnt_inv
total_rec_prncp
0.71 0.59 0.59 0.76 0.77
0.76
total_rec_int last_pymnt_amnt
0.62 0.93
As the overall MSA is more than 0.5, so that the data is having required samples in the data set to
perform PCA.
Scree Plot to know number of Components to be considered.
As per the scree polt, we consider the number of components which are more than one. From the above
plot it is clearly evident that we have 7 components.
We are able to allocate variable to each component without rotation itself.
Loadings:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
loan_amnt 0.969
funded_amnt 0.971
funded_amnt_inv 0.969
int_rate 0.631
installment 0.940
annual_inc 0.681
dti 0.556
delinq_2yrs 0.847
inq_last_6mths 0.660
open_acc 0.818
revol_bal -0.511
revol_util 0.643
total_acc 0.812
out_prncp 0.959
out_prncp_inv 0.959
total_pymnt 0.958
total_pymnt_inv 0.957
total_rec_prncp 0.938
total_rec_int 0.685
last_pymnt_amnt 0.700
PC1 PC2 PC3 PC4 PC5 PC6 PC7

SS loadings 7.918 2.529 1.832 1.478 1.210 1.073 0.983
Proportion Var 0.396 0.126 0.092 0.074 0.061 0.054 0.049
Cumulative Var 0.396 0.522 0.614 0.688 0.748 0.802 0.851
Structure of the Data after PCA
'data.frame': 226786 obs. of 15 variables:

$ PC1 : num -1.093 -1.372 -0.395 -1.142 -1.25 ...
$ PC2 : num -0.0861 -0.1612 -0.1204 -0.4262 -0.1412 ...
$ PC3 : num -0.929 -1.535 0.598 -0.083 -1.625 ...
$ PC4 : num 1.212 1.256 -0.156 -0.672 1.072 ...
$ PC5 : num -0.952 0.546 0.552 1.108 0.809 ...
$ PC6 : num -0.115 1.004 -0.793 0.177 1.504 ...
$ PC7 : num -0.243 -0.693 -0.389 -1.384 -0.869 ...
$ term : Factor w/ 2 levels "36 months","60 months": 1 1 1 1 1 2 1 1 1 1 ...
$ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 3 2 2 4 3 ...
$ emp_length : Factor w/ 12 levels "< 1 year","1 year",..: 3 3 3 5 11 7 3 5 1
6 ...
$ home_ownership : Factor w/ 6 levels "ANY","MORTGAGE",..: 6 6 6 6 6 5 5 6 6 6 ...
$ verification_status: Factor w/ 3 levels "Not Verified",..: 3 1 2 2 2 1 2 2 1 1 ...
$ purpose : Factor w/ 14 levels "car","credit_card",..: 2 12 10 14 1 3 3 2 3
5 ...
$ addr_state : Factor w/ 51 levels "AK","AL","AR",..: 4 15 5 4 5 4 5 15 25 5 ...
$ Default : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
Building of CART Model

Decision Tree
> rpart.rules(train_tree)
Decision Tree Rules
Default
0.00 when PrincipleBal < -0.016
0.00 when PrincipleBal is 0.123 to 0.319 & DebtBurden >= 0.39 & CreditLines >=
-0.419 & LoanProfile >= -1.4
0.01 when PrincipleBal is -0.016 to 0.123 & DebtBurden >= -0.52
0.01 when PrincipleBal is 0.123 to 0.319 & DebtBurden >= 0.93 & CreditLines is
-0.930 to -0.419 & LoanProfile >= -1.4
0.487 & LoanProfile >= -1.3
0.02 when PrincipleBal is 0.123 to 0.319 & DebtBurden is -0.19 to 0.39 & CreditLines >=
0.394 & LoanProfile >= -1.4
0.02 when PrincipleBal is -0.016 to 0.123 & DebtBurden < -0.52 & CreditLines >=
0.260
0.03 when PrincipleBal is 0.123 to 0.319 & DebtBurden < -0.19 & CreditLines >=
1.011
0.03 when PrincipleBal is 0.319 to 0.576 & DebtBurden >= 1.31 & CreditLines is
-0.042 to 0.487 & LoanProfile >= -1.3
0.08 when PrincipleBal is 0.576 to 0.985 & CreditLines >=
2.163
0.09 when PrincipleBal is 0.123 to 0.319 & DebtBurden >= 0.89 & CreditLines <
-0.930 & LoanProfile >= -1.4
0.11 when PrincipleBal is 0.319 to 0.576 & DebtBurden < 0.31 & CreditLines >=
1.279
2.163
-0.042
0.36 when PrincipleBal is 0.576 to 0.662 & DebtBurden < 2.00 & CreditLines is
1.095 to 2.163 & Delinq2Years < 7
0.45 when PrincipleBal is 0.319 to 0.576 & DebtBurden < 1.55 & CreditLines <
-0.042 & LoanProfile >= 0.3
0.49 when PrincipleBal is 0.123 to 0.319 & DebtBurden >= -0.19
& LoanProfile < -1.4
0.52 when PrincipleBal is 0.123 to 0.319 & DebtBurden is -0.19 to 0.39 & CreditLines is
-0.930 to 0.394 & LoanProfile >= -1.4
0.54 when PrincipleBal >= 0.576 & DebtBurden < 2.00 & CreditLines <
2.163 & Delinq2Years >= 7
-0.042 & LoanProfile < -1.3
0.57 when PrincipleBal is 0.123 to 0.319 & DebtBurden is 0.39 to 0.93 & CreditLines is
-0.930 to -0.419 & LoanProfile >= -1.4
0.58 when PrincipleBal is 0.319 to 0.576 & DebtBurden is 0.31 to 1.31 & CreditLines is
-0.042 to 0.487 & LoanProfile >= -1.3
0.79 when PrincipleBal is -0.016 to 0.123 & DebtBurden < -0.52 & CreditLines <
0.260
0.87 when PrincipleBal is 0.123 to 0.319 & DebtBurden < -0.19 & CreditLines <
1.011
0.90 when PrincipleBal is 0.123 to 0.319 & DebtBurden is -0.19 to 0.89 & CreditLines <
-0.930 & LoanProfile >= -1.4
0.94 when PrincipleBal is 0.319 to 0.576 & DebtBurden < 0.31 & CreditLines is
-0.042 to 1.279
0.94 when PrincipleBal >= 0.868 & DebtBurden >= 2.00 & CreditLines <
2.163
0.95 when PrincipleBal >= 0.985 & CreditLines >=
2.163
1.095 & Delinq2Years < 7
-0.042 & LoanProfile < 0.3
1.00 when PrincipleBal >= 0.662 & DebtBurden < 2.00 & CreditLines <
2.163 & Delinq2Years < 7
The tree basically forming on the basis of outstanding principle, loan profile means loan amount &
repayment history, debt burden which includes Rate of interest & DTI, credit line opened and
delinquency status.
Observations
1) If a person is having higher outstanding principle, which may lead to default by the person.
2) Higher the loan amount, higher the chances of default.
3) If a person is not having good repayment history of loans, there is a high chance of default.
4) If a person is borrowing at higher rate of interest and who is having high dti, there is a chance of
default.
5) If a person if having higher credit lines opened, there is a higher chance of default.
6) If a person is having continuous delinquency record during last 2 years, there is higher chance of
default.
Bank should focus on the persons who is applying for higher loan amount, applying at higher ROI, who is
having high DTI, who is not having better repayment record and who is having more number of credit
lines opened and should focus on delinquency status also.
CP Table:
Classification tree:
rpart(formula = Default ~ ., data = train_data, method = "class",
minsplit = 100, minbucket = 33, cp = 0)
Variables actually used in tree construction:

[1] CreditLines DebtBurden Delinq2Years LoanProfile PrincipleBal
Root node error: 13344/158750 = 0.084057
n= 158750
CP nsplit rel error xerror xstd

1 0.88189448 0 1.000000 1.000000 0.0082850
2 0.00843076 1 0.118106 0.119230 0.0029741
3 0.00749400 4 0.091202 0.096748 0.0026817
4 0.00367206 6 0.076214 0.084233 0.0025035
5 0.00262290 8 0.068870 0.075764 0.0023752
6 0.00251049 11 0.061001 0.069694 0.0022787
7 0.00079936 13 0.055980 0.064224 0.0021879
8 0.00074940 16 0.053582 0.062800 0.0021636
9 0.00037470 19 0.051334 0.062575 0.0021598
10 0.00032474 20 0.050959 0.061975 0.0021495
11 0.00029976 23 0.049985 0.062500 0.0021585
12 0.00014988 26 0.049086 0.060776 0.0021287
13 0.00000000 30 0.048486 0.059727 0.0021103
Lower xerror is coming at end node that means pruning of CART model is not required.
Prediction of Using the CRAT models both on training data set and test data set.
CP Plot
Confusion Matrix for Prediction on Train Data.
> confusionMatrix(CART_train_predict,train_data$Default)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 145219 460
1 187 12884
Accuracy : 0.9959
95% CI : (0.9956, 0.9962)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9733
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9987
Specificity : 0.9655
Pos Pred Value : 0.9968
Neg Pred Value : 0.9857
Prevalence : 0.9159
Detection Rate : 0.9148
Detection Prevalence : 0.9177
Balanced Accuracy : 0.9821
'Positive' Class : 0
# Prediction on Test data.
> CART_test_predict<-predict(train_tree,newdata = test_data,type="class")
> confusionMatrix(CART_test_predict,test_data$Default)
Reference
Prediction 0 1
0 62233 220
1 84 5499
Accuracy : 0.9955
95% CI : (0.995, 0.996)
Kappa : 0.9707
Mcnemar's Test P-Value : 9.727e-15
Prevalence : 0.9159
# Area under the Curve on training data set

An object of class "performance"
Slot "x.name":
[1] "None"
Slot "y.name":
[1] "Area under the ROC curve"
Slot "alpha.name":
[1] "none"
Slot "x.values":
list()
Slot "y.values":
[[1]]
[1] 0.9939686
Slot "alpha.values":
list()
ROC Curve on Training Data Set
AUC for Test data
An object of class "performance"

Slot "x.name":
[1] "None"
Slot "y.name":
[1] "Area under the ROC curve"
Slot "alpha.name":
[1] "none"
Slot "x.values":
list()
Slot "y.values":
[[1]]
[1] 0.9917786
Slot "alpha.values":
list()
ROC Curve for Test data.
Building of Logistic Regression Model
Call:
glm(formula = Default ~ ., family = "binomial", data = train_glm[,
-c(11, 13)])
Deviance Residuals:
Min 1Q Median 3Q Max
-3.9643 -0.0365 -0.0103 -0.0009 6.7656
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.98977 0.97992 -1.010 0.312470
LoanProfile -1.78072 0.11259 -15.816 < 2e-16 ***
PrincipleBal 26.36783 0.64769 40.710 < 2e-16 ***
CreditLines -3.91103 0.12789 -30.582 < 2e-16 ***
DebtBurden -4.41904 0.14386 -30.718 < 2e-16 ***
RevBalInq 0.62871 0.08009 7.850 4.16e-15 ***
AnnlIncome -0.96195 0.04989 -19.280 < 2e-16 ***
Delinq2Years -0.77095 0.05518 -13.971 < 2e-16 ***
term60 months -1.43550 0.34107 -4.209 2.57e-05 ***
gradeB -0.63446 0.29829 -2.127 0.033423 *
gradeC -1.77531 0.34393 -5.162 2.45e-07 ***
gradeD -2.77377 0.39693 -6.988 2.79e-12 ***
gradeE -3.38198 0.47332 -7.145 8.98e-13 ***
gradeF -4.56814 0.63313 -7.215 5.39e-13 ***
gradeG -3.95425 1.13365 -3.488 0.000487 ***
emplength1 year -0.44368 0.32801 -1.353 0.176161
emplength10+ years -0.23010 0.24026 -0.958 0.338220
emplength2 years -0.94844 0.32202 -2.945 0.003227 **
emplength3 years -0.18276 0.29258 -0.625 0.532194
emplength4 years -0.49274 0.33281 -1.481 0.138730
emplength5 years -0.47538 0.32595 -1.458 0.144721
emplength6 years -0.25602 0.33402 -0.766 0.443397
emplength7 years 0.01780 0.32996 0.054 0.956990
emplength8 years -0.15582 0.36468 -0.427 0.669165
emplength9 years -0.47669 0.45134 -1.056 0.290888
emplengthn/a 0.18745 0.33982 0.552 0.581216
verifstatusSource Verified -0.57146 0.18561 -3.079 0.002078 **
verifstatusVerified -0.43758 0.17022 -2.571 0.010151 *
AddrstateAL -1.09489 1.09977 -0.996 0.319464
AddrstateAR -0.89128 1.32607 -0.672 0.501506
AddrstateAZ -0.93340 1.05562 -0.884 0.376578
AddrstateCA -0.31088 0.93133 -0.334 0.738525
AddrstateCO -0.84412 1.05379 -0.801 0.423116
AddrstateCT 0.36271 1.01333 0.358 0.720392
AddrstateDC -0.12580 1.35121 -0.093 0.925821
AddrstateDE 0.14886 1.36774 0.109 0.913332
AddrstateFL -0.72136 0.95624 -0.754 0.450628
AddrstateGA -0.81139 1.01475 -0.800 0.423946
AddrstateHI -4.30599 1.99911 -2.154 0.031244 *
AddrstateIA -9.08019 703.35365 -0.013 0.989700
AddrstateID -8.57882 807.60575 -0.011 0.991525
AddrstateIL -0.35229 0.97830 -0.360 0.718769
AddrstateIN 0.01799 1.13778 0.016 0.987384
AddrstateKS -1.79813 1.43052 -1.257 0.208762
AddrstateKY -1.12512 1.29792 -0.867 0.386017
AddrstateLA -1.39298 1.19343 -1.167 0.243126
AddrstateMA -0.73322 1.11967 -0.655 0.512562
AddrstateMD -0.35605 1.00736 -0.353 0.723754
AddrstateME -7.90760 395.78763 -0.020 0.984060
AddrstateMI -0.25938 0.99960 -0.259 0.795262
AddrstateMN -0.60712 1.07424 -0.565 0.571961
AddrstateMO -2.28416 1.38567 -1.648 0.099267 .
AddrstateMS -3.88952 2.64692 -1.469 0.141711
AddrstateMT -3.60489 4.84782 -0.744 0.457112
AddrstateNC -0.01165 0.96958 -0.012 0.990414
AddrstateND -9.88009 707.38497 -0.014 0.988856
AddrstateNE -11.84881 438.60916 -0.027 0.978448
AddrstateNH -0.82337 1.45929 -0.564 0.572600
AddrstateNJ -0.17403 0.97002 -0.179 0.857614
AddrstateNM -1.36861 1.53001 -0.895 0.371048
AddrstateNV 0.07293 1.01150 0.072 0.942523
AddrstateNY -0.55970 0.94880 -0.590 0.555254
AddrstateOH -0.31488 0.98030 -0.321 0.748049
AddrstateOK -1.93305 1.30506 -1.481 0.138553
AddrstateOR -0.42781 1.08921 -0.393 0.694488
AddrstatePA -0.99815 1.02293 -0.976 0.329178
AddrstateRI -0.53969 1.38520 -0.390 0.696825
AddrstateSC -0.12273 1.07585 -0.114 0.909178
AddrstateSD -4.10813 6.81218 -0.603 0.546471
AddrstateTN -1.18543 1.34118 -0.884 0.376765
AddrstateTX -0.23481 0.94520 -0.248 0.803809
AddrstateUT -0.77069 1.28298 -0.601 0.548038
AddrstateVA -0.24881 0.98286 -0.253 0.800153
AddrstateVT -0.80716 1.56293 -0.516 0.605547
AddrstateWA -0.43907 1.01179 -0.434 0.664322
AddrstateWI -0.89688 1.25158 -0.717 0.473619
AddrstateWV -0.69732 1.36134 -0.512 0.608491
AddrstateWY -1.56135 2.15259 -0.725 0.468245
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 91620.1 on 158749 degrees of freedom

Residual deviance: 2449.3 on 158672 degrees of freedom
AIC: 2605.3
Number of Fisher Scoring iterations: 14
From the above logistic regression model it is clearly evident the variables like profile (Loan amount &
loan Repayment status), Outstanding principle, credit lines, debt burden (ROI and DTI), annual income,
revolving credit utilization, grade, 60 months term and delinquency during last 2 years are more
significance variables to predict the probability of default of a customer.
Logistic Regression Prediction on Train Data
Confusion matrix for prediction train data
Reference
Prediction 0 1
0 145359 211
1 47 13133
Accuracy : 0.9984
95% CI : (0.9982, 0.9986)
Kappa : 0.9894
Prevalence : 0.08406
ROC Curve for Train data
AUC Value for Train data
> train.auc.glm = performance(train.roc.glm, "auc")

> train.area.glm = as.numeric(slot(train.auc.glm, "y.values"))
> train.area.glm
[1] 0.9983739
# KS and Gini Values on Train data
> ks.train.glm <- performance(train.roc.glm, "tpr", "fpr")
> train.ks.glm <- max(attr(ks.train.glm, "y.values")[[1]] - (attr(ks.train.glm, "x.values")[[1]]))
> train.ks.glm
[1] 0.9907343
> train.gini.glm = (2 * train.area.glm) - 1

> train.gini.glm
[1] 0.9967478
Logistic Regression Prediction on Test data

> pred_glm_test<-predict(glmmodel_train,newdata=test_glm,type="response")
> glm_pred_matrix_test <- ifelse(pred_glm_test>0.5,1,0)
>
> glm_pred_matrix_test <- as.factor(glm_pred_matrix_test)>
>
> confusionMatrix(glm_pred_matrix_test, test_glm$Default, positive = "1")
Reference
Prediction 0 1
0 62298 108
1 19 5611
Accuracy : 0.9981
95% CI : (0.9978, 0.9984)
Kappa : 0.9878
ROC Curve on test Data

AUC value on test data
> test.auc.glm = performance(test.roc.glm, "auc")
> test.area.glm = as.numeric(slot(test.auc.glm, "y.values"))
> test.area.glm
[1] 0.9982265
KS and
> ks.test.glm <- performance(test.roc.glm, "tpr", "fpr")

> test.ks.glm <- max(attr(ks.test.glm, "y.values")[[1]] - (attr(ks.test.glm, "x.values")[[1]]))
> test.ks.glm
[1] 0.990809
>
>
> test.gini.glm = (2 * test.area.glm) - 1
> test.gini.glm
[1] 0.996453
Note: I am not able to buld the random forest model to RAM memory
issue in my system. I am building naïve bays model
# Building of Naïve Bayes Model.

Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
0 1
0.91594331 0.08405669
Conditional probabilities:
LoanProfile
Y [,1] [,2]
0 0.01504695 1.0129202
1 -0.19902396 0.8010333
PrincipleBal
Y [,1] [,2]
0 -0.2448386 0.2817802
1 2.6740564 1.8088103
CreditLines
Y [,1] [,2]
0 0.02991229 0.9857868
1 -0.32436093 1.0909482
DebtBurden
Y [,1] [,2]
0 0.04074323 0.9703017
1 -0.43759842 1.2151635
RevBalInq
Y [,1] [,2]
0 -0.002209475 0.9901906
1 0.014145492 1.0946685
AnnlIncome
Y [,1] [,2]
0 0.01173404 0.9835334
1 -0.12651457 1.2619096
Delinq2Years
Y [,1] [,2]
0 0.004783965 0.9689868
1 -0.037316490 1.2580965
term
Y 36 months 60 months
0 0.8070575 0.1929425
1 0.6145084 0.3854916
grade
Y A B C D E F G
0 0.191065018 0.320255010 0.254136693 0.144718925 0.062074467 0.022241173 0.005508714
1 0.047811751 0.178806954 0.293989808 0.243705036 0.152502998 0.063998801 0.019184652
emplength
Y < 1 year 1 year 10+ years 2 years 3 years 4 years 5 years 6 years
0 0.08207364 0.06675791 0.30644540 0.09415017 0.08102142 0.06469472 0.07139320 0.05752858
1 0.08505695 0.06864508 0.30987710 0.08820444 0.08258393 0.05912770 0.06040168 0.05110911
emplength
Y 7 years 8 years 9 years n/a
0 0.05568546 0.04657304 0.03796267 0.03571379
1 0.05065947 0.04946043 0.03844424 0.05642986
homeown
Y ANY MORTGAGE NONE OTHER OWN RENT
0 6.877295e-06 5.055706e-01 1.650551e-04 5.914474e-04 8.611061e-02 4.075554e-01
1 0.000000e+00 4.386990e-01 0.000000e+00 0.000000e+00 1.034922e-01 4.578088e-01
verifstatus
Y Not Verified Source Verified Verified
0 0.3562164 0.2903457 0.3534380
1 0.2162020 0.4089478 0.3748501
purpose
Y car credit_card debt_consolidation educational home_improvement house
0 0.0154257734 0.2037742597 0.5812346121 0.0012722996 0.0603826527 0.0066090808
1 0.0066696643 0.1838279376 0.6356414868 0.0000000000 0.0587529976 0.0056954436
purpose
Y major_purchase medical moving other renewable_energy small_business
0 0.0261062129 0.0110862000 0.0077850983 0.0545850928 0.0009490668 0.0161822758
1 0.0183603118 0.0092176259 0.0075689448 0.0520833333 0.0009742206 0.0152877698
purpose
Y vacation wedding
0 0.0064302711 0.0081771041
1 0.0054706235 0.0004496403
Addrstate
Y AK AL AR AZ CA CO
CT
0 2.840323e-03 1.206278e-02 6.684731e-03 2.449005e-02 1.708939e-01 2.374730e-02
1.477243e-02
1 2.323141e-03 1.551259e-02 8.018585e-03 2.278177e-02 1.435102e-01 1.663669e-02
1.408873e-02
Addrstate
Y DC DE FL GA HI IA
ID
0 3.576194e-03 2.737164e-03 6.718430e-02 3.203444e-02 5.694401e-03 2.750918e-05
2.063189e-05
1 1.423861e-03 3.447242e-03 7.441547e-02 3.050060e-02 7.119305e-03 0.000000e+00
0.000000e+00
Addrstate
Y IL IN KS KY LA MA
MD
0 3.669725e-02 1.045349e-02 8.294018e-03 8.899220e-03 1.144382e-02 2.475139e-02
2.378856e-02
1 3.207434e-02 1.513789e-02 5.845324e-03 6.669664e-03 1.371403e-02 2.263189e-02
2.585432e-02
Addrstate
Y ME MI MN MO MS MT
NC
0 6.189566e-05 2.353410e-02 1.756461e-02 1.519195e-02 1.602410e-03 3.136047e-03
2.715156e-02
1 0.000000e+00 2.517986e-02 1.671163e-02 1.266487e-02 5.320743e-03 2.847722e-03
3.095024e-02
Addrstate
Y ND NE NH NJ NM NV
NY
0 3.438648e-05 1.444232e-04 4.814107e-03 3.718554e-02 5.364290e-03 1.440106e-02
8.343535e-02
1 7.494005e-05 9.742206e-04 3.597122e-03 3.829436e-02 6.294964e-03 1.708633e-02
9.854616e-02
Addrstate
Y OH OK OR PA RI SC
SD
0 3.028073e-02 8.438441e-03 1.352764e-02 3.249522e-02 4.325819e-03 1.139568e-02
2.193857e-03
1 2.997602e-02 1.214029e-02 1.034173e-02 3.672062e-02 4.721223e-03 1.101619e-02
2.847722e-03
Addrstate
Y TN TX UT VA VT WA
WI
0 8.816693e-03 7.855247e-02 8.500337e-03 3.151864e-02 1.726201e-03 2.368541e-02
1.242040e-02
1 1.746103e-02 7.868705e-02 7.044365e-03 3.237410e-02 1.049161e-03 1.828537e-02
1.109113e-02
Addrstate
Y WV WY
0 4.814107e-03 2.592740e-03
1 4.571343e-03 1.423861e-03
# Naïve Bayes Prediction on Training Data Set

> nb_pred_train<-predict(nb_train,newdata = train_nb)
> confusionMatrix(nb_pred_train,train_nb$Default,positive = "1")

Reference
Prediction 0 1
0 144004 1528
1 1402 11816
Accuracy : 0.9815
95% CI : (0.9809, 0.9822)
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.8796
Mcnemar's Test P-Value : 0.02093
# Naïve Bayes Prediction on Test data
> nb_pred_test<-predict(nb_train,newdata = test_nb)
> confusionMatrix(nb_pred_test,test_nb$Default,positive = "1")

Reference
Prediction 0 1
0 61703 666
1 614 5053
Accuracy : 0.9812
95% CI : (0.9801, 0.9822)
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8773
Mcnemar's Test P-Value : 0.154
Comparison of Performance measures for CART, Logistic and Naïve Bayes Model
Algorithm Data Set Accuracy Sensitivity Sepcificity AUC KS GINI

CART Train Data 99.87 96.55 99.40 97.04 90.49
99.59
Test data 99.55 99.87 96.15 99.18 96.52 90.52

Logistc Train Data 99.84 98.42 99.97 99.84 99.07 99.67

Regression
Test data 99.81 98.11 99.97 99.83 99.08 99.65

Train Data 98.15 88.55 99.04

Naive Bayes
Test data 98.12 88.35 99.02
After comparison of performance measures for all three models, all measures are better for logistic
regression except for sensitivity which is better in CART model, Which may further can be improved in
logistic regression also.
Insights from Logistic Regression Model
Loan profiles (Loan amount & loan repayment details), annual income, debt
burden (Rate of Interest & Debt income ratio), credit lines, outstanding principle,
loan term of 60 days, grade, delinquency status and revolving balances are the
more significant drivers in prediction the probability of default.
1) Banks should take care while giving loans to lower income customers.
2) Banks should have a look at customer current debt burden status before sanctioning of loans
3) Banks should take more care while giving loans to customers who is accepting higher rate of
interest.
4) Banks should review the applications for higher loan amount.
5) Banks should take care while giving loans to lower grade people
6) From Credit score, bank should track the repayment status of credit lines.
7) Banks should take care while sanctioning of loans to people who is having high credit lines
opened.
Ensemble Methods
Bagging Method:
> train_bagging<-bagging(as.numeric(Default)
~.,data=train_ensmb,control=rpart.control(maxdepth=5, minsplit=4))
> bagging_pred_train<-predict(train_bagging,train_ensmb)
> tabbag_train <- table(train_ensmb$Default, bagging_pred_train>0.5)
> tabbag_train
TRUE
0 145406
1 13344
> tabbag_train
TRUE
0 145406
1 13344
> tabbag_train
TRUE
0 145406
1 13344
Bagging is highly over fit model, because is it also fitting some noise in the data.
Building of XG Boosting Method
> xgb_model1 <- xgboost(

+ data = as.matrix(features_train),
+ label = label_train,
+ eta = 0.001,
+ max_depth = 5,
+ min_child_weight = 3,
+ nrounds = 100,
+ nfold = 10,
+ objective = "binary:logistic",
+ verbose = 0,
+ early_stopping_rounds = 10)
Prediction on Training data

> xgb_pred_train <- predict(xgb_model1, newdata = features_train)
> xgb_matrix_train <- ifelse(xgb_pred_train>0.5,1,0)
>
> xgb_matrix_train <- as.factor(xgb_matrix_train)
>
> confusionMatrix(xgb_matrix_train, train_xgb$Default, positive = "1")
Reference
Prediction 0 1
0 145303 597
1 103 12747
Accuracy : 0.9956
95% CI : (0.9953, 0.9959)
Kappa : 0.9709
#Prediction Test data

> xgb_pred_test <- predict(xgb_model1, newdata = features_test)
>
>
> xgb_matrix_test <- ifelse(xgb_pred_test>0.5,1,0)
>
> xgb_matrix_test <- as.factor(xgb_matrix_test)
>
> confusionMatrix(xgb_matrix_test, test_xgb$Default, positive = "1")
Reference
Prediction 0 1
0 62257 272
1 60 5447
Accuracy : 0.9951
95% CI : (0.9946, 0.9956)
Kappa : 0.9678
# Tuning of XG boost model
Tunning for Best Nrounds Values (1000)
> tp_xgb <- vector()

> lr <- c(0.001, 0.01, 0.1, 0.3, 0.5, 0.7, 1)
> md <- c(1, 3, 5, 7, 9, 15)
> nr <- c(2, 50, 100, 1000, 10000)
>
> for(i in nr) {
+ xgb.fit <- xgboost(
+ data = features_train,
+ eta = 0.001,
+ max_depth = 5,
+ nrounds = i,
+ nfold = 10,
+ verbose = 0,
+
+ train_xgb.pred.class <- predict(xgb.fit, features_train)
+
+ tp_xgb <- cbind(tp_xgb, sum(train_xgb$Default == 1 & train_xgb.pred.class > 0.5))
+ }
>
> tp_xgb
[,1] [,2] [,3] [,4] [,5]
[1,] 12749 12749 12749 12749 12749
Best Max depth Value.(15)
> lr <- c(0.001, 0.01, 0.1, 0.3, 0.5, 0.7, 1)
> md <- c(1, 3, 5, 7, 9, 15)
> nr <- c(2, 50, 100, 1000, 10000)
>
> for(i in md) {
+ eta = 0.001,
+ max_depth = i,
+ nrounds = 100,
+ nfold = 10,
+ verbose = 0,
+
+
+ }
>
> tp_xgb
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 12144 12405 12749 12891 12968 12974
Best eta-1
> lr <- c(0.001, 0.01, 0.1, 0.3, 0.5, 0.7, 1)
> md <- c(1, 3, 5, 7, 9, 15)
> nr <- c(2, 50, 100, 1000, 10000)
>
> for(i in lr) {
+ eta = i,
+ max_depth = 6,
+ nrounds = 5,
+ nfold = 10,
+ verbose = 0,
+
+
+ }
>
> tp_xgb
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 12781 12804 12831 12918 12993 13031 13099
Final Best XGB Model
> xgb_final <- xgboost(

+ eta = 1,
+ max_depth = 15,
+ nrounds = 10000,
+ nfold = 10,
+ verbose = 0,
Prediction on Train data with XGB Tunned model

xgb_pred_train_f <- predict(xgb_final, newdata = features_train)
>
>
> xgb_matrix_train_f <- ifelse(xgb_pred_train_f>0.5,1,0)
>
> xgb_matrix_train_f <- as.factor(xgb_matrix_train_f)
>
> confusionMatrix(xgb_matrix_train_f, train_xgb$Default, positive = "1")
Reference
Prediction 0 1
0 145406 0
1 0 13344
Accuracy : 1
95% CI : (1, 1)
Kappa : 1
Mcnemar's Test P-Value : NA
Prediction on Test data with XGB Tunned model

xgb_pred_test_f <- predict(xgb_final, newdata = features_test)
>
>
> xgb_matrix_test_f <- ifelse(xgb_pred_test_f>0.5,1,0)
>
> xgb_matrix_test_f <- as.factor(xgb_matrix_test_f)
>
> confusionMatrix(xgb_matrix_test_f, test_xgb$Default, positive = "1")
Reference
Prediction 0 1
0 62285 109
1 32 5610
Accuracy : 0.9979
95% CI : (0.9976, 0.9983)
Kappa : 0.9865
Algorithm Data Set Accuracy Sensitivity Sepcificity AUC KS GINI

Train
CART Data 99.59 99.87 96.55 99.40 97.04 90.49
Test data 99.55 99.87 96.15 99.18 96.52 90.52

Train
Logistc Regression Data 99.84 98.42 99.97 99.84 99.07 99.67
Test data 99.81 98.11 99.97 99.83 99.08 99.65

Train
Naive Bayes Data 98.15 88.55 99.04
Test data 98.12 88.35 99.02
Train
XGBoost Before Tuning Data 99.56 95.53 99.93
Test data 99.51 95.24 99.90
Train
XGBoost Before After
Data 100.00 100.00 100.00
Tuning
Test data 99.79 98.09 99.95
I have tuned XG boosting method based on train data, that’s why we are getting 100% performance
measure for train data set in XGboosting model after tuning. After tuning , the performance measures
for test data also improved.

Bartlett Test To Check Whether The Data Is Suitable For PCA

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bartlett Test To Check Whether The Data Is Suitable For PCA

Uploaded by

Copyright:

Available Formats

Bank Loan Default Caspstone Project Notes-2

# applying PCA to continues variables before building models

Bartlett test to check whether the data is suitable for PCA.

We are able to allocate variable to each component without rotation itself.

PC1 PC2 PC3 PC4 PC5 PC6 PC7

'data.frame': 226786 obs. of 15 variables:

Building of CART Model

Variables actually used in tree construction:

Root node error: 13344/158750 = 0.084057

CP nsplit rel error xerror xstd

Confusion Matrix for Prediction on Train Data.

Mcnemar's Test P-Value : < 2.2e-16

> CART_test_predict<-predict(train_tree,newdata = test_data,type="class")

Confusion Matrix and Statistics

Mcnemar's Test P-Value : 9.727e-15

# Area under the Curve on training data set

AUC for Test data

An object of class "performance"

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 91620.1 on 158749 degrees of freedom

Number of Fisher Scoring iterations: 14

Mcnemar's Test P-Value : < 2.2e-16

ROC Curve for Train data

AUC Value for Train data

> train.auc.glm = performance(train.roc.glm, "auc")

> train.gini.glm = (2 * train.area.glm) - 1

Logistic Regression Prediction on Test data

Mcnemar's Test P-Value : 5.776e-15

ROC Curve on test Data

> ks.test.glm <- performance(test.roc.glm, "tpr", "fpr")

# Building of Naïve Bayes Model.

# Naïve Bayes Prediction on Training Data Set

> confusionMatrix(nb_pred_train,train_nb$Default,positive = "1")

# Naïve Bayes Prediction on Test data

> nb_pred_test<-predict(nb_train,newdata = test_nb)

> confusionMatrix(nb_pred_test,test_nb$Default,positive = "1")

Mcnemar's Test P-Value : 0.154

Algorithm Data Set Accuracy Sensitivity Sepcificity AUC KS GINI

Logistc Train Data 99.84 98.42 99.97 99.84 99.07 99.67

Train Data 98.15 88.55 99.04

Insights from Logistic Regression Model

Building of XG Boosting Method

> xgb_model1 <- xgboost(

Prediction on Training data

Mcnemar's Test P-Value : < 2.2e-16

#Prediction Test data

Mcnemar's Test P-Value : < 2.2e-16

# Tuning of XG boost model

Tunning for Best Nrounds Values (1000)

> tp_xgb <- vector()

Final Best XGB Model

> xgb_final <- xgboost(

Prediction on Train data with XGB Tunned model

Mcnemar's Test P-Value : NA

Prediction on Test data with XGB Tunned model

Mcnemar's Test P-Value : 1.55e-10

Algorithm Data Set Accuracy Sensitivity Sepcificity AUC KS GINI

You might also like