Bartlett Test To Check Whether The Data Is Suitable For PCA
Bartlett Test To Check Whether The Data Is Suitable For PCA
1) Model Building- CART, Logistic Regression and Naïve bayes models developed after PCA.
2) Ensembled Methos- Bagging and XG Boost models developed.
$chisq
[1] 5402.731
$p.value
[1] 0
$df
[1] 190
P Values is less than the 0.05, so the data is suitable for PCA.
KMO Test
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = corrmatrix1)
Overall MSA = 0.78
MSA for each item =
loan_amnt funded_amnt funded_amnt_inv int_rate installment
annual_inc
1.00 0.77 0.77 0.69 0.97
0.88
dti delinq_2yrs inq_last_6mths open_acc revol_bal
revol_util
0.64 0.53 0.48 0.67 0.88
0.60
total_acc out_prncp out_prncp_inv total_pymnt total_pymnt_inv
total_rec_prncp
0.71 0.59 0.59 0.76 0.77
0.76
total_rec_int last_pymnt_amnt
0.62 0.93
As the overall MSA is more than 0.5, so that the data is having required samples in the data set to
perform PCA.
Scree Plot to know number of Components to be considered.
As per the scree polt, we consider the number of components which are more than one. From the above
plot it is clearly evident that we have 7 components.
Loadings:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
loan_amnt 0.969
funded_amnt 0.971
funded_amnt_inv 0.969
int_rate 0.631
installment 0.940
annual_inc 0.681
dti 0.556
delinq_2yrs 0.847
inq_last_6mths 0.660
open_acc 0.818
revol_bal -0.511
revol_util 0.643
total_acc 0.812
out_prncp 0.959
out_prncp_inv 0.959
total_pymnt 0.958
total_pymnt_inv 0.957
total_rec_prncp 0.938
total_rec_int 0.685
last_pymnt_amnt 0.700
The tree basically forming on the basis of outstanding principle, loan profile means loan amount &
repayment history, debt burden which includes Rate of interest & DTI, credit line opened and
delinquency status.
Observations
1) If a person is having higher outstanding principle, which may lead to default by the person.
2) Higher the loan amount, higher the chances of default.
3) If a person is not having good repayment history of loans, there is a high chance of default.
4) If a person is borrowing at higher rate of interest and who is having high dti, there is a chance of
default.
5) If a person if having higher credit lines opened, there is a higher chance of default.
6) If a person is having continuous delinquency record during last 2 years, there is higher chance of
default.
Bank should focus on the persons who is applying for higher loan amount, applying at higher ROI, who is
having high DTI, who is not having better repayment record and who is having more number of credit
lines opened and should focus on delinquency status also.
CP Table:
Classification tree:
rpart(formula = Default ~ ., data = train_data, method = "class",
minsplit = 100, minbucket = 33, cp = 0)
n= 158750
Lower xerror is coming at end node that means pruning of CART model is not required.
Prediction of Using the CRAT models both on training data set and test data set.
CP Plot
> confusionMatrix(CART_train_predict,train_data$Default)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 145219 460
1 187 12884
Accuracy : 0.9959
95% CI : (0.9956, 0.9962)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9733
Sensitivity : 0.9987
Specificity : 0.9655
Pos Pred Value : 0.9968
Neg Pred Value : 0.9857
Prevalence : 0.9159
Detection Rate : 0.9148
Detection Prevalence : 0.9177
Balanced Accuracy : 0.9821
'Positive' Class : 0
# Prediction on Test data.
> confusionMatrix(CART_test_predict,test_data$Default)
Reference
Prediction 0 1
0 62233 220
1 84 5499
Accuracy : 0.9955
95% CI : (0.995, 0.996)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9707
Sensitivity : 0.9987
Specificity : 0.9615
Pos Pred Value : 0.9965
Neg Pred Value : 0.9850
Prevalence : 0.9159
Detection Rate : 0.9147
Detection Prevalence : 0.9179
Balanced Accuracy : 0.9801
'Positive' Class : 0
Slot "y.name":
[1] "Area under the ROC curve"
Slot "alpha.name":
[1] "none"
Slot "x.values":
list()
Slot "y.values":
[[1]]
[1] 0.9939686
Slot "alpha.values":
list()
ROC Curve on Training Data Set
Slot "y.name":
[1] "Area under the ROC curve"
Slot "alpha.name":
[1] "none"
Slot "x.values":
list()
Slot "y.values":
[[1]]
[1] 0.9917786
Slot "alpha.values":
list()
ROC Curve for Test data.
Building of Logistic Regression Model
Call:
glm(formula = Default ~ ., family = "binomial", data = train_glm[,
-c(11, 13)])
Deviance Residuals:
Min 1Q Median 3Q Max
-3.9643 -0.0365 -0.0103 -0.0009 6.7656
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.98977 0.97992 -1.010 0.312470
LoanProfile -1.78072 0.11259 -15.816 < 2e-16 ***
PrincipleBal 26.36783 0.64769 40.710 < 2e-16 ***
CreditLines -3.91103 0.12789 -30.582 < 2e-16 ***
DebtBurden -4.41904 0.14386 -30.718 < 2e-16 ***
RevBalInq 0.62871 0.08009 7.850 4.16e-15 ***
AnnlIncome -0.96195 0.04989 -19.280 < 2e-16 ***
Delinq2Years -0.77095 0.05518 -13.971 < 2e-16 ***
term60 months -1.43550 0.34107 -4.209 2.57e-05 ***
gradeB -0.63446 0.29829 -2.127 0.033423 *
gradeC -1.77531 0.34393 -5.162 2.45e-07 ***
gradeD -2.77377 0.39693 -6.988 2.79e-12 ***
gradeE -3.38198 0.47332 -7.145 8.98e-13 ***
gradeF -4.56814 0.63313 -7.215 5.39e-13 ***
gradeG -3.95425 1.13365 -3.488 0.000487 ***
emplength1 year -0.44368 0.32801 -1.353 0.176161
emplength10+ years -0.23010 0.24026 -0.958 0.338220
emplength2 years -0.94844 0.32202 -2.945 0.003227 **
emplength3 years -0.18276 0.29258 -0.625 0.532194
emplength4 years -0.49274 0.33281 -1.481 0.138730
emplength5 years -0.47538 0.32595 -1.458 0.144721
emplength6 years -0.25602 0.33402 -0.766 0.443397
emplength7 years 0.01780 0.32996 0.054 0.956990
emplength8 years -0.15582 0.36468 -0.427 0.669165
emplength9 years -0.47669 0.45134 -1.056 0.290888
emplengthn/a 0.18745 0.33982 0.552 0.581216
verifstatusSource Verified -0.57146 0.18561 -3.079 0.002078 **
verifstatusVerified -0.43758 0.17022 -2.571 0.010151 *
AddrstateAL -1.09489 1.09977 -0.996 0.319464
AddrstateAR -0.89128 1.32607 -0.672 0.501506
AddrstateAZ -0.93340 1.05562 -0.884 0.376578
AddrstateCA -0.31088 0.93133 -0.334 0.738525
AddrstateCO -0.84412 1.05379 -0.801 0.423116
AddrstateCT 0.36271 1.01333 0.358 0.720392
AddrstateDC -0.12580 1.35121 -0.093 0.925821
AddrstateDE 0.14886 1.36774 0.109 0.913332
AddrstateFL -0.72136 0.95624 -0.754 0.450628
AddrstateGA -0.81139 1.01475 -0.800 0.423946
AddrstateHI -4.30599 1.99911 -2.154 0.031244 *
AddrstateIA -9.08019 703.35365 -0.013 0.989700
AddrstateID -8.57882 807.60575 -0.011 0.991525
AddrstateIL -0.35229 0.97830 -0.360 0.718769
AddrstateIN 0.01799 1.13778 0.016 0.987384
AddrstateKS -1.79813 1.43052 -1.257 0.208762
AddrstateKY -1.12512 1.29792 -0.867 0.386017
AddrstateLA -1.39298 1.19343 -1.167 0.243126
AddrstateMA -0.73322 1.11967 -0.655 0.512562
AddrstateMD -0.35605 1.00736 -0.353 0.723754
AddrstateME -7.90760 395.78763 -0.020 0.984060
AddrstateMI -0.25938 0.99960 -0.259 0.795262
AddrstateMN -0.60712 1.07424 -0.565 0.571961
AddrstateMO -2.28416 1.38567 -1.648 0.099267 .
AddrstateMS -3.88952 2.64692 -1.469 0.141711
AddrstateMT -3.60489 4.84782 -0.744 0.457112
AddrstateNC -0.01165 0.96958 -0.012 0.990414
AddrstateND -9.88009 707.38497 -0.014 0.988856
AddrstateNE -11.84881 438.60916 -0.027 0.978448
AddrstateNH -0.82337 1.45929 -0.564 0.572600
AddrstateNJ -0.17403 0.97002 -0.179 0.857614
AddrstateNM -1.36861 1.53001 -0.895 0.371048
AddrstateNV 0.07293 1.01150 0.072 0.942523
AddrstateNY -0.55970 0.94880 -0.590 0.555254
AddrstateOH -0.31488 0.98030 -0.321 0.748049
AddrstateOK -1.93305 1.30506 -1.481 0.138553
AddrstateOR -0.42781 1.08921 -0.393 0.694488
AddrstatePA -0.99815 1.02293 -0.976 0.329178
AddrstateRI -0.53969 1.38520 -0.390 0.696825
AddrstateSC -0.12273 1.07585 -0.114 0.909178
AddrstateSD -4.10813 6.81218 -0.603 0.546471
AddrstateTN -1.18543 1.34118 -0.884 0.376765
AddrstateTX -0.23481 0.94520 -0.248 0.803809
AddrstateUT -0.77069 1.28298 -0.601 0.548038
AddrstateVA -0.24881 0.98286 -0.253 0.800153
AddrstateVT -0.80716 1.56293 -0.516 0.605547
AddrstateWA -0.43907 1.01179 -0.434 0.664322
AddrstateWI -0.89688 1.25158 -0.717 0.473619
AddrstateWV -0.69732 1.36134 -0.512 0.608491
AddrstateWY -1.56135 2.15259 -0.725 0.468245
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the above logistic regression model it is clearly evident the variables like profile (Loan amount &
loan Repayment status), Outstanding principle, credit lines, debt burden (ROI and DTI), annual income,
revolving credit utilization, grade, 60 months term and delinquency during last 2 years are more
significance variables to predict the probability of default of a customer.
Logistic Regression Prediction on Train Data
Confusion matrix for prediction train data
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 145359 211
1 47 13133
Accuracy : 0.9984
95% CI : (0.9982, 0.9986)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9894
Sensitivity : 0.98419
Specificity : 0.99968
Pos Pred Value : 0.99643
Neg Pred Value : 0.99855
Prevalence : 0.08406
Detection Rate : 0.08273
Detection Prevalence : 0.08302
Balanced Accuracy : 0.99193
'Positive' Class : 1
[1] 0.9983739
# KS and Gini Values on Train data
> ks.train.glm <- performance(train.roc.glm, "tpr", "fpr")
> train.ks.glm <- max(attr(ks.train.glm, "y.values")[[1]] - (attr(ks.train.glm, "x.values")[[1]]))
> train.ks.glm
[1] 0.9907343
Accuracy : 0.9981
95% CI : (0.9978, 0.9984)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9878
Sensitivity : 0.98112
Specificity : 0.99970
Pos Pred Value : 0.99663
Neg Pred Value : 0.99827
Prevalence : 0.08406
Detection Rate : 0.08247
Detection Prevalence : 0.08275
Balanced Accuracy : 0.99041
'Positive' Class : 1
KS and
Note: I am not able to buld the random forest model to RAM memory
issue in my system. I am building naïve bays model
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
0 1
0.91594331 0.08405669
Conditional probabilities:
LoanProfile
Y [,1] [,2]
0 0.01504695 1.0129202
1 -0.19902396 0.8010333
PrincipleBal
Y [,1] [,2]
0 -0.2448386 0.2817802
1 2.6740564 1.8088103
CreditLines
Y [,1] [,2]
0 0.02991229 0.9857868
1 -0.32436093 1.0909482
DebtBurden
Y [,1] [,2]
0 0.04074323 0.9703017
1 -0.43759842 1.2151635
RevBalInq
Y [,1] [,2]
0 -0.002209475 0.9901906
1 0.014145492 1.0946685
AnnlIncome
Y [,1] [,2]
0 0.01173404 0.9835334
1 -0.12651457 1.2619096
Delinq2Years
Y [,1] [,2]
0 0.004783965 0.9689868
1 -0.037316490 1.2580965
term
Y 36 months 60 months
0 0.8070575 0.1929425
1 0.6145084 0.3854916
grade
Y A B C D E F G
0 0.191065018 0.320255010 0.254136693 0.144718925 0.062074467 0.022241173 0.005508714
1 0.047811751 0.178806954 0.293989808 0.243705036 0.152502998 0.063998801 0.019184652
emplength
Y < 1 year 1 year 10+ years 2 years 3 years 4 years 5 years 6 years
0 0.08207364 0.06675791 0.30644540 0.09415017 0.08102142 0.06469472 0.07139320 0.05752858
1 0.08505695 0.06864508 0.30987710 0.08820444 0.08258393 0.05912770 0.06040168 0.05110911
emplength
Y 7 years 8 years 9 years n/a
0 0.05568546 0.04657304 0.03796267 0.03571379
1 0.05065947 0.04946043 0.03844424 0.05642986
homeown
Y ANY MORTGAGE NONE OTHER OWN RENT
0 6.877295e-06 5.055706e-01 1.650551e-04 5.914474e-04 8.611061e-02 4.075554e-01
1 0.000000e+00 4.386990e-01 0.000000e+00 0.000000e+00 1.034922e-01 4.578088e-01
verifstatus
Y Not Verified Source Verified Verified
0 0.3562164 0.2903457 0.3534380
1 0.2162020 0.4089478 0.3748501
purpose
Y car credit_card debt_consolidation educational home_improvement house
0 0.0154257734 0.2037742597 0.5812346121 0.0012722996 0.0603826527 0.0066090808
1 0.0066696643 0.1838279376 0.6356414868 0.0000000000 0.0587529976 0.0056954436
purpose
Y major_purchase medical moving other renewable_energy small_business
0 0.0261062129 0.0110862000 0.0077850983 0.0545850928 0.0009490668 0.0161822758
1 0.0183603118 0.0092176259 0.0075689448 0.0520833333 0.0009742206 0.0152877698
purpose
Y vacation wedding
0 0.0064302711 0.0081771041
1 0.0054706235 0.0004496403
Addrstate
Y AK AL AR AZ CA CO
CT
0 2.840323e-03 1.206278e-02 6.684731e-03 2.449005e-02 1.708939e-01 2.374730e-02
1.477243e-02
1 2.323141e-03 1.551259e-02 8.018585e-03 2.278177e-02 1.435102e-01 1.663669e-02
1.408873e-02
Addrstate
Y DC DE FL GA HI IA
ID
0 3.576194e-03 2.737164e-03 6.718430e-02 3.203444e-02 5.694401e-03 2.750918e-05
2.063189e-05
1 1.423861e-03 3.447242e-03 7.441547e-02 3.050060e-02 7.119305e-03 0.000000e+00
0.000000e+00
Addrstate
Y IL IN KS KY LA MA
MD
0 3.669725e-02 1.045349e-02 8.294018e-03 8.899220e-03 1.144382e-02 2.475139e-02
2.378856e-02
1 3.207434e-02 1.513789e-02 5.845324e-03 6.669664e-03 1.371403e-02 2.263189e-02
2.585432e-02
Addrstate
Y ME MI MN MO MS MT
NC
0 6.189566e-05 2.353410e-02 1.756461e-02 1.519195e-02 1.602410e-03 3.136047e-03
2.715156e-02
1 0.000000e+00 2.517986e-02 1.671163e-02 1.266487e-02 5.320743e-03 2.847722e-03
3.095024e-02
Addrstate
Y ND NE NH NJ NM NV
NY
0 3.438648e-05 1.444232e-04 4.814107e-03 3.718554e-02 5.364290e-03 1.440106e-02
8.343535e-02
1 7.494005e-05 9.742206e-04 3.597122e-03 3.829436e-02 6.294964e-03 1.708633e-02
9.854616e-02
Addrstate
Y OH OK OR PA RI SC
SD
0 3.028073e-02 8.438441e-03 1.352764e-02 3.249522e-02 4.325819e-03 1.139568e-02
2.193857e-03
1 2.997602e-02 1.214029e-02 1.034173e-02 3.672062e-02 4.721223e-03 1.101619e-02
2.847722e-03
Addrstate
Y TN TX UT VA VT WA
WI
0 8.816693e-03 7.855247e-02 8.500337e-03 3.151864e-02 1.726201e-03 2.368541e-02
1.242040e-02
1 1.746103e-02 7.868705e-02 7.044365e-03 3.237410e-02 1.049161e-03 1.828537e-02
1.109113e-02
Addrstate
Y WV WY
0 4.814107e-03 2.592740e-03
1 4.571343e-03 1.423861e-03
Reference
Prediction 0 1
0 144004 1528
1 1402 11816
Accuracy : 0.9815
95% CI : (0.9809, 0.9822)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.8796
Mcnemar's Test P-Value : 0.02093
Sensitivity : 0.88549
Specificity : 0.99036
Pos Pred Value : 0.89393
Neg Pred Value : 0.98950
Prevalence : 0.08406
Detection Rate : 0.07443
Detection Prevalence : 0.08326
Balanced Accuracy : 0.93792
'Positive' Class : 1
Reference
Prediction 0 1
0 61703 666
1 614 5053
Accuracy : 0.9812
95% CI : (0.9801, 0.9822)
No Information Rate : 0.9159
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8773
Sensitivity : 0.88355
Specificity : 0.99015
Pos Pred Value : 0.89165
Neg Pred Value : 0.98932
Prevalence : 0.08406
Detection Rate : 0.07427
Detection Prevalence : 0.08329
Balanced Accuracy : 0.93685
'Positive' Class : 1
Comparison of Performance measures for CART, Logistic and Naïve Bayes Model
After comparison of performance measures for all three models, all measures are better for logistic
regression except for sensitivity which is better in CART model, Which may further can be improved in
logistic regression also.
Loan profiles (Loan amount & loan repayment details), annual income, debt
burden (Rate of Interest & Debt income ratio), credit lines, outstanding principle,
loan term of 60 days, grade, delinquency status and revolving balances are the
more significant drivers in prediction the probability of default.
1) Banks should take care while giving loans to lower income customers.
2) Banks should have a look at customer current debt burden status before sanctioning of loans
3) Banks should take more care while giving loans to customers who is accepting higher rate of
interest.
4) Banks should review the applications for higher loan amount.
5) Banks should take care while giving loans to lower grade people
6) From Credit score, bank should track the repayment status of credit lines.
7) Banks should take care while sanctioning of loans to people who is having high credit lines
opened.
Ensemble Methods
Bagging Method:
> train_bagging<-bagging(as.numeric(Default)
~.,data=train_ensmb,control=rpart.control(maxdepth=5, minsplit=4))
> bagging_pred_train<-predict(train_bagging,train_ensmb)
> tabbag_train <- table(train_ensmb$Default, bagging_pred_train>0.5)
> tabbag_train
TRUE
0 145406
1 13344
> tabbag_train <- table(train_ensmb$Default, bagging_pred_train>0.6)
> tabbag_train
TRUE
0 145406
1 13344
> tabbag_train <- table(train_ensmb$Default, bagging_pred_train>0.7)
> tabbag_train
TRUE
0 145406
1 13344
Bagging is highly over fit model, because is it also fitting some noise in the data.
Reference
Prediction 0 1
0 145303 597
1 103 12747
Accuracy : 0.9956
95% CI : (0.9953, 0.9959)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9709
Sensitivity : 0.95526
Specificity : 0.99929
Pos Pred Value : 0.99198
Neg Pred Value : 0.99591
Prevalence : 0.08406
Detection Rate : 0.08030
Detection Prevalence : 0.08094
Balanced Accuracy : 0.97728
'Positive' Class : 1
Reference
Prediction 0 1
0 62257 272
1 60 5447
Accuracy : 0.9951
95% CI : (0.9946, 0.9956)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9678
Sensitivity : 0.95244
Specificity : 0.99904
Pos Pred Value : 0.98910
Neg Pred Value : 0.99565
Prevalence : 0.08406
Detection Rate : 0.08006
Detection Prevalence : 0.08094
Balanced Accuracy : 0.97574
'Positive' Class : 1
Best eta-1
> tp_xgb <- vector()
> lr <- c(0.001, 0.01, 0.1, 0.3, 0.5, 0.7, 1)
> md <- c(1, 3, 5, 7, 9, 15)
> nr <- c(2, 50, 100, 1000, 10000)
>
> for(i in lr) {
+ xgb.fit <- xgboost(
+ data = features_train,
+ label = label_train,
+ eta = i,
+ max_depth = 6,
+ nrounds = 5,
+ nfold = 10,
+ objective = "binary:logistic",
+ verbose = 0,
+ early_stopping_rounds = 0)
+
+ train_xgb.pred.class <- predict(xgb.fit, features_train)
+
+ tp_xgb <- cbind(tp_xgb, sum(train_xgb$Default == 1 & train_xgb.pred.class > 0.5))
+ }
>
> tp_xgb
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 12781 12804 12831 12918 12993 13031 13099
Reference
Prediction 0 1
0 145406 0
1 0 13344
Accuracy : 1
95% CI : (1, 1)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Sensitivity : 1.00000
Specificity : 1.00000
Pos Pred Value : 1.00000
Neg Pred Value : 1.00000
Prevalence : 0.08406
Detection Rate : 0.08406
Detection Prevalence : 0.08406
Balanced Accuracy : 1.00000
'Positive' Class : 1
Reference
Prediction 0 1
0 62285 109
1 32 5610
Accuracy : 0.9979
95% CI : (0.9976, 0.9983)
No Information Rate : 0.9159
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9865
Sensitivity : 0.98094
Specificity : 0.99949
Pos Pred Value : 0.99433
Neg Pred Value : 0.99825
Prevalence : 0.08406
Detection Rate : 0.08246
Detection Prevalence : 0.08293
Balanced Accuracy : 0.99021
'Positive' Class : 1
I have tuned XG boosting method based on train data, that’s why we are getting 100% performance
measure for train data set in XGboosting model after tuning. After tuning , the performance measures
for test data also improved.