Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Problem Statement

Based on the given loan data can we understand the major factors or characteristics of a borrower which makes them to get into delinquent
stage.

• Delinquency is a major metric in assessing risk as more and more customers getting delinquent means the risk of customers that will default
will also increase.

• The main objective is to minimize the risk for which you need to build a decision tree model using CART technique that will identify various risk
and non-risk attributes of borrower’s to get into delinquent stage

Importing libraries and Loading data

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

ld_df = pd.read_csv("Loan Delinquent Dataset.csv")

Checking the data

ld_df.head()

Dropping unwanted variables

Sdelinquent can also be dropped instead of delinquent.

ld_df=ld_df.drop(["ID","delinquent"],axis=1) 

ld_df.head()

ld_df.shape

ld_df.info() 

many columns are of type object i.e. strings. These need to be converted to ordinal type

Geting unique counts of all Objects

print('term \n',ld_df.term.value_counts())
print('\n')
print('gender \n',ld_df.gender.value_counts())
print('\n')
print('purpose \n',ld_df.purpose.value_counts())
print('\n')
print('home_ownership \n',ld_df.home_ownership.value_counts())
print('\n')
print('age \n',ld_df.age.value_counts())
print('\n')
print('FICO \n',ld_df.FICO.value_counts())

Note:
Decision tree in Python can take only numerical / categorical colums. It cannot take string / object types.

The following code loops through each column and checks if the column type is object then converts those columns into categorical with each
distinct value becoming a category.

for feature in ld_df.columns: 
    if ld_df[feature].dtype == 'object': 
        print('\n')
        print('feature:',feature)
        print(pd.Categorical(ld_df[feature].unique()))
        print(pd.Categorical(ld_df[feature].unique()).codes)
        ld_df[feature] = pd.Categorical(ld_df[feature]).codes

For each feature, look at the 2nd and 4th row to get the encoding mappings. Do not look at the line starting with 'Categories'

Comparing the unique counts from above

print('term \n',ld_df.term.value_counts())
print('\n')
print('gender \n',ld_df.gender.value_counts())
print('\n')
print('purpose \n',ld_df.purpose.value_counts())
print('\n')
print('home_ownership \n',ld_df.home_ownership.value_counts())
print('\n')
print('age \n',ld_df.age.value_counts())
print('\n')
print('FICO \n',ld_df.FICO.value_counts())

ld_df.info()

ld_df.head()

Label Encoding has been done and all columns are converted to number

Proportion of 1s and 0s

ld_df.Sdelinquent.value_counts(normalize=True)

print(ld_df.Sdelinquent.value_counts())
print('%1s',7721/(7721+3827))
print('%0s',3827/(7721+3827))

Extracting the target column into separate vectors for training set and test set

X = ld_df.drop("Sdelinquent", axis=1)

y = ld_df.pop("Sdelinquent")

X.head()

Splitting data into training and test set

from sklearn.model_selection import train_test_split

X_train, X_test, train_labels, test_labels = train_test_split(X, y, test_size=.30, random_state=1)

Checking the dimensions of the training and test data

print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('train_labels',train_labels.shape)
print('test_labels',test_labels.shape)
print('Total Obs',8083+3465)

Building a Decision Tree Classifier

# Initialise a Decision Tree Classifier
# Fit the model

from sklearn import tree

train_char_label = ['No', 'Yes']
ld_Tree_File = open('ld_Tree_File.dot','w')
dot_data = tree.export_graphviz(dt_model, 
                                out_file=ld_Tree_File, 
                                feature_names = list(X_train), 
                                class_names = list(train_char_label))

ld_Tree_File.close()

The above code will save a .dot file in your working directory.
WebGraphviz is Graphviz in the Browser.
Copy paste the contents of the file into the link below to get the visualization
http://webgraphviz.com/

Variable Importance

print (pd.DataFrame(dt_model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values('Imp',ascending=False))

Predicting Test Data

y_predict.shape

Regularising the Decision Tree

Adding Tuning Parameters

reg_dt_model = DecisionTreeClassifier(criterion = 'gini', max_depth = 30,min_samples_leaf=100,min_samples_split=1000, random_state=1)
reg_dt_model.fit(X_train, train_labels)

Generating New Tree

ld_tree_regularized = open('ld_tree_regularized.dot','w')
dot_data = tree.export_graphviz(reg_dt_model, out_file= ld_tree_regularized , feature_names = list(X_train), class_names = list(train_cha

ld_tree_regularized.close()
dot_data

Variable Importance

Predicting on Training and Test dataset

# Complete the below code
ytrain_predict = 
ytest_predict = 
print('ytrain_predict',ytrain_predict.shape)
print('ytest_predict',ytest_predict.shape)

Getting the Predicted Classes

ytest_predict

Getting the Predicted Probabilities

ytest_predict_prob=reg_dt_model.predict_proba(X_test)
ytest_predict_prob

 pd.DataFrame(ytest_predict_prob).head()

Model Evaluation

Measuring AUC-ROC Curve

import matplotlib.pyplot as plt

AUC and ROC for the training data

# predict probabilities
probs = reg_dt_model.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(train_labels, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(train_labels, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.')
# show the plot
plt.show()

AUC and ROC for the test data

# predict probabilities

# keep probabilities for the positive outcome only

# calculate AUC

# calculate roc curve

# plot the roc curve for the model

# show the plot

Confusion Matrix for the training data

from sklearn.metrics import classification_report,confusion_matrix
#Train Data Accuracy
reg_dt_model.score(X_train,train_labels) 

print((1985+4742)/(1985+650+706+4742))

print(classification_report(train_labels, ytrain_predict))

Confusion Matrix for test data

confusion_matrix(test_labels, ytest_predict)

#Test Data Accuracy
reg_dt_model.score(X_test,test_labels)

print((922+1941)/(922+270+332+1941))

print(classification_report(test_labels, ytest_predict))

Conclusion

Accuracy on the Training Data: 83%


Accuracy on the Test Data: 82%

AUC on the Training Data: 87.9%


AUC on the Test: 88.1%

Accuracy, AUC, Precision and Recall for test data is almost inline with training data. This proves no overfitting or underfitting has happened, and
overall the model is a good model for classification

Also,here analysing the metric recall is more important because, we don't want to miss out on those customers who are likely to delinquent,
having a predictive power to catch the delinquincies would help the banks to be more proactive in their approach, from the confusion matrix of
test data we can see that our model has miss calssified 332(False Negatives) customers as non delinquent but infact they are delinquent.

FICO, term and gender (in same order of preference) are the most important variables in determining if a borrower will get into a delinquent
stage

You might also like