Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

Finance and Risk Analytics Project

India Credit Risk model

Prepared by:-Deepak Batabyal


Table of Contents
1. Phase 1-Discovery

1.1 Problem Objective

2 Phase 2- Data preparation

2.1 Data Exploration

3. Phase 3 – Model Planning and Building

3.1 Logistic Regression


1 Phase I – Discovery

1.1 Problem Objective

In this problem we would be exploring the dataset ‘raw-data’ and ‘validation_data’ to


create India Credit risk model in R.

raw -data.csv validation_data.csv

Using logistic regression, we are required to build this model. We shall be required to
perform following on the dataset :
 EDA (Exploratory Data analytics)
 Modelling on the data set
 Evaluate model performance

Let’s get started.

Set the working directory and load the data file.

Inference from the dataset :

The given dataset has 52 columns which captures various financial parameters.
While looking at the dataset, following are the initial inferences:
 The “net worth for next year” in the given dataset needs to be converted into
the binary form so that the logistic regression can be modelled on the dataset.

 The given dataset has very high skewness in the data set which need to be
removed before modelling. Also, the outlier in the given dataset needs
treatment so that model gives us better output.
2. Phase II – Data Preparation

2.1 Data Exploration

From section above, following steps are done to understand the data in binary form :
 In order to have data in binary form – we are considering that If the net worth
for the given parameter/ number is less than zero or negative (company
making losses) it is treated as zero & if the net worth is positive (company
making profit) the value is considered as 1.

 With this we come to know that 234 sample values are there which are
default. They are rated as 0. The default rate comes out to be 234/3541 as
6.5%.

Let’s start the data exploration step with following commands in R :

## Read Input data from raw-data and validation_data


data_raw = read.csv("raw-data.csv", header = TRUE)
data_validation = read.csv("validation_data.csv", header = TRUE)

## View column names


names(data_raw)
names(data_validation)

## View Structure of Input data


str(data_raw)
str(data_validation)
## View Summary of Input data
summary(data_raw)
summary(data_validation)

data_training = data_raw[ !(data_raw$Num %in% data_validation$Num), ]


dim(data_validation)
dim(data_training)

## Add a column of Default based on the value of NetWorthNextYear.


# 0 if NetWorthNextYear is positive,
#1 if NetWorthNextYear is negative.
#Then remove the NetWorthNextYear column

data_training$Default <- ifelse(data_training$`Networth Next Year`>=0 ,0,1)


data_training = data_training[,c(1,3,4:53)]

## Reordering the columns so making the details easy to find


data_training = data_training[,order(names(data_training))]
data_validation = data_validation[,order(names(data_validation))]
data_training = data_training[,c(29,1:28,30:52)] data_validation =
data_validation[,c(29,1:28,30:52)] attach(data_training)

##Default Rate for the data set


default_rate = (sum(data_training$Default)/(nrow(data_training)))*100
paste("Default Rate for the dataset is ", default_rate, "%")
Inference (Missing value treatment) :

The Outlier has been treated with the capping logic. The highest value has been
treated to the 99% value & the lowest with the 1% value. So that we can shrink the
data & outlier can be removed. By doing this we have kept in mind regarding the
skewness factor. The missing values has been treated with the average value. For
treating the missing value we got the missing vs. observed graph ready which
shared that the 7% of the value are not available for the given dataset.

## For finding the missing values:

missing_values = lapply(data_training,function(x)
sum(is.na(x))) missing_values
missmap(data_training, main = "Missing values vs
observed")
Now we need to remove the attributes where the missing values are more than 50%.
#Remove columns for both test and training data for which more than 50% is missing
i.e. ones which have more 1416 missing
data_training <-subset(data_training,-c(data_training$`PE on
BSE`,data_training$`Deposits (accepted by commercial
banks)`,data_training$Investments))
data_validation <- subset(data_validation, select = -c(data_validation$`PE on
BSE`,data_validation$`Deposits (accepted by commercial
banks)`,data_validation$Investments))
## Now replacing the NA by the mean value for the given attribute:
mean_values = array()

##Replace the rest of missing NAs with mean values in TRAINING data.
for (i in 1:length(data_training)) {
mean_values[i] = mean(data_training[,i],na.rm=T)
data_training[,i][is.na(data_training[,i])] <-
mean_values[i]
}
#Replace the rest of missing NAs with mean values for given training &
validation data.
data_validation_non_missing = data_validation
for (i in 1:length(data_validation_non_missing)) {
data_validation_non_missing[,i]
[is.na(data_validation_non_missing[,i])] <-
mean(data_validation_non_missing[,i],na.rm=T)
}
Code for the capping for 99% & 1% values to reduce the
number of outliers:
## Remove Outliers
data_summary = summary(data_training)
data_summary
col_length = ncol(data_summary)

View(data_training)

data_training_pruned =
data_training
View(data_training_pruned)
#Remove anything above 99% and Remove anything below 1%
for (i in 2:col_length) {
# for (i in 2:18) {
if (colnames(data_training_pruned)[i] != "Default") {
topmost = quantile(data_training[,i], 0.99)
bottommost = quantile(data_training[,i], 0.01)
data_training_pruned = training_pruned[data_training_pruned[,i] < topmost,]

Inference :

With this the data readiness is completed & now we can go ahead with the logistic
model which need to be built on the shared dataset.
3. Phase III – Model Planning and Building

3.1 Logistic Regression

library(ISLR)
library(dplyr)
library(ggvis)
library(boot)
library(rJava)
str(data)
summary(data)
dim(data)
## [1] 3541 52
train <- data[,2:52]
val <- val[,2:52] str(train)
dim(train)
## [1] 3541 51
str(train)
#Remove the blank variable from the file
#Let's run first iteration
my_logit <- glm(Default~., family = "binomial", data = train)
Summary(my_logit)
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##
## (Dispersion parameter for binomial family taken to be 1)
## Null deviance: 1723.67 on 3540 degrees of freedom
## Residual deviance: 875.68 on 3492 degrees of freedom
## AIC: 973.68
## Number of Fisher Scoring iterations: 14
#Now is the time to implement and check the performance of boruta
package. The syntax of boruta is almost similar to regression (lm)
method.
# install.packages("Boruta")
library(Boruta)
set.seed(123)
boruta.train <- Boruta(Default~., data = train, doTrace = 2)

boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i]) names(lz) <-


colnames(boruta.train$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels), at =
1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)

final.boruta <- TentativeRoughFix(boruta.train) print(final.boruta)


## Boruta performed 99 iterations in 6.128146 mins.
## Tentatives roughfixed over the last 99 iterations.
## 42 attributes confirmed important: Adjusted.EPS, Borrowings, ##
Capital.employed, Cash.profit, Cash.profit.as...of.total.income ## and 37
more;
## 7 attributes confirmed unimportant:
## Cash.to.average.cost.of.sales.per.day,
## Cash.to.current.liabilities..times., Creditors.turnover,
## Current.ratio..times., Equity.face.value and 2 more;
train <- train[,getSelectedAttributes(final.boruta, withTentative = F)]
train <- cbind(data[2],train) dim(train)
## [1] 3541 43
View(train)
colnames(train[1]) = "Default"
boruta.df <- attStats(final.boruta) class(boruta.df)
print(boruta.df)
## Total.assets 0.97979798 Confirmed
## Net.worth 1.00000000 Confirmed
## Total.income 0.98989899 Confirmed
## Change.in.stock 1.00000000 Confirmed
## Total.expenses 0.98989899 Confirmed
## Profit.after.tax 1.00000000 Confirmed
## PBDITA 1.00000000 Confirmed
## PBT 1.00000000 Confirmed
## Cash.profit 1.00000000 Confirmed
## PBDITA.as...of.total.income 1.00000000 Confirmed
## PBT.as...of.total.income 1.00000000 Confirmed
## PAT.as...of.total.income 1.00000000 Confirmed
## Cash.profit.as...of.total.income 1.00000000 Confirmed
## PAT.as...of.net.worth 1.00000000 Confirmed
## Sales 1.00000000 Confirmed
## Income.from.financial.services 0.73737374 Confirmed
## Other.income 0.91919192 Confirmed
## Total.capital 1.00000000 Confirmed
## Reserves.and.funds 1.00000000 Confirmed
## Borrowings 1.00000000 Confirmed
## Current.liabilities...provisions 1.00000000 Confirmed
## Deferred.tax.liability 0.95959596 Confirmed
## Shareholders.funds 1.00000000 Confirmed
## Cumulative.retained.profits 1.00000000 Confirmed

## Capital.employed 1.00000000 Confirmed


## TOL.TNW 1.00000000 Confirmed
## Total.term.liabilities...tangible.net.worth 1.00000000 Confirmed
## Contingent.liabilities...Net.worth.... 1.00000000 Confirmed
## Contingent.liabilities 0.97979798 Confirmed
## Net.fixed.assets 1.00000000 Confirmed
## Investments 0.22222222 Rejected
## Current.assets 0.98989899 Confirmed
## Net.working.capital 1.00000000 Confirmed
## Quick.ratio..times. 0.00000000 Rejected
## Current.ratio..times. 0.00000000 Rejected
## Debt.to.equity.ratio..times. 1.00000000 Confirmed
## Cash.to.current.liabilities..times. 0.00000000 Rejected
## Cash.to.average.cost.of.sales.per.day 0.01010101 Rejected
## Creditors.turnover 0.60606061 Rejected
## Debtors.turnover 0.83838384 Confirmed
## Finished.goods.turnover 0.57575758 Confirmed
## WIP.turnover 0.82828283 Confirmed
## Raw.material.turnover 0.93939394 Confirmed
## Shares.outstanding 0.38383838 Confirmed
## Equity.face.value 0.01010101 Rejected
## EPS 1.00000000 Confirmed
## Adjusted.EPS 1.00000000 Confirmed
## Total.liabilities 0.96969697 Confirmed
## PE.on.BSE 1.00000000 Confirmed
my_logit1 <- glm(Default~., family = "binomial", data = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
my_logit1
## Degrees of Freedom: 3540 Total (i.e. Null); 3499 Residual ## Null
Deviance: 1724
## Residual Deviance: 891.8 AIC: 975.8
anova(my_logit1, test = "Chisq")
summary(my_logit1)
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1723.67 on 3540 degrees of freedom
## Residual deviance: 891.84 on 3499 degrees of freedom
## AIC: 975.84
##
## Number of Fisher Scoring iterations: 15
toselect.x <- summary(my_logit1)$coeff[-1,4] < 0.05
## Call: glm(formula = sig.formula, data = train)
## Degrees of Freedom: 3540 Total (i.e. Null); 3527 Residual
## Null Deviance: 218.5
## Residual Deviance: 147.2
AIC: -1183
anova(my_logit2, test = "Chisq")
summary(my_logit2)
## Call:
## glm(formula = sig.formula, data = train)
## (Dispersion parameter for gaussian family taken to be 0.04172674)
##
## Null deviance: 218.54 on 3540 degrees of freedom
## Residual deviance: 147.17 on 3527 degrees of freedom
## AIC: -1183.5
##
## Number of Fisher Scoring iterations: 2 my_logit2_Val <-
glm(formula=sig.formula,data=val)
my_logit2_Val
## Call:
## glm(formula = sig.formula, data = val)
## Degrees of Freedom: 396 Total (i.e. Null); 383 Residual ## (318
observations deleted due to missingness)
## Null Deviance: 15.36
## Residual Deviance: 10.85
AIC: -272.6
anova(my_logit2_Val, test = "Chisq")

summary(my_logit2_Val)
## Call:
## glm(formula = sig.formula, data = val)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.49273 -0.06993 -0.03881 0.00430 0.91737
## (Dispersion parameter for gaussian family taken to be 0.0283165) ##
## Null deviance: 15.355 on 396 degrees of freedom
## Residual deviance: 10.845 on 383 degrees of freedom
## (318 observations deleted due to missingness)
## AIC: -572.65
##
## Number of Fisher Scoring iterations: 2

The AIC Score on both Train and Test Dataset has a good match.

You might also like