FRA Assignment

FINANCE RISK ANALYTICS
CREDIT RISK MODEL
BY :
PRANAV. V
1.Introduction
This Mini Project is a part of the Module: Finance and Risk Analytics of
the Post Graduate Program in Business Analytics and Business
Intelligence offered by Great Learning and McCombs School Business,
University of Texas at Austin.
Credit Risk is important and a critical parameter for any financial
institutions / lending firms / investment firms and venture capitalists.
Credit Risk would be a good parameter to understand how well the
company is doing and how well does it see in the near future.
The probability or chance of a borrower being unable to make their
payment on time and default on their debt is the Credit Risk. It refers to
the risk that any financial or lending institution might not be able to get
the interest payment or the principal on time. This automatically results
in a big mess for the financial and lender institution. Cash Flow shortage
and the higher cost of collection are some of the examples. In extreme
cases, some portion of the loan or even the entire loan may have to be
waived off resulting in a big loss for the institutions.
1.1 Factors Affecting Credit Risk Modelling

The risk for the lender is of several kinds ranging from disruption to
cash flows and increased collection costs to loss of interest and
principal. There are several major factors to consider while determining
credit risk. From the financial health of the borrower and the
consequences of default for both the borrower and creditor to a variety
of macro-economic considerations. Listing down three major factors
affecting the credit risk of the borrower:
A)Probability of Default.(PD)
This is one of the most important parameter in Credit Risk. This implies
the likelihood that a borrower will default on their loans. For Individuals
this score is mainly based on their parameters like Current Income,
Current Debt, Current Assets and their ratios. For institutions, there are
several standard approaches from rating agencies like FICO,
Moody’s, Standard and Poor’s, Sick Industrial Company Act. The
detailed parameters will be discussed below and throughout the project
for the Probability of Default of Companies. The current project is based
on this and the models developed are solely for predicting the
probability of default of Companies based on their previous Financial
History.
B)Loss Given Default (LGD)

This refers to the total loss that the lender will have if the debt is not
repaid. Similar Credit Score companies will have a different risk profile
if their borrowed amounts are very different. The lender will have to
bear up a bigger loss if the default is on the higher debt borrower. This
again plays a big role in determining interest rates and down payments.
This is basically in percentage terms on the expectation on how much
money will be recovered.
C)Exposure at Default
This is a measure of the total exposure that a lender is exposed to at any
given point of time. This also has an impact on the credit risk because it
is an indicator of the risk appetite of the lender.
2.Project Objective
The objective of this project is to build a model which can identify the
right set of business firms who are predicted to default. Basically build a
model predicting their probability of Default. 3,541 observations of
various companies’ financial parameters are given in the dataset. A
separate validation dataset is given to test out this model. For both of
these datasets, the default variable is also given (or needs to be
calculated based on the Net.Worth.Next.Year). This will be the
dependent variable for the model.
3. Data Preparation
3.1 Initial Hypothesis
NULL Hypothesis: (HO) – No predictor is available to predict risk
model.
ALTERNATE Hypothesis: (HA) – There is at least one independent
variable to predict risk model.
3.2 Dataset and its variables

Around 3,541 observations of Company’s parameters are given in the
dataset. 50 variables are present in the dataset (1 variable - Deposits
(accepted by commercial banks) – doesn’t have any values and hence we
have omitted it.) :
Variable Name Discreption Type

Networth Next Year Net worth of the customer Continuous, Dependant
in next year (Will be converted to default)
Total assets Total assets of customer Continuous, Predictor
Net worth Net worth of the customer Continuous, Predictor
of present year
Total income Total income of the customer Continuous, Predictor
Change in stock difference between value of current Continuous, Predictor
stock and the value of stock
in last trading day
Total expenses Total expense done by customer Continuous, Predictor
Profit after tax Profit after tax deduction Continuous, Predictor
PBDITA Profit before depreciation, Continuous, Predictor
income tax and amortization
PBT Profit before tax deduction Continuous, Predictor
Cash profit Total Cash profit Continuous, Predictor
PBDITA as % of total PBDITA / Total income Ratio, Predictor
income
PBT as % of total PBT / Total income Ratio, Predictor
income
PAT as % of total PAT / Total income Ratio, Predictor
income
Cash profit as % of Cash Profit / Total income Ratio, Predictor
total income
PAT as % of net worth PAT / Net worth Ratio, Predictor
Sales Sales done by customer Continuous, Predictor
Income from financial Income from financial services Continuous, Predictor
services
Other income Income from other sources Continuous, Predictor
Total capital Total capital of the customer Continuous, Predictor
Reserves and funds Total reserves and funds of the customer Continuous, Predictor
Deposits (accepted by All blank values Continuous, Predictor
commercial banks)
Borrowings Total amount borrowed by customer Continuous, Predictor
Current liabilities & current liabilities of the customer Continuous, Predictor
provisions
Deferred tax liability Future income tax customer will Continuous, Predictor
pay because of the current transaction
Shareholders funds Amount of equity in a company, Continuous, Predictor
which is belong to shareholder
Cumulative retained Total cumulative profit Continuous, Predictor
profits retained by customer
Capital employed Current asset minus current liabilities Continuous, Predictor
TOL/TNW Total liabilities of the customer Ratio, Predictor

divided by Total net worth
Total term liabilities / Short + long term liabilities Ratio, Predictor
tangible net worth divided by tangible net worth
Contingent liabilities / Contingent liabilities / Net worth Ratio, Predictor
Net worth (%)
Contingent liabilities Liabilities because of uncertain events Continuous, Predictor
Net fixed assets purchase price of all fixed assets Continuous, Predictor
Investments Total invested amount Continuous, Predictor
Current assets Assets that are expected to be Continuous, Predictor
converted to cash within a year
Cash to current Total liquid cash divided Ratio, Predictor
liabilities (times) by current liabilities
Cash to average cost of Total cash divided by average cost of Ratio, Predictor
sales per day the sales
Creditors turnover Net credit purchase divided to Ratio, Predictor
average trade creditors
Debtors turnover Net credit sales divided by Ratio, Predictor
average accounts receivable
Finished goods Annual sales divided by average Ratio, Predictor
turnover inventory
WIP turnover The cost of goods sold for a period Ratio, Predictor
divided by the average inventory
for that period
Raw material turnover Cost of goods sold is divided by Ratio, Predictor
the average inventory for the same
period
Shares outstanding Number of issued shares minus Continuous, Predictor
the number of share held in the
company
Equity face value cost of the equity at the time of issuing Continuous, Predictor
EPS Net income divided by total number of Ratio, Predictor
outstanding share
Adjusted EPS Adjusted net earning divided by the Ratio, Predictor
weighted
average number of common share
outstanding
on a diluted basis during the plan year
Total liabilities Sum of all type of liabilities Continuous, Predictor
PE on BSE Company current stock price Ratio, Predictor
divided by its earning per share
Net working capital Difference of current Continuous, Predictor
liabilities and current assets
Quick ratio (times) Total cash divided by Ratio, Predictor
current liabilities
Current ratio (times) Current assets divided by Ratio, Predictor
current liabilities
Debt to equity ratio Total liabilities divided by Ratio, Predictor
(times) its shareholder equity
The Dependent variable here is the “Net Worth Next Year”. This is
converted into a new categorical dependent variable as per the problem
statement.
#Introducing new variable of Default
Default=ifelse(Networth.Next.Year>0,0,1)
The new dependent variable is generated as “Default” which will be
used for the models which we run.
3.3 Setting up Environment ,Importing Libraries and dataset:
This step ensures that:
 Environment is set up with a path .
 Importing various libraries for using certain functions .
 Getting data set for exploration .
A Setting up Environment Before the exploration of the given dataset,we

first set up an environment more precisely where we want to save and
take in the data set from .
This is done with the help of setwd() which is used to set up the working
environment.
getwd() - is a function which helps to get the location which is set.
FIG 1: Setting up environment
Importing Libraries: This step ensures and helps to import libraries

which are essential for using certain functions for data processing.
Libraries have in built functions for various manipulation and statistical
processes.
4. Exploratory Data Analysis – Step by step approach:
In statistics, exploratory data analysis (EDA) is an approach
to analyzing data sets to summarize their main characteristics,
often with visual methods.
The objectives of EDA are to:
 Suggest hypotheses about the causes of observed phenomena.
 Assess assumptions on which statistical inference will be based.
 Support the selection of appropriate statistical tools and
techniques.
 Provide a basis for further data collection through surveys or
experiments.
NOTE:
As there are more than 50 variables ,the first step in EDA here is to clean up the
missing values ,impute the outliers and get the significant variables by means of
multicollinearity and VIF.
4.1 Dataset Cleansing

This dataset (raw-data) was uploaded into R. It was observed that one
particular variable had all values as blanks. To avoid further issues with
these variables, this particular variable was deleted off. While visually
exploring excel, it was found that there were two types of missing values
ie. Blanks and “NAs”. All the NAs were replaced by blanks to do the
missing value treatments. Also, as discussed before, one variable was
completely deleted as all the values were missing in it - Deposits
(accepted by commercial banks)
After this, we tried in identifying how many NA values were there.
Fig 3: Missing values
MISSING VALUE TREATMENT:

Firstly, lets check the percentage of missing values before imputation .
Fig 4: Missing values(Percentage)
A cut off percentage of 20% was kept for the selection variables ie. If
you have more than 20% of the values as missing values then that
variable is to be avoided in the model.
Fig 4: Variables removed
Now ,An imputation algorithm is applied to impute missing values.Here

im using knn Imputation .It works by taking up the nearest neighbor
value.
KNN Imputation:
kNN imputation uses k-Nearest Neighbours approach to impute missing
values. For every observation to be imputed, it identifies ‘k’ closes
observation based on the Euclidean distance and computes the weighted
average (weight based on distance) of these ‘k’ observations. This is
using the “DMwR” library.
Fig 5: Knn Imputation
4.2 Outliers treatment:

Firstly, lets check the extent of outliers of various variables by plotting a
multivariate boxplot of various variables with Default.
Fig 6: Outliers Plot-1

Fig 7: Outliers Plot-2
Almost all the variables are skewed very highly. Also, it indicates a
presence of outliers on the very extreme side. This makes the
distribution of the data very unorderly and uneven. There is a very high
scope of proper outlier treatment and feature re-engineering for this set
of dataset.
For better plots, we have tried doing a sample outlier treatment for the
variables wherever there is the highest gap from the 95th to the 100th
Percentile. For example if the gap is the highest between 95th percentile
and 96th Percentile – all the values will be capped at 95th percentile and
similarly on the lower extreme side if the gap is the highest between the
4th and the 5th percentile. All the values will be capped at 5th percentile.
Different variables were treated differently.
An outlier will be capped if the values is below its first quartile – 1.5IQR
or above third quartile + 1.5IQR.
The following is the code snippet used for treating the outliers :
library(outliers)
outlier_capping <- function(x){
qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
caps <- quantile(x, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(x, na.rm = T)
x[x < (qnt[1] - H)] <- caps[1]
x[x > (qnt[2] + H)] <- caps[2]
return(x)}
Other outlier treatment methods are like Imputation with
Mean/Mode/Median and Prediction based (the outliers can be replaced
with missing values (NA) and then can be predicted by considering them
as a response variable). The Imputation method might be very basic and
cannot be applied here as this is a financial project. The Prediction based
method might too beyond according to the scope of this project. Hence
the best method used here was the capping approach by Quantiles.
4.3 New Variable Creation:

At first the default variable is created based on the Net.Worth.Next.year.
Default=ifelse(Networth.Next.Year>0,0,1)
Based on the existing set of variables, it is proposed to create new set of
variables to get the modelling more accurate. As per the project
statement it is expected to create one set of variables from each of the
following domains :
 Profitability
 Leverage
 Liquidity
 Company Size
Profitability:
Profitability Ratio
This is the ratio between the profit and sales. This will give an idea on
how much profit the company has in terms of sales. The ratio = (Profit
after Tax) / (Sales)
Profitability in terms of Assets

This is the ratio between Profit Before Tax and Total Assets. This is one
of the main parameters in the Altman Z Score.
The corresponding ratio in the India Z score model is Profit After Tax
and Depreciation and Total.Assets. This might not be required separately
as it will just increase the multi collinearity.
Leverage
Total Equity
This will help us to identify the total shareholder equity value. This is
attained by the ratio of (Total.liabilities) and
(Debt.to.equity.ratio..times.)
Equity Multiplier
The equity multiplier is a financial leverage ratio that measures the
amount of a firm's assets that are financed by its shareholders by
comparing total assets with total shareholder's equity. In other words, the
equity multiplier shows the %age of assets that are financed or owed by
the shareholders. The ratio between (Total.assets) and (TotalEquity).
Borrowing Ratio
This is the ratio between (Total.Borrowings) and (Total.assets). This is
one of the main parameters in the India Z Score Model.
Liquidity
Liquidity Ratio
Liquidity ratios are measurements used to examine the ability of an
organization to pay off its short-term obligations. This is attained by the
ratio of (Net.working.capital) and (Total.assets).
Asset Turn Over Ratio
This is the Ratio between (Sales) and (Total.assets). One of the main
parameters in the Altman Z Score
Company size
Company Size
This is to determine on how the company is doing in terms of its Net
worth.
This is attained by the Ratio of (Net.worth) and (Total.assets).
4.4 Checking for Correlation and Multi Collinearity:

As there are a number of variables, it might not really make sense in
doing a univariate and bivariate analysis for all the variables. It would be
good first to check the correlation between each of the variables and
then remove those variables which are highly correlated. The correlation
matrix cannot be shown below due to the high number of variables and it
is 51x51 matrix. A part of the matrix is shown below:
Fig 8: Correlation
Fig 8: Correlation-2
The highlighted fields have high correlation .hence one among the two
will be removed.
This matrix is further analysed in Excel and all those variables which
has a correlation of greater than 0.9 will be grouped together and
ranalysed.
Then created a model to see the VIF of the variables.
Fig 9: VIF
Along with this and the removal of VIFs we will be identifying one final
set of variables for the Univariate and Bivariate Analysis.
As seen in most of the Credit Scoring methods, we can observe that
there is only the significance of Ratios which is actually relevant for all
the scoring methods. Hence here also we will be taking mainly the ratios
and generating the model. Only these variables will be used for the
Univariate and Bivariate Analysis.
Now, when we ran Logistic Model using BLR it selects us the adequate
variables (Step Wise Selection) for the prediction and gives us the
following output.
Fig 10: Significant variables
For better understanding and to avoid clutter, the Univariate and

Bivariate analysis is done based on the above variables only. We can
understand that most of the variables obtained are ratio variables which
makes sense as other standard credit scores like Altman Z and India Z
also has ratio variables as their inputs.
4.5 Univariate Analysis
Though we have 16 variables in the dataset, we will be concerned with
only some of the variables as per the problem statement - as most of the
variables are not significant in predicting the Default. We will start off
with a quick statistical summary for all the variables:
The default Variable is the only categorical variable here with summary
as follows :
(0) – 3,298 ;; (1) – 243
Fig 11: Summary of variables
Pie chart of dependant variable:
Fig 12: pie Chart

Fig 13: Univariate variables
Observations from Univariate Analysis
Default : This is the dependent variable which we will be using. Almost
93% of the Customers are non defaulters while the remaining 7% are
Defaulters.
CashProfit: The Cash Profits are in the range of -120 to 1000. There are
a few outliers and these are capped at 5000.
PAT.as...of.net.worth: The PAT.as...of.net.worth are in the range of 0
to 20. There are a few outliers and these are capped at -138 and 97.
TOL.TNW: The TOL.TNW are in the range of 0 to 3. There are a few
outliers and these are capped at -2 and 55.
Cash.profit.as...of.total.income: The Cash.profit.as...of.total.income are
in the range of 1 to 10.5. There are a few outliers and these are capped at
-163 and 56.
CompanySize: The CompanySizeare in the range of 0 to 1. The Median
Value is at 0.35
Turnover.ratio: The CompanySizeare in the range of 0 to 11. There are
a few outliers and these are capped at 0 and 96.
Cash.to.average.cost.of.sales.per.day: This is in the range of 0 to 1.75.
There are a few outliers and these are capped at 0 and 37.
Debt.to.equity.ratio..times.This is in the range of 0 to 21. There are a
few outliers and these are capped at 0 and 1284.
Cash.to.current.liabilities..times.: This is in the range of 0 to 21. There
are a few outliers and these are capped at 0 and 1284.
Debtors.turnover: This is in the range of 0 to 11. There are a few
outliers and these are capped at 0 and 207.
WIP.turnover: This is in the range of 0 to 20. There are a few outliers
and these are capped at -0.18 and 328.
Raw.material.turnover: This is in the range of 2 to 12. There are a few
outliers and these are capped at -2 and 100.
EPS: This is in the range of 0 to 10. There are a few outliers and these
are capped at - 60 and 896.
Profitablity: This is in the range of 0 to 0.06. There are a few outliers
and these are capped at 0 and 0.5.
4.6 Multivariate Analysis
Fig 14: Bivariate
Fig 15: Bivariate

Observations from Bivariate Analysis and Multi Variate Analysis
CashProfit: The Median of Cash Profit is on the lower side for the
Defaulters v/s Non Defaulters.
Cash.profit.as...of.total.income: Most of the values lies in the negative
region for the defaulters and the median is also lesser than 0 for the
defaulters. For the Non defaulters – most of the values are in the positive
side and median is also positive.
PAT.as...of.net.worth: Most of the values lies in the negative region for
the defaulters and the median is also lesser than 0 for the defaulters. For
the Non defaulters – most of the values are in the positive side and
median is also positive.
Current ratio (times) : The Defaulters Median value is lesser than the
Non Defaulters. This implies the current assets is lower than their
current liabilities for the Defaulters.
Debt to equity ratio (times): The Defaulters values are much higher
than the non defaulters. This implies that the liabilities is much greater
than the shareholders equity values.
Company Size: The defaulters’ company’s net worth is much lower
than their total assets. The median ratio is below 0.1 which is alarming.
For the Non defaulters the median ratio is around 0.35
5 Logistic Regression
5.1Methodology
As there are almost 51 variables, it might not be suitable to check the
Wald Test and Significance of all the variables. We will run the Logistic
Regression model using all the variables and also use the option
Maxit = 100. This is to make sure that the Logistic Regression
Algorithm converges. As discussed before the main variables will be the
ratios which is being used with some of the absolute variables in place.
Fig 16: Logistic regression
The model is run and the values are predicted. If the predictive score is
higher than 0.5 than the Predictive class is taken as 1.
Next, we chack the vid to furthue eliminate the variables to create a final
model.
Fig 17: VIF

From the above figure we can see that company_size,
Debt_to_equity_ratio, Pat_as_of_total_income,
profitability_ratio,total_expenses,total_income are having high vif.
Now, when we ran Logistic Model with blorr(),it selects us the adequate
variables (Step Wise Selection) for the prediction and gives us the
following output.
Finally we create another model with significant variables
Fig 18: LR-Updated

Fig 19: LR-Updated
5.2 Analysing coefficients and sign
Fig 20: probability
Variable Probability
Total.income 1.000e+00
Intercept 1.044e-01
Cummmulative.retained.profits 9.9634e-01
Cash.profit 9.96e-01
Profitability_ratio 1.89e-01
We can Profitability_assets_ratio 4.806e-08
Company_size 3.73e-02
Current_ratio 9.233e-01
Cash.to.average.cost.of.sales.per.day 1.008e+00
Debt.to.equity.ratio 1.12e+00
Borrowing _ratio 3.733e+00
understand that the most of them are negative coefficients. That is the
lower that values of these the higher the chances of default. This can be
mainly seen on the following variables – Profitability.assets.ratio,
CompanySize, Profitability, Current.ratio..times and
PAT.as...of.net.worth. Hence the higher the decrease in here , the higher
the chances of the default.
5.3 Results and Model Validation (without Smote)

Train Dataset
Accuracy=95.92 %
Sensitivity=48.14 %
Specificity=98.99 %
The Accuracy Rate is at 95.92 %. The Sensitivity is at 48.14 % while

the Specificity is also at 98.99 %. This implies that we have been mostly
identifying the Defaulters and Non Defaulters 85% of the time correctly.
This is a fairly decent model to go ahead with.
Test Dataset
Accuracy= 92.58%
Sensitivity= 89%
Specificity=98.59%
The Accuracy Rate is at 91%. The Sensitivity is at 89% while the

Specificity is at 91%. This implies that its fairly a good prediction.
OTHER EVALUATION METRICS:

The figure above shows the area under the curve which is close to 1.
This implies the model is a good one. Larger the area under the curve
(AUC) the best the model is.
The AUC value is at 0.70 which indicates a very good prediction.

The Model was validated using the gini also.
The Gini Score at 0.97 shows very high which implies it has performed
well on gini too.
6. Probability of Default and Deciles

On further analyzing it we get,
The 10th decile has the large number of non-defaulters and while the 1st
decile has large number of defaulters with the probability rate of 91%
and 85% respectively.
Both of these are very good measures for the model. Hence the model
can be said to be a very good model. In all the remaining deciles it can
be seen that there is relatively very less values on the Default. This
implies only in the top deciles most of the Defaulters lie and hence the
loans can be accordingly given
7. Model Performance Summary and Conclusions

The following are the observations and conclusions:
Around 51 variables were there. New ratios were introduced.
Finally by checking the significance it was observed that the ratios
were more significant than the absolute values.
In the final equation too, most of the variables were ratios.
This can be understood from all the standard credit scores such as
Altman Z and India Z score where all are ratios.
Logistic Regression was used for building the predictive
models. The accuracy has increased by 6% for the Validation
Dataset (v/s Train Dataset). The accuracy of the train dataset is at
95% while for the test dataset is at 92%.
The Sensitivity for the validation dataset is also higher means
better predicting at 89% (v/s 48% for the train dataset)
The Decile summary based on the probability of default also
shows very good prediction of the model.
On Analyzing the Coefficients :
We can understand that the most of them are negative
coefficients. That is the lower that values of these the higher the
chances of default. This can be mainly seen on the following
variables – Profitability.assets.ratio, CompanySize, Profitability,
Current.ratio..times and PAT.as...of.net.worth.
Hence the higher the decrease in here , the higher the chances
of the default. On the positive side, if the variables
TOL.TNW, Turnover.ratio, Debt.to.equity.ratio..times.
Cash.to.current.liabilities..times are higher there are high chances
of default.

FRA Assignment

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FRA Assignment

Uploaded by

Copyright:

Available Formats

FINANCE RISK ANALYTICS

CREDIT RISK MODEL

1.1 Factors Affecting Credit Risk Modelling

B)Loss Given Default (LGD)

3.2 Dataset and its variables

Variable Name Discreption Type

TOL/TNW Total liabilities of the customer Ratio, Predictor

#Introducing new variable of Default

A Setting up Environment Before the exploration of the given dataset,we

FIG 1: Setting up environment

Importing Libraries: This step ensures and helps to import libraries

4.1 Dataset Cleansing

MISSING VALUE TREATMENT:

Fig 4: Variables removed

Now ,An imputation algorithm is applied to impute missing values.Here

Fig 5: Knn Imputation

4.2 Outliers treatment:

Fig 6: Outliers Plot-1

4.3 New Variable Creation:

Profitability in terms of Assets

4.4 Checking for Correlation and Multi Collinearity:

For better understanding and to avoid clutter, the Univariate and

Pie chart of dependant variable:

Fig 12: pie Chart

Fig 14: Bivariate

Fig 15: Bivariate

Fig 17: VIF

Finally we create another model with significant variables

Fig 18: LR-Updated

5.2 Analysing coefficients and sign

Fig 20: probability

5.3 Results and Model Validation (without Smote)

The Accuracy Rate is at 95.92 %. The Sensitivity is at 48.14 % while

The Accuracy Rate is at 91%. The Sensitivity is at 89% while the

OTHER EVALUATION METRICS:

The AUC value is at 0.70 which indicates a very good prediction.

6. Probability of Default and Deciles

7. Model Performance Summary and Conclusions

You might also like