FRA Business Report

FRA Milestone-1 - Report
Surabhi Kulkarni
PGP-DSBA Online
TABLE OF CONTENTS
1. Problem Statement
2. Summary of Data
3. Outlier Treatments
4. Missing Value Treatment
5. Transform Target Variable to 0 and 1
6. MultiVariate Analysis
7. Train Test Split
8. Logistic Regression Models
9. Performance Metrics of All Models & Interpretations

List of Figures
Fig – Outlier
Fig – Boxplot
Fig – heatmap
Fig – distplot
Fig – countplot
Fig –Scatterplot
Problem Statement
Businesses or companies can fall prey to default if they are not
able to keep up their debt obligations. Defaults will lead to a lower
credit rating for the company which in turn reduces its chances of
getting credit in the future and may have to pay higher interests
on existing debts as well as any new obligations. From an
investor's point of view, he would want to invest in a company if it
is capable of handling its financial obligations, can grow quickly,
and is able to manage the growth scale.
A balance sheet is a financial statement of a company that
provides a snapshot of what a company owns, owes, and the
amount invested by the shareholders. Thus, it is an important tool
that helps evaluate the performance of a business.
Data that is available includes information from the financial
statement of the companies for the previous year (2015). Also,
information about the Networth of the company in the following
year (2016) is provided which can be used to drive the labeled
field.
Importing Libraries.
Importing Data.
Checking the type of the dataset.
Checking the shape of the dataset: (3586, 67)
Getting the info data types column wise.

dtypes: float64(63), int64(3), object(1)
memory usage: 1.8+ MB
Observation-1:
The data set contains 3586 row, 67 columns .
In the given data set there are 3 Integer type features, 63 Float type
features. 1 Object type features.
Performing EDA
EDA-Step 1: Checking for duplicate records in the data
Number of duplicate rows = 0
Target Variable –
- We create a target variable - ‘default’
- Where, if Net-worth next year is zero or positive —> default = 0
- If Net-worth next year is negative —> default = 1

Co_Code 291
Networth_Next_Year 676
Equity_Paid_Up 448
Networth 650
Capital_Employed 596
...
Creditors_Velocity_Days 391
Inventory_Velocity_Days 262
Value_of_OutputtoTotal_Assets 150
Value_of_OutputtoGross_Block 481
default 388
Length: 67, dtype: int64
Number of missing values after replacing

outliers with Nan values is 42828
The data set contains 3586 row, 67 columns .
Given the fact that this is a financial data and the outliers might very
well reflect the information which is genuine in nature. Since there is
data captured for small, medium as well as large companies.
1.2 Missing Value Treatment
Visualizing Missing Values:

presence of missing values in some variables can be observed.Blue
color in the heatmap is indicating occupied cells while red cuolor
indicates missing values present in the data.Listing down few
observations:
No more missing values were present after treatment.

Q1.3. : Transform Target variable into 0 and 1 :
A new dependent variable named "Default" was created based on the

criteria given in the project notes.
Criteria 1 - If the Net Worth Next Year is negative for the company 0 - If
the Net Worth Next Year is positive for the company
Made use of np.where function to achieve this.
Creating a binary target variable using 'Networth_Next_Year'.
After generating the dependent column, we checked for the split of

data based on this dependent variable. Below is a bar plot showing the
same.
Distinct values of the dependent variable – 0 and 1.
0 3271
1 315
Q1.4. : Univariate & Bivariate analysis with proper interpretation : (You

may choose to include only those variables which were significant in
the model building)
We could see all the important features contributing to the model seem
to be having a lot of outliers.
We also have values both in positive and negative range, which is for
most of the variables. Univariate Analysis :
Boxplot has been created for the numerical variables which have
importance w.r.t. features in the dataset.
Distribution of column with Displot & Box plot:
Bivariate Analysis
Gross Sales Vs Net Sales:
There exists linear relationship between these two important variables.

Networth Vs Capital Employment:
As the capital increases, net worth also increases, but in some cases,
capital seems to be disbursed even for lesser networth.
Networth Vs Cost of Production
Multi-variate Analysis:
 We also performed multi variate analsysis on the data to see if there

are any correlation that are observed within the data.
 Correlations function was used and seaborn clustermap was used to

plot the correlations and to make better sense of the data.
 We observed that networth and networth next year were highly

correlated. Apart from this,  We also found various Rate of Growth
variables were highly correlated.
 This analysis tells us that there is a problem of collinearity with this

data set.
Heatmap has been plotted as follows :

: Train Test Split :
 We are splitting the data set as df_1 (data which has independent
variables) and df_2 (data which has the predictor variable)
 We performed the splitting of training and testing sets in the ratio
of 67: 33 and then we try to the fit the model into the testing and
training sets and find out the performance of those sets.
 Seed value of 42 was used
Q 1.6. : Build Logistic Regression Model (using statsmodel library) on
most important variables on Train Dataset and choose the optimum
cutoff. Also showcase your model building approach.
For model building, we try to approach recursive feature elimination

and we want to select top 15 features that would contribute to the
model well.
We give weightage to each variable and based on the weightage;

rankings are provided.
For modeling we will use Logistic Regression will recursive feature

elimination.
Applying GridSearchCV for Logistic Regression :
grid_search.best_params_ and grid_search.best_estimator_ are as

follows :
{'penalty': 'none', 'solver': 'lbfgs', 'tol':
0.0001}
LogisticRegression(max_iter=10000, n_jobs=2,
penalty='none')
Q1.7. : Validate the Model on Test Dataset and
state the performance matrices. Also state
interpretation from the model.
We train the model and then validate the model

in both the training and testing sets.
1
0
0 1.00 0.00
We are plotting the confusion matrix and classification
1 0.97 0.03 report for both sets.
2 0.99 We could see high precision and accuracy, but the recall
0.01
seems to be less in the training data. We need to improve

3 0.73 0.27
the recall value as that would give us True Positives (TP),
4 1.00 0.00 which in turn means that , we will correctly identify the
defaulters accurately, because if we miss a defaulter, that
would account to the bank paying higher interests to the existing debts
and cash flow will not be regularized in the bank.
Confusion matrix and Classification Report for the training set:
[[2165 26]
[ 86 125]]
precision recall f1-score support
0 0.96 0.99 0.97 2191

1 0.83 0.59 0.69 211
accuracy 0.95 2402
macro avg 0.89 0.79 0.83 2402
weighted avg 0.95 0.95 0.95 2402
Confusion matrix and Classification Report for the test set :
We could see high precision and accuracy, but the recall seems to be
less in the testing set.
[[1062 18]
[ 43 61]]
precision recall f1-score support
0 0.96 0.98 0.97 1080

1 0.77 0.59 0.67 104
accuracy 0.95 1184

macro avg 0.87 0.78 0.82 1184
weighted avg 0.94 0.95 0.95 1184
In [ ]:
Finally, we are able to achieve a descent recall value without
overfitting. Considering the opportunities such as outliers, missing
values and correlated features this is a fairly good model. It can be
improved if we get better quality data where the features explaining
the default are not missing to this extent. Of course we can try other
techniques which are not sensitive towards missing values and outliers.

FRA Business Report

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FRA Business Report

Uploaded by

Copyright:

Available Formats

FRA Milestone-1 - Report

4. Missing Value Treatment

5. Transform Target Variable to 0 and 1

7. Train Test Split

8. Logistic Regression Models

9. Performance Metrics of All Models & Interpretations

Checking the type of the dataset.

Checking the shape of the dataset: (3586, 67)

Getting the info data types column wise.

The data set contains 3586 row, 67 columns .

EDA-Step 1: Checking for duplicate records in the data

Number of duplicate rows = 0

- We create a target variable - ‘default’

- Where, if Net-worth next year is zero or positive —> default = 0

- If Net-worth next year is negative —> default = 1

Number of missing values after replacing

The data set contains 3586 row, 67 columns .

1.2 Missing Value Treatment

Visualizing Missing Values:

No more missing values were present after treatment.

A new dependent variable named "Default" was created based on the

Made use of np.where function to achieve this.

Creating a binary target variable using 'Networth_Next_Year'.

After generating the dependent column, we checked for the split of

Q1.4. : Univariate & Bivariate analysis with proper interpretation : (You

Gross Sales Vs Net Sales:

There exists linear relationship between these two important variables.

 We also performed multi variate analsysis on the data to see if there

 Correlations function was used and seaborn clustermap was used to

 We observed that networth and networth next year were highly

 This analysis tells us that there is a problem of collinearity with this

Heatmap has been plotted as follows :

For model building, we try to approach recursive feature elimination

We give weightage to each variable and based on the weightage;

For modeling we will use Logistic Regression will recursive feature

Applying GridSearchCV for Logistic Regression :

grid_search.best_params_ and grid_search.best_estimator_ are as

We train the model and then validate the model

seems to be less in the training data. We need to improve

Confusion matrix and Classification Report for the training set:

precision recall f1-score support

0 0.96 0.99 0.97 2191

Confusion matrix and Classification Report for the test set :

precision recall f1-score support

0 0.96 0.98 0.97 1080

accuracy 0.95 1184

You might also like