Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

FRA Milestone-1 - Report

Surabhi Kulkarni

PGP-DSBA Online

TABLE OF CONTENTS

1. Problem Statement

2. Summary of Data

3. Outlier Treatments

4. Missing Value Treatment

5. Transform Target Variable to 0 and 1

6. MultiVariate Analysis

7. Train Test Split

8. Logistic Regression Models

9. Performance Metrics of All Models & Interpretations


List of Figures

Fig – Outlier
Fig – Boxplot
Fig – heatmap
Fig – distplot
Fig – countplot
Fig –Scatterplot

Problem Statement
Businesses or companies can fall prey to default if they are not
able to keep up their debt obligations. Defaults will lead to a lower
credit rating for the company which in turn reduces its chances of
getting credit in the future and may have to pay higher interests
on existing debts as well as any new obligations. From an
investor's point of view, he would want to invest in a company if it
is capable of handling its financial obligations, can grow quickly,
and is able to manage the growth scale.
A balance sheet is a financial statement of a company that
provides a snapshot of what a company owns, owes, and the
amount invested by the shareholders. Thus, it is an important tool
that helps evaluate the performance of a business.
Data that is available includes information from the financial
statement of the companies for the previous year (2015). Also,
information about the Networth of the company in the following
year (2016) is provided which can be used to drive the labeled
field.
Importing Libraries.
Importing Data.

Checking the type of the dataset.

Checking the shape of the dataset: (3586, 67)

Getting the info data types column wise.


dtypes: float64(63), int64(3), object(1)
memory usage: 1.8+ MB

Observation-1:

The data set contains 3586 row, 67 columns .

In the given data set there are 3 Integer type features, 63 Float type
features. 1 Object type features.

Performing EDA

EDA-Step 1: Checking for duplicate records in the data

Number of duplicate rows = 0

Target Variable –

- We create a target variable - ‘default’

- Where, if Net-worth next year is zero or positive —> default = 0

- If Net-worth next year is negative —> default = 1


Co_Code 291
Networth_Next_Year 676
Equity_Paid_Up 448
Networth 650
Capital_Employed 596
...
Creditors_Velocity_Days 391
Inventory_Velocity_Days 262
Value_of_OutputtoTotal_Assets 150
Value_of_OutputtoGross_Block 481
default 388
Length: 67, dtype: int64

Number of missing values after replacing


outliers with Nan values is 42828

The data set contains 3586 row, 67 columns .

Given the fact that this is a financial data and the outliers might very
well reflect the information which is genuine in nature. Since there is
data captured for small, medium as well as large companies.

1.2 Missing Value Treatment

Visualizing Missing Values:


presence of missing values in some variables can be observed.Blue
color in the heatmap is indicating occupied cells while red cuolor
indicates missing values present in the data.Listing down few
observations:

No more missing values were present after treatment.


Q1.3. : Transform Target variable into 0 and 1 :

A new dependent variable named "Default" was created based on the


criteria given in the project notes.

Criteria 1 - If the Net Worth Next Year is negative for the company 0 - If
the Net Worth Next Year is positive for the company

Made use of np.where function to achieve this.

Creating a binary target variable using 'Networth_Next_Year'.

After generating the dependent column, we checked for the split of


data based on this dependent variable. Below is a bar plot showing the
same.
Distinct values of the dependent variable – 0 and 1.
0 3271
1 315

Q1.4. : Univariate & Bivariate analysis with proper interpretation : (You


may choose to include only those variables which were significant in
the model building)

We could see all the important features contributing to the model seem
to be having a lot of outliers.

We also have values both in positive and negative range, which is for
most of the variables. Univariate Analysis :

Boxplot has been created for the numerical variables which have
importance w.r.t. features in the dataset.
Distribution of column with Displot & Box plot:
Bivariate Analysis

Gross Sales Vs Net Sales:

There exists linear relationship between these two important variables.


Networth Vs Capital Employment:

As the capital increases, net worth also increases, but in some cases,
capital seems to be disbursed even for lesser networth.
Networth Vs Cost of Production
Multi-variate Analysis:

 We also performed multi variate analsysis on the data to see if there


are any correlation that are observed within the data.

 Correlations function was used and seaborn clustermap was used to


plot the correlations and to make better sense of the data.

 We observed that networth and networth next year were highly


correlated. Apart from this,  We also found various Rate of Growth
variables were highly correlated.

 This analysis tells us that there is a problem of collinearity with this


data set.

Heatmap has been plotted as follows :


: Train Test Split :

 We are splitting the data set as df_1 (data which has independent
variables) and df_2 (data which has the predictor variable)
 We performed the splitting of training and testing sets in the ratio
of 67: 33 and then we try to the fit the model into the testing and
training sets and find out the performance of those sets.
 Seed value of 42 was used
Q 1.6. : Build Logistic Regression Model (using statsmodel library) on
most important variables on Train Dataset and choose the optimum
cutoff. Also showcase your model building approach.

For model building, we try to approach recursive feature elimination


and we want to select top 15 features that would contribute to the
model well.

We give weightage to each variable and based on the weightage;


rankings are provided.

For modeling we will use Logistic Regression will recursive feature


elimination.

Applying GridSearchCV for Logistic Regression :

grid_search.best_params_ and grid_search.best_estimator_ are as


follows :
{'penalty': 'none', 'solver': 'lbfgs', 'tol':
0.0001}

LogisticRegression(max_iter=10000, n_jobs=2,
penalty='none')
Q1.7. : Validate the Model on Test Dataset and
state the performance matrices. Also state
interpretation from the model.

We train the model and then validate the model


in both the training and testing sets.

1
0

0 1.00 0.00
We are plotting the confusion matrix and classification
1 0.97 0.03 report for both sets.

2 0.99 We could see high precision and accuracy, but the recall
0.01

seems to be less in the training data. We need to improve


3 0.73 0.27
the recall value as that would give us True Positives (TP),
4 1.00 0.00 which in turn means that , we will correctly identify the
defaulters accurately, because if we miss a defaulter, that
would account to the bank paying higher interests to the existing debts
and cash flow will not be regularized in the bank.

Confusion matrix and Classification Report for the training set:

[[2165 26]
[ 86 125]]

precision recall f1-score support

0 0.96 0.99 0.97 2191


1 0.83 0.59 0.69 211
accuracy 0.95 2402
macro avg 0.89 0.79 0.83 2402
weighted avg 0.95 0.95 0.95 2402

Confusion matrix and Classification Report for the test set :

We could see high precision and accuracy, but the recall seems to be
less in the testing set.
[[1062 18]
[ 43 61]]

precision recall f1-score support

0 0.96 0.98 0.97 1080


1 0.77 0.59 0.67 104

accuracy 0.95 1184


macro avg 0.87 0.78 0.82 1184
weighted avg 0.94 0.95 0.95 1184

In [ ]:
Finally, we are able to achieve a descent recall value without
overfitting. Considering the opportunities such as outliers, missing
values and correlated features this is a fairly good model. It can be
improved if we get better quality data where the features explaining
the default are not missing to this extent. Of course we can try other
techniques which are not sensitive towards missing values and outliers.

You might also like