Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Exploratory Data Analysis

Name:Mallireddy Srinitha
Problem-Statement

This case study aims to give an idea of applying EDA in a real


business scenario. In this case study, apart from applying the
techniques that you have learnt in the EDA module, we will
also develop a basic understanding of risk analytics in banking
and financial services and understand how data is used to
minimise the risk of losing money while lending to customers.
Business-Understanding

The loan providing companies find it hard to give loans to the people due to their
insufficient or non-existent credit history. Because of that, some consumers use it as
their advantage by becoming a defaulter. Suppose you work for a consumer finance
company which specialises in lending various types of loans to urban customers. You
have to use EDA to analyse the patterns present in the data. This will ensure that the
applicants are capable of repaying the loan are not rejected.
When the company receives a loan application, the company has to decide for loan
approval based on the applicant’s profile. Two types of risks are associated with the
bank’s decision:
If the customer is likely to repay the loan, then not approving the loan results in a loss of
business to the company
If the customer is not likely to repay the loan, i.e. he/she is likely to default, then
approving the loan may lead to a financial loss for the company.
Business-Objectives

This case study aims to identify patterns which indicate if a client has
difficulty paying their instalments which may be used for taking actions
such as denying the loan, reducing the amount of loan, lending (to risky
applicants) at a higher interest rate, etc. This will ensure that the
consumers capable of repaying the loan are not rejected. Identification
of such applicants using EDA is the aim of this case study.
Specifications of Application_Data
● Shape: (30755,122)
● It is a combination of Numerical and categorical variable
columns
● Described the dataset and found the mean, standard
deviation, minimum value, maximum value, 25%, 50% and
75% values of each column
Missing Values in Application_Data
● Checked the missing values percentage of each column in
this data frame
● Drop columns with 45% or more Missing values
After dropping the columns we are left with 73 columns.
Impute missing values for Numerical Variables

checking for "AMT_REQ_CREDIT_BUREAU_DAY"


From this Boxplot we can
see that there is no outlier and
value seems to be continous
so we can impute nan values
with the mean which is "0.007000"
checking for "AMT_REQ_CREDIT_BUREAU_WEEK"
From this Boxplot we can see that there is no outlier and value seems to
be continous so we can impute nan values
with the mean which is "0.034362".
checking for "AMT_REQ_CREDIT_BUREAU_QRT"
From this Boxplot we can see that there is outlier at "19.0" and "261.0"
as maximum values are at "0.0" so we can impute nan values with the
median
Changing the Data Types
Changed the datatype of column "DAYS_REGISTRATION" from Float to Int as
"DAYS_REGISTRATION" contains no. of days which should be integer.
Changed the datatype of column "AMT_REQ_CREDIT_BUREAU_HOUR" from Float to
Int as "AMT_REQ_CREDIT_BUREAU_HOUR" contains no. of observation of client's
social surroundings which should be integer.
Changed the datatype of "AMT_REQ_CREDIT_BUREAU_DAY” from Float to Int as
"AMT_REQ_CREDIT_BUREAU_DAY” contains number of enquiries to Credit Bureau
about the client one day before application which should be an integer value.
Changed the datatype of AMT_REQ_CREDIT_BUREAU_MON from Float to Int as
AMT_REQ_CREDIT_BUREAU_MON contains number of enquiries to Credit Bureau
about the client one month before application which should be an integer value.
Like this I have changed upto 10 columns.
Identifying Outliers

From this figure we can see that except AMT_REQ_CREDIT_BUREAU_QRT


other columns have continuous values so these cannot be considered as
outliers.
From this boxplot we can cleary see that except
OBS_30_CNT_SOCIAL_CIRCLE and OBS_60_CNT_SOCIAL_CIRCLE other
columns have continuous values so these cannot be condidered as
outliers.
Outlier analysis for AMT_INCOME_TOTAL variable

There is no huge difference between median and mean


There is difference between max and 99th and 95th percentile so it has outlier.
From the Boxplot shown beside, we see that there is outlier.
Outlier analysis for CNT_CHILDREN variable
There is no huge difference between median and mean.
There is huge difference between max and 99th and 95th percentile as a person cannot have 19 children
as it can be seen that the 99percentile a person has children '3' children so this is outlier.
From the Boxplot shown beside, we can see that there is outlier.
Treatment of Outliers By IQR method
Binning the Continuous variables

Binning for DAYS_BIRTH Binning for CNT_FAM_MEMBERS

Here we are dividing DAYS_BIRTH Here we are dividing


CNT_FAM_MEMBERS column in
column in different categories of
different categories of
Young, Teenage, Old.
small,medium,large,veryLarge
We find below observations: Families.

Young: 225337 We find below observations:


Small: 80776
Teenage: 52806
Medium: 518
Old: 29368
Large: 7
Very Large: 4
Analysis Of TARGET Variables

From this below Pie chart we can see the Imbalance between target type
1 and 0
ratio of type 1 is 8.07 and type 0 is 91.92
I have merged the application_data and previous_data datasets and
created a new dataset as final_data whose shape is (1413701, 110)
Data divided the data set into 2 subsets based on Target variable Target=0 and
Target=1
Looking to the percent of defaulted credits, females have a higher chance of not
returning their loans.
Doing Univariate Analysis on Combined data for default customers.
Doing Univariate Analysis on Combined data for non default customers
Doing Bivariate Analysis on Combined data for default customers.
Doing Bivariate Analysis on Combined data for non default customers
Final Conclusion

Most of the Male loan applicants are drivers and laborers , and have
more credit amount and lesser income, so we should limit down the
credit amount to these kind of clients.
Most of the Female loan applicants are sales staff and laborers, and
have more credit amount and lesser income, so we should limit down
the credit amount to these kind of clients

You might also like