Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

MAY 2021

This Business Report

shall provide detailed

DATA MINING
explanation of how we

approached each

BUSINESS REPORT problem given in the

assignment. It shall

THAKUR ARUN SINGH also provide relative

resolution and

explanation with

regards to the

problems
CONTENTS
Problem 1: Clustering ......................................................................................................................... 2
Problem 1.1 .................................................................................................................................... 2
Problem 1.2 .................................................................................................................................... 7
Problem 1.3 .................................................................................................................................... 7
Problem 1.4 .................................................................................................................................. 10
Problem 1.5 .................................................................................................................................. 11
Problem 2:........................................................................................................................................ 12
Problem 2.1 .................................................................................................................................. 12
Problem 2.2 .................................................................................................................................. 20
Model 2: ........................................................................................................................................ 22
MODEL 3 ...................................................................................................................................... 24
Problem 2.3 .................................................................................................................................. 26
Problem 2.4 .................................................................................................................................. 35
Problem 2.5 .................................................................................................................................. 37

1
Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its customers.
They collected a sample that summarizes the activities of users during the past few months. You are
given the task to identify the segments based on credit card usage.

PROBLEM 1.1
Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate, and
multivariate analysis).

Resolution:
Describing the data:
 First we import all the necessary libraries in Python, and then import the data file which is
‘bank_marketing_part1_Data’. Once we import the file we confirm whether the data has been
uploaded correctly or not using ‘head’ function. Using this function we can view the data and all
the columns and headers whether they are aligning correctly or not.

 Then using the ‘shape’ function we can understand how many row and columns are there in our
data set.

 To check the data type of all the columns and also to check the null values, ‘info’ function. Has
been used.

 To see the detail description of the data such as, Count, Mean, Median, Min, Max, Standard
Deviations etc,

count mean std min 25% 50% 75% max


spending 210 14.84752 2.909699 10.59 12.27 14.355 17.305 21.18

advance_payments 210 14.55929 1.305959 12.41 13.45 14.32 15.715 17.25

probability_of_full_payment 210 0.870999 0.023629 0.8081 0.8569 0.87345 0.887775 0.9183

current_balance 210 5.628533 0.443063 4.899 5.26225 5.5235 5.97975 6.675


credit_limit 210 3.258605 0.377714 2.63 2.944 3.237 3.56175 4.033

min_payment_amt 210 3.700201 1.503557 0.7651 2.5615 3.599 4.76875 8.456

max_spent_in_single_shopping 210 5.408071 0.49148 4.519 5.045 5.223 5.877 6.55

 Using the ‘isnull’ function, one can understand if there are any null values in the data set. And
we do not have any null values in the existing data set.

 Using the ‘dups’ function we check for the duplicates and there were no duplicate values.
After reviewing the data thoroughly, and based on the above analysis we can say that, we have seven
variables, Mean and Median values are almost equal, and Standard deviation for ‘Spending’ is higher
than other variables. There are no duplicates in the data set.
2
Exploratory data analysis
Univariate and multivariate analysis

The above dist plot shows the distribution of data from 10 – 22 and is positively skewed. Boxplot shows
that there are no outliers.

The above dist plot shows the distribution of data from 12 – 17 and is positively skewed. Boxplot shows
that there are no outliers
3
The above dist plot shows the distribution of data from 0.80 – 0.92 and is negatively skewed. Boxplot
shows that there are a few outliers. Probably values is good above 80%.

The above dist plot shows the distribution of data from 5.0 – 6.5 and is positively skewed. Boxplot
shows that there are no outliers.

The above dist plot shows the distribution of data from 2.5 – 4.0 and is positively skewed. Boxplot
shows that there are no outliers.
4
The above dist plot shows the distribution of data from 1 – 8 and is positively skewed. Boxplot shows
that there are a few outliers.

The above dist plot shows the distribution of data from 4.5 – 6.5 and is positively skewed. Boxplot
shows that there are a few outliers.
Outlier will not be treated as there are only 3 to 4 values which were observed in the data set.

5
Multivariate Analysis

6
From both the analysis we can see that there is strong positive correlation between all the variables
except ‘min_payment_amt’

PROBLEM 1.2
Do you think scaling is necessary for clustering in this case? Justify

Resolution:
As the data is unscaled,it is imperative that we perform the scaling as this model works on distance
based components. As we know that the values of all the variables are in different scale.
For Example: ‘Spending’ and ‘advance payments’ are in different values and this may get more weight.
Scaling will bring down the values to relatively same range. Using standard scalar below is the data
how it looks after scaling.

advance_ probability_of_full_ current_ba credit_lim min_paym max_spent_in_single_


spending
payments payment lance it ent_amt shopping

0 1.754355 1.811968 0.17823 2.367533 1.338579 -0.29881 2.328998


1 0.393582 0.25384 1.501773 -0.600744 0.858236 -0.24281 -0.538582
2 1.4133 1.428192 0.504874 1.401485 1.317348 -0.22147 1.509107
3 -1.38403 -1.22753 -2.591878 -0.793049 -1.63902 0.987884 -0.454961
4 1.082581 0.998364 1.19634 0.591544 1.155464 -1.08815 0.874813

PROBLEM 1.3

Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using Dendrogram
and briefly describe them

Resolution:

For Hierarchical clustering we can use ward’s method for scaled data
7
From the above dendrogram, we can see analyze that all the data points has clustered in different
clusters.

To achieve the business objective and to obtain the optimal number of clusters we can use ‘truncate
mode as lastp.’

Where we can give the last p = 10 according to industry standards.

Here we can see that the data has clustered into three clusters.

Using the criterion as ‘maxclust’ and fclusters we can map these clusters to the data set.

advance_ probability_of current_bal credit_lim min_pay max_spent_in_s


spending fcluster
payments _full_payment ance it ment_amt ingle_shopping

0 19.94 16.92 0.8752 6.675 3.763 3.252 6.55 1

1 15.99 14.89 0.9064 5.363 3.582 3.336 5.144 3

2 18.95 16.42 0.8829 6.248 3.755 3.368 6.148 1

3 10.83 12.96 0.8099 5.278 2.641 5.182 5.185 2

4 17.99 15.86 0.8992 5.89 3.694 2.068 5.837 1

We can now look at the cluster frequency in our data set


8
We can choose average method to the scaled data

From the above dendrogram, we can see analyze that all the data points has clustered in different
clusters by average method.

To achieve the business objective and to obtain the optimal number of clusters we can use ‘truncate
mode as lastp.’

Where we can give the last p = 10 according to industry standards.

Here we can see that the data has clustered into three clusters.

Using the criterion as ‘maxclust’ and fclusters we can map these clusters to the data set.

Clustering –

Analysis – From the above analysis we can say that there is not much variation in both methods. Both
methods has similar means, minor variation which is predictable.
Clustering – Based on the above dendrograms 3 to 4 clustering groups looks perfect. As per the
dataset what we have we can go with 3 groups of clusters. This gives us a patter based on
9
high/medium/low spending with max_spent_in_single_shopping (high value item) and
probability_of_full_payment (payment made).

PROBLEM 1.4

Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and
silhouette score. Explain the results properly. Interpret and write inferences on the finalized clusters

Resolution:

For K-Means clustering, we can randomly decide to give n_clusters = 3 and we look at the distribution
of clusters according to the n_clusters.

We apply K-means technique to the scaled data.

Cluster output for all the observations in the dataset.

Now as we have 3 clusters 0,1,2 and to find the optimal number of clusters, we can use k-elbow
method.

To obtain the inertia value for all the clusters from 1 to 11, we can use a ‘for loop’ to find the optimal
number of clusters.

The silhouette score for 3 clusters is good - 0.4007270552751299


10
From the above elbow curve we can see that after 3 clusters there is no huge drop in the values, so we
select 3 clusters.

So adding the cluster results to our dataset to solve our business objective

Observations - By K-Means method we came at cluster 3. We find it optimal after there is no huge drop
in inertia values. Also the elbow curve seems to show similar results.

The silhouette width score of the K–Means also seems to very less value that indicates all the data
points are properly clustered. There is no mismatch in the data points with regards to clustering.

Based on the above dendrograms 3 to 4 clustering groups looks perfect. As per the dataset what we
have we can go with 3 groups of clusters.
This gives us a patter based on high/medium/low spending with max_spent_in_single_shopping (high
value item) and probability_of_full_payment (payment made).
PROBLEM 1.5

Describe cluster profiles for the clusters defined. Recommend different promotional strategies for
different clusters.

Resolution:

Based on the above analysis we can divide the cluster profiles in three groups.

High Spending, Medium Spending and Low Spending.


11
Group 1 – High Spending:

 Offering reward points might increase their purchases / spend.


 Maximum max_spent_in_single_shopping is high for this group, so we can offer discount / offer
on next transactions upon full payment.
 Increase their credit limit which in turn increases the spending habits.
 Can offer loans against the credit card, as they are high value customers with good repayment
record.
 Tie up with luxury brands, which will drive more one_time_maximun spending.

Group 2 – Medium Spending:

 They are potential target customers who are paying bills and doing purchases and maintaining
comparatively good credit score.
 We can increase credit limit or can lower down interest rate.
 Promote premium cards/loyalty cars to increase transactions.
 Increase spending habits by trying with premium ecommerce sites, travel portal, travel
airlines/hotel, as this will encourage them to spend more.

Group 3 – Low Spending:

 These customers should be given remainders for payments.


 Offers can be provided on early payments to improve their payment rate
 Increase their spending habits by tying up with grocery stores, utilities (electricity, phone, gas,
others)

*****************************************^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^*************************************

Problem 2:
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides
to collect data from the past few years. You are assigned the task to make a model which predicts the
claim status and provide recommendations to management. Use CART, RF & ANN and compare the
models' performances in train and test sets.

PROBLEM 2.1

Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis).

Resolution:

 First we import all the necessary libraries in Python, and then import the data file which is
‘insurance_part2_data’. Once we import the file we confirm whether the data has been uploaded
correctly or not using ‘head’ function. Using this function we can view the data and all the
columns and headers whether they are aligning correctly or not.
12
 Then using the ‘shape’ function we can understand how many row and columns are there in our
data set.

 To check the data type of all the columns and also to check the null values, ‘info’ function. Has
been used.

 To see the detail description of the data such as, Count, Mean, Median, Min, Max, Standard
Deviations etc,

count unique top freq mean std min 25% 50% 75% max
Age 3000 NaN NaN NaN 38.091 10.4635 8 32 36 42 84
Agency_Code 3000 4 EPX 1365 NaN NaN NaN NaN NaN NaN NaN
Travel
Type 3000 2 1837 NaN NaN NaN NaN NaN NaN NaN
Agency
Claimed 3000 2 No 2076 NaN NaN NaN NaN NaN NaN NaN
Commission 3000 NaN NaN NaN 14.5292 25.4815 0 0 4.63 17.235 210.21
Channel 3000 2 Online 2954 NaN NaN NaN NaN NaN NaN NaN
Duration 3000 NaN NaN NaN 70.0013 134.053 -1 11 26.5 63 4580
Sales 3000 NaN NaN NaN 60.2499 70.734 0 20 33 69 539
Product Customized
3000 5 1136 NaN NaN NaN NaN NaN NaN NaN
Name Plan
Destination 3000 3 ASIA 2465 NaN NaN NaN NaN NaN NaN NaN

 Using the ‘isnull’ function, one can understand if there are any null values in the data set. And
we do not have any null values in the existing data set.

 Using the ‘dups’ function we check for the duplicates and there were few duplicate values which
are noted.

 Using the ‘drop_duplicates’ function, we can exclude the duplicate values. Then check for the
data.
Checking for Outliers

As there is no unique identifier I’m not dropping the duplicates it may be different customer’s data.

13
Outliers exist in almost all the numeric values.

We can treat outliers in random forest classification.

Checking pairplot distribution of the continuous variables

Checking for Correlations

14
AGENCY_CODE: 4
JZI 239
CWT 472
C2B 924
EPX 1365
TYPE: 2
Airlines 1163
Travel Agency 1837
CLAIMED: 2
Yes 924
No 2076
CHANNEL: 2
Offline 46
Online 2954
PRODUCT NAME: 5
Gold Plan 109
Silver Plan 427
Bronze Plan 650
Cancellation Plan 678
Customised Plan 1136
DESTINATION: 3
EUROPE 215
Americas 320
ASIA 2465
15
Univariate

The above dist plot shows the distribution of data from 20 – 80 and is positively skewed. Boxplot shows
that there are no outliers. Majority of the distribution lies in the range of 30 – 40.

The above dist plot shows the distribution of data from 0 – 30 and is positively skewed. Boxplot shows
that there a few outliers.

The above dist plot shows the distribution of data from 0 – 100 and is positively skewed. Boxplot shows
that there are a few outliers.
16
The above dist plot shows the distribution of data from 30 – 300 and is positively skewed. Boxplot
shows that there are a few outliers.

Categorical variables

The distribution of the agency code shows us EXP with maximum frequency.
17
The box plot shows the split of sales with different agency code and also hue having claimed column.

It seems that C2B have claimed more claims than other agency.

The box plot shows the split of sales with different type and also hue having claimed column.

We could understand airlines type has more claims.


18
The majority of customers have used online medium, very less with offline medium

The box plot shows the split of sales with different channel and also hue having claimed column.

Customized plan seems to be most liked plan by customers when compared to all other plans.
19
The box plot shows the split of sales with different product name and also hue having claimed column.

Asia is where customers choose when compared with other destination places.

The box plot shows the split of sales with different destination and also hue having claimed column.

PROBLEM 2.2

Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial
Neural Network

Resolution:

For training and testing purpose we are splitting the dataset into train and test data in the ratio 70:30.

We have divided the dataset into train and test.


20
MODEL 1

Building A Decision Tree Classifier

Checking for the feature

.
21
Regularizing the Decision Tree

Adding Tuning Parameters

Variable Importance

MODEL 2:

Building a Ensemble Random Forest Classifier

Treating Outliers from Random Forest


22
Boxplot to check the outliers

Random Forest Classifier

Finding the optimal numbers using grid search


23
Fitting the Model To RFCL Values Obtained By Optimal Grid

Search Method

Best grid values

Predicting on traning dataset for Random Forest

MODEL 3

Building a neural network classifier

Before building the model

We have scale the values, to standard scale using minmaxscaler


24
After scaling the data we are transforming the same to the test data

MLP classifier

Training the

Grid Search

Fitting the model using the optimal values from grid search
25
Best grid values,

PROBLEM 2.3

Performance Metrics: Comment and Check the performance of Predictions on Train and Test sets
using Accuracy,

Resolution:

Decision tree prediction

Accuracy

Confusion Matrix
26
Model Evaluation for Decision Tree
AUC and ROC for the training data for Decision Tree

27
28
Model 2 prediction random forest

Accuracy

Confusion Matrix

29
Model Evaluation for Random Forest
AUC and ROC for the training data for Random Forest

30
Accuracy

Confusion Matrix

31
MODEL 3

CONFUSION MATRIX
32
ACCURACY

Model evaluation for neural network classifier

33
Accuracy

Confusion Matrix
34
PROBLEM 2.4

Final Model: Compare all the models and write an inference which model is best/optimized.

Resolution:
35
CONCLUSION:

Here we are selecting the RF model, as it has better accuracy, precision, recall, and f1 score better
than other two CART & NN.
36
PROBLEM 2.5

Inference: Based on the whole Analysis, what are the business insights and recommendations?

Resolution:

After thoroughly analyzing the model, more data will help us understand and predict models better.

Streamlining online experiences benefitted customers which lead to an increase in conversions, and
subsequently raised profits.

• 90% of the insurance is done by online channel.

• Almost all the offline business has a claimed associated.

• Need to train the JZI agency resources to pick up sales as they are in bottom, need to run promotional
marketing campaign or evaluate if we need to tie up with alternate agency.

• Based on the model we are getting 80%accuracy, so we need customer books airline tickets or plans,
cross sell the insurance based on the claim data pattern.

• Other interesting fact is more sales happen via Agency than Airlines and the trend shows the claim
are processed more at Airline. So we may need to perform a deep dive analysis into the process to
understand the workflow and why?

Key performance indicators (KPI) The KPI’s of insurance claims are

• Increase customer satisfaction which in fact will give more revenue

• Combat fraud transactions; deploy measures to avoid fraudulent transactions at earliest

• Optimize claims recovery method

• Reduce claim handling costs.

The End

Thakur Arun Singh

*****************************^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^**************************
37

You might also like