Professional Documents
Culture Documents
Data Mining Business Report
Data Mining Business Report
DATA MINING
explanation of how we
approached each
assignment. It shall
resolution and
explanation with
regards to the
problems
CONTENTS
Problem 1: Clustering ......................................................................................................................... 2
Problem 1.1 .................................................................................................................................... 2
Problem 1.2 .................................................................................................................................... 7
Problem 1.3 .................................................................................................................................... 7
Problem 1.4 .................................................................................................................................. 10
Problem 1.5 .................................................................................................................................. 11
Problem 2:........................................................................................................................................ 12
Problem 2.1 .................................................................................................................................. 12
Problem 2.2 .................................................................................................................................. 20
Model 2: ........................................................................................................................................ 22
MODEL 3 ...................................................................................................................................... 24
Problem 2.3 .................................................................................................................................. 26
Problem 2.4 .................................................................................................................................. 35
Problem 2.5 .................................................................................................................................. 37
1
Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its customers.
They collected a sample that summarizes the activities of users during the past few months. You are
given the task to identify the segments based on credit card usage.
PROBLEM 1.1
Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate, and
multivariate analysis).
Resolution:
Describing the data:
First we import all the necessary libraries in Python, and then import the data file which is
‘bank_marketing_part1_Data’. Once we import the file we confirm whether the data has been
uploaded correctly or not using ‘head’ function. Using this function we can view the data and all
the columns and headers whether they are aligning correctly or not.
Then using the ‘shape’ function we can understand how many row and columns are there in our
data set.
To check the data type of all the columns and also to check the null values, ‘info’ function. Has
been used.
To see the detail description of the data such as, Count, Mean, Median, Min, Max, Standard
Deviations etc,
Using the ‘isnull’ function, one can understand if there are any null values in the data set. And
we do not have any null values in the existing data set.
Using the ‘dups’ function we check for the duplicates and there were no duplicate values.
After reviewing the data thoroughly, and based on the above analysis we can say that, we have seven
variables, Mean and Median values are almost equal, and Standard deviation for ‘Spending’ is higher
than other variables. There are no duplicates in the data set.
2
Exploratory data analysis
Univariate and multivariate analysis
The above dist plot shows the distribution of data from 10 – 22 and is positively skewed. Boxplot shows
that there are no outliers.
The above dist plot shows the distribution of data from 12 – 17 and is positively skewed. Boxplot shows
that there are no outliers
3
The above dist plot shows the distribution of data from 0.80 – 0.92 and is negatively skewed. Boxplot
shows that there are a few outliers. Probably values is good above 80%.
The above dist plot shows the distribution of data from 5.0 – 6.5 and is positively skewed. Boxplot
shows that there are no outliers.
The above dist plot shows the distribution of data from 2.5 – 4.0 and is positively skewed. Boxplot
shows that there are no outliers.
4
The above dist plot shows the distribution of data from 1 – 8 and is positively skewed. Boxplot shows
that there are a few outliers.
The above dist plot shows the distribution of data from 4.5 – 6.5 and is positively skewed. Boxplot
shows that there are a few outliers.
Outlier will not be treated as there are only 3 to 4 values which were observed in the data set.
5
Multivariate Analysis
6
From both the analysis we can see that there is strong positive correlation between all the variables
except ‘min_payment_amt’
PROBLEM 1.2
Do you think scaling is necessary for clustering in this case? Justify
Resolution:
As the data is unscaled,it is imperative that we perform the scaling as this model works on distance
based components. As we know that the values of all the variables are in different scale.
For Example: ‘Spending’ and ‘advance payments’ are in different values and this may get more weight.
Scaling will bring down the values to relatively same range. Using standard scalar below is the data
how it looks after scaling.
PROBLEM 1.3
Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using Dendrogram
and briefly describe them
Resolution:
For Hierarchical clustering we can use ward’s method for scaled data
7
From the above dendrogram, we can see analyze that all the data points has clustered in different
clusters.
To achieve the business objective and to obtain the optimal number of clusters we can use ‘truncate
mode as lastp.’
Here we can see that the data has clustered into three clusters.
Using the criterion as ‘maxclust’ and fclusters we can map these clusters to the data set.
From the above dendrogram, we can see analyze that all the data points has clustered in different
clusters by average method.
To achieve the business objective and to obtain the optimal number of clusters we can use ‘truncate
mode as lastp.’
Here we can see that the data has clustered into three clusters.
Using the criterion as ‘maxclust’ and fclusters we can map these clusters to the data set.
Clustering –
Analysis – From the above analysis we can say that there is not much variation in both methods. Both
methods has similar means, minor variation which is predictable.
Clustering – Based on the above dendrograms 3 to 4 clustering groups looks perfect. As per the
dataset what we have we can go with 3 groups of clusters. This gives us a patter based on
9
high/medium/low spending with max_spent_in_single_shopping (high value item) and
probability_of_full_payment (payment made).
PROBLEM 1.4
Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and
silhouette score. Explain the results properly. Interpret and write inferences on the finalized clusters
Resolution:
For K-Means clustering, we can randomly decide to give n_clusters = 3 and we look at the distribution
of clusters according to the n_clusters.
Now as we have 3 clusters 0,1,2 and to find the optimal number of clusters, we can use k-elbow
method.
To obtain the inertia value for all the clusters from 1 to 11, we can use a ‘for loop’ to find the optimal
number of clusters.
So adding the cluster results to our dataset to solve our business objective
Observations - By K-Means method we came at cluster 3. We find it optimal after there is no huge drop
in inertia values. Also the elbow curve seems to show similar results.
The silhouette width score of the K–Means also seems to very less value that indicates all the data
points are properly clustered. There is no mismatch in the data points with regards to clustering.
Based on the above dendrograms 3 to 4 clustering groups looks perfect. As per the dataset what we
have we can go with 3 groups of clusters.
This gives us a patter based on high/medium/low spending with max_spent_in_single_shopping (high
value item) and probability_of_full_payment (payment made).
PROBLEM 1.5
Describe cluster profiles for the clusters defined. Recommend different promotional strategies for
different clusters.
Resolution:
Based on the above analysis we can divide the cluster profiles in three groups.
They are potential target customers who are paying bills and doing purchases and maintaining
comparatively good credit score.
We can increase credit limit or can lower down interest rate.
Promote premium cards/loyalty cars to increase transactions.
Increase spending habits by trying with premium ecommerce sites, travel portal, travel
airlines/hotel, as this will encourage them to spend more.
*****************************************^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^*************************************
Problem 2:
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides
to collect data from the past few years. You are assigned the task to make a model which predicts the
claim status and provide recommendations to management. Use CART, RF & ANN and compare the
models' performances in train and test sets.
PROBLEM 2.1
Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis).
Resolution:
First we import all the necessary libraries in Python, and then import the data file which is
‘insurance_part2_data’. Once we import the file we confirm whether the data has been uploaded
correctly or not using ‘head’ function. Using this function we can view the data and all the
columns and headers whether they are aligning correctly or not.
12
Then using the ‘shape’ function we can understand how many row and columns are there in our
data set.
To check the data type of all the columns and also to check the null values, ‘info’ function. Has
been used.
To see the detail description of the data such as, Count, Mean, Median, Min, Max, Standard
Deviations etc,
count unique top freq mean std min 25% 50% 75% max
Age 3000 NaN NaN NaN 38.091 10.4635 8 32 36 42 84
Agency_Code 3000 4 EPX 1365 NaN NaN NaN NaN NaN NaN NaN
Travel
Type 3000 2 1837 NaN NaN NaN NaN NaN NaN NaN
Agency
Claimed 3000 2 No 2076 NaN NaN NaN NaN NaN NaN NaN
Commission 3000 NaN NaN NaN 14.5292 25.4815 0 0 4.63 17.235 210.21
Channel 3000 2 Online 2954 NaN NaN NaN NaN NaN NaN NaN
Duration 3000 NaN NaN NaN 70.0013 134.053 -1 11 26.5 63 4580
Sales 3000 NaN NaN NaN 60.2499 70.734 0 20 33 69 539
Product Customized
3000 5 1136 NaN NaN NaN NaN NaN NaN NaN
Name Plan
Destination 3000 3 ASIA 2465 NaN NaN NaN NaN NaN NaN NaN
Using the ‘isnull’ function, one can understand if there are any null values in the data set. And
we do not have any null values in the existing data set.
Using the ‘dups’ function we check for the duplicates and there were few duplicate values which
are noted.
Using the ‘drop_duplicates’ function, we can exclude the duplicate values. Then check for the
data.
Checking for Outliers
As there is no unique identifier I’m not dropping the duplicates it may be different customer’s data.
13
Outliers exist in almost all the numeric values.
14
AGENCY_CODE: 4
JZI 239
CWT 472
C2B 924
EPX 1365
TYPE: 2
Airlines 1163
Travel Agency 1837
CLAIMED: 2
Yes 924
No 2076
CHANNEL: 2
Offline 46
Online 2954
PRODUCT NAME: 5
Gold Plan 109
Silver Plan 427
Bronze Plan 650
Cancellation Plan 678
Customised Plan 1136
DESTINATION: 3
EUROPE 215
Americas 320
ASIA 2465
15
Univariate
The above dist plot shows the distribution of data from 20 – 80 and is positively skewed. Boxplot shows
that there are no outliers. Majority of the distribution lies in the range of 30 – 40.
The above dist plot shows the distribution of data from 0 – 30 and is positively skewed. Boxplot shows
that there a few outliers.
The above dist plot shows the distribution of data from 0 – 100 and is positively skewed. Boxplot shows
that there are a few outliers.
16
The above dist plot shows the distribution of data from 30 – 300 and is positively skewed. Boxplot
shows that there are a few outliers.
Categorical variables
The distribution of the agency code shows us EXP with maximum frequency.
17
The box plot shows the split of sales with different agency code and also hue having claimed column.
It seems that C2B have claimed more claims than other agency.
The box plot shows the split of sales with different type and also hue having claimed column.
The box plot shows the split of sales with different channel and also hue having claimed column.
Customized plan seems to be most liked plan by customers when compared to all other plans.
19
The box plot shows the split of sales with different product name and also hue having claimed column.
Asia is where customers choose when compared with other destination places.
The box plot shows the split of sales with different destination and also hue having claimed column.
PROBLEM 2.2
Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial
Neural Network
Resolution:
For training and testing purpose we are splitting the dataset into train and test data in the ratio 70:30.
.
21
Regularizing the Decision Tree
Variable Importance
MODEL 2:
Search Method
MODEL 3
MLP classifier
Training the
Grid Search
Fitting the model using the optimal values from grid search
25
Best grid values,
PROBLEM 2.3
Performance Metrics: Comment and Check the performance of Predictions on Train and Test sets
using Accuracy,
Resolution:
Accuracy
Confusion Matrix
26
Model Evaluation for Decision Tree
AUC and ROC for the training data for Decision Tree
27
28
Model 2 prediction random forest
Accuracy
Confusion Matrix
29
Model Evaluation for Random Forest
AUC and ROC for the training data for Random Forest
30
Accuracy
Confusion Matrix
31
MODEL 3
CONFUSION MATRIX
32
ACCURACY
33
Accuracy
Confusion Matrix
34
PROBLEM 2.4
Final Model: Compare all the models and write an inference which model is best/optimized.
Resolution:
35
CONCLUSION:
Here we are selecting the RF model, as it has better accuracy, precision, recall, and f1 score better
than other two CART & NN.
36
PROBLEM 2.5
Inference: Based on the whole Analysis, what are the business insights and recommendations?
Resolution:
After thoroughly analyzing the model, more data will help us understand and predict models better.
Streamlining online experiences benefitted customers which lead to an increase in conversions, and
subsequently raised profits.
• Need to train the JZI agency resources to pick up sales as they are in bottom, need to run promotional
marketing campaign or evaluate if we need to tie up with alternate agency.
• Based on the model we are getting 80%accuracy, so we need customer books airline tickets or plans,
cross sell the insurance based on the claim data pattern.
• Other interesting fact is more sales happen via Agency than Airlines and the trend shows the claim
are processed more at Airline. So we may need to perform a deep dive analysis into the process to
understand the workflow and why?
The End
*****************************^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^**************************
37