Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 34

DATA MINING

BUSINESS REPORT

HANSRAJ YADAV
PGPDSBA JAN’2020 BATCH

1|Page
CONTENTS

1. Objective ………………………………………………………………………………………………… 3
2. Problem 1: Clustering ……………………………………………………………………………….4 – 17
a) Assumptions …………………………………………………………………………………4
b) Importing Packages ………………………………………………………………………4
c) Solution 1.1 ………………………………………………………………………………….5
d) Solution 1.2 ………………………………………………………………………………….11
e) Solution 1.3 ………………………………………………………………………………….12
f) Solution 1.4 ………………………………………………………………………………….14
g) Solution 1.5 ………………………………………………………………………………….16
3. Problem 2: CART, RF, ANN ……………………………………………………………………….18 – 34
a) Assumptions …………………………………………………………………………………18
b) Importing Packages ………………………………………………………………………18
c) Solution 2.1 ………………………………………………………………………………….19
d) Solution 2.2 ………………………………………………………………………………….24
e) Solution 2.3 ………………………………………………………………………………….29
f) Solution 2.4 ………………………………………………………………………………….33
g) Solution 2.5 ………………………………………………………………………………….34

PROJECT OBJECTIVE

2|Page
Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its
customers. They collected a sample that summarizes the activities of users during the past
few months. You are given the task to identify the segments based on credit card usage.
1.1. To perform Exploratory Data Analysis on the dataset and describe it briefly.
1.2. To provide justification whether scaling is necessary for clustering in this case.
1.3. To perform hierarchical clustering to scaled data and identify the number of
optimum clusters using Dendrogram and briefly describe them.
1.4. To perform K-Means clustering on scaled data and determine optimum clusters.
Apply elbow curve and silhouette score.
1.5. To describe cluster profiles for the clusters defined and recommend different
promotional strategies for different clusters.

Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The
management decides to collect data from the past few years. You are assigned the task to
make a model which predicts the claim status and provide recommendations to
management. Use CART, RF & ANN and compare the models' performances in train and test
sets.
2.1. To read the dataset and perform the descriptive statistics and do null value
condition check and write an inference on it.
2.2. To split the data into test and train, build classification model CART, Random Forest
and Artificial Neural Network.
2.3. To check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model.
2.4. To compare all the models and write an inference which model is best/optimized.
2.5. To provide business insights and recommendations.

3|Page
PROBLEM 1: CLUSTERING

ASSUMPTIONS
The dataset provided to us is stored as “bank_marketing_part1_Data.csv” which contains
data of 210 customers and 7 variables namely:

spending Amount spent by the customer per month (in 1000s)


advance_payments Amount paid by the customer in advance by cash (in 100s)
probability_of_full_payment Probability of payment done in full by the customer to the bank
current_balance Balance amount left in the account to make purchases (in
1000s)
credit_limit Limit of the amount in credit card (10000s)
min_payment_amt minimum paid by the customer while making payments for
purchases made monthly (in 100s)
max_spent_in_single_shopping Maximum amount spent in one purchase (in 1000s)

IMPORTING PACKAGES
So as to import the dataset and perform Exploratory Data Analysis on the given dataset we
imported the following packages:

SOLUTIONS
1.1 To perform Exploratory Data Analysis on the dataset and describe it
briefly.
Importing the Dataset

The dataset in question is imported in jupyter notebook using pd.read_csv () function and will store
the dataset in “bank_df”. The top 5 rows of the dataset are viewed using pd.head () function.

4|Page
Dimension of the Dataset

Structure of the Dataset

Structure of the Dataset can be computed using .info () function.

Summary of the Dataset


The summary of the dataset can be computed using .describe () function.

5|Page
Checking for Missing Values

The missing values or “NA” needs to be checked and dropped from the dataset for the ease of
evaluation and null values can give errors or disparities in results. Missing Values can be computed
using .isnull().sum() function.

As computed from above command the dataset does not have any null or NA values.

Univariate Analysis

Histograms are plotted for all the numerical variables using sns.displot () function from seaborn
package.

6|Page
Boxplots of Variables to check for Outliers

7|Page
Inference: After plotting the Boxplots for all the variables we can conclude that a few outliers are
present in the variable namely, min_payment_amt which means that there are only a few
customers whose minimum payment amount falls on the higher side on an average. Since only one
of the seven variable have a very small outlier value, hence there is no need to treat the outliers.
This small value will not create any difference in our analysis.

We can conclude from the above graphs that the most of the customers in our data have a higher
spending capacity, high current balance in their accounts and these customers spent a higher
amount during a single shopping event. Majority of the customers have a higher probability to make
full payment to the bank.

Multivariate Analysis

Heat Map (Relationship Analysis)

8|Page
We will now plot a Heat Map or Correlation Matrix to evaluate the relationship between different
variables in our dataset. This graph can help us to check for any correlations between different
variables.

Inference: As per the Heat Map, we can conclude that the following variables are highly correlated:

 Spending and advance_payments, spending and current_balance, spending and credit_limit


 Advance_payment and current_balance, advance_payment and credit limit
 Current balance and max spent in single shopping

By this we can conclude that the customers who are spending very high have a higher current
balance and high credit limit. Advance payments and maximum expenditure done in single shopping
are done by majority of those customers who have high current balance in their bank accounts.

Probability of full payments are higher for those customers who have a higher credit limit.

Minimum payment amount is not correlated to any of the variables, hence, it is not affected by any
changes in the current balance or credit limit of the customers.

Pair Plot for all the variables

9|Page
With the help of the above pair plot we can understand the Univariate and Bivariate trends for all
the variables in the dataset.

1.2 To provide justification whether scaling is necessary for clustering in this


case

10 | P a g e
Feature scaling or Standardization is a technique for Machine Learning algorithms which helps in
pre-processing the data. It is applied to independent variables which helps to normalise the data in a
particular range. If feature scaling is not done, then a machine learning algorithm tends to weigh
greater values, higher and consider smaller values as the lower values, regardless of the unit of the
values.

For the data given to us, scaling is required as all the variables are expressed in different units such
as spending in 1000’s, advance payments in 100’s and credit limit in 10000’s, whereas probability is
expressed as fraction or decimal values. Since the other values expressed in higher units will
outweigh probabilities and can give varied results hence it is important to Scale the data using
Standard Scaler and therefore normalise the values where the means will be 0 and standard
deviation 1.

Scaling of data is done using importing a package called StandardScaler from sklearn.preprocessing.
For further clustering of dataset, we will be using the scaled data, “scaled_bank_df”.

1.3 To perform hierarchical clustering to scaled data and identify the number
of optimum clusters using Dendrogram and briefly describe them.

Cluster Analysis or Clustering is a widely accepted Unsupervised Learning technique in Machine


Learning, Clustering can be divided into two categories namely, Hierarchical and K-means clustering.

11 | P a g e
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar
objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct
from each other cluster, and the objects within each cluster are broadly similar to each other. There
are two types of hierarchical clustering, Divisive and Agglomerative.

For the dataset in question we will be using Agglomerative Hierarchical Clustering method to create
optimum clusters and categorising the dataset on the basis of these clusters.

To create a Dendrogram using our scaled data we have firstly imported the package dendrogram,
linkage from scipy.cluster.hierarchy. Using this function, we have created a dendrogram which
shows two clusters very clearly. Now, we will check the make-up of these two clusters using
‘maxclust’ and ‘distance’. As can be seen from above we will now take 2 clusters for our further
analysis.

12 | P a g e
This above graph shows the last 10 links in the dendrogram.

The above two methods show the designated clusters that are assigned to each of the customer. We
have segregated the two clusters using two methods from the fcluster package.

1.4 To perform K-Means clustering on scaled data and determine optimum


clusters. Apply elbow curve and silhouette score.
13 | P a g e
K-means clustering is one of the unsupervised machine learning algorithms. K-means
algorithm identifies k number of centroids, and then allocates every data point to the
nearest cluster, while keeping the centroids as small as possible.

For the dataset we will be using K-means clustering on scaled data and identify the clusters formed
and use them further to devise tools to target each group separately.

Firstly, we have scaled the dataset using StandardScaler package from sklearn.preprocessing and
using the scaled_bank_df we will now plot two curves to determine the optimal number of
clusters(k) to be used for our clustering. The two methods are namely within sum of squares(wss)
method and average silhouette scores method.

As per the above plot i.e. within sum of squares (wss) method we can conclude that the optimal
number of clusters to be taken for k-means clustering is 3 since as per the elbow method it can be
easily seen in the curve that after 3 the curve gets flat

14 | P a g e
As per the plot of Average silhouette scores it can be seen that the highest Average score is
corresponding to k=3. Hence, as per both the methods i.e. within sum of squares and silhouette
method we can conclude that the optimal number of k or clusters that needs to be taken for k-
means clustering is 3.

The silhouette scores and silhouette widths are calculated using silhouette_samples and
silhouette_score package from sklearn.metrics. The average silhouettes score is coming to be 0.400
and minimum silhouette score is 0.002. The silhouette score ranges from -1 to +1 and higher the
silhouette score better the clustering.

15 | P a g e
1.5. To describe cluster profiles for the clusters defined and recommend
different promotional strategies for different clusters.

Now, the final step is to identify the clusters that we have created using Hierarchical clustering and
K-means clustering for our market segment analysis and devise promotional strategies for the
different clusters. Since from the above analysis we have identified 2 clusters from hierarchical
clustering and 3 optimal clusters from k-means clustering. We will now further analyse and
determine the best clustering approach that can be helpful for the market segmentation problem in
hand. We will first plot and map the clusters from both the methods.

HIERARCHICAL CLUSTERING

K-MEANS CLUSTERING

Now, in the below table we have tabulated the averages for all the
variables of the five clusters created from the above clustering using hierarchical and K-means
methods. As per the values we can segment the clusters into two for Hierarchical and three

16 | P a g e
Segments

Hierarchical Cluster 1: This segment has higher spending per month, high current balance and credit
limit. This is the Prosperous or Upper class with majorly higher income. This segment can be
targeted using various offers such as cards with rewards and loyalty points for every spent.

Hierarchical Cluster 2: This segment has must lower spending per month with low current balance
and lower credit limit. This is the Middle Class with low incomes. This segment can be targeted with
cards that have lower interest rates so as to encourage more spending.

K-means Cluster 0: This segment has the lowest spending per month, lowest current balance and
credit limit. This is the Financially Stressed Class with very low income on an average. This segment
can be targeted with cards with offers such as zero annual charges and lurking them with benefits
such as free coupons or tickets and waivers on a variety of places.

K-means Cluster 1: This segment has higher spending per month, high current balance and credit
limit. This is the Prosperous or Upper class with majorly higher income. This segment can be
targeted using various offers such as cards with rewards and loyalty points for every spent.

K-means Cluster 2: This segment has must lower spending per month with low current balance and
lower credit limit. This is the Middle Class with low incomes. This segment can be targeted with
cards that have lower interest rates so as to encourage more spending.

VARIABLES Spending Advance Probabilit Current Credit Min Max


Payments y of full balance Limit Payment spent in
payment Amt single
shopping
Hierarchical 18.62 16.26 0.88 6.19 3.71 3.66 6.06
(Cluster 1)
Hierarchical 13.23 13.83 0.87 5.39 3.07 3.71 5.13
(Cluster 2)
K-means 11.86 13.25 0.85 5.23 2.85 4.74 5.1
(Cluster 0)
K-means 18.5 16.2 0.88 6.18 3.7 3.63 6.04
(Cluster 1)
K-means 14.43 14.33 0.88 5.51 3.26 2.7 5.12
(Cluster 2)

PROBLEM 2: CART-RF-ANN
ASSUMPTIONS
The dataset provided to us is stored as “insurance_part2_data.csv” which contains data of
3000 customers and 10 variables namely:

Age Age of insured

17 | P a g e
Agency_Code Code of tour firm
Type Type of tour insurance firms
Claimed Target: Claim Status
Commission The commission received for tour insurance firm
Channel Distribution channel of tour insurance agencies
Duration Duration of the tour
Sales Amount of sales of tour insurance policies
Product Name Name of the tour insurance products
Destination Destination of the tour

IMPORTING PACKAGES
So as to import the dataset and perform Exploratory Data Analysis on the given dataset we
imported the following packages:

SOLUTIONS
2.1 To read the dataset and perform the descriptive statistics and do null
value condition check and write an inference on it.

Importing the Dataset

The dataset in question is imported in jupyter notebook using pd.read_csv () function and will store
the dataset in “claim_df”. The top 5 rows of the dataset are viewed using pd.head () function.

18 | P a g e
Dimension of the Dataset

Structure of the Dataset

Structure of the Dataset can be computed using pd.info() function.

Summary of the Dataset

The summary of the dataset can be computed using pd.describe () function.

19 | P a g e
Checking for Missing Values

The missing values or “NA” needs to be checked and dropped from the dataset for the ease of
evaluation and null values can give errors or disparities in results. Missing Values can be computed
using .isnull().sum() function.

As computed from above command the dataset does not have any null or NA values.

Dropping the non-important columns

In this dataset, “Agency_Code” Is the column which cannot be used for our analysis. Hence, we will
be dropping this column by using .drop() function.

Univariate Analysis

Histograms are plotted for all the numerical variables using sns.displot () function from seaborn
package.

20 | P a g e
Bar Plots are plotted for all Categorical Variables using sns.countplot() function from seaborn
package.

21 | P a g e
Boxplots of Variables to check for Outliers

Dropping the Outliers from the Dataset

Inference: After plotting the Boxplots for all the numerical variables we can conclude that a very high
number of outliers are present in the variables namely, Age, Commision, Duration and Sales which

22 | P a g e
means that we need to treat these outlier values so as to proceed further with our model building
and analysis as these values can create errors and can deviate from the actual results.

We can conclude from the above graphs that the majority of the customers doing a claim in our data
belong to age group of 25-40 with the type of Tour Agency firm being Travel Agency, Channel being
Online, Product name being Customised Plan and Destination being Asia.

Multivariate Analysis

Heat Map (Relationship Analysis)

We will now plot a Heat Map or Correlation Matrix to evaluate the relationship between different
variables in our dataset. This graph can help us to check for any correlations between different
variables.

As interpreted from the above heat map, there is no or extremely low correlation between the
variables given in the dataset.

2.2. To split the data into test and train, build classification model CART,
Random Forest and Artificial Neural Network.

Converting ‘object’ datatype to ‘int’


For our analysis and building Decision tree and Random Forest, we have to convert the variables
which have ‘object’ datatype and convert them into integer.

23 | P a g e
Splitting Dataset in Train and Test Data (70:30)
For building the models we will now have to split the dataset into Training and Testing Data with
the ratio of 70:30. These two datasets are stored in X_train and X_test with their corresponding
dimensions as follows

CART Model
Classification and Regression Trees(CART) are a type of Decision trees used in Data mining. It is a
type of Supervised Learning Technique where the predicted outcome is either a discrete or class
(classification) of the dataset or the outcome is of continuous or numerical in nature(regression).

Using the Train Dataset(X_train) we will be creating a CART model and then further testing the
model on Test Dataset(X_test)

For creating the CART Model two packages were imported namely, “DecisionTreeClassifier” and
“tree” from sklearn.

24 | P a g e
With the help of DecisonTreeClassifier we will create a decision tree model namely, dt_model
and using the “gini” criteria we will fit the train data into this model. After this using the tree
package we will create a dot file namely, claim_tree.dot to help visualise the tree.

Below are the variable importance values or the feature importance to build the tree.

Using the GridSearchCV package from sklearn.model_selection we will identify the best
parameters to build a regularised decision tree. Hence, doing a few iterations with the values we
got the best parameters to build the decision tree which are as follows

25 | P a g e
These best grid parameters are henceforth used to build the regularised or pruned Decision tree.

The regularised Decision tree was formulated using best grid parameters computed above and
with the “gini” criteria it is fitted in the train dataset. The regularised tree is stored as a dot file
namely, claim_tree_regularised.dot and can be viewed using webgraphviz in the browser.

Random Forest

26 | P a g e
Random Forest is another Supervised Learning Technique used in Machine Learning which consists
of many decision trees that helps in predictions using individual trees and selects the best output
from them.

Using the Train Dataset(X_train) we will be creating a Random Forest model and then further testing
the model on Test Dataset(X_test)

For creating the Random Forest, the package “RandomForestClassifier” is imported from
sklearn.metrics.

Using the GridSearchCV package from sklearn.model_selection we will identify the best parameters
to build a Random Forest namely, rfcl. Hence, doing a few iterations with the values we got the best
parameters to build the RF Model which are as follows

Using these best parameters evaluated using GridSeachCV a Random Forest Model is created which
is further used for model performance evaluation.

Artificial Neural Network (ANN)

27 | P a g e
Artificial Neural Network(ANN) is a computational model that consists of several processing
elements that receive inputs and deliver outputs based on their predefined activation functions.

Using the train dataset(X_train) and test dataset(X_test) we will be creating a Neural Network using
MLPClassifier from sklearn.metrics.

Firstly, we will have to Scale the two datasets using Standard Scaler package.

Using the GridSearchCV package from sklearn.model_selection we will identify the best parameters
to build an Artificial Neural Network Model namely, mlp. Hence, doing a few iterations with the
values we got the best parameters to build the ANN Model which are as follows

Using these best parameters evaluated using GridSeachCV an Artificial Neural Network Model is
created which is further used for model performance evaluation.

2.3.  To check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model.

28 | P a g e
To check the Model Performances of the three models created above certain model evaluators are
used i.e., Classification Report, Confusion Matrix, ROC_AUC Score and ROC Plot. They are calculated
first for train data and then for test data.

CART Model
Classification Report

Confusion Matrix

ROC_AUC Score and ROC Curve

29 | P a g e
Model Score

Random Forest Model


Classification Report

Confusion Matrix

AUC_ROC Score and ROC Curve

30 | P a g e
Artificial Neural Network Model
Classification Report

Confusion Matrix

31 | P a g e
AUC_ROC Score and ROC Curve

2.4. To compare all the models and write an inference which model is
best/optimized.

Comparison of all the performance evaluators for the three models are given in the following table.
We are using Precision, F1 Score and AUC Score for our evaluation.

32 | P a g e
Model Precision F1 Score AUC Score

CART Model

Train Data 0.67 0.82 0.84

Test Data 0.61 0.77 0.76

Random Forest

Train Data 0.71 0.82 0.84

Test Data 0.65 0.77 0.80

Neural Network

Train Data 0.68 0.82 0.84

Test Data 0.60 0.76 0.79

Insights:
From the above table, comparing the model performance evaluators for the three models it is quite
clear that the Random Forest Model is performing well as compared to the other two as it has high
precisions for both training and testing data and although the AUC Score is the same for all the three
models for training data but for testing data it is the highest for Random Forest Model. Choosing
Random Forest Model is the best option in this case as it will exhibit very less variance as compared
to a single decision tree or a multi – layered Neural Network.

2.5. To provide business insights and recommendations.

For the business problem of an Insurance firm providing Tour Insurance, we have attempted
to make a few Data Models for predictions of probabilities. The models that are attempted

33 | P a g e
are namely, CART or Classification and Regression Trees, Random Forest and Artificial Neural
Network(MLP). The three models are then evaluated on training and testing datasets and
their model performance scores are calculated.
The Accuracy, Precision, F1 Score are computed using Classification Report. The confusion
matrix, AUC_ROC Scores and ROC plot are computed for each model separately and
compared. All the three models have performed well but to increase our accuracy in
determining the claims made by the customers we can choose the Random Forest Model.
Instead of creating a single Decision Tree it can create a multiple decision trees and hence
can provide the best claim status from the data.
As seen from the above model performance measures, for all the models i.e. CART, Random
Forest and ANN have performed exceptionally well. Hence, we can choose either of the
models but choosing Random Forest Model is a great option as even though they exhibit the
same accuracy but choosing Random Forest over Cart model is way better as they have
much less variance than a single decision tree.

34 | P a g e

You might also like