Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Data Mining - Assignment

Girish Nayak
[email protected]
Contents
1. Problem 1: Clustering.........................................................................................................................2
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis)...............................................................................................................................2
1.2 Do you think scaling is necessary for clustering in this case? Justify...................................................4
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them..........................................................................................................5
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and
silhouette score. Explain the results properly. Interpret and write inferences on the finalized clusters.........7
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies for
different clusters.............................................................................................................................................8
Problem 2: CART-RF-ANN...........................................................................................................................10
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis).............................................................................................................................10
1. Problem 1: Clustering
1.1 Read the data, do the necessary initial steps, and exploratory
data analysis (Univariate, Bi-variate, and multivariate analysis)

Sample of the dataset –

Types of variables and their missing values –

From the above results we can see that there is no missing value present in the dataset.

Correlation Plot – To show how the different attributes in the bank data are correlated to each other.

Pairplot:

Pairplot shows the relationship between the variables in the form of scatterplot and the distribution of
the variable in the form of histogram. From the graph, we can see that there is positive linear
relationship between variables like spending and advance payments
To check whether the data has outliers: As evident with the fig below, only min_payment_amt has
outlier, rest are fine.
Data values: The data values for spending, current_balance, max_spent_in_single_shopping are in 1000s
where as the others are in 100s or 10000s. So we will bring them in the same scale in the next question.

1.2 Do you think scaling is necessary for clustering in this case?


Justify

Yes, in the provided dataset, scaling is necessary because as described in the data dictionary, we have
continuous variables but on different scales –

- spending, current_balance, max_spent_in_single_shopping are in 1000s


- advance_payments, min_payment_amt are in 100s
- credit_limit is in 10000s
So due to this, the same ‘actual’ value will have different impacts for different attributes. Example – a
value of 10000 in advance_payment field will be 100 whereas in spending field will be 10. Since,
clustering techniques use Euclidean Distance to form the cohorts, it is of utmost importance to have the
data scaled.

In the below table, we can see the differences between the std, min/max etc for different features.

To successfully scale this data – we can use StandardScaler from sklearn.preprocessing.

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero, and s is the standard deviation of the training
samples.

1.3 Apply hierarchical clustering to scaled data. Identify the


number of optimum clusters using Dendrogram and briefly describe
them

It is not really explained in the videos for finding out the optimum number of clusters using a
dendrogram. However by looking at the dendrogram below, and the number of records in the dataset, I
am going ahead with 3 clusters.
The 3 clusters being –

1. Big spenders (Cluster 1)


2. Low spenders (Cluster 2)
3. Medium spenders (Cluster 3)

Below are some plots showing the clusters w.r.t some features
1.4 Apply K-Means clustering on scaled data and determine
optimum clusters. Apply elbow curve and silhouette score. Explain
the results properly. Interpret and write inferences on the finalized
clusters.

As per the below plot of within sum of squares, it is evident that the change is not drastic beyond 3
clusters, hence the optimum number of clusters = 3.
The silhouette score for the 3 clusters = 0.401 and the minimum value of the silhouette width is 0.0027.
So the positive values shows that no clusters are indirectly assigned. This also means clusters are well
apart from each other and clearly distinguished.

Below is a plot of the elbow curve using distortions, where also the optimum number of clusters appear
to be 3.

The silhouette scores for different number of clusters are :

2 : 0.46577247686580914
3 : 0.40072705527512986
4 : 0.3291966792017613
5 : 0.28722184455759475
6 : 0.29127768970444345
7 : 0.2796045365286959
8 : 0.2554830824906814
9 : 0.2539488265085003

However, the minimum silhouette width for different number of clusters are given below and as we can
see, apart from the number of clusters = 3, there are interferences among the clusters evident by
negative values of silhouette width.
2 : -0.0061712389274612344
3 : 0.002713089347678376
4 : -0.053840826993600814
5 : -0.08545150449435547
6 : -0.06438844854564076
7 : -0.11950241847834445
8 : -0.048959286018583625
9 : -0.11950241847834445

The cluster numbers are assigned to the original data frame. Sample provided below, new columns are as
follows -

Cluster – Cluster number assigned by hierarchical clustering

Clus_kmeans – Cluster number assigned by k_means clustering.

1.5 Describe cluster profiles for the clusters defined. Recommend


different promotional strategies for different clusters.

The clusters defined by KMeans are considered for this. Below are the profiles at a high level:

Cluster 0: These are the groups of low spenders. Below is an insight of the data in cluster 0.
Cluster 1: These are the groups of medium spenders. Below is an insight of the data in cluster 1.

Cluster 2: These are the groups of high spenders. Below is an insight of the data in cluster 2.

From the above 3 groups, we can deduce that the spread of the data is almost equal in all 3 clusters.
A few recommendations are provided below basis the data analysis:

1. The low-spender group of customers can be approached with marketing schemes on more
spending – like bonus on higher no. of transactions within a month, or a gift on total spends
reaching X amounts.
2. The probability of full payment is lower among low-spenders as compared to medium and high
spenders. Since the bank earns when the customers have revolving credit, they may want to look
into it.
3. The minimum payment amount is higher in case of the low spenders and lower in case of cluster
1 and 2. So the customers can be encouraged to make use of the credit limit. And if required,
basis their transactions, the customers can get their credit limits enhanced.
Problem 2: CART-RF-ANN
2.1 Read the data, do the necessary initial steps, and exploratory
data analysis (Univariate, Bi-variate, and multivariate analysis).

Sample of the data

Types of variables and their missing values –

From the above results we can see that there is no missing value present in the dataset.

Correlation Plot – To show how the different attributes in the bank data are correlated to each other.

Pairplot:
Pairplot shows the relationship between the variables in the form of scatterplot and the distribution of
the variable in the form of histogram.

To check whether the data has outliers: As evident with the fig below, there are many outliers.
After removing the one outlier with duration > 4000

Data values

Below are few insights on the categorical features:

Count of each agency and avg commissions:

Avg sales from each agency, w.r.t destination


Claims from each agency:

2.2 Data Split: Split the data into test and train, build classification
model CART, Random Forest, Artificial Neural Network

The data is split into testing and training components using a random_state = 7. The random_state =
7 is used so that the same split can be repeated, if required. The split is as follows:
X_train (2100, 9)
X_test (900, 9)
train_labels (2100,)
test_labels (900,)
Total Obs 3000

The pruned CART Model:


The importance of the features in the cart model:
Imp
Duration 0.259584
Sales 0.197865
Agency_Code 0.182647
Age 0.167921
Commision 0.123896
Product Name 0.045807
Destination 0.015892
Channel 0.005258
Type 0.001131

For Random Forest, the same split is used and using the Grid Search, following are the best
parameter combination:
{'max_depth': 7,
'max_features': 4,
'min_samples_leaf': 8,
'min_samples_split': 36}

For Random forest with above parameters, feature importance is:


Imp
Agency_Code 0.307525
Product Name 0.184714
Sales 0.180748
Commision 0.124665
Duration 0.090187
Age 0.065879
Type 0.035887
Destination 0.008151
Channel 0.002243
For ANN, the same split is used and using the Grid Search, following are the best parameter
combination:

{'activation': 'relu',
'hidden_layer_sizes': (100, 100, 100),
'max_iter': 10000,
'solver': 'adam',
'tol': 0.1}

2.3 Performance Metrics: Comment and Check the performance of


Predictions on Train and Test sets using Accuracy, Confusion
Matrix, Plot ROC curve and get ROC_AUC score, classification
reports for each model.

CART Model:
AUC and ROC for Training Data:

AUC and ROC for Test Data:


Training data accuracy: 0.8080952380952381
Test data accuracy: 0.7777777777777778

Classification report for training data:

Classification report for test data:

RF Model:
AUC and ROC for Training Data:
AUC and ROC for Test Data:

Training data accuracy: 0.8057142857142857


Test data accuracy: 0.7844444444444445

Classification report for training data:

Classification report for test data:


ANN Model:

Training data accuracy: 0.78


Test data accuracy: 0.79

Classification report for training data:

Classification report for test data:

2.4 Final Model: Compare all the models and write an inference
which model is best/optimized.

As per the details mentioned in the above answer, Random Forest in the final model.

2.5 Inference: Based on the whole Analysis, what are the business
insights and recommendations

Recommendations:
1. The claims from C2B is very high. Since C2B has only destination as ASIA, so the company can look
into these.
2. The lowest claims are from JZI where the most trips are to Europe, which seems to be a profitable
destination for the company.
3.

You might also like