Customer Churn - E-Commerce: Capstone Project Report

Project Notes - I
Customer Churn - E-Commerce

Capstone Project Report
Submitted by:
Karthiheswar M
Mentor guidance: Abhay Poddar
Batch: PGP-DSBA Sep’19
Date of Submission: 20 Sep’2020
Group - 2 1
Project Notes - I
Table of Contents
1 Introduction.................................................................................................................................3
1.1 Problem Statement.............................................................................................................. 3
1.2 Need of the Project study.................................................................................................... 3
1.3 Business/Social Opportunity.............................................................................................. 3
2 Data Report..................................................................................................................................3
2.1 Collection of Data................................................................................................................. 3
2.2 Visual Inspection of Data.....................................................................................................3
2.3 Understanding of Attributes............................................................................................... 4
3 Exploratory Data Analysis..........................................................................................................5
3.1 Univariate Analysis.............................................................................................................. 5
3.2 Bivariate Analysis................................................................................................................ 8
3.3 Removal of Unwanted Variables...................................................................................... 15
3.4 Missing Value Treatment.................................................................................................. 15
3.5 Outliers Treatment............................................................................................................ 15
3.6 Variable Transformation...................................................................................................16
4 Business insights from EDA..................................................................................................... 17
4.1 Checking whether the data is balanced........................................................................... 17
4.2 Clustering........................................................................................................................... 17
4.3 Other Business Insights.....................................................................................................19
5 Model Building and Interpretation......................................................................................... 20
5.1 Building Various Models................................................................................................... 20
5.2 Performance Metrics......................................................................................................... 21
5.3 Interpretation of Models:.................................................................................................. 31
6 Model Tuning............................................................................................................................ 32
6.1 Ensemble Modeling........................................................................................................... 32
6.2 Model Tuning Measures.................................................................................................... 37
6.3 Interpretation of Ensemble Model................................................................................... 40
6.4 Interpretation of Optimum Model....................................................................................41
6.5 Implication on the Business............................................................................................412
7 Appendix....................................................................................................................................43
Group - 2 2
Project Notes - I
1 Introduction
E-Commerce (Electronic Commerce) is the activity of buying and selling of goods, products
and online services over the internet. This also includes the sending and receiving of funds,
inventory management and internet marketing. Business-to-Consumer (B2C) and
Business-to-Business (B2B) are some of the important business transactions that can occur.
E-Commerce is one of the hottest business over many industries like electronics, fashions,
grocery, furniture, medicals, foods and etc.
1.1 Problem Statement

Since E-Commerce is one of the most important business that is happening, there few
problems that can occur in this business. Among those problems customer Churn rate is
one of the most important factor to be considered, as the major part of sales and profit
depends on it. So in this problem, few necessary steps and precautions have been taken to
predict the customer Churn rate.
1.2 Need of the Project study

Predicting the customer Churn rate helps the company to decide the right path to proceed
as they can evaluate their feedback with the past Churn rate data. This also helps in
identifying the reasons for the customer to Churn, also some indications that the customer
may Churn.
1.3 Business/Social Opportunity

To predict the Churn rate of the E-Commerce company, first the company’s dataset have to
be explored to find the insights that are helpful in predicting customer Churn rate. To do so,
initially the raw data has to be pre-processed with the required techniques. Ultimately this
project predicts the customer Churn rate so that the company can turn up with some
promos to offer to the customers and can do their marketing strategies accordingly.
2 Data Report
This report consists of the data from E-Commerce company, where the data are analyzed,
explored and the insights are described with the necessary plots for the visualization of
data. All these insights will be very supportive in predicting the customer Churn rate.
2.1 Collection of Data

The data is collected from a E-Commerce company which works on electronics, grocery,
fashion and few others online shopping. This is the data source for this project with a
customer with maximum tenure of 61 months and a customer with 0 value on days since
last order which means the latest day.
2.2 Visual Inspection of Data
The data has 5630 of observations with 20 variables, where there are few null values in
some of the variable’s observations. The independent variables have both numerical and
Group - 2 3
Project Notes - I
categorical data where the Churn variable is considered as target variable. The below table
describes the numerical data with few necessary details:
Variable name Count Mean Std Min 25% 50% 75% Max
Churn 5630.0 0.168384 0.374240 0.0 0.00 0.00 0.0000 1.00
Tenure 5366.0 10.189899 8.557241 0.0 2.00 9.00 16.0000 61.00
CityTier 5630.0 1.654707 0.915389 1.0 1.00 1.00 3.0000 3.00
WarehouseToHome 5379.0 15.639896 8.531475 5.0 9.00 14.00 20.0000 127.00
HourSpendOnApp 5375.0 2.931535 0.721926 0.0 2.00 3.00 3.0000 5.00
NumberOfDeviceRegistered 5630.0 3.688988 1.023999 1.0 3.00 4.00 4.0000 6.00
SatisfactionScore 5630.0 3.066785 1.380194 1.0 2.00 3.00 4.0000 5.00
NumberOfAddress 5630.0 4.214032 2.583586 1.0 2.00 3.00 6.0000 22.00
Complain 5630.0 0.284902 0.451408 0.0 0.00 0.00 1.0000 1.00
OrderAmountHikeFromlastYear 5365.0 15.707922 3.675485 11.0 13.00 15.00 18.0000 26.00
CouponUsed 5374.0 1.751023 1.894621 0.0 1.00 1.00 2.0000 16.00
OrderCount 5372.0 3.008004 2.939680 1.0 1.00 2.00 3.0000 16.00
DaySinceLastOrder 5323.0 4.543491 3.654433 0.0 2.00 3.00 7.0000 46.00
CashbackAmount 5630.0 177.223030 49.207036 0.0 145.77 163.28 196.3925 324.99
2.3 Understanding of Attributes

The data has some of the details on customer’s transaction history. On general observation,
Tenure, Complain and DaySinceLastOrder are some of the important independent variables
where the dependent variable usually depends on in such cases. Few observations under
some of the variables are same with different names, so those entities have to be merged.
Those are as follows:
PreferredLoginDevice: The Mobile Phone and Phone are same entities, so both are merged
as Phone.
PreferredPaymentMode: The Cash on Delivery and COD are same entities, so both are
merged as COD. Also CC and Credit Card are same entities, so both are merged as Credit
Card.
PreferedOrderCat: The Mobile and Mobile Phone are same entities, so both are merged as
Mobile.
Group - 2 4
Project Notes - I
There are 15 continues variables and 5 categorical variables present in the raw data. There
are no duplicates present in the data. Also little amount of skewness are present in all
continues variables as follows:
Variable name Skewness

Churn 1.772843
Tenure 0.736513
CityTier 0.735326
WarehouseToHome 1.619154
HourSpendOnApp -0.027213
NumberOfDeviceRegistered -0.396969
SatisfactionScore -0.142626
NumberOfAddress 1.088639
Complain 0.953347
OrderAmountHikeFromlastYear 0.790785
CouponUsed 2.545653
OrderCount 2.196414
DaySinceLastOrder 1.191000
CashbackAmount 1.149846
3 Exploratory Data Analysis

Exploratory data analysis is the important factor to find the insights in data. This can be
done in various methods as following:
3.1 Univariate Analysis

It is the graphical representation of how a variable is distributed. The Univariate analysis is
done on numerical data and categorical data separately. The below plots shows the
distribution of data among the respective variables.
Dependent Variable:
Group - 2 5
Project Notes - I
Insights
 The 0 value represents that customers not churned and 1 value represents the
churned customers.
 Customers who churned are very less compared to customers who are not churned.
Independent Variables:
3.1.1 Numeric Variables:
Group - 2 6
Project Notes - I
Insights
 Most of the customers are from tier-1 cities where tier-2 cities have least customers.
 Many customers spent about 3 hours on company’s app and also there few
customers who spent 5 hours on company’s app which is found as maximum hour
spent.
 There are only few complaints raised in last month comparatively.
 The data are widely distributed in Satisfaction score, Order amount hike from last
year, Days since last order and Cashback amount.
Tenure:
Insights
 The distribution of the variable is right skewed.
 The average tenure of the customer is around 10 months.
 Most of the customers are new as tenure is around 1 month which means recently
joined.
 The maximum tenure that a customer has is 61 months.
Group - 2 7
Project Notes - I
3.1.2 Categorical Variables:
Insights
 Customer prefer to login in Phone than Computer as they may found easy to access.
 Male customers are more than female customers.
 We can see that frequency of post publishing increases daily from Monday, reaches
its maximum point on Wednesday and then gradually declines.
 The base time frequency is showing similar patterns, it is maximum on Thursdays
and then declining further.
 So, this may be an inference for business to think some other way to engage people
during weekends rather than Facebook promotions.
3.2 Bivariate Analysis

This method of analysis describes the relationship between the variables.
Group - 2 8
Project Notes - I
Description
 Darken the color, higher the number of complaints.
 Bigger the size, higher the satisfaction score.
Insights
 From above plot, customers from tier-1 city spent much hour on company’s app,
followed by tier-3 city.
 Also customers from tier-1 city have raised much complaints but also rated higher
satisfaction score followed by tier-3 city.
 Customers from tier-1 cities are more active where the customers from tier-2 cities
are least active as their number of complaints raised, satisfaction score and hours
spent on app are very less.
 Customers who fail more complaints tends to churn.
 Customers who are married spend more hours on app, also they have raised much
complaints and higher satisfaction score, also it compiles with the male customer.
Group - 2 9
Project Notes - I
Description
 Darken the color, higher the Order amount hike from last year.
 Bigger the size, more number of Coupons used.
Insights
 Laptop and accessories are the most Preferred order category followed by Mobile
Phones, where groceries and others are least Preferred order category.
 Also more number of Coupons used on Laptop and accessories followed by Mobile
Phones, where groceries and others are least on which Coupons used.
 Since Laptop and accessories are most Preferred order category, the Cashback
amount is huge on Laptop and accessories followed by Mobile Phones.
Group - 2 10
Project Notes - I
Description
 Darken the color, higher the Number of devices registered.
 Bigger the size, more the number of Days since last order.
Insights
 Customers with more Number of address tends to register in more Number of
devices, also these customers have higher number of Days since last order and tend
to pay in Debit card.
 Customers with less number of Days since last order tends to pay through UPI as
they have less Number of devices registered and have less Number of address.
Group - 2 11
Project Notes - I
Description
 Orange color indicates Mobile phone and Blue color indicates Computer in Preferred
login device.
Insights
 Laptop and accessories are having higher values on almost all variables like Order
count, Satisfaction score, Complain, Order amount hike from last year and Cashback
amount followed by Mobile Phones, where groceries and others have least values.
Group - 2 12
Project Notes - I
Insights
 Customers from tier-1 cities who are married have more Number of address,
Number of device registered, Tenure, distance from Warehouse to home and Days
since last order followed by customers from tier-3 cities.
Target variable vs Numerical variables:
Insights
 Customers spending hours on app doesn’t decide the Churn rate and Orders count
also doesn’t shows much difference on Churn rate.
 Customers who raised more complaints tends to churn, also customers with less
Tenure Churn’s lot and on other hand the churned customers have high Satisfaction
score.
Group - 2 13
Project Notes - I
 The customers with comparatively less Cashback amount, more distance from
Warehouse to home and surprisingly with more Satisfaction score and less number
of Days since last order are tending to Churn.
Target variable vs Categorical variables:
Insights
 Customers who are single Churns more followed by divorced customers.
 Customers from from tier-1 cities churns less comparatively also they pay through
Cash on Delivery.
Correlation plot:
Group - 2 14
Project Notes - I
Insights
 Churn variable is highly correlated with Tenure and Complain variable and least
correlated with Coupon used.
 Also Churn is decently correlated with Days since last order, Cashback amount,
Satisfaction score and Number of device registered.
 Some independent variables are highly correlated with each other.
 Order count is highly correlated with Coupon used and Days since last order.
 Also Cashback amount is having a decent positive correlation with Tenure, Coupon
used, Order count and Days since last order which means higher the Tenure, Coupon
used, Order count and Days since last order higher the Cashback amount.
3.3 Removal of Unwanted Variables

The variable CouponUsed has some good correlation with OrderCount and also very less
correlated with the dependent variable, so CouponUsed variable can be dropped from the
dataset. Also CustomerID variable has to be dropped at model building stage and for outlier
treatment.
3.4 Missing Value Treatment

Missing values or null values are a common occurrence in a dataset which cause a
significant effect and also it will be a barrier to build a good model, so these missing values
have to be treated accordingly. There are 1600 missing values present in the dataset. The
following table list the number of missing values with respect to their variables:
Variable name Missing values

Tenure 264
WarehouseToHome 251
HourSpendOnApp 255
OrderAmountHikeFromlastYear 265
OrderCount 258
DaySinceLastOrder 307
These missing values are treated with their median values as all null values are present in
the continues variables.
3.5 Outliers Treatment

Outliers are the values that are present far from the remaining observations and it also
causes significant difference at model building. So these outliers have to be treated to build
a better model.
Group - 2 15
Project Notes - I
The above plot clearly indicates there are too much of outliers present in the dataset and
outliers are present in almost every continues variables. Removing of outliers can cause
some huge loss in data, so instead these outliers are imputed.
These outliers are treated by finding Inter Quartile Range (IQR) of upper range value and
lower range value for all continues variables. So the observations above Upper range of IQR
are replaced with upper range value and the observations below Lower range of IQR are
replaced with lower range value. The below plot is the plot after treated from the outliers:
3.6 Variable Transformation

The dataset doesn’t require any scaling and normalization as there many categorical
variables present and the continues variables are of nearly in same magnitude.
The CityTier variable is shown as numerical variable but it should be converted as
categorical variable as it describes the type of the city.
One hot encoding is not done as it creates lot more variables, instead all the categorical
variables are label encoded which means the entities are turned up with the codes and
Group - 2 16
Project Notes - I
made as numerical variable for model building. The below table are the top 5 observation
of dataset after label encoding:
4 Business insights from EDA

Out of all the analysis done with dataset, the ideas and information that drives the business
are the insights.
4.1 Checking whether the data is balanced

In the given dataset there are totally 4682 customers not churned and 948 customers
churned. So among the given dataset 16.838% of customers are churned. So there is no
required of SMOTE technique.
4.2 Clustering
The customers based on their behaviour, they are divided into 5 groups as following table:
Variable name Row1 Row2 Row3 Row4 Row5

Cluster cluster_0 cluster_1 cluster_2 cluster_3 cluster_4
CustomerID (Count) 1313 1059 1913 646 699
Churn (Unique
1(162), 0(1151) 1(288), 0(771) 1(360), 0(1553) 1(32), 0(614) 1(106), 0(593)
concatenate with count)
Tenure (Mean) 9.713632902 6.780925401 7.792472556 19.87925697 13.27753934
PreferredLoginDevice
Phone(927), Phone(749), Phone(1328), Phone(496), Phone(496),
(Unique concatenate
Computer(386) Computer(310) Computer(585) Computer(150) Computer(203)
with count)
Group - 2 17
Project Notes - I
CityTier (Unique 3(601), 1(678), 1(825), 3(171), 3(513), 1(1311), 1(452), 3(156), 1(400), 3(281),
concatenate with count) 2(34) 2(63) 2(89) 2(38) 2(18)
WarehouseToHome
16.12947449 14.43862134 15.6589127 14.78328173 16.43347639
(Mean)
Credit
Credit
UPI(86), Debit Debit Card(782), E COD(37), Debit Card(209),
Card(365), E
PreferredPaymentMode Card(454), Credit wallet(177), Card(267), Credit Debit
wallet(216),
(Unique concatenate Card(352), Credit Card(617), Card(231), E Card(274), E
Debit Card(537),
with count) COD(125), E COD(174), wallet(62), wallet(117),
COD(108),
wallet(42) UPI(163) UPI(49) UPI(29),
UPI(87)
COD(70)
Gender (Unique Male(761), Male(674), Female(744), Female(275), Female(290),

concatenate with count) Female(552) Female(385) Male(1169) Male(371) Male(409)
HourSpendOnApp
3.135186596 2.617563739 2.948248824 2.925696594 3.009298999
(Mean)
NumberOfDeviceRegister
3.891850724 3.260623229 3.705697857 3.748452012 3.908440629
ed (Mean)
Fashion(235), Mobile(992), Laptop & Fashion(538),

PreferedOrderCat Others(264),
Laptop & Laptop & Accessory(877), Grocery(74),
(Unique concatenate Grocery(334),
Accessory(1021) Accessory(65), Mobile(1031), Laptop &
with count) Fashion(48)
, Mobile(57) Grocery(2) Fashion(5) Accessory(87)
SatisfactionScore (Mean) 3.038080731 3.058545798 3.07997909 3.080495356 3.084406295
Single(427), Single(396), Single(635), Divorced(124), Divorced(112),

MaritalStatus (Unique
Divorced(187), Divorced(148), Divorced(277), Single(154), Single(184),
concatenate with count)
Married(699) Married(515) Married(1001) Married(368) Married(403)
NumberOfAddress
4.4843869 3.248347498 4.11134344 4.944272446 4.726752504
(Mean)
Complain (Mean) 0.275704494 0.282341832 0.286983795 0.275541796 0.309012876
OrderAmountHikeFroml
15.81188119 15.37110482 15.75352849 15.37616099 15.91273247
astYear (Mean)
OrderCount (Mean) 2.753236862 1.634560907 2.349189754 3.383900929 3.097281831
DaySinceLastOrder
5.17136329 2.551463645 4.029534762 6.723684211 4.852646638
(Mean)
CashbackAmount (Mean) 179.6992003 126.3645881 152.1284527 268.3160043 218.821731
The following table groups the customers based on their Churn rate shows their behaviour
on every variables and clearly defines on every aspect:
Variable name Row1 Row2

Churn Churn_0 Churn_1
CustomerID (Count) 4682 948
Group - 2 18
Project Notes - I
Tenure (Mean) 11.38530543 3.859704641
PreferredLoginDevice (Unique concatenate

Phone(3372), Computer(1310) Phone(624), Computer(324)
with count)
CityTier (Unique concatenate with count) 3(1354), 1(3134), 2(194) 3(368), 1(532), 2(48)
WarehouseToHome (Mean) 15.26719351 16.85654008
E wallet(474), Debit Card(1958),

PreferredPaymentMode (Unique Debit Card(356), UPI(72), Credit
COD(386), Credit Card(1522),
concatenate with count) Card(252), COD(128), E wallet(140)
UPI(342)
Gender (Unique concatenate with count) Male(2784), Female(1898) Female(348), Male(600)
HourSpendOnApp (Mean) 2.928662965 2.964135021
NumberOfDeviceRegistered (Mean) 3.650683469 3.916666667
Fashion(698), Laptop & Laptop & Accessory(210),

PreferedOrderCat (Unique concatenate with
Accessory(1840), Mobile(1510), Mobile(570), Others(20),
count)
Others(244), Grocery(390) Fashion(128), Grocery(20)
SatisfactionScore (Mean) 3.001281504 3.390295359
MaritalStatus (Unique concatenate with Divorced(724), Married(2642), Single(480), Divorced(124),

count) Single(1316) Married(344)
NumberOfAddress (Mean) 4.15890645 4.450421941
Complain (Mean) 0.234087997 0.535864979
OrderAmountHikeFromlastYear (Mean) 15.68272106 15.61708861
OrderCount (Mean) 2.543571123 2.407172996
DaySinceLastOrder (Mean) 4.68058522 3.187236287
CashbackAmount (Mean) 178.5006413 159.6365032
4.3 Other Business Insights

 From all above data analysis, it’s clear that customers with less Tenure churns more.
 Majority of the customers are Male and in general the majority of the customer’s
Martial status are Married.
 From the cluster table, Satisfaction score doesn’t vary much with churned and not
churned customers.
 Most customers prefer to shop Mobile Phones and Laptop and accessories where
Groceries and Others sells least.
 Customers from tier-2 cities are the least number of customers, so there is no
enough reach in tier-2 cities.
Group - 2 19
Project Notes - I
5 Model Building and Interpretation

Once the insights are derived from Exploratory Data Analysis the next step is to build the
models for the Churn prediction from the given dataset, models result are interpreted to
find the best suited model.
5.1 Building Various Models

Various models are built using various machine learning algorithms.
5.1.1 Data Split

Initially the processed dataset is splitted into 2 following subsets by dropping CustomerID
variable:
 Training Data: This subset has 70% of data which is for model building using
machine learning algorithm.
 Testing data: This subset has remaining 30% of data where the built models are
tested on test data.
5.1.2 Machine Learning Algorithms

The target variable taken is Churn and since it is binary variable, many classification
algorithms are considered to the built the models. The following algorithm techniques are:
1. Logistic Regression: This algorithm is the basic machine learning algorithm of
classification technique, using regression technique it establishes the relation
between independent variable and dependent variable.
2. Linear Discriminant Analysis: LDA uses linear combinations of independent
variables to predict the class in the response variable of a given observation.
3. K-Nearest Neighbors: KNN works based on feature similarity. It calculates the
similarities or distance of test query from each point in train subset.
4. Naive Bayes: It is based on the principle of probability where probability of an
event which is actually based on the preceding values of the event. It assumes that
the input features are independent from each other.
5. Support Vector Machine: The principle of SVM is to find an hyperplane which, can
classify the training data points in to labeled categories. The input of SVM is the
training data and use this training sample point to predict class of test point.
6. Artificial Neural Network: ANN works based on number of neurons and number of
hidden layers assigned. It calculates the weightage of independent variables on
neurons and hidden layers assigned.
Group - 2 20
Project Notes - I
5.2 Performance Metrics

The predictive models are built out of training data by applying various machine learning
techniques, these predictive models are tested against testing data and their performance
are determined with some metrics.
5.2.1 Logistic Regression:

1. Accuracy of train data:
0.7749
2. Accuracy of test data:
0.7655
3. Confusion matrix on train data:
4. Confusion matrix on test data:
5. Classification report on train data:

Dependent variable Precision Recall F1-score Support
0 0.95 0.77 0.85 3269
1 0.42 0.82 0.55 672
Group - 2 21
Project Notes - I
6. Classification report on test data:
0 0.95 0.76 0.84 1413
1 0.39 0.81 0.53 276
7. AUC on train data:
8. AUC on test data:
5.2.2 Linear Discriminant Analysis:

0.8751
0.8815
Group - 2 22
Project Notes - I

0 0.89 0.97 0.93 3269
1 0.73 0.42 0.53 672

0 0.89 0.98 0.93 1413
1 0.76 0.40 0.53 276
Group - 2 23
Project Notes - I
5.2.3 K-Nearest Neighbours:

0.9023
0.8644
Group - 2 24
Project Notes - I

0 0.92 0.97 0.94 3269
1 0.80 0.56 0.66 672

0 0.90 0.95 0.92 1413
1 0.62 0.43 0.51 276
Group - 2 25
Project Notes - I
5.2.4 Naive Bayes:

0.8513
0.8703
Group - 2 26
Project Notes - I
0 0.91 0.91 0.91 3269
1 0.56 0.57 0.57 672

0 0.93 0.92 0.92 1413
1 0.60 0.62 0.61 276
Group - 2 27
Project Notes - I
5.2.5 Support Vector Machine:
0.8756
0.8833

0 0.89 0.97 0.93 3269
1 0.75 0.40 0.53 672

0 0.89 0.98 0.93 1413
1 0.77 0.41 0.53 276
Group - 2 28
Project Notes - I
5.2.6 Artificial Neural Network:

0.9373
0.9153
Group - 2 29
Project Notes - I

0 0.86 0.98 0.92 3269
1 0.70 0.26 0.38 672

0 0.88 0.98 0.92 1413
1 0.72 0.29 0.42 276
Group - 2 30
Project Notes - I
5.3 Interpretation of Models:

Once the models are built using training data, the models performance are determined by
using testing data to predict the target variable.
1. Logistic Regression:
This turns up with accuracy of 0.7749 on train data and 0.7655 on test data which is a
decent performance as there are no underfit or overfit of data. Also the model has good
Recall value but the Precision rates and F1 score are very low on Churned customers of
both train and test data.
2. Linear Discriminant Analysis:
This model has accuracy of 0.8751 on train data and 0.8815 on test data which is a good
performance model. Here, the model has good Precision value but the Recall rates and
F1 score are low on Churned customers of both train and test data.
3. K-Nearest Neighbors:
The model’s accuracy is 0.9023 on train data and 0.8644 on test data which is overall a
good performance model, but the Precision value, Recall value and F1 score are
comparatively low on Churned customers of both train and test data.
4. Naive Bayes:
Accuracy is 0.8513 on train data and 0.8703 on test data of the model which is overall a
good performance model, but the Precision value, Recall value and F1 score are very
low on Churned customers of both train and test data.
5. Support Vector Machine:
This model performs poorly even though the accuracy is 0.8294 on train data and
0.8365 on test data. The Precision value, Recall value and F1 score are zero on Churned
customers of both train and test data.
Group - 2 31
Project Notes - I
6. Artificial Neural Network:
performance model. Here, the model has decent Precision value but the Recall rates and
F1 score are poor on Churned customers of both train and test data.
6 Model Tuning
However the models which are built can be fine tuned to improve the models performance
and can validate the models using some techniques.
6.1 Ensemble Modeling

Ensemble technique builds many models and combines in order to produce one good
model. The following ensemble techniques are used.
6.1.1 Grid Search CV and Random Forest

Grid search is the process of performing hyper parameter tuning in order to determine the
optimal values for a given model. This is significant as the performance of the entire model
is based on the hyper parameter values specified.
The "forest" it builds, is an ensemble of decision trees, usually trained with the “bagging”
method. The general idea of the bagging method is that a combination of learning models
increases the overall result.
0.9358
0.9171
Group - 2 32
Project Notes - I

0 0.94 0.98 0.96 3269
1 0.90 0.70 0.79 672

0 0.93 0.97 0.95 1413
1 0.81 0.65 0.72 276
Group - 2 33
Project Notes - I
6.1.2 Ada Boost

It aims to convert a set of weak classifiers into a strong one.
0.9015
0.8910
Group - 2 34
Project Notes - I
0 0.92 0.96 0.94 3269
1 0.77 0.61 0.68 672

0 0.92 0.95 0.94 1413
1 0.70 0.58 0.63 276
Group - 2 35
Project Notes - I
6.1.3 XG Boost
XG Boost is an implementation of gradient boosted decision trees designed for speed and
performance.
0.9302
0.9100

0 0.93 0.99 0.96 3269
1 0.93 0.64 0.76 672
Group - 2 36
Project Notes - I
0 0.92 0.98 0.95 1413
1 0.82 0.58 0.68 276
6.2 Model Tuning Measures

Cross validation is an important technique to verify the models built by splitting the data
into 10 folds.
6.2.1 Logistic Regression

1. Train data:
0.7645 0.7538 0.7741 0.8071 0.7639
0.7157 0.7741 0.7487 0.8147 0.7944
Group - 2 37
Project Notes - I
2. Test data:
0.8047 0.7514 0.7692 0.7869 0.8284
0.8106 0.7633 0.7455 0.8106 0.7619
6.2.2 Linear Discriminant Analysis

1. Train data:
0.8506 0.8883 0.8578 0.8781 0.8502
0.8578 0.8680 0.8857 0.8857 0.8832
2. Test data:
0.8698 0.8698 0.8816 0.8934 0.8757
0.8934 0.8698 0.8698 0.8639 0.8809
6.2.3 K-Nearest Neighbour

1. Train data:
0.8683 0.8629 0.8426 0.8604 0.8502
0.8527 0.8629 0.8730 0.8629 0.8705
2. Test data:
0.8875 0.8461 0.8639 0.8224 0.8639
0.8875 0.8402 0.7988 0.8284 0.8392
6.2.4 Naive Bayes

1. Train data:
0.8075 0.8502 0.8502 0.8730 0.8477
0.8147 0.8553 0.8477 0.8883 0.8654
2. Test data:
0.8461 0.8461 0.8639 0.8698 0.8934
0.8520 0.8343 0.8402 0.8698 0.8392
Group - 2 38
Project Notes - I
6.2.5 Support Machine Vector
1. Train data:
0.8278 0.8299 0.8299 0.8299 0.8299
0.8299 0.8299 0.8299 0.8299 0.8274
2. Test data:
0.8402 0.8402 0.8402 0.8343 0.8343
0.8343 0.8343 0.8343 0.8343 0.8392
6.2.6 Artificial Neural Network

1. Train data:
0.8531 0.8527 0.8477 0.8553 0.8401
0.8274 0.8934 0.8680 0.8502 0.8527
2. Test data:
0.8757 0.8698 0.8520 0.8461 0.8402
0.8343 0.8343 0.8402 0.8402 0.8511
6.2.7 Random Forest

1. Train data:
0.8962 0.9111 0.9111 0.9162 0.8908
0.8883 0.9213 0.9137 0.9162 0.9035
2. Test data:
0.9112 0.8875 0.8875 0.8579 0.9289
0.9171 0.8934 0.8816 0.9053 0.9226
6.2.8 Ada Boost

1. Train data:
0.8860 0.8984 0.8934 0.9010 0.8832
Group - 2 39
Project Notes - I
0.8857 0.9035 0.9137 0.9035 0.8908
2. Test data:
0.8934 0.9053 0.8875 0.8875 0.8934
0.8994 0.8875 0.8875 0.9171 0.8869
6.2.9 XG Boost
1. Train data:
0.9012 0.9035 0.9060 0.9238 0.8807
0.8705 0.9213 0.9086 0.8934 0.8959
2. Test data:
0.9112 0.9230 0.8816 0.8639 0.9171
0.9171 0.8994 0.8816 0.9112 0.8988
6.3 Interpretation of Ensemble Model

Once the models are built using training data, the models performance are determined by
using testing data to predict the target variable.
1. Grid Search CV and Random Forest:
This turns up with accuracy of 0.9358 on train data and 0.9171 on test data which is a
good performance as there are no underfit or overfit of data. Also the model has good
Precision rates but the Recall value and F1 score are comparatively low on Churned
customers of both train and test data.
2. ADA Boost:
performance model. Here, the model has comparatively low Precision value and the
Recall rates and F1 score are low on Churned customers of both train and test data.
3. XG Boost:
The model’s accuracy is 0.9302 on train data and 0.9100 on test data which is overall a
good performance model, also the model has good Precision rates but the Recall value
and F1 score are low on Churned customers of both train and test data.
Group - 2 40
Project Notes - I
6.4 Interpretation of Optimum Model
Model Accuracy Precision of Precision of Recall of Recall of F1-score of F1-score of AUC Value
not Churned not Churned not Churned
Churned Churned Churned
Train Test Train Test Train Test Train Test Train Test Train Test Train Test Train Test
Logistic 0.77 0.76 0.95 0.95 0.42 0.39 0.77 0.76 0.82 0.81 0.85 0.84 0.55 0.53 0.86 0.86
Regression
LDA 0.87 0.88 0.89 0.89 0.73 0.76 0.97 0.98 0.42 0.40 0.93 0.93 0.53 0.53 0.85 0.86
KNN 0.90 0.86 0.92 0.90 0.80 0.62 0.97 0.95 0.56 0.43 0.94 0.92 0.66 0.51 0.95 0.85
Naive 0.85 0.87 0.91 0.93 0.56 0.60 0.91 0.92 0.57 0.62 0.91 0.92 0.57 0.61 0.82 0.84
Bayes
SVM 0.87 0.88 0.89 0.89 0.75 0.77 0.97 0.98 0.40 0.41 0.93 0.93 0.53 0.53 0.86 0.87
ANN 0.93 0.91 0.86 0.88 0.70 0.72 0.98 0.98 0.26 0.29 0.92 0.92 0.38 0.42 0.83 0.83
Random 0.93 0.91 0.94 0.93 0.90 0.81 0.98 0.97 0.70 0.65 0.96 0.95 0.79 0.72 0.98 0.94
Forest
Ada Boost 0.90 0.89 0.92 0.92 0.77 0.70 0.96 0.95 0.61 0.58 0.94 0.94 0.68 0.63 0.93 0.91
XG Boost 0.93 0.91 0.93 0.92 0.93 0.82 0.99 0.98 0.64 0.58 0.96 0.95 0.76 0.68 0.94 0.91
Note: The Out of Bag score of Random forest is 0.9063

From the above table the models are compared and interpreted as follows:
 Random Forest using Grid Search CV turned to be a best model as they have best
accuracy, precision, recall and f1- score on both train and test data.
 Generally Ensemble techniques performs better than other classification techniques.
 Apart from Ensemble techniques, SVM and LDA performs good but with low recall.
 After Random Forest, XG Boost model performs better.
Group - 2 41
Project Notes - I
6.5 Implication on the Business
Logistic Regression LDA
Random Forest Ada Boost
XG Boost
 Complain plays the most important role, followed by Number of Device Registered
in Logistic Regression and LDA model and Cashback has the least importance.
 Tenure plays the most important role, followed by Complain in Random Forest and
XG Boost.
Group - 2 42
Project Notes - I
6.5.1 Overall Observations
 Tenure and Complain are important factors for Churn where Order Amount Hike
from Last year and Gender seems to be least important factor which plays in model
building.
 Other variables like City Tier, Days since Last Order, Satisfaction Score are also some
of the important factors which plays in model building.
 Random Forest using Grid Search CV model performs best among all other models
so it is good to implement but also it takes much time to build, so if time is a concern
then the next best is XG Boost model to implement.
 Company should be conscious on above important factors to reduce Churn rate and
the model built predicts the Churn rate of customers with respect to given features.
7 Appendix
 Tools used: Python, Tableau, Knime
 Python code file, Tableau public link and Knime file are attached here for reference
Final
Project_Karthiheswar.ipynb
https://public.tableau.com/profile/karthiheswar#!/vizhome/Projectnote-
1/ChurnvsCitytier
Project
Note-1.knwf
Group - 2 43

Customer Churn - E-Commerce: Capstone Project Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Customer Churn - E-Commerce: Capstone Project Report

Uploaded by

Copyright:

Available Formats

Project Notes - I

Customer Churn - E-Commerce

Batch: PGP-DSBA Sep’19

Date of Submission: 20 Sep’2020

1.1 Problem Statement

1.2 Need of the Project study

1.3 Business/Social Opportunity

2.1 Collection of Data

Tenure 5366.0 10.189899 8.557241 0.0 2.00 9.00 16.0000 61.00

CityTier 5630.0 1.654707 0.915389 1.0 1.00 1.00 3.0000 3.00

WarehouseToHome 5379.0 15.639896 8.531475 5.0 9.00 14.00 20.0000 127.00

HourSpendOnApp 5375.0 2.931535 0.721926 0.0 2.00 3.00 3.0000 5.00

NumberOfDeviceRegistered 5630.0 3.688988 1.023999 1.0 3.00 4.00 4.0000 6.00

SatisfactionScore 5630.0 3.066785 1.380194 1.0 2.00 3.00 4.0000 5.00

NumberOfAddress 5630.0 4.214032 2.583586 1.0 2.00 3.00 6.0000 22.00

Complain 5630.0 0.284902 0.451408 0.0 0.00 0.00 1.0000 1.00

OrderAmountHikeFromlastYear 5365.0 15.707922 3.675485 11.0 13.00 15.00 18.0000 26.00

CouponUsed 5374.0 1.751023 1.894621 0.0 1.00 1.00 2.0000 16.00

OrderCount 5372.0 3.008004 2.939680 1.0 1.00 2.00 3.0000 16.00

DaySinceLastOrder 5323.0 4.543491 3.654433 0.0 2.00 3.00 7.0000 46.00

CashbackAmount 5630.0 177.223030 49.207036 0.0 145.77 163.28 196.3925 324.99

2.3 Understanding of Attributes

Variable name Skewness

3 Exploratory Data Analysis

3.1 Univariate Analysis

3.1.1 Numeric Variables:

3.2 Bivariate Analysis

3.3 Removal of Unwanted Variables

3.4 Missing Value Treatment

Variable name Missing values

3.5 Outliers Treatment

3.6 Variable Transformation

4 Business insights from EDA

4.1 Checking whether the data is balanced

Variable name Row1 Row2 Row3 Row4 Row5

CustomerID (Count) 1313 1059 1913 646 699

Tenure (Mean) 9.713632902 6.780925401 7.792472556 19.87925697 13.27753934

Gender (Unique Male(761), Male(674), Female(744), Female(275), Female(290),

Fashion(235), Mobile(992), Laptop & Fashion(538),

SatisfactionScore (Mean) 3.038080731 3.058545798 3.07997909 3.080495356 3.084406295

Single(427), Single(396), Single(635), Divorced(124), Divorced(112),

Complain (Mean) 0.275704494 0.282341832 0.286983795 0.275541796 0.309012876

OrderCount (Mean) 2.753236862 1.634560907 2.349189754 3.383900929 3.097281831

CashbackAmount (Mean) 179.6992003 126.3645881 152.1284527 268.3160043 218.821731

Variable name Row1 Row2

CustomerID (Count) 4682 948

PreferredLoginDevice (Unique concatenate

WarehouseToHome (Mean) 15.26719351 16.85654008

E wallet(474), Debit Card(1958),

Gender (Unique concatenate with count) Male(2784), Female(1898) Female(348), Male(600)

HourSpendOnApp (Mean) 2.928662965 2.964135021

NumberOfDeviceRegistered (Mean) 3.650683469 3.916666667

Fashion(698), Laptop & Laptop & Accessory(210),

SatisfactionScore (Mean) 3.001281504 3.390295359

MaritalStatus (Unique concatenate with Divorced(724), Married(2642), Single(480), Divorced(124),

NumberOfAddress (Mean) 4.15890645 4.450421941

Complain (Mean) 0.234087997 0.535864979

OrderAmountHikeFromlastYear (Mean) 15.68272106 15.61708861

OrderCount (Mean) 2.543571123 2.407172996

DaySinceLastOrder (Mean) 4.68058522 3.187236287

CashbackAmount (Mean) 178.5006413 159.6365032

4.3 Other Business Insights