Customer Churn - E-Commerce: Capstone Project Report
Customer Churn - E-Commerce: Capstone Project Report
Submitted by:
Karthiheswar M
Mentor guidance: Abhay Poddar
Group - 2 1
Project Notes - I
Table of Contents
1 Introduction.................................................................................................................................3
1.1 Problem Statement.............................................................................................................. 3
1.2 Need of the Project study.................................................................................................... 3
1.3 Business/Social Opportunity.............................................................................................. 3
2 Data Report..................................................................................................................................3
2.1 Collection of Data................................................................................................................. 3
2.2 Visual Inspection of Data.....................................................................................................3
2.3 Understanding of Attributes............................................................................................... 4
3 Exploratory Data Analysis..........................................................................................................5
3.1 Univariate Analysis.............................................................................................................. 5
3.2 Bivariate Analysis................................................................................................................ 8
3.3 Removal of Unwanted Variables...................................................................................... 15
3.4 Missing Value Treatment.................................................................................................. 15
3.5 Outliers Treatment............................................................................................................ 15
3.6 Variable Transformation...................................................................................................16
4 Business insights from EDA..................................................................................................... 17
4.1 Checking whether the data is balanced........................................................................... 17
4.2 Clustering........................................................................................................................... 17
4.3 Other Business Insights.....................................................................................................19
5 Model Building and Interpretation......................................................................................... 20
5.1 Building Various Models................................................................................................... 20
5.2 Performance Metrics......................................................................................................... 21
5.3 Interpretation of Models:.................................................................................................. 31
6 Model Tuning............................................................................................................................ 32
6.1 Ensemble Modeling........................................................................................................... 32
6.2 Model Tuning Measures.................................................................................................... 37
6.3 Interpretation of Ensemble Model................................................................................... 40
6.4 Interpretation of Optimum Model....................................................................................41
6.5 Implication on the Business............................................................................................412
7 Appendix....................................................................................................................................43
Group - 2 2
Project Notes - I
1 Introduction
E-Commerce (Electronic Commerce) is the activity of buying and selling of goods, products
and online services over the internet. This also includes the sending and receiving of funds,
inventory management and internet marketing. Business-to-Consumer (B2C) and
Business-to-Business (B2B) are some of the important business transactions that can occur.
E-Commerce is one of the hottest business over many industries like electronics, fashions,
grocery, furniture, medicals, foods and etc.
2 Data Report
This report consists of the data from E-Commerce company, where the data are analyzed,
explored and the insights are described with the necessary plots for the visualization of
data. All these insights will be very supportive in predicting the customer Churn rate.
Group - 2 3
Project Notes - I
categorical data where the Churn variable is considered as target variable. The below table
describes the numerical data with few necessary details:
Variable name Count Mean Std Min 25% 50% 75% Max
Churn 5630.0 0.168384 0.374240 0.0 0.00 0.00 0.0000 1.00
Group - 2 4
Project Notes - I
There are 15 continues variables and 5 categorical variables present in the raw data. There
are no duplicates present in the data. Also little amount of skewness are present in all
continues variables as follows:
Tenure 0.736513
CityTier 0.735326
WarehouseToHome 1.619154
HourSpendOnApp -0.027213
NumberOfDeviceRegistered -0.396969
SatisfactionScore -0.142626
NumberOfAddress 1.088639
Complain 0.953347
OrderAmountHikeFromlastYear 0.790785
CouponUsed 2.545653
OrderCount 2.196414
DaySinceLastOrder 1.191000
CashbackAmount 1.149846
Group - 2 5
Project Notes - I
Insights
The 0 value represents that customers not churned and 1 value represents the
churned customers.
Customers who churned are very less compared to customers who are not churned.
Independent Variables:
Group - 2 6
Project Notes - I
Insights
Most of the customers are from tier-1 cities where tier-2 cities have least customers.
Many customers spent about 3 hours on company’s app and also there few
customers who spent 5 hours on company’s app which is found as maximum hour
spent.
There are only few complaints raised in last month comparatively.
The data are widely distributed in Satisfaction score, Order amount hike from last
year, Days since last order and Cashback amount.
Tenure:
Insights
The distribution of the variable is right skewed.
The average tenure of the customer is around 10 months.
Most of the customers are new as tenure is around 1 month which means recently
joined.
The maximum tenure that a customer has is 61 months.
Group - 2 7
Project Notes - I
3.1.2 Categorical Variables:
Insights
Customer prefer to login in Phone than Computer as they may found easy to access.
Male customers are more than female customers.
We can see that frequency of post publishing increases daily from Monday, reaches
its maximum point on Wednesday and then gradually declines.
The base time frequency is showing similar patterns, it is maximum on Thursdays
and then declining further.
So, this may be an inference for business to think some other way to engage people
during weekends rather than Facebook promotions.
Group - 2 8
Project Notes - I
Description
Darken the color, higher the number of complaints.
Bigger the size, higher the satisfaction score.
Insights
From above plot, customers from tier-1 city spent much hour on company’s app,
followed by tier-3 city.
Also customers from tier-1 city have raised much complaints but also rated higher
satisfaction score followed by tier-3 city.
Customers from tier-1 cities are more active where the customers from tier-2 cities
are least active as their number of complaints raised, satisfaction score and hours
spent on app are very less.
Customers who fail more complaints tends to churn.
Customers who are married spend more hours on app, also they have raised much
complaints and higher satisfaction score, also it compiles with the male customer.
Group - 2 9
Project Notes - I
Description
Darken the color, higher the Order amount hike from last year.
Bigger the size, more number of Coupons used.
Insights
Laptop and accessories are the most Preferred order category followed by Mobile
Phones, where groceries and others are least Preferred order category.
Also more number of Coupons used on Laptop and accessories followed by Mobile
Phones, where groceries and others are least on which Coupons used.
Since Laptop and accessories are most Preferred order category, the Cashback
amount is huge on Laptop and accessories followed by Mobile Phones.
Group - 2 10
Project Notes - I
Description
Darken the color, higher the Number of devices registered.
Bigger the size, more the number of Days since last order.
Insights
Customers with more Number of address tends to register in more Number of
devices, also these customers have higher number of Days since last order and tend
to pay in Debit card.
Customers with less number of Days since last order tends to pay through UPI as
they have less Number of devices registered and have less Number of address.
Group - 2 11
Project Notes - I
Description
Orange color indicates Mobile phone and Blue color indicates Computer in Preferred
login device.
Insights
Laptop and accessories are having higher values on almost all variables like Order
count, Satisfaction score, Complain, Order amount hike from last year and Cashback
amount followed by Mobile Phones, where groceries and others have least values.
Group - 2 12
Project Notes - I
Insights
Customers from tier-1 cities who are married have more Number of address,
Number of device registered, Tenure, distance from Warehouse to home and Days
since last order followed by customers from tier-3 cities.
Target variable vs Numerical variables:
Insights
Customers spending hours on app doesn’t decide the Churn rate and Orders count
also doesn’t shows much difference on Churn rate.
Customers who raised more complaints tends to churn, also customers with less
Tenure Churn’s lot and on other hand the churned customers have high Satisfaction
score.
Group - 2 13
Project Notes - I
The customers with comparatively less Cashback amount, more distance from
Warehouse to home and surprisingly with more Satisfaction score and less number
of Days since last order are tending to Churn.
Target variable vs Categorical variables:
Insights
Customers who are single Churns more followed by divorced customers.
Customers from from tier-1 cities churns less comparatively also they pay through
Cash on Delivery.
Correlation plot:
Group - 2 14
Project Notes - I
Insights
Churn variable is highly correlated with Tenure and Complain variable and least
correlated with Coupon used.
Also Churn is decently correlated with Days since last order, Cashback amount,
Satisfaction score and Number of device registered.
Some independent variables are highly correlated with each other.
Order count is highly correlated with Coupon used and Days since last order.
Also Cashback amount is having a decent positive correlation with Tenure, Coupon
used, Order count and Days since last order which means higher the Tenure, Coupon
used, Order count and Days since last order higher the Cashback amount.
WarehouseToHome 251
HourSpendOnApp 255
OrderAmountHikeFromlastYear 265
OrderCount 258
DaySinceLastOrder 307
These missing values are treated with their median values as all null values are present in
the continues variables.
Group - 2 15
Project Notes - I
The above plot clearly indicates there are too much of outliers present in the dataset and
outliers are present in almost every continues variables. Removing of outliers can cause
some huge loss in data, so instead these outliers are imputed.
These outliers are treated by finding Inter Quartile Range (IQR) of upper range value and
lower range value for all continues variables. So the observations above Upper range of IQR
are replaced with upper range value and the observations below Lower range of IQR are
replaced with lower range value. The below plot is the plot after treated from the outliers:
Group - 2 16
Project Notes - I
made as numerical variable for model building. The below table are the top 5 observation
of dataset after label encoding:
4.2 Clustering
The customers based on their behaviour, they are divided into 5 groups as following table:
Churn (Unique
1(162), 0(1151) 1(288), 0(771) 1(360), 0(1553) 1(32), 0(614) 1(106), 0(593)
concatenate with count)
PreferredLoginDevice
Phone(927), Phone(749), Phone(1328), Phone(496), Phone(496),
(Unique concatenate
Computer(386) Computer(310) Computer(585) Computer(150) Computer(203)
with count)
Group - 2 17
Project Notes - I
CityTier (Unique 3(601), 1(678), 1(825), 3(171), 3(513), 1(1311), 1(452), 3(156), 1(400), 3(281),
concatenate with count) 2(34) 2(63) 2(89) 2(38) 2(18)
WarehouseToHome
16.12947449 14.43862134 15.6589127 14.78328173 16.43347639
(Mean)
Credit
Credit
UPI(86), Debit Debit Card(782), E COD(37), Debit Card(209),
Card(365), E
PreferredPaymentMode Card(454), Credit wallet(177), Card(267), Credit Debit
wallet(216),
(Unique concatenate Card(352), Credit Card(617), Card(231), E Card(274), E
Debit Card(537),
with count) COD(125), E COD(174), wallet(62), wallet(117),
COD(108),
wallet(42) UPI(163) UPI(49) UPI(29),
UPI(87)
COD(70)
HourSpendOnApp
3.135186596 2.617563739 2.948248824 2.925696594 3.009298999
(Mean)
NumberOfDeviceRegister
3.891850724 3.260623229 3.705697857 3.748452012 3.908440629
ed (Mean)
NumberOfAddress
4.4843869 3.248347498 4.11134344 4.944272446 4.726752504
(Mean)
OrderAmountHikeFroml
15.81188119 15.37110482 15.75352849 15.37616099 15.91273247
astYear (Mean)
DaySinceLastOrder
5.17136329 2.551463645 4.029534762 6.723684211 4.852646638
(Mean)
The following table groups the customers based on their Churn rate shows their behaviour
on every variables and clearly defines on every aspect:
Group - 2 18
Project Notes - I
Tenure (Mean) 11.38530543 3.859704641
CityTier (Unique concatenate with count) 3(1354), 1(3134), 2(194) 3(368), 1(532), 2(48)
Group - 2 19
Project Notes - I
Group - 2 20
Project Notes - I
0.7749
2. Accuracy of test data:
0.7655
3. Confusion matrix on train data:
Group - 2 21
Project Notes - I
6. Classification report on test data:
Dependent variable Precision Recall F1-score Support
0 0.95 0.76 0.84 1413
1 0.39 0.81 0.53 276
0.8751
2. Accuracy of test data:
0.8815
3. Confusion matrix on train data:
Group - 2 22
Project Notes - I
Group - 2 23
Project Notes - I
0.9023
2. Accuracy of test data:
0.8644
3. Confusion matrix on train data:
Group - 2 24
Project Notes - I
4. Confusion matrix on test data:
Group - 2 25
Project Notes - I
0.8513
2. Accuracy of test data:
0.8703
3. Confusion matrix on train data:
Group - 2 26
Project Notes - I
5. Classification report on train data:
Dependent variable Precision Recall F1-score Support
0 0.91 0.91 0.91 3269
1 0.56 0.57 0.57 672
Group - 2 27
Project Notes - I
5.2.5 Support Vector Machine:
1. Accuracy of train data:
0.8756
2. Accuracy of test data:
0.8833
3. Confusion matrix on train data:
Group - 2 28
Project Notes - I
7. AUC on train data:
0.9373
2. Accuracy of test data:
0.9153
3. Confusion matrix on train data:
Group - 2 29
Project Notes - I
4. Confusion matrix on test data:
Group - 2 30
Project Notes - I
Group - 2 31
Project Notes - I
6. Artificial Neural Network:
This model has accuracy of 0.9373 on train data and 0.9153 on test data which is a good
performance model. Here, the model has decent Precision value but the Recall rates and
F1 score are poor on Churned customers of both train and test data.
6 Model Tuning
However the models which are built can be fine tuned to improve the models performance
and can validate the models using some techniques.
0.9358
2. Accuracy of test data:
0.9171
3. Confusion matrix on train data:
Group - 2 32
Project Notes - I
4. Confusion matrix on test data:
Group - 2 33
Project Notes - I
0.9015
2. Accuracy of test data:
0.8910
3. Confusion matrix on train data:
Group - 2 34
Project Notes - I
5. Classification report on train data:
Dependent variable Precision Recall F1-score Support
0 0.92 0.96 0.94 3269
1 0.77 0.61 0.68 672
Group - 2 35
Project Notes - I
6.1.3 XG Boost
XG Boost is an implementation of gradient boosted decision trees designed for speed and
performance.
1. Accuracy of train data:
0.9302
2. Accuracy of test data:
0.9100
3. Confusion matrix on train data:
Group - 2 36
Project Notes - I
6. Classification report on test data:
Dependent variable Precision Recall F1-score Support
0 0.92 0.98 0.95 1413
1 0.82 0.58 0.68 276
Group - 2 37
Project Notes - I
2. Test data:
0.8047 0.7514 0.7692 0.7869 0.8284
0.8106 0.7633 0.7455 0.8106 0.7619
2. Test data:
0.8698 0.8698 0.8816 0.8934 0.8757
0.8934 0.8698 0.8698 0.8639 0.8809
2. Test data:
0.8875 0.8461 0.8639 0.8224 0.8639
0.8875 0.8402 0.7988 0.8284 0.8392
2. Test data:
0.8461 0.8461 0.8639 0.8698 0.8934
0.8520 0.8343 0.8402 0.8698 0.8392
Group - 2 38
Project Notes - I
6.2.5 Support Machine Vector
1. Train data:
0.8278 0.8299 0.8299 0.8299 0.8299
0.8299 0.8299 0.8299 0.8299 0.8274
2. Test data:
0.8402 0.8402 0.8402 0.8343 0.8343
0.8343 0.8343 0.8343 0.8343 0.8392
2. Test data:
0.8757 0.8698 0.8520 0.8461 0.8402
0.8343 0.8343 0.8402 0.8402 0.8511
2. Test data:
0.9112 0.8875 0.8875 0.8579 0.9289
0.9171 0.8934 0.8816 0.9053 0.9226
Group - 2 39
Project Notes - I
0.8857 0.9035 0.9137 0.9035 0.8908
2. Test data:
0.8934 0.9053 0.8875 0.8875 0.8934
0.8994 0.8875 0.8875 0.9171 0.8869
6.2.9 XG Boost
1. Train data:
0.9012 0.9035 0.9060 0.9238 0.8807
0.8705 0.9213 0.9086 0.8934 0.8959
2. Test data:
0.9112 0.9230 0.8816 0.8639 0.9171
0.9171 0.8994 0.8816 0.9112 0.8988
Group - 2 40
Project Notes - I
Model Accuracy Precision of Precision of Recall of Recall of F1-score of F1-score of AUC Value
not Churned not Churned not Churned
Churned Churned Churned
Train Test Train Test Train Test Train Test Train Test Train Test Train Test Train Test
Logistic 0.77 0.76 0.95 0.95 0.42 0.39 0.77 0.76 0.82 0.81 0.85 0.84 0.55 0.53 0.86 0.86
Regression
LDA 0.87 0.88 0.89 0.89 0.73 0.76 0.97 0.98 0.42 0.40 0.93 0.93 0.53 0.53 0.85 0.86
KNN 0.90 0.86 0.92 0.90 0.80 0.62 0.97 0.95 0.56 0.43 0.94 0.92 0.66 0.51 0.95 0.85
Naive 0.85 0.87 0.91 0.93 0.56 0.60 0.91 0.92 0.57 0.62 0.91 0.92 0.57 0.61 0.82 0.84
Bayes
SVM 0.87 0.88 0.89 0.89 0.75 0.77 0.97 0.98 0.40 0.41 0.93 0.93 0.53 0.53 0.86 0.87
ANN 0.93 0.91 0.86 0.88 0.70 0.72 0.98 0.98 0.26 0.29 0.92 0.92 0.38 0.42 0.83 0.83
Random 0.93 0.91 0.94 0.93 0.90 0.81 0.98 0.97 0.70 0.65 0.96 0.95 0.79 0.72 0.98 0.94
Forest
Ada Boost 0.90 0.89 0.92 0.92 0.77 0.70 0.96 0.95 0.61 0.58 0.94 0.94 0.68 0.63 0.93 0.91
XG Boost 0.93 0.91 0.93 0.92 0.93 0.82 0.99 0.98 0.64 0.58 0.96 0.95 0.76 0.68 0.94 0.91
Group - 2 41
Project Notes - I
XG Boost
Complain plays the most important role, followed by Number of Device Registered
in Logistic Regression and LDA model and Cashback has the least importance.
Tenure plays the most important role, followed by Complain in Random Forest and
XG Boost.
Group - 2 42
Project Notes - I
6.5.1 Overall Observations
Tenure and Complain are important factors for Churn where Order Amount Hike
from Last year and Gender seems to be least important factor which plays in model
building.
Other variables like City Tier, Days since Last Order, Satisfaction Score are also some
of the important factors which plays in model building.
Random Forest using Grid Search CV model performs best among all other models
so it is good to implement but also it takes much time to build, so if time is a concern
then the next best is XG Boost model to implement.
Company should be conscious on above important factors to reduce Churn rate and
the model built predicts the Churn rate of customers with respect to given features.
7 Appendix
Tools used: Python, Tableau, Knime
Python code file, Tableau public link and Knime file are attached here for reference
Final
Project_Karthiheswar.ipynb
https://public.tableau.com/profile/karthiheswar#!/vizhome/Projectnote-
1/ChurnvsCitytier
Project
Note-1.knwf
Group - 2 43