Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Predictive Modeling - Project 4

Telecom Customer Churn


Prediction

Presented by: Sanan Sahadevan Olachery.

Submission Date: March 1st 2020.

1|Page
Content

 Problem Statement - Page 3.

 EDA - Page 3 - 12.

 Build Models and compare them to get to the best one - Page 13-21.

 Actionable Insights - Page 22.

NOTE: As a part of Assessment requirement R’code snapshots, Pivot Table workings & appropriate
images are affixed along with the solutions.

Sr.
No R codes Installed
1 CaTools
2 Car
3 FMSB
4 Rocr
5 ggplots
6 corrplot

2|Page
Problem Statement:
Customer Churn is a burning problem for Telecom companies. In this project, we simulate one
such case of customer churn where we work on a data of postpaid customers with a contract.
The data has information about the customer usage behavior, contract details and the payment
details. The data also indicates which were the customers who canceled their service. Based on
this past data, we need to build a model which can predict whether a customer will cancel their
service in the future or not.

Q1) EDA
Solution: For the above captioned problem statement we are provided with a data set.
We first conduct a detailed analysis to the data set and understand the elements in it.

 There are total 3333 Observations of 11 Variables.


 There are no Visible Outliers & Missing Values in the DATASET (basis observation of the
Excel data provided).
The description and type of the variable are provided in the DATASET which is mentioned in
the below table.

Particulars Variable
(Variable) Description Type
Churn 1 if customer cancelled service, 0 if not Categorical
number of weeks customer has had active
AccountWeeks account Continuous
1 if customer recently renewed contract, 0 if
ContractRenewal not Categorical
DataPlan 1 if customer has data plan, 0 if not Categorical

3|Page
DataUsage gigabytes of monthly data usage Continuous
CustServCalls number of calls into customer service Continuous
DayMins average daytime minutes per month Continuous
DayCalls average number of daytime calls Continuous
MonthlyCharge average monthly bill Continuous
OverageFee largest overage fee in last 12 months Continuous
RoamMins average number of roaming minutes Continuous

DATASET is further analyzed in R studio and with help of Graphical Representation we Observe
and interpret correlations between the variable.

Correlation between the variables in the DATASET is explained in the below table and with reference to
the FIG:1

SR.NO Variable Correlation


1 Data Plan with Data Usage Highly Correlated
Monthly Charge with Data Plan and Data Usage Positively Correlated
2 Cust.Service Calls, Daily minutes and Roamingminutes
3 Contract Renewal with Churn Value Negatively
Correlated

FIG:1

4|Page
In the DATASET we have following Binary Variables which need to be converted into Factor Variables.

 Churn
 DataPlan
 ContractRenewal

Further Analysis of the Variables in DATASET are interpreted as follows:

Variable 1 - AccountWeeks:

It can be seen that Account Weeks varies from Min 1 to Max243 and a Standard deviation of 6.94,
therefore we cannot Interpret that churn rate decreases with an increase in the Account week as there
is no Visibility in the Trend.

5|Page
Variable 2 - ContractRenewal :

Running a Pivot on the DATASET we can find the frequency of Contract Renewals. It reflect that for a
Churn Value 0 a user has not cancelled while Contract renewal 0 Value shows user has not renewed the
connection. There for we can state that an account churing probability is good if the contract is not
renewed.

0=323
1=3010
Count of Column
ContractRenewal Labels
Churn 0 1 Grand Total
0 186 2664 2850
1 137 346 483
Grand Total 323 3010 3333

Variable 3 - DataPlan:

From the below table we can interpret that churning probability is higher if the account has not
subscribed /opted for a DataPlan
0=2411
1=922
Count of DataPlan Column Labels
Row Labels 0 1 Grand Total
0 2008 842 2850
1 403 80 483
Grand Total 2411 922 3333

Variable 4 - Data Usage:

Data Usage ranges from 0 to 5.40 we can interpret that datausage there is maximum churning in this
category.

6|Page
Variable 5 - CustServCalls:

As per the name of Variable CustServCalls Suggest we can infer number of times a customer has
contacted Customer Care/Service. It Range from 0.00 to Maximum 9.00. Thus we can conclude that the
Chur rate is high if a customer make more than 4 calls.

7|Page
Variable 6 - DayMins:

This variable reflects the time spent by a customer over the call in a day. As we see that its range varies
from 0.00 to max 350.8 and has a mean of 179.8. The churning rate increases if the daymins is greater
than 245

8|Page
Variable 7 - DayCalls:

As you can see in the below figures that there is no visible pattern in Chun with daily calls.

9|Page
Variable 8 - MonthlyCharge

The churning rate increases if the monthly charge increases.

10 | P a g e
Variable 9 - OverageFee

As you can see in the below figures that there is no visible pattern in Chun with OverageFee.

Variable 10 - RoamMins

You can see there are a Low Range in variance and no visible pattern in Chun with RoamMins.

11 | P a g e
12 | P a g e
Q2) Build Models and compare them to get to the best one.
Solution:

DATASET is divided into 70:30 ratios between training and testing sets.

Logistic Regression Model 1 –Using All Variables

Observation:

1. The following Variables has a negative impact on Churning.(ContractRenewal, CustServCalls &


RoamMins)
2. AIC Score: 2210.4
3. VIF values of Correlated variables are inflated due to multicollinearity.
4. Monthly charge and Datausage needs to removed from the dataset

Akaike Information Criterion (AIC) is a measure of the relative quality in a stastical model for a given
dataset. The Preferred model is the one with minimum AIC . In the below model AIC score is 2210.4. If
this is lowest AIC score amongst other subsequent MODELS to be tested, then we can consider it.

13 | P a g e
14 | P a g e
Logistic Regression Model 2 – Using without Correlated Variables
From the observations of Model 1 we would remove Monthly Charge and Data Usage.

Observation:
1. AIC Score= 2206.6
2. AccountWeeks and Daycalls need to be removed as these variables are insignificant.

15 | P a g e
Logistic Regression Model 3 – Using without Insignificant Variables

Observation
1. AIC Score=2204.6
2. Coefficient value signifies how the odd ratio in a log is affecting Churning, which were also
observed in the EDA

16 | P a g e
Accuracy of the Base Model is derived at 0.8550855

Confusion Matrix at threshold of 0.5


(835+16)/(835+20+129+16)=0.851

Sensitivity = 16/129+16 =0.1103

17 | P a g e
ROC Curve

The ROC Curve demonstrates a good result as it is moving from Left to Right ie 0 to 1. However we can
improve the curve by lowering the threshold value by 0.2 towards left ie towards TRUE POSITIVE RATE.

Accuracy with a Lower Threshold of 0.2

Accuracy Logistic Model

(724+90)/(724+90+131+55)= 0.814

18 | P a g e
Sensitivity = 90/(90/55) =0.6206

Form the Above calculation and Observation we can conclude that by Lowering Threshold Value the
model sensitivity is improved.

KNN:
Accuracy of the Largest Value was used to select an Optimal Model

K Value Used is “7”

Performance of the Model – Sample


Confusion Matrix
Actual
Prediction 0 1
0 1986 181
1 9 158
Accuracy 0.919
Sensitivity 0.466
Specificity 0.996

19 | P a g e
Performance of the Model – Out
Sample Confusion Matrix
Actual
Prediction 0 1
0 841 91
1 14 53
Accuracy 0.895
Sensitivity 0.368
Specificity 0.984

Naive Bayes Model

20 | P a g e
Performance of the Model – Sample Confusion Matrix
Actual
Prediction 0 1
0 1940 250
1 55 89
Accuracy 0.869
Sensitivity 0.263
Specificity 0.972

Performance of the Model – Out Sample Confusion Matrix


Actual
Prediction 0 1
0 840 109
1 15 35
Accuracy 0.876
Sensitivity 0.243
Specificity 0.982

Model Comparison
Accuracy Sensitivity Specificity
Logistic Regression 85.5 11.03
KNN 91.9 46.6 99.6
Naïve Bayes 87.6 24.3 98.2

Inference from the MODEL Comparison:

From the above table and detailed model comparison we can conclude that as Accuracy and Sensitivity
in KNN is high we can use the model for decision making.

21 | P a g e
Actionables:

From the overall analysis we can infer the following for each variable in the dataset.

 Contract Renewal – if there is an increase in customer renewing their contract then there are
chances of decrease in churning.

 Data Plan has a negative impact in churning in case they opt for a particular data plan.

 Overage Fee and RoamMins: An increase in Overage fee will result to increase in customer
churning.

22 | P a g e

You might also like