Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 28

Car Usage

Predictions
2019

DATE

Authored by: XXXX

1
Car Usage Predictions
Project Objectives

This case study is prepared for an organization to study their employees transport
preference to commute and need to predict whether an employee will use Car as a
mode of transport. Also, which variables are a significant predictor behind this
decision? The objective is to build the best model using Machine Learning techniques
which can identify right employees who prefers cars.

We will be performing below steps and will analyze the data using Machine Learning
Modeling techniques to identify such customers:
 Exploratory Data Analysis (EDA) & Data Preparation
 Modeling
 Actionable Insights and Recommendations

The factors which predominantly plays an important role in their choice are: -
• Monthly salary
• Expenses
• Work Experience
• Distance
• Position they hold
• Age

In this case study, we will try to identify the most influencing factors in the employee’s
decision to use car as their favorable means of transport by building Machine Learning
model.

2
Exploratory Data Analysis & Data Preparation.
The dataset has data of 418 employee’s information about their mode of transport as
well as their personal and professional details like age, salary, work exp etc.

Variables Description
Age Age of an employee
Gender Gender of an employee
Engineer Whether employee is an engineer graduate or not.1 means Engineer, 0 means not.
MBA Whether employee has done MBA or not.1 means MBA, 0 means not.
Work Exp Total Work Experience of an employee
Salary Monthly salary of an employee
Distance average distance employee travels
license Whether employee holds valid driving license or not. 1 means Yes, 0 means no.
Transport Transport mode which they prefer to commute currently.

Data Summary

Structure of Data:
‘data. Frame’: 418 obs. Of 9 variables:
$ Age : int 28 24 27 25 25 21 23 23 24 28 …
$ Gender : Factor w/ 2 levels “Female”, “Male”: 2 2 1 2 1 2 2 2 2 2 …
$ Engineer : int 1 1 1 0 0 0 1 0 1 1 …
$ MBA : int 0 0 0 0 0 0 1 0 0 0 …
$ Work.Exp : int 5 6 9 1 3 3 3 0 4 6 …
$ Salary : num 14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 …
$ Distance : num 5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 …
$ license : int 0 0 0 0 0 0 0 0 0 1 …
$ Transport: Factor w/ 3 levels “2Wheeler”,”Car”,..: 1 1 1 1 1 1 1 1 1 1 …

Target Variable: - Transport


Summary of Data

3
EDA & Visualization of Data

Percentage Split of Usage of mode of transportation:

 19.9% use Two Wheelers, 8.4% use Cars and 71.8% use Public Transport

Gender wise Split of Usage of mode of transportation:

 Very Few Females use Cars compared to Males. Both Males & Females use more
of Public Transport. No Significant difference due to Gender

4
Engineer/Non Engineer wise analysis of Usage of mode of transportation:

 No Significant difference due to Engineer/Non-Engineer

MBA/Non-MBA wise analysis of Usage of mode of transportation:

 No Significant difference due to MBA/Non-MBA

5
License wise analysis of Usage of mode of transportation:

 Driving License Holders prefer 2 Wheelers & Cars over Public Transport.
 Significant number of people without Driving License use 2 Wheelers

Work experience wise analysis of Usage of mode of transportation:

 The higher the Experience the more the usage of Cars over 2 Wheelers and Public
Transport.
 Work Experience between 15 & 25 Years prefer Cars

6
Salary wise analysis of Usage of mode of transportation:

 The higher the Salary the less the usage of 2 Wheelers and Public Transport.

Distance wise analysis of Usage of mode of transportation:

 Car is preferred for travelling a distance greater than 13 Miles

Our primary interest as per problem statement is to understand the factors


influencing car usage.

Hence, we will create a new column for Car usage. It will take value 0 for Public
Transport & 2-Wheeler and 1 for car usage and Understand the proportion of cars in
Transport Mode accordingly

7
We can clearly see that Target variable is less than 10% in total available data set so we
will be applying SMOTE in further steps. Before that we will be converting Engineer,
MBA and License variable into Factor Variable by executing below R Code

Checking Target variable proportion in overall data set

First check the proportion of target variable in actual dataset

The number of records for people travelling by car is in minority i.e. 10%. Hence, we
need to use an appropriate sampling method.
We will explore using SMOTE to balance target variable proportion and we will use
those Test and Train dataset in logistic regression to see the best fit model and explore
a couple of black box models for prediction later.

Applying SMOTE for data balancing

After balancing we can see that ratio of target variable has increased over 10% and we
can use this balanced dataset in further validating models.
8
Let’s create a subset and create Train and Test dataset

Checking Correlation

Let's look at the correlation between all the variables and treat highly correlated
variables accordingly to build the regression model.

Correlation Interpretation:
 Age, Work Exp and Salary are highly Correlated
 Age, Work Exp and Salary are all moderately correlated with Distance and
License
 Transport is somewhat moderately (marginally) correlated with Gender but not
significant
Since we are unable to identify clearly the variables from which we can predict the
Mode of Transport, we will perform a logistic Regression

9
Modeling

Logistic Regression

We will start with Logistic Regression Analysis which will give us clear insights on those
variables which are significant in building the model so that we can achieve more
precision by eliminating which are the irrelevant variables that can be eliminated

Building Logistic Regression Model based upon all the given variables

Checking for Logistic Regression Model Multicollinearity

Interpretation from logistic model using all available variables and going through the
multicollinearity
 The multicollinearity has caused the inflated VIF values for correlated variables,
making the model unreliable for model building.

10
 VIF values for Salary and Work Exp are 5.54 and 15.69 respectively which are
not inflated as such
 Being conservative and not considering VIF values above 5, we will remove
Salary & Work Exp (Highly Correlated)

Steps for Variable Reduction:


 Use full set of explanatory variables.
 Calculate VIF for each variable.
 Remove variable with highest value.
 Recalculate all VIF values for logistic model built with the new set of variables.
 Removes again variable which has highest value, until all values are within
threshold.

Creating Model 2 - Logistic Regression after Removing highly correlated variables


Create 2nd Model after removing correlated variables Salary & Work Exp

Engineer, Distance, Gender and MBA are Insignificant, so we will remove them as well
and create a new model based upon the rest of the variables.

11
Creating Model 3 - Logistic Regression built after Removing all insignificant variables

Now based on this new built model we can see that all values are significant, and we
can verify the same by checking the multicollinearity as well.

Now we can see that VIF values are within range and all variables are significant and
results are making more sense and are in line with the results which we obtained from
EDA.

Regression Model Performance on Train and Test Data set

1. Confusion Matrix: -

We will start model evaluation on train and test data by executing below code
and will see that how accurate our model will be in identification of employee
who will be preferring Car as mode of transport.

Calculating Confusion Matrix on Train Test Data: -


We are predicting classification of 0 and 1 for each row and then we are putting
our actual and predicted into a table to build confusion matrix to check that how
accurate our model is by executing below R Code.
12
Calculating Confusion Matrix on Test Data: -

Confusion Matrix Output: -


From Confusion matrix we can clearly see that our Train data is 96.75% accurate
in predicting and Train data confirms the same with 96.20% of accuracy. We can
see there is a slight variation but that is within the range so we can confirm that
our model is good model.

2. ROC

The ROC curve is the plot between sensitivity and (1- specificity).
(1- specificity) is also known as false positive rate and sensitivity is also known as
True Positive rate.

Calculating ROC on Train Data

13
Calculating ROC on Test Data

ROC Output Analysis: -

The plot is covering large area under the curve and we could predict the True
Positive side.
 In Train data, my true positive rate is 99.66% and in test data it’s 98.80%.
 There is no major variation in our Test and Train data
 This proves that our model is more stable.

3. K-S chart

K-S will measure the degree of separation between car users and non-car users
By executing below code on Train and Test model, we will be able to see K-S
Analysis result:

14
K-S Output Analysis:
From K-S analysis we can clearly see that our Train data can distinguish between
people likely to prefer car or not
 95.30% on Train and 93.96 % on Test accuracy.
 We can see there is a slight variation but that is within the range
 We can confirm that our model is ok.

4. Gini chart

Gini is the ratio between area between the ROC curve and the diagonal line &
the area of the above triangle.

Gini Output Analysis


From Gini analysis we can clearly see that our
 Train data covering maximum area of car and non-car use employee with
73.06% and test data with 71.93% of accuracy.
 We can see there is a slight variation but that is within the range so
 we can confirm that our model is ok.

15
k-NN Classification

k-NN is a supervised learning algorithm. It uses labeled input data to learn a function
that produces an appropriate output when given new unlabeled data. So, let’s build
our classification model by following below steps: -

Splitting data into Test and Train set in 70:30 ratio

Creating k-NN model: With 3 neighbors

Creating k-NN model on Train and Test data set

Performing classification Model Performance Measures, when K=3

1. Confusion Matrix: -
We will start model evaluation on train and test data by executing below code
and will see that how accurate our model will be in identification of employee
who will be preferring Car as mode of transport.

16
Calculating Confusion Matrix on Train Data: -

We are predicting classification of 0 and 1 for each row and then we are putting
our actual and predicted into a table to build confusion matrix to check that how
accurate our model is by executing below R Code.

Calculating Confusion Matrix on Test Data: -

Confusion Matrix Output: - From Confusion matrix we can clearly see that
 Train data is 97.83% accurate in predicting and Test data confirms the
same with 94.93% of accuracy.
 We can see there is a slight variation but that is within the range
 We can confirm that our model is good.

2. ROC
The ROC curve is the plot between sensitivity and (1- specificity).
(1- specificity) is also known as false positive rate and sensitivity is also known as
True Positive rate.

Calculating ROC on Train Data

17
Calculating ROC on Test Data

ROC Output Analysis: -


So, from the plot we can see that plot is covering large area under the curve and
we can predict on the True Positive side.
 In Train data, my true positive rate is 97.22% and in test data it’s 92.58%.
 There is major variation in our Test and Train data, and
 This proves that our model is stable.
3. K-S chart
K-S will measure the degree of separation between car users and non-car users
By executing below code on Train and Test model, we will be able to see K-S
Analysis result:

18
K-S Output Analysis
From K-S analysis we can clearly see that our
 Train data can distinguish between people likely to prefer car or not
94.44% on Train and 85.17% on Test accuracy.
 We can see there is a variation and that is not within the range so
 we can confirm that our model is not stable

4. Gini chart
Gini is the ratio between area between the ROC curve and the diagonal line &
the area of the above triangle.

Gini Output Analysis


From Gini analysis, we can clearly see that our
 Train data not covering maximum area of car and non-car use employee
with 15.39% and test data with 15.98% of accuracy.
 We can see there is a slight variation but that is within the range so
 We can confirm that our model is ok.

Creating k-NN model: With 5 neighbors

Creating k-NN model on Train and Test data set

Performing classification Model Performance Measures, when K=5

1. Confusion Matrix: -
We will start model evaluation on train and test data by executing below code and
will see that how accurate our model will be in identification of employee who will
be preferring Car as mode of transport.

19
Calculating Confusion Matrix on Train Data: -
We are predicting classification of 0 and 1 for each row and then we are putting
our actual and predicted into a table to build confusion matrix to check that how
accurate our model is by executing below R Code.

Calculating Confusion Matrix on Test Data: -

Confusion Matrix Output: - From Confusion matrix we can clearly see that our

 Train data is 97.56% accurate in predicting and Train data confirms the
same with 95.56% of accuracy.
 We can see there is a slight variation but that is within the range so
 we can confirm that our model is good.

2. ROC
The ROC curve is the plot between sensitivity and (1- specificity).
(1- specificity) is also known as false positive rate and sensitivity is also known as
True Positive rate.

Calculating ROC on Train Data

20
Calculating ROC on Test Data

ROC Output Analysis: -

So, from the plot we can see that plot is covering large area under the curve and
we can predict on the True Positive side.

 In Train data, my true positive rate is 97.02% and in test data it’s 94.04%.
 There is no major variation in our Test and Train data, and
 This proves that our model is stable.

3. K-S chart

K-S will measure the degree of separation between car users and non-car users
By executing below code on Train and Test model, we will be able to see K-S
Analysis result: -

21
K-S Output Analysis

From K-S analysis we can clearly see that our


 Train data can distinguish between people likely to prefer car or not 94.04%
on Train and 88.08% on Test accuracy.
 We can see there is a slight variation and that is not within the range so
 we can confirm that our model is not stable.

4. Gini chart
Gini is the ratio between area between the ROC curve and the diagonal line & the
area of the above triangle.

Gini Output Analysis : From Gini analysis we can clearly see that our
 Train data not covering maximum area of car and non-car use employee
with 15.32% and test data with 15.57% of accuracy.
 We can see there is a slight variation but that is within the range so
 we can confirm that our model is ok.

22
Creating Naïve Bayes model

Naive Bayes classifier presume that the presence of a feature in a class is unrelated to
the presence of any other feature in the same class, so let’s build the model and see
how good our model is as per this classification model

Performing classification Model Performance Measures for Naïve Bayes

1. Confusion Matrix: -
Calculating Confusion Matrix on Train and Test Data: -

Calculating Confusion Matrix on Test Data: -

Confusion Matrix Output: -

From Confusion matrix we can clearly see that our Train data is 95.94% accurate
in predicting and Test data has 93.67% accuracy in prediction the churn rate.

2. ROC
The ROC curve is the plot between sensitivity and (1- specificity).
(1- specificity) is also known as false positive rate and sensitivity is also known as
True Positive rate.

23
Calculating ROC on Train Data

Calculating ROC on Test Data

ROC Output Analysis: -


So, from the plot we can see that plot is covering large area under the curve and
we can predict on the True Positive side.
 In Train data, my true positive rate is 72.93% and in test data it’s 94.04%.
 There is major variation in our Test and Train data, and
 This proves that our model is not stable.
24
3. K-S chart
K-S will measure the degree of separation between car users and non-car users
By executing below code on Train and Test model, we will be able to see K-S
Analysis result:

K-S Output Analysis


From K-S analysis we can clearly see that our
 Train data can distinguish between people likely to prefer car or not 45.87%
on Train and 50.03% on Test accuracy.
 We can see there is a slight variation and that is within the range so
 we can confirm that our model is stable.

Applying Bagging and Boosting Technique

Bagging and Boosting: - The technique is an ensemble method that helps in training
the multiple models using the same algorithm and helps in creating the strong
learner from weak one.
Bagging (aka Bootstrap Aggregating): is a way to decrease the variance of
prediction by generating additional data for training from your original dataset
using combinations with repetitions to produce multisets of the same
cardinality/size as your original data.
Boosting: Is a way of training the weak learners sequentially

Applying Bagging model:

Helps in comparing the prediction with the observed values thereby estimating the
errors

25
Interpretation:

Bagging here is going with the baseline approach calling everything as true hence it’s
an extreme, hence bagging is going with the minority therefore it is not preferable.
Firstly, convert the dependent variable to a numeric

Next use the balanced test data from SMOTE analysis

Pass the Boosting model:

We are using XG method that is a specialized implementation of gradient boosting


decision trees designed for performance

Convert everything to numeric after passing the metrics

XGBoost works with matrixes that contain all numeric variables. Hence firstly change
the data to matrix

26
Pass the XGBoost model:

The functions above are described as below:

eta = A learning rate at which the values are updated, it’s a slow learning rate
max_depth = Represents how many nodes to expand the trees. Larger the depth,
more complex the model; higher chances of overfitting. There is no standard value
for max_depth. Larger data sets require deep trees to learn the rules from data.
min_child_weight = it blocks the potential feature interactions to prevent overfitting
nrounds = Controls the maximum number of iterations. For classification, it is similar
to the number of trees to grow
nfold = used for cross validation
verbose = do not want to see the output printed
early_stopping_rounds = stop if no improvement for 10 consecutive trees

Confusion Matrix Output: -

Shows a prediction of 100% accuracy that the customers are using cars. This model
same as bagging and therefore is a proper representation of both majority and
minority class

27
Actionable Insights & Recommendations

Logistics Regression Model, K-NN and Naïve Bayes Models are all able to predict the
Transport mode with very high accuracy. However, using bagging and Boosting, we
can predict the Choice of Transport Mode with 100% Accuracy. In this case, any of the
models Logistics Regression, K-NN, Naïve Bayes or Bagging/Boosting can be used for
high accuracy prediction. However, the key aspect is SMOTE for balancing the minority
and majority class, without which our models will not be so accurate

References

Chandar, M., Laha, A., & Krishna, P. (2006, March). Modeling churn behavior of bank customers
using predictive data mining techniques. Paper presented at the National Conference in Soft
Computing Techniques for Engineering Applications (SCT-2006), Rourkela, India.
Hassouna, M., Tarhini, A., Elyas, T., & AbouTrab, M. S. (2016). Customer Churn in Mobile
Markets A Comparison of Techniques. doi:10.5539/ibr.v8n6p224
Preimesberger, C. (2013, August 12). Unstructured Data Is an Important Untapped Resource: 10
Reasons Why. Retrieved from http://www.eweek.com/storage/slideshows/unstructured-data-
is-an-important-untapped-resource-10-reasons-why
Tamaddoni Jahromi, A., Sepehri, M. M., Teimourpour, B., & Choobdar, S. (2010). Modeling
customer churn in a non-contractual setting: the case of telecommunications service providers.
Journal Of Strategic Marketing, 18(7), 587. doi:10.1080/0965254X.2010.529158
Turban, E., Sharda, R., & Delen, D. (2011). Decision support and business intelligence systems.
Boston, MA: Prentice Hall.

28

You might also like