Machine Learning Solution

Machine Learning – Predict the type of transport
Read the train and test data set

cars = read.csv("cars.csv")
Given sample data set containing 444 rows
carsTest = read.csv("test.csv")
Sample of two tests for which prediction must be done
Data exploration and analysis

str(cars)
'data.frame': 444 obs. of 9 variables:
$ Age : int 28 23 29 28 27 26 28 26 22 27 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 1 2 2 ...
$ Engineer : int 0 1 1 1 1 1 1 1 1 1 ...
$ MBA : int 0 0 0 1 0 0 0 0 0 0 ...
$ Work.Exp : int 4 4 7 5 4 4 5 3 1 4 ...
$ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : int 0 0 0 0 0 1 0 0 0 0 ...
$ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 3 3 3 3 3 3 1 3 3 3 ...
Variables like Engineer, MBA and license has been read as numeric so should be converted to factors
first.
cars$Engineer = as.factor(cars$Engineer)
cars$MBA = as.factor(cars$MBA)
cars$license = as.factor(cars$license)
Descriptive Analysis
summary(cars)
Age Gender Engineer MBA Work.Exp Salary
Distance
Min. :18.00 Female:128 0:109 0 :331 Min. : 0.0 Min. :
6.50 Min. : 3.20
1st Qu.:25.00 Male :316 1:335 1 :112 1st Qu.: 3.0 1st Qu.:
9.80 1st Qu.: 8.80
Median :27.00 NA's: 1 Median : 5.0 Median :
13.60 Median :11.00
Mean :27.75 Mean : 6.3 Mean :
16.24 Mean :11.32
3rd Qu.:30.00 3rd Qu.: 8.0 3rd
Qu.:15.72 3rd Qu.:13.43
Max. :43.00 Max. :24.0 Max. :
57.00 Max. :23.40
license Transport
0:340 2Wheeler : 83
1:104 Car : 61
Public Transport:300
 We can conclude that we have majority of Males approx.. 75%
 Similarly Engineers outnumber MBA’s
 Total number of engineers and MBA’s is greater then 444, hence possibly
some of candidates have dual degree
 One of data point for MBA is missing
 Salary might have skewed distribution
 Again, public transport is most common mode of transportation
Visual Analysis
boxplot(cars$Age ~cars$Engineer, main = "Age vs Eng.")

boxplot(cars$Age ~cars$MBA, main ="Age Vs MBA")
As expected not much of difference here, people for all qulaifications and all work exp would be
employed in firm
Let us see the avg difference in salary for two profession
boxplot(cars$Salary ~cars$Engineer, main = "Salary vs Eng.")

boxplot(cars$Salary ~cars$MBA, main = "Salary vs MBA.")
We do not see any appreciable difference in salary of Engs Vs Non-Engs or Mba
vs Non-MBA’s
Also, mean salary for both MBA’s and Eng is around 16
hist(cars$Work.Exp, col = "red", main = "Distribution of work exp")

This is skewed towards right, again this would be on expected lines as there
would be more juniors then seniors in any firm
table(cars$license,cars$Transport)
2Wheeler Car Public Transport

0 60 13 267
1 23 48 33
boxplot(cars$Work.Exp ~ cars$Gender)
Not much of difference between mean work experience in two genders, so

population is equally distributed for both male and females.
Hypothesis Testing
1. Higher the salary more the chances of using car for commute.
boxplot(cars$Salary~cars$Transport, main="Salary vs Transport")

Plot clearly shows as salary increase, inclination of commuting by car is higher.
2. Again with age or work. Exp (Age and work exp would be collinear), propensity of using car
Increases
cor(cars$Age, cars$Work.Exp)
[1] 0.8408335
boxplot(cars$Age~cars$Transport, main="Age vs Transport")
As was the case with salary, we could see clear demarcation in usage of transport. With lower age group
2-wheeler is preferable and with higher work exp car is preferred.
3. As distance increase employee, would prefer car for comfort and ease.
boxplot(cars$Distance~cars$Transport, main="Distance vs Transport")

There is a slight pattern that could be observed here. For greater distance car is preferred followed by 2-
wheeler and then public transport.
4. Females would prefer more of private transfer then public transport
table(cars$Gender,cars$Transport)

Female 38 13 77
Male 45 48 223
We could see that around 40 % of females use private transport and 10% use car compared to males
where 15% prefers car and total of 30% uses private transport. Thus, even though percentage of car
usage
is high but they are also high on public transport.
Data cleaning
Missing values
anyNA(cars)
[1] TRUE
Finding out where the missing value is

cars[!complete.cases(cars), ]
Age Gender Engineer MBA Work.Exp Salary Distance license
Transport
145 28 Female 0 <NA> 6 13.7 9.4 0 PublicTransport
Use KNN means method to impute the missing value

library(DMwR)
cars = knnImputation(cars, 5)
Normalize continuous variables
cars$Salary = log(cars$Salary)
Perform similar transformation on test data

carsTest$Salary = log(carsTest$Salary)
carsTest$Engineer = as.factor(carsTest$Engineer)
carsTest$MBA = as.factor(carsTest$MBA)
carsTest$license = as.factor(carsTest$license)
Create test and train data from sample data
library(caret)
random <- createDataPartition(cars$Transport, p=0.70, list=FALSE)
cars_train <- cars[ random,]
cars_test <- cars[-random,]
This sample has all the three categories representation above 10% so we can go ahead without any
oversampling
Model Building and Predictions
Naïve Bayes
library(e1071)
Naive_Bayes_Model=naiveBayes(cars_train$Transport ~., data=cars_train)
Naive_Bayes_Model
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
0.1891026 0.1378205 0.6730769
Conditional probabilities:
Age
Y [,1] [,2]
2Wheeler 25.42373 2.620893
Car 35.72093 3.340413
Public Transport 26.73333 2.924134
Gender
Y Female Male
2Wheeler 0.4915254 0.5084746
Car 0.2558140 0.7441860
Engineer
Y 0 1
2Wheeler 0.2542373 0.7457627
Car 0.1395349 0.8604651
MBA
Y 0 1
2Wheeler 0.7966102 0.2033898
Car 0.7674419 0.2325581
Work.Exp
Y [,1] [,2]
2Wheeler 4.084746 3.114417
Car 15.674419 4.921870
Salary
Y [,1] [,2]
2Wheeler 2.452621 0.3659353
Car 3.514029 0.4321709
Distance
Y [,1] [,2]
2Wheeler 11.92881 3.524009
Car 15.85581 3.864263
license
Y 0 1
2Wheeler 0.7288136 0.2711864
Car 0.2558140 0.7441860
This gives us the rule or factors which can help us employees decision to use car or not.
(These are summarized at the end)
General way to interpret this output is that for any factor variable say license we can say that 72% of
people without license use 2-wheeler and 27% with license.
For continuous variables for example distance we can say 2-wheeler is used by people for whom
commute distance is 11.9 with sd of 3.5
#Prediction on the test dataset

NB_Predictions=predict(Naive_Bayes_Model,cars_test)
table(NB_Predictions,cars_test$Transport)
NB_Predictions 2Wheeler Car Public Transport

2Wheeler 8 0 6
Car 3 14 3
Public Transport 13 4 81
# prediction for test sample

NB_Predictions=predict(Naive_Bayes_Model,carsTest)
NB_Predictions
[1] Public Transport Public Transport
Levels: 2Wheeler Car Public Transport
LDA
We would once again import the two files and do data cleaning as required by
LDA. LDA works best with continuous variables hence convert factors as 1 and
0.
cars[145,4] = 0

cars$Gender<-ifelse(cars$Gender=="Male",1,0)
carsTest$Gender<-ifelse(carsTest$Gender=="Male",1,0)

library(MASS)
fit.ld=lda(Transport~., data=cars_train, cv=TRUE)
fit.ld
Call:
lda(Transport ~ ., data = cars_train, cv = TRUE)
Prior probabilities of groups:

0.1891026 0.1378205 0.6730769
Group means:
Age Gender Engineer MBA Work.Exp Salary
Distance license
2Wheeler 25.42373 0.5593220 0.7288136 0.1694915 4.186441 2.450022
11.56102 0.2372881
Car 35.67442 0.7441860 0.8139535 0.1860465 15.790698 3.536208
15.50000 0.7906977
Public Transport 26.76190 0.7666667 0.7285714 0.2857143 4.980952 2.515765
10.35238 0.1190476
Coefficients of linear discriminants:

LD1 LD2
Age -0.11042612 -0.3860466
Gender 0.25706348 -1.3517327
Engineer -0.14185048 0.2586975
MBA 0.18988407 -0.7316381
Work.Exp -0.07413621 0.2145325
Salary -0.58477768 -0.5036353
Distance -0.10677304 0.1340226
license -1.11223223 1.5268154
Proportion of trace:
LD1 LD2
0.9029 0.0971
Almost similar output as in Naïve Bayes
Predictions and accuracy
LDA_predictions = predict(fit.ld,cars_train)
table(LDA_predictions$class, cars_train$Transport)

2Wheeler 18 0 11
Car 3 36 3
LDA_predictions = predict(fit.ld,cars_test)
table(LDA_predictions$class, cars_test$Transport)

2Wheeler 11 0 6
Car 1 14 1
predict(fit.ld,carsTest)
$class
$posterior
1 0.2036210 7.228535e-05 0.7963068
2 0.2078997 5.165238e-06 0.7920952
$x
LD1 LD2
1 0.7702525 0.2470294
2 1.4835708 0.3306443
KNN
cars[145,4] = 0

cars$Gender<-ifelse(cars$Gender=="Male",1,0)
carsTest$Gender<-ifelse(carsTest$Gender=="Male",1,0)

library(class)
trControl <- trainControl(method = "cv", number = 10)

fit.knn <- train(Transport ~ .,
+ method = "knn",
+ tuneGrid = expand.grid(k = 2:20),
+ trControl = trControl,
+ metric = "Accuracy",
+ preProcess = c("center","scale"),
+ data = cars_train)
fit.knn
k-Nearest Neighbors
312 samples
8 predictor
3 classes: '2Wheeler', 'Car', 'Public Transport'
Pre-processing: centered (8), scaled (8)

Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 281, 281, 280, 281, 280, 282, ...
Resampling results across tuning parameters:
k Accuracy Kappa
2 0.7365457 0.4543489
3 0.7855712 0.5248631
4 0.7629839 0.4800127
5 0.7828562 0.5081854
6 0.7734812 0.4905393
7 0.7634005 0.4624704
8 0.7408065 0.4118105
9 0.7534005 0.4199273
10 0.7536022 0.4116860
11 0.7598454 0.4168749
12 0.7662970 0.4266860
13 0.7662970 0.4213708
14 0.7566129 0.3930122
15 0.7661895 0.4135919
16 0.7660887 0.4090611
17 0.7566129 0.3862387
18 0.7629637 0.3926229
19 0.7661895 0.4026549
20 0.7661895 0.3942178
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 3.
KNN_predictions = predict(fit.knn,cars_train)
table(KNN_predictions, cars_train$Transport)
KNN_predictions 2Wheeler Car Public Transport

2Wheeler 37 0 8
Car 0 35 2
KNN_predictions = predict(fit.knn,cars_test)
table(KNN_predictions, cars_test$Transport)
KNN_predictions 2Wheeler Car Public Transport
2Wheeler 9 0 11
Car 1 15 3
predict(fit.knn,carsTest)
We see that all three models predict Public Transport for the two test samples
Let us summarize the conclusions from analysis and models for employee’s decision whether to use car
Or not:
 Important variables are Age, Work.Exp, Distance and License

 Age and Work.Exp are correlated hence we could use any one (prefer Work.Exp) here
 Hence employees with work exp of 10 and above are likely to use car
 Employees who must commute for distance greater than 12 are more likely to prefer car
 With license, we do see that 74% who commute through car have license and 89% who
commute through bus don’t have. But surprisingly 72% without license use 2-wheeler.
 Again, people with higher salaries (>20) are likely to use cars

Machine Learning Solution

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Solution

Uploaded by

Copyright:

Available Formats

Machine Learning – Predict the type of transport

Read the train and test data set

Data exploration and analysis

boxplot(cars$Age ~cars$Engineer, main = "Age vs Eng.")

Let us see the avg difference in salary for two profession

boxplot(cars$Salary ~cars$Engineer, main = "Salary vs Eng.")

hist(cars$Work.Exp, col = "red", main = "Distribution of work exp")

2Wheeler Car Public Transport

Not much of difference between mean work experience in two genders, so

boxplot(cars$Salary~cars$Transport, main="Salary vs Transport")

boxplot(cars$Age~cars$Transport, main="Age vs Transport")

boxplot(cars$Distance~cars$Transport, main="Distance vs Transport")

4. Females would prefer more of private transfer then public transport

2Wheeler Car Public Transport

Finding out where the missing value is

Use KNN means method to impute the missing value

Normalize continuous variables

Perform similar transformation on test data

Create test and train data from sample data

Model Building and Predictions

Naive Bayes Classifier for Discrete Predictors

#Prediction on the test dataset

NB_Predictions 2Wheeler Car Public Transport

# prediction for test sample

Normalize continuous variables

random <- createDataPartition(cars$Transport, p=0.70, list=FALSE)

Prior probabilities of groups:

Coefficients of linear discriminants:

Almost similar output as in Naïve Bayes

Predictions and accuracy

2Wheeler Car Public Transport

2Wheeler Car Public Transport

Normalize continuous variables

random <- createDataPartition(cars$Transport, p=0.70, list=FALSE)

trControl <- trainControl(method = "cv", number = 10)

Pre-processing: centered (8), scaled (8)

KNN_predictions 2Wheeler Car Public Transport

 Important variables are Age, Work.Exp, Distance and License

You might also like