Machine Learning Solution
Machine Learning Solution
carsTest = read.csv("test.csv")
Sample of two tests for which prediction must be done
Variables like Engineer, MBA and license has been read as numeric so should be converted to factors
first.
cars$Engineer = as.factor(cars$Engineer)
cars$MBA = as.factor(cars$MBA)
cars$license = as.factor(cars$license)
Descriptive Analysis
summary(cars)
Age Gender Engineer MBA Work.Exp Salary
Distance
Min. :18.00 Female:128 0:109 0 :331 Min. : 0.0 Min. :
6.50 Min. : 3.20
1st Qu.:25.00 Male :316 1:335 1 :112 1st Qu.: 3.0 1st Qu.:
9.80 1st Qu.: 8.80
Median :27.00 NA's: 1 Median : 5.0 Median :
13.60 Median :11.00
Mean :27.75 Mean : 6.3 Mean :
16.24 Mean :11.32
3rd Qu.:30.00 3rd Qu.: 8.0 3rd
Qu.:15.72 3rd Qu.:13.43
Max. :43.00 Max. :24.0 Max. :
57.00 Max. :23.40
license Transport
0:340 2Wheeler : 83
1:104 Car : 61
Public Transport:300
We can conclude that we have majority of Males approx.. 75%
Similarly Engineers outnumber MBA’s
Total number of engineers and MBA’s is greater then 444, hence possibly
some of candidates have dual degree
One of data point for MBA is missing
Salary might have skewed distribution
Again, public transport is most common mode of transportation
Visual Analysis
As expected not much of difference here, people for all qulaifications and all work exp would be
employed in firm
table(cars$license,cars$Transport)
boxplot(cars$Work.Exp ~ cars$Gender)
Hypothesis Testing
1. Higher the salary more the chances of using car for commute.
2. Again with age or work. Exp (Age and work exp would be collinear), propensity of using car
Increases
cor(cars$Age, cars$Work.Exp)
[1] 0.8408335
As was the case with salary, we could see clear demarcation in usage of transport. With lower age group
2-wheeler is preferable and with higher work exp car is preferred.
3. As distance increase employee, would prefer car for comfort and ease.
table(cars$Gender,cars$Transport)
We could see that around 40 % of females use private transport and 10% use car compared to males
where 15% prefers car and total of 30% uses private transport. Thus, even though percentage of car
usage
is high but they are also high on public transport.
Data cleaning
Missing values
anyNA(cars)
[1] TRUE
cars$Salary = log(cars$Salary)
library(caret)
random <- createDataPartition(cars$Transport, p=0.70, list=FALSE)
cars_train <- cars[ random,]
cars_test <- cars[-random,]
This sample has all the three categories representation above 10% so we can go ahead without any
oversampling
Naïve Bayes
library(e1071)
Naive_Bayes_Model=naiveBayes(cars_train$Transport ~., data=cars_train)
Naive_Bayes_Model
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
2Wheeler Car Public Transport
0.1891026 0.1378205 0.6730769
Conditional probabilities:
Age
Y [,1] [,2]
2Wheeler 25.42373 2.620893
Car 35.72093 3.340413
Public Transport 26.73333 2.924134
Gender
Y Female Male
2Wheeler 0.4915254 0.5084746
Car 0.2558140 0.7441860
Public Transport 0.2761905 0.7238095
Engineer
Y 0 1
2Wheeler 0.2542373 0.7457627
Car 0.1395349 0.8604651
Public Transport 0.2714286 0.7285714
MBA
Y 0 1
2Wheeler 0.7966102 0.2033898
Car 0.7674419 0.2325581
Public Transport 0.7333333 0.2666667
Work.Exp
Y [,1] [,2]
2Wheeler 4.084746 3.114417
Car 15.674419 4.921870
Public Transport 4.866667 3.062559
Salary
Y [,1] [,2]
2Wheeler 2.452621 0.3659353
Car 3.514029 0.4321709
Public Transport 2.508357 0.3066213
Distance
Y [,1] [,2]
2Wheeler 11.92881 3.524009
Car 15.85581 3.864263
Public Transport 10.27286 3.090404
license
Y 0 1
2Wheeler 0.7288136 0.2711864
Car 0.2558140 0.7441860
Public Transport 0.8857143 0.1142857
This gives us the rule or factors which can help us employees decision to use car or not.
(These are summarized at the end)
General way to interpret this output is that for any factor variable say license we can say that 72% of
people without license use 2-wheeler and 27% with license.
For continuous variables for example distance we can say 2-wheeler is used by people for whom
commute distance is 11.9 with sd of 3.5
LDA
We would once again import the two files and do data cleaning as required by
LDA. LDA works best with continuous variables hence convert factors as 1 and
0.
cars = read.csv("cars.csv")
carsTest = read.csv("test.csv")
cars[145,4] = 0
library(MASS)
fit.ld=lda(Transport~., data=cars_train, cv=TRUE)
fit.ld
Call:
lda(Transport ~ ., data = cars_train, cv = TRUE)
Group means:
Age Gender Engineer MBA Work.Exp Salary
Distance license
2Wheeler 25.42373 0.5593220 0.7288136 0.1694915 4.186441 2.450022
11.56102 0.2372881
Car 35.67442 0.7441860 0.8139535 0.1860465 15.790698 3.536208
15.50000 0.7906977
Public Transport 26.76190 0.7666667 0.7285714 0.2857143 4.980952 2.515765
10.35238 0.1190476
Proportion of trace:
LD1 LD2
0.9029 0.0971
LDA_predictions = predict(fit.ld,cars_train)
table(LDA_predictions$class, cars_train$Transport)
LDA_predictions = predict(fit.ld,cars_test)
table(LDA_predictions$class, cars_test$Transport)
predict(fit.ld,carsTest)
$class
[1] Public Transport Public Transport
Levels: 2Wheeler Car Public Transport
$posterior
2Wheeler Car Public Transport
1 0.2036210 7.228535e-05 0.7963068
2 0.2078997 5.165238e-06 0.7920952
$x
LD1 LD2
1 0.7702525 0.2470294
2 1.4835708 0.3306443
KNN
cars = read.csv("cars.csv")
carsTest = read.csv("test.csv")
cars[145,4] = 0
cars$Gender<-ifelse(cars$Gender=="Male",1,0)
carsTest$Gender<-ifelse(carsTest$Gender=="Male",1,0)
312 samples
8 predictor
3 classes: '2Wheeler', 'Car', 'Public Transport'
k Accuracy Kappa
2 0.7365457 0.4543489
3 0.7855712 0.5248631
4 0.7629839 0.4800127
5 0.7828562 0.5081854
6 0.7734812 0.4905393
7 0.7634005 0.4624704
8 0.7408065 0.4118105
9 0.7534005 0.4199273
10 0.7536022 0.4116860
11 0.7598454 0.4168749
12 0.7662970 0.4266860
13 0.7662970 0.4213708
14 0.7566129 0.3930122
15 0.7661895 0.4135919
16 0.7660887 0.4090611
17 0.7566129 0.3862387
18 0.7629637 0.3926229
19 0.7661895 0.4026549
20 0.7661895 0.3942178
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 3.
KNN_predictions = predict(fit.knn,cars_train)
table(KNN_predictions, cars_train$Transport)
KNN_predictions = predict(fit.knn,cars_test)
table(KNN_predictions, cars_test$Transport)
KNN_predictions 2Wheeler Car Public Transport
2Wheeler 9 0 11
Car 1 15 3
Public Transport 14 3 76
predict(fit.knn,carsTest)
[1] Public Transport Public Transport
Levels: 2Wheeler Car Public Transport
We see that all three models predict Public Transport for the two test samples
Let us summarize the conclusions from analysis and models for employee’s decision whether to use car
Or not: