Assignment AnjaliVats 244
Assignment AnjaliVats 244
On
UNDER GUIDANCE OF
1. Build various models of decision tree using different combinations of independent variables.
2. And check the accuracy of the models.
3. Find best model among the models which u generated.
DataSet
The Carseats dataset is a dataframe with 400 observations on the following 11 variables:
Using rpart
1. Predicting weather the person belongs to US or not based on variables (Income, Advertising,
Population, price)
############ CODE 1 ##################
install.packages("rpart")
library(rpart)
getwd()
Carseats<-read.csv("C:/Users/intone/Desktop/MBA/T4/MA/After MidTerm/Carseats.csv")
attach(Carseats)
names(Carseats)
tree_analysis<-rpart(US~Income+Advertising+Population+price, data=Carseats)
tree_analysis
install.packages("rpart.plot")
library(rpart.plot)
rpart.plot(tree_analysis,extra=1)
2) Predicting weather the person belongs to Urban or not based on other parameters in the dataset
i.e ( Income, Advertising, Education, Population, Price, Age, Shelvloc, US , Sales)
Carseats <- Carseats[,-1]
print(summary(Carseats))
set.seed(1234)
rpart.plot(tree)
tree_analysis<-
rpart(Shelveloc~Income+Advertising+Education+Population+price+age+Urban+US+Sales,
data=Carseats)
rpart.plot(tree_analysis,extra=1)
The least mean error value has been obtained for model7 and model8 with 7 and 8 independent
variables ie. With the following combinations :
tree_model7=tree(High~Advertising+age+price+Education+Income+Population+US+Shelv
eloc,training_data)
tree_model8=tree(High~Advertising+age+price+Education+Income+Population+US+Shelv
eloc+Urban,training_data)
Using tree package
a) Creating a decision model by splitting the dataset into training and test data.
set.seed(2)
train=sample(1:nrow(Carseats),nrow(Carseats)/2)
training_data=Carseats[train,]
testing_data=Carseats[test, ]
testing_High=High[test]
tree_model=tree(High~.,training_data)
plot(tree_model)
text(tree_model, pretty = 0)
tree_pred=predict(tree_model, testing_data, type="class")
mean(tree_pred!=testing_High)
cv_tree=cv.tree(tree_model, FUN=prune.misclass)
names(cv_tree)
plot(cv_tree$size, cv_tree$dev, type="b")
##prune the tree
pruned_model=prune.misclass(tree_model, best=9)
plot(pruned_model)
text(pruned_model, pretty=0)
mean(tree_pred !=testing_High)
Best Model which is generated
Comparing Accuracy
1) The best model has been generated is the one in which US (Yes/ No) labels have been
predicted keeping other variables. This model has accuracy of around 89%.
Comaparing Means
The least mean error value has been obtained for model7 and model8 with 7 and 8 independent
variables ie. With the following combinations :
tree_model7=tree(High~Advertising+age+price+Education+Income+Population+US+Shelv
eloc,training_data)
tree_model8=tree(High~Advertising+age+price+Education+Income+Population+US+Shelv
eloc+Urban,training_data)