Assignment 2 PDF
Assignment 2 PDF
Table of Contents
(2) Assumptions……………………………………………………………………………...3
2
Project Objective
The objective of the project is to use the dataset 'Factor-Hair-Revised.csv' to build an optimum
regression model to predict Customer satisfaction.
➢ Perform exploratory data analysis on the dataset with some charts, graphs; find out outliers
and missing values.
➢ Evidence of Multicollinearity.
➢ Perform Simple linear regression for dependent variables with every independent variable
➢ Perform PCA/Factor analysis by extracting 4 factors.
➢ Perform multiple linear regression with customer satisfaction as dependent variables and the
four factors as independent variables.
Assumptions
The following observations are made from the statistics and insights of the data set:
We assume that “Satisfaction” is dependent variable and rest of the variables are Independent
variables.
Exploratory Data Analysis-Step by Step Approach
The various steps followed to analyse the case study is mentioned and explained below: -
(3.1) Environment Set Up and Import the Data Set
The lists of R packages used to analyse the data are listed below:
➢ reader package-→ to read csv data file.
➢ lattice package→ to display multivariate relationship.
➢ library(corrplot) # used for making correlation.
➢ library(nFactor) # used for Factor Analysis
➢ library(car) # used to check multicollinearity.
➢ library(caTools)# used for partitioning the data.
Setting a working directory on starting of the R session makes importing and exporting data files and
code files easier. Basically, working directory is the location/ folder on the PC where i have the data,
codes etc. related to the project.
Please refer to the appendix for the source code.
3
The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing the file.
Data in file “Factor-Hair-Revised.csv” is stored in data frame called “Hair1”
❖ attach: to refer to the variables in the data frame by their names alone, rather than as
components of the data frame.
❖ anyNA: to check the missing values in the dataset.
❖ dim: to check dimension (#rows/columns) of a data frame.
❖ str: Display internal structure of an R object.
❖ head: to return the first parts of a vector, matrix, table, data frame or function.
❖ tail: to return the last parts of a vector, matrix, table, data frame or function.
❖ summary: to produce result summaries of the results of various model fitting functions.
❖ dataframe: to check the tightly coupled collections of variables which share many of the
properties of matrices and of lists.
❖
Working with Problem Statements
(i) Perform exploratory data analysis on the dataset. Showcase some charts,
graphs. Check for outliers and missing values.
To begin with, I have started up starting the working directory as well as different variable inferences
have been used.
Please refer the code in the Appendix for codes of all the problems and their solutions identified.
## setwd("D:/PGP_BA_BI/PG_BA_BI/Advance Statistics/Assignment2") ## Setting up the Working
Directory
4
###Perform EDA###
5
summary(Hair) ##To Find out summary of the data##
hair1=Hair[,-1] ##Remove the ID as it’s the name of the column and we don’t require the same for the
analysis##
##Create new data set frame hair 1 with all except the ID variable. ##
6
boxplot(hair1$Satisfaction,horizontal = TRUE,xlab=variables[12]) ")## Box plot of the Target
Variable##
> boxplot(hair1$Satisfaction,horizontal = TRUE,xlab=variables[12])
boxplot(hair1[,-12], las = 2, names = variables[-12], cex.axis = 0.5) ")## Box plot of the independent
variables##
boxplot(hair1[,-12], las = 2, names = variables[-12], cex.axis = 0.5)
7
##Now to Check with the outliers in the dataset##
list("OutLiers")
OutLiers <- hair1[(1:12),]
> list("OutLiers")
> OutLiers <- hair1[(1:12),]
##To check with the multicollinearity is evident in the dataset, we need to perform a series of
function to verify the same.
8
corlnMtrx=cor(hair1[,2:11])
corlnMtrx
> corlnMtrx=cor(hair1[,2:11])
> corlnMtrx
Ecom TechSup CompRes Advertising ProdLine
SalesFImage ComPricing
Ecom 1.0000000000 0.0008667887 0.1401793 0.42989071 -0.05268784
0.79154371 0.22946240
TechSup 0.0008667887 1.0000000000 0.0966566 -0.06287007 0.19262546
0.01699054 -0.27078668
CompRes 0.1401792611 0.0966565978 1.0000000 0.19691685 0.56141695
0.22975176 -0.12795425
Advertising 0.4298907110 -0.0628700668 0.1969168 1.00000000 -0.01155082
0.54220366 0.13421689
ProdLine -0.0526878383 0.1926254565 0.5614170 -0.01155082 1.00000000
-0.06131553 -0.49494840
SalesFImage 0.7915437115 0.0169905395 0.2297518 0.54220366 -0.06131553
1.00000000 0.26459655
ComPricing 0.2294624014 -0.2707866821 -0.1279543 0.13421689 -0.49494840
0.26459655 1.00000000
WartyClaim 0.0518981915 0.7971679258 0.1404083 0.01079207 0.27307753
0.10745534 -0.24498605
OrdBilling 0.1561473316 0.0801018246 0.7568686 0.18423559 0.42440825
0.19512741 -0.11456703
DelSpeed 0.1916360683 0.0254406935 0.8650917 0.27586308 0.60185021
0.27155126 -0.07287173
WartyClaim OrdBilling DelSpeed
Ecom 0.05189819 0.15614733 0.19163607
TechSup 0.79716793 0.08010182 0.02544069
CompRes 0.14040830 0.75686859 0.86509170
Advertising 0.01079207 0.18423559 0.27586308
ProdLine 0.27307753 0.42440825 0.60185021
SalesFImage 0.10745534 0.19512741 0.27155126
ComPricing -0.24498605 -0.11456703 -0.07287173
WartyClaim 1.00000000 0.19706512 0.10939460
OrdBilling 0.19706512 1.00000000 0.75100307
DelSpeed 0.10939460 0.75100307 1.00000000
9
As we can see from the above correlation matrix:
1. CompRes and DelSpeed are highly correlated
2. OrdBilling and CompRes are highly correlated
3. WartyClaim and TechSupport are highly correlated
4. CompRes and OrdBilling are highly correlated
5. OrdBilling and DelSpeed are highly correlated
6. Ecom and SalesFImage are highly correlated
The correlation between sales force image and e-commerce is highly significant. So is the correlation
between delivery speed and order billing with complaint resolution. Also, the correlation between
order & billing and delivery speed. We can safely assume that there is a high degree of collinearity
between the independent variables.
10
As the analysis suggest that VIF for DelSpeed(Delivery Speed) is 6.516014, which is greater than
4,suggest presence of multicollinearity.
Also CompRes (Compalint Resolution) VIF shows greater than 4 suggesting presence of
multicollinearity which can destablise the Regression model
(iii) Perform simple linear regression for the dependent variable with every
independent variable.
To start with the simple linear regression, we will use all 11 factors with Dependent variable –
Satisfaction was done using lm function.
> lm.ProdQual=lm(Satisfaction~ProdQual,hair1)
> lm.ProdQual
Call:
lm(formula = Satisfaction ~ ProdQual, data = hair1)
Coefficients:
(Intercept) ProdQual
3.6759 0.4151
Satisfaction= 3.6759+0.4151*ProdQual
So, we can infer that any one-unit change in product quality, Satisfaction rating would improve by
0.4151, keeping other things constant.
lm.Ecom=lm(Satisfaction~Ecom,hair1)
> lm.Ecom
Call:
lm(formula = Satisfaction ~ Ecom, data = hair1)
Coefficients:
(Intercept) Ecom
5.1516 0.4811
Satisfaction=5.1516+0.4811*Ecom
> lm.TechSup=lm(Satisfaction~TechSup,hair1)
> lm.TechSup
Call:
lm(formula = Satisfaction ~ TechSup, data = hair1)
Coefficients:
(Intercept) TechSup
6.44757 0.08768
Satisfaction=6.44757+0.08768*TechSup
> lm.CompRes=lm(Satisfaction~CompRes,hair1)
> lm.CompRes
Call:
lm(formula = Satisfaction ~ CompRes, data = hair1)
Coefficients:
(Intercept) CompRes
3.680 0.595
Satisfaction=3.680+0.595*CompRes
> lm.Advertising=lm(Satisfaction~Advertising,hair1)
> lm.Advertising
11
Call:
lm(formula = Satisfaction ~ Advertising, data = hair1)
Coefficients:
(Intercept) Advertising
5.6259 0.3222
Satisfaction=5.6259+0.3222*Advertising
> lm.ProdLine=lm(Satisfaction~ProdLine,hair1)
> lm.ProdLine
Call:
lm(formula = Satisfaction ~ ProdLine, data = hair1)
Coefficients:
(Intercept) ProdLine
4.0220 0.4989
Satisfaction=4.0220+0.4989*ProdLine
> lm.SalesFImage=lm(Satisfaction~SalesFImage,hair1)
> lm.SalesFImage
Call:
lm(formula = Satisfaction ~ SalesFImage, data = hair1)
Coefficients:
(Intercept) SalesFImage
4.070 0.556
Satisfaction=4.070+0.556*SalesFImage
> lm.ComPricing=lm(Satisfaction~ComPricing,hair1)
> lm.ComPricing
Call:
lm(formula = Satisfaction ~ ComPricing, data = hair1)
Coefficients:
(Intercept) ComPricing
8.0386 -0.1607
Satisfaction=8.0386+(-0.1607)*ComPricing
> lm.WartyClaim=lm(Satisfaction~WartyClaim,hair1)
> lm.WartyClaim
Call:
lm(formula = Satisfaction ~ WartyClaim, data = hair1)
Coefficients:
(Intercept) WartyClaim
5.3581 0.2581
Satisfaction=5.3581+0.2581*WartyClaim
> lm.OrdBilling=lm(Satisfaction~OrdBilling,hair1)
> lm.OrdBilling
Call:
lm(formula = Satisfaction ~ OrdBilling, data = hair1)
Coefficients:
(Intercept) OrdBilling
4.0541 0.6695
Satisfaction=4.0541+0.6695*OrdBilling
> lm.DelSpeed =lm(Satisfaction~DelSpeed,hair1)
> lm.DelSpeed
Call:
12
lm(formula = Satisfaction ~ DelSpeed, data = hair1)
Coefficients:
(Intercept) DelSpeed
3.2791 0.9364
atisfaction=3.2791+0.9264*DelSpeed
(iv) Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and
name the Factors.
To perform with the PCA/FA analysis, a series of tests needs to be performed, which is described
below with the analysis section too.
(a) To Check Sphrecity of the Data- Perform Barttlest Test-where if P-Value < 0.05, then the data
is not spherical and we are good to go for the PCA/FA analysis.
cortest.bartlett(corlnMtrx,100)
$chisq
[1] 619.2726
$p.value
[1] 1.79337e-96
$df
[1] 55
As P-value less than 0.05 then it is ideal case for dimension reduction and we can proceed with the
PCA/FA analysis.
(b) To check Sample Adequacy in the data-Perform KMO (Kaiser-Meyer-Olkin) test, where if
MSA value >0.5, the data is sufficient for Factor Analysis.
KMO(corlnMtrx)
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = corlnMtrx)
Overall MSA = 0.65
MSA for each item =
ProdQual Ecom TechSup CompRes Advertising ProdLine Sa
lesFImage ComPricing
0.51 0.63 0.52 0.79 0.78 0.62
0.62 0.75
WartyClaim OrdBilling DelSpeed
0.51 0.76 0.67
As the analysis shows MSA value is greater than 0.5 we are good to go with factor analysis.
(c) The next step is to check for Eigen values, to showcase into how many factors the data
could be taken upto.
A=eigen(corlnMtrx)
> EV=A$values
> EV
[1] 3.42697133 2.55089671 1.69097648 1.08655606 0.60942409 0.55188378 0.4
0151815 0.24695154 0.20355327
[10] 0.13284158 0.09842702
(d) The next step would be plotting the Eigen value in to a plot, named as Scree plot to
determine the number of factors. (Kaiser Rule = Eigen Value >=1),Elbow Rule.
13
plot(EV, main = "Scree Plot", xlab = "Factors", ylab = "Eigen Values", col
= "blue")
> lines(EV, col = "red")
> abline(h = 1, col = "green")
As the scree plot identified, that four factors are above the line so we need to determine four factors
for the given data set Factor-Hair revised.
(e) The second last step is to reduce the dimensions into four factors as suggested in the above
analysis. So, we will proceed with the PCA/factor analysis technique.
14
Test of the hypothesis that 4 components are sufficient.
Loadings:
RC1 RC2 RC3 RC4
ProdQual 0.876
Ecom 0.871
TechSup 0.939
15
CompRes 0.926
Advertising 0.742
ProdLine 0.642
SalesFImage 0.900
ComPricing -0.723
WartyClaim 0.931
OrdBilling 0.864
DelSpeed 0.938
>
The 4 RCs explain about 80 % of cumulative variation in the dataset which is good number.
##Now, just to explore we will do the Factor Analysis too in the dataset##
> fa <- fa(r= hair1[,-12], nfactors = 4, rotate = "varimax", fm="pa")
> fa
Factor Analysis using method = pa
Call: fa(r = corlnMtrx, nfactors = 4, rotate = "varimax", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 PA4 h2 u2 com
ProdQual 0.02 -0.07 0.02 0.65 0.42 0.576 1.0
Ecom 0.07 0.79 0.03 -0.11 0.64 0.362 1.1
TechSup 0.02 -0.03 0.88 0.12 0.79 0.205 1.0
CompRes 0.90 0.13 0.05 0.13 0.84 0.157 1.1
Advertising 0.17 0.53 -0.04 -0.06 0.31 0.686 1.2
ProdLine 0.53 -0.04 0.13 0.71 0.80 0.200 1.9
SalesFImage 0.12 0.97 0.06 -0.13 0.98 0.021 1.1
ComPricing -0.08 0.21 -0.21 -0.59 0.44 0.557 1.6
WartyClaim 0.10 0.06 0.89 0.13 0.81 0.186 1.1
OrdBilling 0.77 0.13 0.09 0.09 0.62 0.378 1.1
DelSpeed 0.95 0.19 0.00 0.09 0.94 0.058 1.1
The degrees of freedom for the null model are 55 and the objective funct
ion was 6.55
The degrees of freedom for the model are 17 and the objective function wa
s 0.33
16
Both the PCA and FA has shown the factors, now to rename the factors and check out the factor
scores.
hair2=cbind(hair1[,12],fa$scores)
> head(hair2)
PA1 PA2 PA3 PA4
[1,] 8.2 -0.1338871 0.9175166 -1.719604873 0.09135411
[2,] 5.7 1.6297604 -2.0090053 -0.596361722 0.65808192
[3,] 8.9 0.3637658 0.8361736 0.002979966 1.37548765
[4,] 4.8 -1.2225230 -0.5491336 1.245473305 -0.64421384
[5,] 7.1 -0.4854209 -0.4276223 -0.026980304 0.47360747
[6,] 4.7 -0.5950924 -1.3035333 -1.183019401 -0.95913571
Now, to rename the four factors from PA1,PA2,PA3,PA4 to Purchase, Marketing, Post Purchase,
Product Position.
17
colnames(hair2) <- c("Satisfaction","Purchase","Marketing","Post Purchase"
,"Product Position")
> head(hair2)
Satisfaction Purchase Marketing Post Purchase Product Position
[1,] 8.2 -0.1338871 0.9175166 -1.719604873 0.09135411
[2,] 5.7 1.6297604 -2.0090053 -0.596361722 0.65808192
[3,] 8.9 0.3637658 0.8361736 0.002979966 1.37548765
[4,] 4.8 -1.2225230 -0.5491336 1.245473305 -0.64421384
[5,] 7.1 -0.4854209 -0.4276223 -0.026980304 0.47360747
[6,] 4.7 -0.5950924 -1.3035333 -1.183019401 -0.95913571
Score matrix was converted into a data frame and its variables which are nothing but FA
components were given meaningful names for further analysis
We achieved a dimensionality reduction where just 4 factors can explain the complete 11 predictor
variables of the hair dataset through FA analysis.
model=lm(Satisfaction~.,hair2)
> summary(model)
Call:
lm(formula = Satisfaction ~ ., data = hair2)
Residuals:
Min 1Q Median 3Q Max
-1.7125 -0.4708 0.1024 0.4158 1.3483
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.91800 0.06696 103.317 < 2e-16 ***
Purchase 0.57963 0.06857 8.453 3.32e-13 ***
Marketing 0.61978 0.06834 9.070 1.61e-14 ***
`Post Purchase` 0.05692 0.07173 0.794 0.429
`Product Position` 0.61168 0.07656 7.990 3.16e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Summary-
(a) Pr (t) values of Coefficients like Intercept (constant beta-naught) we see that it is significant
even at 0.001 level.so it’s not zero and contributes to regression model.
(b) Predictor variables like Purchase, Marketing and Product Position have significant betas
implying that Dependant variable Satisfaction is linearly associated with them.
(c) Post Purchase is the only variable which has some high p-value implying that its beta
coefficient may not be contributing that significantly to the model or may be zero.
18
(d) P-value (extremely less) of Model given by F-statistic gives evidence against the null-
hypothesis. Model is significantly valid at this point.
Call:
lm(formula = Satisfaction ~ ., data = train)
Residuals:
Min 1Q Median 3Q Max
-1.07153 -0.42578 -0.01652 0.38619 1.27831
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.925120 0.066212 104.590 < 2e-16 ***
Purchase 0.553473 0.066208 8.360 6.02e-12 ***
Marketing 0.750898 0.068772 10.919 < 2e-16 ***
`Post Purchase` -0.001064 0.069597 -0.015 0.988
`Product Position` 0.525923 0.069971 7.516 1.94e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5554 on 66 degrees of freedom
Multiple R-squared: 0.8049, Adjusted R-squared: 0.7931
F-statistic: 68.09 on 4 and 66 DF, p-value: < 2.2e-16
> vif(lm2)
Purchase Marketing `Post Purchase` `Product Position
`
1.011048 1.005816 1.000977 1.00605
6
> pred <- predict(model, newdata = test)
> summary(pred)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.545 6.552 6.949 6.953 7.183 8.713
> SSE <- sum((pred - test$Satisfaction)^2)
> SSE
[1] 20.37887
> SST <- sum((mean(test$Satisfaction) - test$Satisfaction)^2)\
Error: unexpected input in "SST <- sum((mean(test$Satisfaction) - test$Sat
isfaction)^2)\"
> SST
[1] 36.26207
> rsq <- 1-SSE/SST
> rsq
[1] 0.4380114
> SSE <- sum((pred - test$Satisfaction)^2)
> SSE
[1] 20.37887
> SST <- sum((mean(test$Satisfaction) - test$Satisfaction)^2)
> SST
[1] 36.26207
> rsq <- 1-SSE/SST
> rsq
[1] 0.4380114
19
We can say that the " Satisfaction" ratings of hair product depend very highly on the overall
Purchasing experience of the Customer i.e. how quickly his product is delivered, its billed and if there
are complaints are resolved in shortest possible time
The factors Purchase, Marketing, Product Position are highly significant and Post Purchase is not
significant in the model.
Including Interaction model, we are able to make a better prediction. Even though the Interaction
didn't give a significant increase compared to the individual variables.
So, we can infer that overall the model is valid and also not overfit.
Source Code-Appendix
##??? Perform exploratory data analysis on the dataset with some charts,
graphs; find out outliers and missing values##
setwd("D:/PGP_BA_BI/PG_BA_BI/Advance Statistics/Assignment2")
Hair=read.csv("Factor-Hair-Revised.csv",header = TRUE)
"Warranty & Claims" , "Order & Billing" , "Delivery Speed" , "Customer Satisfaction")
dim(Hair)
str(Hair)
names(Hair)
summary(Hair)
hair1=Hair[,-1]
str(hair1)
attach(hair1)
sum(is.na(hair1))
20
include.lowest=T, right=T, col=8, border=1,main = "Histogram of Customer Satisfaction"
boxplot(hair1$Satisfaction,horizontal = TRUE,xlab=variables[12])
list("OutLiers")
corlnMtrx=cor(hair1[,-12])
corlnMtrx
vif(lm1)
class(corlnMtrx)
corlnMtrx=as.data.frame(corlnMtrx)
class(corlnMtrx)
##Perform simple linear regression for the dependent variable with every independent variable.
lm.ProdQual=lm(Satisfaction~ProdQual,hair1)
lm.ProdQual
Satisfaction= 3.6759+0.4151*ProdQual
lm.Ecom=lm(Satisfaction~Ecom,hair1)
lm.Ecom
Satisfaction=5.1516+0.4811*Ecom
lm.TechSup=lm(Satisfaction~TechSup,hair1)
21
lm.TechSup
Satisfaction=6.44757+0.08768*TechSup
lm.CompRes=lm(Satisfaction~CompRes,hair1)
lm.CompRes
Satisfaction=3.680+0.595*CompRes
lm.Advertising=lm(Satisfaction~Advertising,hair1)
lm.Advertising
Satisfaction=5.6259+0.3222*Advertising
lm.ProdLine=lm(Satisfaction~ProdLine,hair1)
lm.ProdLine
Satisfaction=4.0220+0.4989*ProdLine
lm.SalesFImage=lm(Satisfaction~SalesFImage,hair1)
lm.SalesFImage
Satisfaction=4.070+0.556*SalesFImage
lm.ComPricing=lm(Satisfaction~ComPricing,hair1)
lm.ComPricing
Satisfaction=8.0386+(-0.1607)*ComPricing
lm.WartyClaim=lm(Satisfaction~WartyClaim,hair1)
lm.WartyClaim
Satisfaction=5.3581+0.2581*WartyClaim
lm.OrdBilling=lm(Satisfaction~OrdBilling,hair1)
lm.OrdBilling
22
Satisfaction=4.0541+0.6695*OrdBilling
lm.DelSpeed =lm(Satisfaction~DelSpeed,hair1)
lm.DelSpeed
Satisfaction=3.2791+0.9264*DelSpeed
##Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name the Factors.
cortest.bartlett(corlnMtrx,100)
KMO(corlnMtrx)
A=eigen(corlnMtrx)
EV=A$values
EV
plot(EV, main = "Scree Plot", xlab = "Factors", ylab = "Eigen Values", col = "blue")
pca
pca_r
23
fa <- fa(r= hair1[,-12], nfactors = 4, rotate = "varimax", fm="pa")
fa
fa.diagram(fa)
hair2=cbind(hair1[,12],fa$scores)
head(hair2)
head(hair2)
class(hair2)
hair2=as.data.frame(hair2)
class(hair2)
model=lm(Satisfaction~.,hair2)
summary(model)
set.seed(1234)
library(caTools)
dim(train)
dim(test)
lm2=lm(Satisfaction~.,data=train)
summary(lm2)
24
vif(lm2)
summary(pred)
SSE
SST
rsq
25