Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

1

Table of Contents

(1) Project Objective………………………………………………………………………....3

(2) Assumptions……………………………………………………………………………...3

(3) Exploratory Data Analysis-Step by Step Approach……………………………………...3

(3.1) Environment Set Up and Import the Data Set…………………………………...…3

(3.1.1) Install necessary packages and invoke libraries………………………………….3

(3.1.2) Set up working directory………………………………………………………....3

(3.1.3) Import the Data Set in R……………………………………………………...….4


(4) Variable Identification……………………………………………………………...……4
(5) Working with Problem Statements………………………………………………….….4
(6) Solutions and Conclusions to Problem Statements………………………………….….5
(9) Conclusion & Inference to the Problem Statements……………………………………19
(10) Source Code-Appendix………………………………………………………………....18

2
Project Objective

The objective of the project is to use the dataset 'Factor-Hair-Revised.csv' to build an optimum
regression model to predict Customer satisfaction.
➢ Perform exploratory data analysis on the dataset with some charts, graphs; find out outliers
and missing values.
➢ Evidence of Multicollinearity.
➢ Perform Simple linear regression for dependent variables with every independent variable
➢ Perform PCA/Factor analysis by extracting 4 factors.
➢ Perform multiple linear regression with customer satisfaction as dependent variables and the
four factors as independent variables.
Assumptions

The following observations are made from the statistics and insights of the data set:
We assume that “Satisfaction” is dependent variable and rest of the variables are Independent
variables.
Exploratory Data Analysis-Step by Step Approach

The various steps followed to analyse the case study is mentioned and explained below: -
(3.1) Environment Set Up and Import the Data Set

(3.1.1) Install necessary packages and invoke libraries

The lists of R packages used to analyse the data are listed below:
➢ reader package-→ to read csv data file.
➢ lattice package→ to display multivariate relationship.
➢ library(corrplot) # used for making correlation.
➢ library(nFactor) # used for Factor Analysis
➢ library(car) # used to check multicollinearity.
➢ library(caTools)# used for partitioning the data.

(3.1.2) Set up working directory

Setting a working directory on starting of the R session makes importing and exporting data files and
code files easier. Basically, working directory is the location/ folder on the PC where i have the data,
codes etc. related to the project.
Please refer to the appendix for the source code.

(3.1.3) Import the Data Set in R-

3
The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing the file.
Data in file “Factor-Hair-Revised.csv” is stored in data frame called “Hair1”

Please refer to the appendix for source code.


(4) Variable Identification

❖ attach: to refer to the variables in the data frame by their names alone, rather than as
components of the data frame.
❖ anyNA: to check the missing values in the dataset.
❖ dim: to check dimension (#rows/columns) of a data frame.
❖ str: Display internal structure of an R object.
❖ head: to return the first parts of a vector, matrix, table, data frame or function.
❖ tail: to return the last parts of a vector, matrix, table, data frame or function.
❖ summary: to produce result summaries of the results of various model fitting functions.
❖ dataframe: to check the tightly coupled collections of variables which share many of the
properties of matrices and of lists.

Working with Problem Statements

(i) Perform exploratory data analysis on the dataset. Showcase some charts,
graphs. Check for outliers and missing values.

Solutions and Conclusions to Q1 -Step by step approach

To begin with, I have started up starting the working directory as well as different variable inferences
have been used.
Please refer the code in the Appendix for codes of all the problems and their solutions identified.
## setwd("D:/PGP_BA_BI/PG_BA_BI/Advance Statistics/Assignment2") ## Setting up the Working
Directory

Hair=read.csv("Factor-Hair-Revised.csv",header = TRUE)##Reading the files in R##


attach(Hair)
variables <- c("Product Quality" , "E-Commerce" , "Technical Support" , "Complaint Resolution" ,
"Advertising" , "Product Line" , "Salesforce Image", "Competitive Pricing" ,
"Warranty & Claims" , "Order & Billing" , "Delivery Speed" , "Customer Satisfaction")## Saving Var
iable names in matrix

4
###Perform EDA###

dim(Hair)##Check the Dimensions of the data##


> dim(Hair)
[1] 100 13

str(Hair)##Check the Structure of the data##


> str(Hair)
'data.frame': 100 obs. of 13 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ ProdQual : num 8.5 8.2 9.2 6.4 9 6.5 6.9 6.2 5.8 6.4 ...
$ Ecom : num 3.9 2.7 3.4 3.3 3.4 2.8 3.7 3.3 3.6 4.5 ...
$ TechSup : num 2.5 5.1 5.6 7 5.2 3.1 5 3.9 5.1 5.1 ...
$ CompRes : num 5.9 7.2 5.6 3.7 4.6 4.1 2.6 4.8 6.7 6.1 ...
$ Advertising : num 4.8 3.4 5.4 4.7 2.2 4 2.1 4.6 3.7 4.7 ...
$ ProdLine : num 4.9 7.9 7.4 4.7 6 4.3 2.3 3.6 5.9 5.7 ...
$ SalesFImage : num 6 3.1 5.8 4.5 4.5 3.7 5.4 5.1 5.8 5.7 ...
$ ComPricing : num 6.8 5.3 4.5 8.8 6.8 8.5 8.9 6.9 9.3 8.4 ...
$ WartyClaim : num 4.7 5.5 6.2 7 6.1 5.1 4.8 5.4 5.9 5.4 ...
$ OrdBilling : num 5 3.9 5.4 4.3 4.5 3.6 2.1 4.3 4.4 4.1 ...
$ DelSpeed : num 3.7 4.9 4.5 3 3.5 3.3 2 3.7 4.6 4.4 ...
$ Satisfaction: num 8.2 5.7 8.9 4.8 7.1 4.7 5.7 6.3 7 5.5 ...

names(Hair) ##To Find out names of the columns##


> names(Hair)
[1] "ID" "ProdQual" "Ecom" "TechSup" "CompRes"
"Advertising" "ProdLine" "SalesFImage"
[9] "ComPricing" "WartyClaim" "OrdBilling" "DelSpeed" "Satisfac
tion"

5
summary(Hair) ##To Find out summary of the data##

hair1=Hair[,-1] ##Remove the ID as it’s the name of the column and we don’t require the same for the
analysis##

##Create new data set frame hair 1 with all except the ID variable. ##

str(hair1) ##Check the Structure of the new data hair1##


> str(hair1)
'data.frame': 100 obs. of 12 variables:
$ ProdQual : num 8.5 8.2 9.2 6.4 9 6.5 6.9 6.2 5.8 6.4 ...
$ Ecom : num 3.9 2.7 3.4 3.3 3.4 2.8 3.7 3.3 3.6 4.5 ...
$ TechSup : num 2.5 5.1 5.6 7 5.2 3.1 5 3.9 5.1 5.1 ...
$ CompRes : num 5.9 7.2 5.6 3.7 4.6 4.1 2.6 4.8 6.7 6.1 ...
$ Advertising : num 4.8 3.4 5.4 4.7 2.2 4 2.1 4.6 3.7 4.7 ...
$ ProdLine : num 4.9 7.9 7.4 4.7 6 4.3 2.3 3.6 5.9 5.7 ...
$ SalesFImage : num 6 3.1 5.8 4.5 4.5 3.7 5.4 5.1 5.8 5.7 ...
$ ComPricing : num 6.8 5.3 4.5 8.8 6.8 8.5 8.9 6.9 9.3 8.4 ...
$ WartyClaim : num 4.7 5.5 6.2 7 6.1 5.1 4.8 5.4 5.9 5.4 ...
$ OrdBilling : num 5 3.9 5.4 4.3 4.5 3.6 2.1 4.3 4.4 4.1 ...
$ DelSpeed : num 3.7 4.9 4.5 3 3.5 3.3 2 3.7 4.6 4.4 ...
$ Satisfaction: num 8.2 5.7 8.9 4.8 7.1 4.7 5.7 6.3 7 5.5 ...

attach(hair1)##Attach the new data##


> attach(hair1)

sum(is.na(hair1))##Check for missing values##


sum(is.na(hair1))
[1] 0

hist(hair1$Satisfaction,breaks = c(0:11), labels = T ,include.lowest=T, right=T, col=8, border=1,main


= "Histogram of Customer Satisfaction” ,xlab = "Customer Satisfaction",ylab = "Numbers")## Plot
Histogram of the Target Variable##
hist(hair1$Satisfaction,breaks = c(0:11), labels = T,
+ include.lowest=T, right=T, col=8, border=1,main = "Histogram of Cus
tomer Satisfaction"
+ ,xlab = "Customer Satisfaction",ylab = "Numbers")

6
boxplot(hair1$Satisfaction,horizontal = TRUE,xlab=variables[12]) ")## Box plot of the Target
Variable##
> boxplot(hair1$Satisfaction,horizontal = TRUE,xlab=variables[12])

boxplot(hair1[,-12], las = 2, names = variables[-12], cex.axis = 0.5) ")## Box plot of the independent
variables##
boxplot(hair1[,-12], las = 2, names = variables[-12], cex.axis = 0.5)

7
##Now to Check with the outliers in the dataset##
list("OutLiers")
OutLiers <- hair1[(1:12),]
> list("OutLiers")
> OutLiers <- hair1[(1:12),]

(ii) Is there evidence of multicollinearity? Showcase your analysis.

##To check with the multicollinearity is evident in the dataset, we need to perform a series of
function to verify the same.

Step1-→ We need a to create a Correlation Matrix.

8
corlnMtrx=cor(hair1[,2:11])

corlnMtrx
> corlnMtrx=cor(hair1[,2:11])
> corlnMtrx
Ecom TechSup CompRes Advertising ProdLine
SalesFImage ComPricing
Ecom 1.0000000000 0.0008667887 0.1401793 0.42989071 -0.05268784
0.79154371 0.22946240
TechSup 0.0008667887 1.0000000000 0.0966566 -0.06287007 0.19262546
0.01699054 -0.27078668
CompRes 0.1401792611 0.0966565978 1.0000000 0.19691685 0.56141695
0.22975176 -0.12795425
Advertising 0.4298907110 -0.0628700668 0.1969168 1.00000000 -0.01155082
0.54220366 0.13421689
ProdLine -0.0526878383 0.1926254565 0.5614170 -0.01155082 1.00000000
-0.06131553 -0.49494840
SalesFImage 0.7915437115 0.0169905395 0.2297518 0.54220366 -0.06131553
1.00000000 0.26459655
ComPricing 0.2294624014 -0.2707866821 -0.1279543 0.13421689 -0.49494840
0.26459655 1.00000000
WartyClaim 0.0518981915 0.7971679258 0.1404083 0.01079207 0.27307753
0.10745534 -0.24498605
OrdBilling 0.1561473316 0.0801018246 0.7568686 0.18423559 0.42440825
0.19512741 -0.11456703
DelSpeed 0.1916360683 0.0254406935 0.8650917 0.27586308 0.60185021
0.27155126 -0.07287173
WartyClaim OrdBilling DelSpeed
Ecom 0.05189819 0.15614733 0.19163607
TechSup 0.79716793 0.08010182 0.02544069
CompRes 0.14040830 0.75686859 0.86509170
Advertising 0.01079207 0.18423559 0.27586308
ProdLine 0.27307753 0.42440825 0.60185021
SalesFImage 0.10745534 0.19512741 0.27155126
ComPricing -0.24498605 -0.11456703 -0.07287173
WartyClaim 1.00000000 0.19706512 0.10939460
OrdBilling 0.19706512 1.00000000 0.75100307
DelSpeed 0.10939460 0.75100307 1.00000000

Step12-→ Use corrplot function to showcase the data.


corrplot(corlnMtrx, method = "number", number.cex = 0.7)
> corrplot(corlnMtrx, method = "number", number.cex = 0.7)

9
As we can see from the above correlation matrix:
1. CompRes and DelSpeed are highly correlated
2. OrdBilling and CompRes are highly correlated
3. WartyClaim and TechSupport are highly correlated
4. CompRes and OrdBilling are highly correlated
5. OrdBilling and DelSpeed are highly correlated
6. Ecom and SalesFImage are highly correlated
The correlation between sales force image and e-commerce is highly significant. So is the correlation
between delivery speed and order billing with complaint resolution. Also, the correlation between
order & billing and delivery speed. We can safely assume that there is a high degree of collinearity
between the independent variables.

##Now, to check the evidence of Multicollinearity##


Evidence of multicollinearity amongst the variables of hair dataset Variance Inflation Factors
(VIF) concept was used. Any variable having value of VIF > 4 suggests presence of
multicollinearity amongst predictor variables.

lm1=lm(Satisfaction ~ . , data = hair1)


vif(lm1)
lm1=lm(Satisfaction ~ . , data = hair1)
> vif(lm1)
ProdQual Ecom TechSup CompRes Advertising ProdLine Sa
lesFImage ComPricing
1.635797 2.756694 2.976796 4.730448 1.508933 3.488185
3.439420 1.635000
WartyClaim OrdBilling DelSpeed
3.198337 2.902999 6.516014

10
As the analysis suggest that VIF for DelSpeed(Delivery Speed) is 6.516014, which is greater than
4,suggest presence of multicollinearity.
Also CompRes (Compalint Resolution) VIF shows greater than 4 suggesting presence of
multicollinearity which can destablise the Regression model
(iii) Perform simple linear regression for the dependent variable with every
independent variable.

To start with the simple linear regression, we will use all 11 factors with Dependent variable –
Satisfaction was done using lm function.
> lm.ProdQual=lm(Satisfaction~ProdQual,hair1)
> lm.ProdQual

Call:
lm(formula = Satisfaction ~ ProdQual, data = hair1)

Coefficients:
(Intercept) ProdQual
3.6759 0.4151
Satisfaction= 3.6759+0.4151*ProdQual

Intercept Coefficient is equal to 3.6759

Slope or the variable coefficient of Product Quality is equal to 0.4151

So, we can infer that any one-unit change in product quality, Satisfaction rating would improve by
0.4151, keeping other things constant.
lm.Ecom=lm(Satisfaction~Ecom,hair1)
> lm.Ecom

Call:
lm(formula = Satisfaction ~ Ecom, data = hair1)

Coefficients:
(Intercept) Ecom
5.1516 0.4811
Satisfaction=5.1516+0.4811*Ecom
> lm.TechSup=lm(Satisfaction~TechSup,hair1)
> lm.TechSup

Call:
lm(formula = Satisfaction ~ TechSup, data = hair1)

Coefficients:
(Intercept) TechSup
6.44757 0.08768
Satisfaction=6.44757+0.08768*TechSup
> lm.CompRes=lm(Satisfaction~CompRes,hair1)
> lm.CompRes

Call:
lm(formula = Satisfaction ~ CompRes, data = hair1)

Coefficients:
(Intercept) CompRes
3.680 0.595
Satisfaction=3.680+0.595*CompRes
> lm.Advertising=lm(Satisfaction~Advertising,hair1)
> lm.Advertising

11
Call:
lm(formula = Satisfaction ~ Advertising, data = hair1)
Coefficients:
(Intercept) Advertising
5.6259 0.3222
Satisfaction=5.6259+0.3222*Advertising
> lm.ProdLine=lm(Satisfaction~ProdLine,hair1)
> lm.ProdLine

Call:
lm(formula = Satisfaction ~ ProdLine, data = hair1)

Coefficients:
(Intercept) ProdLine
4.0220 0.4989
Satisfaction=4.0220+0.4989*ProdLine
> lm.SalesFImage=lm(Satisfaction~SalesFImage,hair1)
> lm.SalesFImage
Call:
lm(formula = Satisfaction ~ SalesFImage, data = hair1)

Coefficients:
(Intercept) SalesFImage
4.070 0.556
Satisfaction=4.070+0.556*SalesFImage
> lm.ComPricing=lm(Satisfaction~ComPricing,hair1)
> lm.ComPricing

Call:
lm(formula = Satisfaction ~ ComPricing, data = hair1)

Coefficients:
(Intercept) ComPricing
8.0386 -0.1607
Satisfaction=8.0386+(-0.1607)*ComPricing
> lm.WartyClaim=lm(Satisfaction~WartyClaim,hair1)
> lm.WartyClaim

Call:
lm(formula = Satisfaction ~ WartyClaim, data = hair1)

Coefficients:
(Intercept) WartyClaim
5.3581 0.2581
Satisfaction=5.3581+0.2581*WartyClaim
> lm.OrdBilling=lm(Satisfaction~OrdBilling,hair1)
> lm.OrdBilling
Call:
lm(formula = Satisfaction ~ OrdBilling, data = hair1)
Coefficients:
(Intercept) OrdBilling
4.0541 0.6695
Satisfaction=4.0541+0.6695*OrdBilling
> lm.DelSpeed =lm(Satisfaction~DelSpeed,hair1)
> lm.DelSpeed

Call:

12
lm(formula = Satisfaction ~ DelSpeed, data = hair1)

Coefficients:
(Intercept) DelSpeed
3.2791 0.9364
atisfaction=3.2791+0.9264*DelSpeed

(iv) Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and
name the Factors.

To perform with the PCA/FA analysis, a series of tests needs to be performed, which is described
below with the analysis section too.

(a) To Check Sphrecity of the Data- Perform Barttlest Test-where if P-Value < 0.05, then the data
is not spherical and we are good to go for the PCA/FA analysis.

cortest.bartlett(corlnMtrx,100)
$chisq
[1] 619.2726

$p.value
[1] 1.79337e-96

$df
[1] 55
As P-value less than 0.05 then it is ideal case for dimension reduction and we can proceed with the
PCA/FA analysis.

(b) To check Sample Adequacy in the data-Perform KMO (Kaiser-Meyer-Olkin) test, where if
MSA value >0.5, the data is sufficient for Factor Analysis.
KMO(corlnMtrx)
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = corlnMtrx)
Overall MSA = 0.65
MSA for each item =
ProdQual Ecom TechSup CompRes Advertising ProdLine Sa
lesFImage ComPricing
0.51 0.63 0.52 0.79 0.78 0.62
0.62 0.75
WartyClaim OrdBilling DelSpeed
0.51 0.76 0.67

As the analysis shows MSA value is greater than 0.5 we are good to go with factor analysis.

(c) The next step is to check for Eigen values, to showcase into how many factors the data
could be taken upto.
A=eigen(corlnMtrx)
> EV=A$values
> EV
[1] 3.42697133 2.55089671 1.69097648 1.08655606 0.60942409 0.55188378 0.4
0151815 0.24695154 0.20355327
[10] 0.13284158 0.09842702

(d) The next step would be plotting the Eigen value in to a plot, named as Scree plot to
determine the number of factors. (Kaiser Rule = Eigen Value >=1),Elbow Rule.

13
plot(EV, main = "Scree Plot", xlab = "Factors", ylab = "Eigen Values", col
= "blue")
> lines(EV, col = "red")
> abline(h = 1, col = "green")

As the scree plot identified, that four factors are above the line so we need to determine four factors
for the given data set Factor-Hair revised.

(e) The second last step is to reduce the dimensions into four factors as suggested in the above
analysis. So, we will proceed with the PCA/factor analysis technique.

##PCA without Rotation##


> pca <- principal(corlnMtrx, nfactors = 4, rotate = "none")
> pca
Principal Components Analysis
Call: principal(r = corlnMtrx, nfactors = 4, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
PC1 PC2 PC3 PC4 h2 u2 com
ProdQual 0.25 -0.50 -0.08 0.67 0.77 0.232 2.2
Ecom 0.31 0.71 0.31 0.28 0.78 0.223 2.1
TechSup 0.29 -0.37 0.79 -0.20 0.89 0.107 1.9
CompRes 0.87 0.03 -0.27 -0.22 0.88 0.119 1.3
Advertising 0.34 0.58 0.11 0.33 0.58 0.424 2.4
ProdLine 0.72 -0.45 -0.15 0.21 0.79 0.213 2.0
SalesFImage 0.38 0.75 0.31 0.23 0.86 0.141 2.1
ComPricing -0.28 0.66 -0.07 -0.35 0.64 0.359 1.9
WartyClaim 0.39 -0.31 0.78 -0.19 0.89 0.108 2.0
OrdBilling 0.81 0.04 -0.22 -0.25 0.77 0.234 1.3
DelSpeed 0.88 0.12 -0.30 -0.21 0.91 0.086 1.4

PC1 PC2 PC3 PC4


SS loadings 3.43 2.55 1.69 1.09
Proportion Var 0.31 0.23 0.15 0.10
Cumulative Var 0.31 0.54 0.70 0.80
Proportion Explained 0.39 0.29 0.19 0.12
Cumulative Proportion 0.39 0.68 0.88 1.00

Mean item complexity = 1.9

14
Test of the hypothesis that 4 components are sufficient.

The root mean square of the residuals (RMSR) is 0.06


Fit based upon off diagonal values = 0.97
> print(pca$loadings, cutoff = 0.5)
Loadings:
PC1 PC2 PC3 PC4
ProdQual -0.501 0.670
Ecom 0.713
TechSup 0.794
CompRes 0.871
Advertising 0.581
ProdLine 0.716
SalesFImage 0.752
ComPricing 0.660
WartyClaim 0.778
OrdBilling 0.809
DelSpeed 0.876

PC1 PC2 PC3 PC4


SS loadings 3.427 2.551 1.691 1.087
Proportion Var 0.312 0.232 0.154 0.099
Cumulative Var 0.312 0.543 0.697 0.796

##PCA with Varimax Rotation##


> pca_r <- principal(corlnMtrx, nfactors = 4, rotate = "varimax")
> pca_r
Principal Components Analysis
Call: principal(r = corlnMtrx, nfactors = 4, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 RC4 h2 u2 com
ProdQual 0.00 -0.01 -0.03 0.88 0.77 0.232 1.0
Ecom 0.06 0.87 0.05 -0.12 0.78 0.223 1.1
TechSup 0.02 -0.02 0.94 0.10 0.89 0.107 1.0
CompRes 0.93 0.12 0.05 0.09 0.88 0.119 1.1
Advertising 0.14 0.74 -0.08 0.01 0.58 0.424 1.1
ProdLine 0.59 -0.06 0.15 0.64 0.79 0.213 2.1
SalesFImage 0.13 0.90 0.08 -0.16 0.86 0.141 1.1
ComPricing -0.09 0.23 -0.25 -0.72 0.64 0.359 1.5
WartyClaim 0.11 0.05 0.93 0.10 0.89 0.108 1.1
OrdBilling 0.86 0.11 0.08 0.04 0.77 0.234 1.1
DelSpeed 0.94 0.18 0.00 0.05 0.91 0.086 1.1

RC1 RC2 RC3 RC4


SS loadings 2.89 2.23 1.86 1.77
Proportion Var 0.26 0.20 0.17 0.16
Cumulative Var 0.26 0.47 0.63 0.80
Proportion Explained 0.33 0.26 0.21 0.20
Cumulative Proportion 0.33 0.59 0.80 1.00

Mean item complexity = 1.2


Test of the hypothesis that 4 components are sufficient.

The root mean square of the residuals (RMSR) is 0.06

Fit based upon off diagonal values = 0.97


>
> print(pca_r$loadings, cutoff = 0.6)

Loadings:
RC1 RC2 RC3 RC4
ProdQual 0.876
Ecom 0.871
TechSup 0.939

15
CompRes 0.926
Advertising 0.742
ProdLine 0.642
SalesFImage 0.900
ComPricing -0.723
WartyClaim 0.931
OrdBilling 0.864
DelSpeed 0.938

RC1 RC2 RC3 RC4


SS loadings 2.893 2.234 1.856 1.774
Proportion Var 0.263 0.203 0.169 0.161
Cumulative Var 0.263 0.466 0.635 0.796

>

The 4 RCs explain about 80 % of cumulative variation in the dataset which is good number.

##Now, just to explore we will do the Factor Analysis too in the dataset##
> fa <- fa(r= hair1[,-12], nfactors = 4, rotate = "varimax", fm="pa")
> fa
Factor Analysis using method = pa
Call: fa(r = corlnMtrx, nfactors = 4, rotate = "varimax", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 PA4 h2 u2 com
ProdQual 0.02 -0.07 0.02 0.65 0.42 0.576 1.0
Ecom 0.07 0.79 0.03 -0.11 0.64 0.362 1.1
TechSup 0.02 -0.03 0.88 0.12 0.79 0.205 1.0
CompRes 0.90 0.13 0.05 0.13 0.84 0.157 1.1
Advertising 0.17 0.53 -0.04 -0.06 0.31 0.686 1.2
ProdLine 0.53 -0.04 0.13 0.71 0.80 0.200 1.9
SalesFImage 0.12 0.97 0.06 -0.13 0.98 0.021 1.1
ComPricing -0.08 0.21 -0.21 -0.59 0.44 0.557 1.6
WartyClaim 0.10 0.06 0.89 0.13 0.81 0.186 1.1
OrdBilling 0.77 0.13 0.09 0.09 0.62 0.378 1.1
DelSpeed 0.95 0.19 0.00 0.09 0.94 0.058 1.1

PA1 PA2 PA3 PA4


SS loadings 2.63 1.97 1.64 1.37
Proportion Var 0.24 0.18 0.15 0.12
Cumulative Var 0.24 0.42 0.57 0.69
Proportion Explained 0.35 0.26 0.22 0.18
Cumulative Proportion 0.35 0.60 0.82 1.00

Mean item complexity = 1.2


Test of the hypothesis that 4 factors are sufficient.

The degrees of freedom for the null model are 55 and the objective funct
ion was 6.55
The degrees of freedom for the model are 17 and the objective function wa
s 0.33

The root mean square of the residuals (RMSR) is 0.02


The df corrected root mean square of the residuals is 0.03

Fit based upon off diagonal values = 1


Measures of factor score adequacy
PA1 PA2 PA3 PA4
Correlation of (regression) scores with factors 0.98 0.99 0.94 0.88
Multiple R square of scores with factors 0.96 0.97 0.88 0.78
Minimum correlation of possible factor scores 0.93 0.94 0.77 0.55

##Now, will check the factors by factor diagram##

16
Both the PCA and FA has shown the factors, now to rename the factors and check out the factor
scores.

(f) Next to rename the four factors.

hair2=cbind(hair1[,12],fa$scores)
> head(hair2)
PA1 PA2 PA3 PA4
[1,] 8.2 -0.1338871 0.9175166 -1.719604873 0.09135411
[2,] 5.7 1.6297604 -2.0090053 -0.596361722 0.65808192
[3,] 8.9 0.3637658 0.8361736 0.002979966 1.37548765
[4,] 4.8 -1.2225230 -0.5491336 1.245473305 -0.64421384
[5,] 7.1 -0.4854209 -0.4276223 -0.026980304 0.47360747
[6,] 4.7 -0.5950924 -1.3035333 -1.183019401 -0.95913571

Now, to rename the four factors from PA1,PA2,PA3,PA4 to Purchase, Marketing, Post Purchase,
Product Position.

17
colnames(hair2) <- c("Satisfaction","Purchase","Marketing","Post Purchase"
,"Product Position")
> head(hair2)
Satisfaction Purchase Marketing Post Purchase Product Position
[1,] 8.2 -0.1338871 0.9175166 -1.719604873 0.09135411
[2,] 5.7 1.6297604 -2.0090053 -0.596361722 0.65808192
[3,] 8.9 0.3637658 0.8361736 0.002979966 1.37548765
[4,] 4.8 -1.2225230 -0.5491336 1.245473305 -0.64421384
[5,] 7.1 -0.4854209 -0.4276223 -0.026980304 0.47360747
[6,] 4.7 -0.5950924 -1.3035333 -1.183019401 -0.95913571

Score matrix was converted into a data frame and its variables which are nothing but FA
components were given meaningful names for further analysis

We achieved a dimensionality reduction where just 4 factors can explain the complete 11 predictor
variables of the hair dataset through FA analysis.

(v) Perform Multiple linear regression with customer satisfaction as dependent


variables and the four factors as independent variables. Comment on the
Model output and validity.

model=lm(Satisfaction~.,hair2)
> summary(model)

Call:
lm(formula = Satisfaction ~ ., data = hair2)

Residuals:
Min 1Q Median 3Q Max
-1.7125 -0.4708 0.1024 0.4158 1.3483

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.91800 0.06696 103.317 < 2e-16 ***
Purchase 0.57963 0.06857 8.453 3.32e-13 ***
Marketing 0.61978 0.06834 9.070 1.61e-14 ***
`Post Purchase` 0.05692 0.07173 0.794 0.429
`Product Position` 0.61168 0.07656 7.990 3.16e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6696 on 95 degrees of freedom


Multiple R-squared: 0.6971, Adjusted R-squared: 0.6844
F-statistic: 54.66 on 4 and 95 DF, p-value: < 2.2e-16

Summary-

(a) Pr (t) values of Coefficients like Intercept (constant beta-naught) we see that it is significant
even at 0.001 level.so it’s not zero and contributes to regression model.
(b) Predictor variables like Purchase, Marketing and Product Position have significant betas
implying that Dependant variable Satisfaction is linearly associated with them.
(c) Post Purchase is the only variable which has some high p-value implying that its beta
coefficient may not be contributing that significantly to the model or may be zero.

18
(d) P-value (extremely less) of Model given by F-statistic gives evidence against the null-
hypothesis. Model is significantly valid at this point.

##Now to Test and Train the Model##


> set.seed(1234)
> library(caTools)
> index=sample.split(hair2$Satisfaction, SplitRatio = 0.7)
> train <- subset(hair2, index == TRUE)
> test <- subset(hair2, index == FALSE)
> dim(train)
[1] 71 5
> dim(test)
[1] 29 5
> lm2=lm(Satisfaction~.,data=train)
> summary(lm2)

Call:
lm(formula = Satisfaction ~ ., data = train)

Residuals:
Min 1Q Median 3Q Max
-1.07153 -0.42578 -0.01652 0.38619 1.27831

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.925120 0.066212 104.590 < 2e-16 ***
Purchase 0.553473 0.066208 8.360 6.02e-12 ***
Marketing 0.750898 0.068772 10.919 < 2e-16 ***
`Post Purchase` -0.001064 0.069597 -0.015 0.988
`Product Position` 0.525923 0.069971 7.516 1.94e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5554 on 66 degrees of freedom
Multiple R-squared: 0.8049, Adjusted R-squared: 0.7931
F-statistic: 68.09 on 4 and 66 DF, p-value: < 2.2e-16

> vif(lm2)
Purchase Marketing `Post Purchase` `Product Position
`
1.011048 1.005816 1.000977 1.00605
6
> pred <- predict(model, newdata = test)
> summary(pred)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.545 6.552 6.949 6.953 7.183 8.713
> SSE <- sum((pred - test$Satisfaction)^2)
> SSE
[1] 20.37887
> SST <- sum((mean(test$Satisfaction) - test$Satisfaction)^2)\
Error: unexpected input in "SST <- sum((mean(test$Satisfaction) - test$Sat
isfaction)^2)\"
> SST
[1] 36.26207
> rsq <- 1-SSE/SST
> rsq
[1] 0.4380114
> SSE <- sum((pred - test$Satisfaction)^2)
> SSE
[1] 20.37887
> SST <- sum((mean(test$Satisfaction) - test$Satisfaction)^2)
> SST
[1] 36.26207
> rsq <- 1-SSE/SST
> rsq
[1] 0.4380114

19
We can say that the " Satisfaction" ratings of hair product depend very highly on the overall
Purchasing experience of the Customer i.e. how quickly his product is delivered, its billed and if there
are complaints are resolved in shortest possible time

The factors Purchase, Marketing, Product Position are highly significant and Post Purchase is not
significant in the model.

Including Interaction model, we are able to make a better prediction. Even though the Interaction
didn't give a significant increase compared to the individual variables.

So, we can infer that overall the model is valid and also not overfit.

Source Code-Appendix

##??? Perform exploratory data analysis on the dataset with some charts,
graphs; find out outliers and missing values##

setwd("D:/PGP_BA_BI/PG_BA_BI/Advance Statistics/Assignment2")

Hair=read.csv("Factor-Hair-Revised.csv",header = TRUE)

variables <- c("Product Quality" , "E-Commerce" , "Technical Support" , "Complaint Resolution" ,

"Advertising" , "Product Line" , "Salesforce Image", "Competitive Pricing" ,

"Warranty & Claims" , "Order & Billing" , "Delivery Speed" , "Customer Satisfaction")

dim(Hair)

str(Hair)

names(Hair)

summary(Hair)

hair1=Hair[,-1]

str(hair1)

attach(hair1)

sum(is.na(hair1))

hist(hair1$Satisfaction,breaks = c(0:11), labels = T,

20
include.lowest=T, right=T, col=8, border=1,main = "Histogram of Customer Satisfaction"

,xlab = "Customer Satisfaction",ylab = "Numbers")

boxplot(hair1$Satisfaction,horizontal = TRUE,xlab=variables[12])

boxplot(hair1[,-12], las = 2, names = variables[-12], cex.axis = 0.5)

list("OutLiers")

OutLiers <- hair1[(1:12),]

##Is there evidence of multicollinearity? Showcase your analysis##

Create Corelation Matrix.

corlnMtrx=cor(hair1[,-12])

corlnMtrx

corrplot(corlnMtrx, method = "number", number.cex = 0.7)

lm1=lm(Satisfaction ~ . , data = hair1)

vif(lm1)

class(corlnMtrx)

corlnMtrx=as.data.frame(corlnMtrx)

class(corlnMtrx)

##Perform simple linear regression for the dependent variable with every independent variable.

lm.ProdQual=lm(Satisfaction~ProdQual,hair1)

lm.ProdQual

Satisfaction= 3.6759+0.4151*ProdQual

lm.Ecom=lm(Satisfaction~Ecom,hair1)

lm.Ecom

Satisfaction=5.1516+0.4811*Ecom

lm.TechSup=lm(Satisfaction~TechSup,hair1)

21
lm.TechSup

Satisfaction=6.44757+0.08768*TechSup

lm.CompRes=lm(Satisfaction~CompRes,hair1)

lm.CompRes

Satisfaction=3.680+0.595*CompRes

lm.Advertising=lm(Satisfaction~Advertising,hair1)

lm.Advertising

Satisfaction=5.6259+0.3222*Advertising

lm.ProdLine=lm(Satisfaction~ProdLine,hair1)

lm.ProdLine

Satisfaction=4.0220+0.4989*ProdLine

lm.SalesFImage=lm(Satisfaction~SalesFImage,hair1)

lm.SalesFImage

Satisfaction=4.070+0.556*SalesFImage

lm.ComPricing=lm(Satisfaction~ComPricing,hair1)

lm.ComPricing

Satisfaction=8.0386+(-0.1607)*ComPricing

lm.WartyClaim=lm(Satisfaction~WartyClaim,hair1)

lm.WartyClaim

Satisfaction=5.3581+0.2581*WartyClaim

lm.OrdBilling=lm(Satisfaction~OrdBilling,hair1)

lm.OrdBilling

22
Satisfaction=4.0541+0.6695*OrdBilling

lm.DelSpeed =lm(Satisfaction~DelSpeed,hair1)

lm.DelSpeed

Satisfaction=3.2791+0.9264*DelSpeed

##Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name the Factors.

cortest.bartlett(corlnMtrx,100)

KMO(corlnMtrx)

A=eigen(corlnMtrx)

EV=A$values

EV

plot(EV, main = "Scree Plot", xlab = "Factors", ylab = "Eigen Values", col = "blue")

lines(EV, col = "red")

abline(h = 1, col = "green")

##• Perform PCA/Factor analysis by extracting 4 factors. Interpret the


output and name the Factors.

pca <- principal(corlnMtrx, nfactors = 4, rotate = "none")

pca

print(pca$loadings, cutoff = 0.5)

pca_r <- principal(corlnMtrx, nfactors = 4, rotate = "varimax")

pca_r

print(pca_r$loadings, cutoff = 0.6)

23
fa <- fa(r= hair1[,-12], nfactors = 4, rotate = "varimax", fm="pa")

fa

fa.diagram(fa)

hair2=cbind(hair1[,12],fa$scores)

head(hair2)

colnames(hair2) <- c("Satisfaction","Purchase","Marketing","Post Purchase","Product Position")

head(hair2)

class(hair2)

hair2=as.data.frame(hair2)

class(hair2)

model=lm(Satisfaction~.,hair2)

summary(model)

##• Perform Multiple linear regression with customer satisfaction as


dependent variables and the four factors as independent variables. Comment on the Model output
and validity.

set.seed(1234)

library(caTools)

index=sample.split(hair2$Satisfaction, SplitRatio = 0.7)

train <- subset(hair2, index == TRUE)

test <- subset(hair2, index == FALSE)

dim(train)

dim(test)

lm2=lm(Satisfaction~.,data=train)

summary(lm2)

24
vif(lm2)

pred <- predict(model, newdata = test)

summary(pred)

SSE <- sum((pred - test$Satisfaction)^2)

SSE

SST <- sum((mean(test$Satisfaction) - test$Satisfaction)^2)

SST

rsq <- 1-SSE/SST

rsq

25

You might also like