Mini Project - Factor Hair Analysis: Sravanthi.M
Mini Project - Factor Hair Analysis: Sravanthi.M
Mini Project - Factor Hair Analysis: Sravanthi.M
Analysis
Sravanthi.M
1
Table of Contents
1. Project Objective...............................................................................................................................3
2. Assumptions......................................................................................................................................3
3. Exploratory Data Analysis – Step by step approach...........................................................................3
3.1. Environment Set up and Data Import........................................................................................3
3.1.1.Install necessary Packages and Invoke Libraries.................................................................3
3.1.2.Set up working Directory....................................................................................................3
3.1.3.Import and Read the Dataset.............................................................................................4
3.2. Variable Identification................................................................................................................4
4. Conclusion.........................................................................................................................................5
5. Detailed Explanation of Findings…………………………………………………………………………………………………….5
1.Perform exploratory data analysis on the dataset. Showcase some charts, graphs. Check for
outliers and missing values
1.2 EDA - Check for Outliers and missing values and check the summary of the dataset
3.Perform simple linear regression for the dependent variable with every independent variable
4.Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name the Factors
4.1 Perform PCA/FA and Interpret the Eigen Values (apply Kaiser Normalization Rule)
4.2 Output Interpretation Tell why only 4 factors are being asked in the questions and tell
whether it is correct in choosing 4 factors. Name the factors with correct explanations.
5.Perform Multiple linear regression with customer satisfaction as dependent variables and the four
factors as independent variables. Comment on the Model output and validity. Your remarks should
make it meaningful for everybody
5.1 Create a data frame with a minimum of 5 columns, 4 of which are different factors and the
5th column is Customer Satisfaction
5.2 Perform Multiple Linear Regression with Customer Satisfaction as the Dependent Variable
and the four factors as Independent Variables
5.3 MLR summary interpretation and significance (R, R2, Adjusted R2,Degrees of Freedom, f-
statistic, coefficients along with p-values)
6. Source Code
1 Project Objective
The objective of the report is to explore the Factor Hair data in R and generate insights about the
data set. This exploration report will consist of the following:
2 Assumptions
Is there evidence of multicollinearity?
Perform factor analysis by extracting four factors.
Name four factors.
Perform multiple Liner regression with customer satisfaction as the dependent variable
and the four factors as independent variable.
3|Page
3.1 Environment Set up and Data Import
3.1.1 Install necessary Packages and Invoke Libraries
Use this section to install necessary packages and invoke associated libraries. Having all the
packages at the same places increases code readability. For installation we will use
install. packages (“Package name”)
getwd() : returns an absolute file path representing the current working directory
dim: returns the dimension (e.g. the number of columns and rows)
various model fitting functions. The function invokes particular methods which
4 Conclusion
4|Page
From the above given problem, we have found out how Factor Analysis can be used to reduce
the dimensionality of a dataset and then we used multiple linear regression on the
dimensionally reduced columns for further analysis/predictions. Below mentioned points are
covered
1. Checked for Multicollinearity
2. Done Factor Analysis
3. Named the Factors - Cust.Satisf,Sales.Distri,Marketing, After.Sales.Service,Value.For. Money
4. Perform Multiple Linear Regression with customer satisfaction as dependent variable and
Cust.Satisf,Sales.Distri,Marketing,After.Sales.Service,Value.For.Money as independent variables.
getwd()
dim(Factorhair)
names(Factorhair)
str(Factorhair)
5|Page
## summary of the data
summary(Factorhair)
Output:
From the summary we have noticed that first column is named as “ID” is just column number and it is
not required further hence we will be removing the column and renaming dataset as hair and remove
the column ID from it.
We need to find missing values
Syntax: sum(is.na(hair))
Output:
6|Page
Box plot of dependent variable (Customer satisfaction)
Syntax: boxplot(`Customer Satisfaction`, horizontal = TRUE, xlab = variables[12],
col = "pink", border="blue",ylim = c(0,11))
l = round(min(hair[,i]),0)-1
n = variables[i]
7|Page
Boxplot of independent variables
par(mfrow = c(2,1))
boxplot(hair[,-12], las = 2, names = variables[-12], col = "blue", border = "pink", cex.axis = 1)
Bivariate Analysis - Scatter Plot of independent variables against the dependent variable
Syntax: par(mfrow = c(3,3))
for (i in c(1:11))
{plot(hair[,i],`Customer Satisfaction`, xlab = variables[i],ylab = NULL,col= "red",cex.lab =
1,cex.axis = 1,
cex.main = 1,cex.sub = 1,xlim = c(0,10),ylim = c(0,10))
abline(lm(formula = `Customer Satisfaction`~ hair[,i]),col = "blue")
8|Page
Finding Outliers in variables
Syntax: list("OutLiers")
OutLiers <- hair[(1:12),]
for (i in c(1:12)) {
if (length(Box_Plot)>0) {
OutLiers[(1:length(Box_Plot)),i] <- Box_Plot
}
}
write.csv(OutLiers, "OutLiers.csv")
3.Perform simple linear regression for the dependent variable with every independent variable
Ans: From the above correlation matrix we will be doing Bartlett Test. If P-value is less than 0.05
then it is ideal case for dimension reduction.
Syntax: cortest.bartlett(corlnMtrx, 100)
4.Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name the Factors
4.1 Perform PCA/FA and Interpret the Eigen Values (apply Kaiser Normalization Rule)
4.2 Output Interpretation Tell why only 4 factors are being asked in the questions and tell whether it
is correct in choosing 4 factors. Name the factors with correct explanations
Ans: Kaiser-Meyer-Olkin (KMO) Test is a measure of how suited your data is for Factor Analysis.
Syntax: KMO(corlnMtrx)
The KMO statistic of 0.65 is also large (greater than 0.50). Hence Factor Analysis is considered as
an appropriate technique for further analysis of the data.
Calculate the Eigen values for the variables
11 | P a g
e
Syntax:
A <- eigen(corlnMtrx)
EV <- A$values
EV
plot(EV, main = "Scree Plot", xlab = "Factors", ylab = "Eigen Values", pch = 20, col = "blue")
lines(EV, col = "red")
abline(h = 1, col = "green", lty = 2)
5.Perform Multiple linear regression with customer satisfaction as dependent variables and the four
factors as independent variables. Comment on the Model output and validity. Your remarks should
make it meaningful for everybody
12 | P a g
e
5.1 Create a data frame with a minimum of 5 columns, 4 of which are different factors and the
5th column is Customer Satisfaction
5.2 Perform Multiple Linear Regression with Customer Satisfaction as the Dependent Variable
and the four factors as Independent Variables
5.3 MLR summary interpretation and significance (R, R2, Adjusted R2,Degrees of Freedom, f-
statistic, coefficients along with p-values)
5.4 Output Interpretation
Ans: As per the above scree plot extracting 4 factors from 11 variables
Without rotating
Syntax:
FourFactor = fa(r= hair[,-12], nfactors =4, rotate ="none", fm ="pa")
print(FourFactor)
13 | P a g
e
write.csv(Loading, "loading.csv")
fa.diagram(FourFactor)
write.csv(Loading1, "Loading1.csv")
fa.diagram(FourFactor1)
Create a new data frame using scores for four factors and dependent variable
head(hair1)
colnames(hair1) <-
c("Cust.Satisf","Sales.Distri","Marketing","After.Sales.Service","Value.For.Money")
class(hair1)
set.seed(1)
creating two datasets one to train the model and another to test the model.
cat(" Train Dimention: ", dim(Train) ,"\n", "Test Dimention : ", dim(Test))
17 | P a g
e
linearModel = lm(Cust.Satisf ~., data = Train)
summary(linearModel)
vif(linearModel)
Check SSE - sum of squared deviations of actual values from predicted values
check SSR - sum of squared deviations of predicted values (predicted using regression)
cat(" SST :", SST, "\n", "SSE :", SSE, "\n","SSR :", SSR, "\n","R squared Test :" , R.square.test)
18 | P a g
e
6 Source Code
## Seeting up working directory and getting working directory
getwd()
##Importing packages
library(corrplot)
install.packages("tidyverse")
library(tidyverse)
library(ggplot2)
install.packages("psych")
library(psych)
library(car)
install.packages("caTools")
library(caTools)
dim(Factorhair)
names(Factorhair)
str(Factorhair)
19 | P a g
e
summary(Factorhair)
## Creating new data set with hair name and removing column ID
dim(hair)
colnames(hair) <-variables
summary(hair)
attach(hair)
hair
sum(is.na(hair))
l = round(min(hair[,i]),0)-1
n = variables[i]
par(mfrow = c(2,1))
boxplot(hair[,-12], las = 2, names = variables[-12], col = "blue",
border = "pink", cex.axis = 1)
## Bivariate Analysis
par(mfrow = c(3,3))
for (i in c(1:11))
{plot(hair[,i],`Customer Satisfaction`, xlab = variables[i],ylab =
NULL,col= "red",cex.lab = 1,cex.axis = 1,
cex.main = 1,cex.sub = 1,xlim = c(0,10),ylim = c(0,10))
abline(lm(formula = `Customer Satisfaction`~ hair[,i]),col = "blue")
list("OutLiers")
OutLiers <- hair[(1:12),]
for (i in c(1:12)) {
if (length(Box_Plot)>0) {
OutLiers[(1:length(Box_Plot)),i] <- Box_Plot
}
}
write.csv(OutLiers, "OutLiers.csv")
corlnMtrx
cortest.bartlett(corlnMtrx, 100)
KMO(corlnMtrx)
A <- eigen(corlnMtrx)
EV <- A$values
EV
## Without rotating
print(FourFactor)
write.csv(Loading, "loading.csv")
fa.diagram(FourFactor)
print(FourFactor1)
22 | P a g
e
Loading1 <- print(FourFactor1$loadings,cutoff = 0.3)
write.csv(Loading1, "Loading1.csv")
fa.diagram(FourFactor1)
## Create a new data frame using scores for four factors and dependent
varible
head(hair1)
head(hair1)
class(hair1)
set.seed(1)
##creating two datasets one to train the model and another to test
the model.
23 | P a g
e
linearModel = lm(Cust.Satisf ~., data = Train)
summary(linearModel)
vif(linearModel)
cat(" SST :", SST, "\n", "SSE :", SSE, "\n","SSR :", SSR, "\n","R
squared Test :" , R.square.test)
24 | P a g
e