Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

ADVANCED STATISTICS

PROJECT

DONE BY: SAVITHA VINODH


PGP-DSBA Online
September 2021
DATE: 12/12/2021

1
Table of Contents
Problem 1A:
1. State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually…………………………………………….... …6
2. Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results………………………. 6
3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results…………………….......7
4. If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result…………………………………………….......8

Problem 1B:
1. What is the interaction between two treatments? Analyze the effects of one variable on the
other (Education and Occupation) with the help of an interaction plot……………………8
2. Perform a two-way ANOVA based on Salary with respect to both Education and Occupation
(along with their interaction Education * Occupation). State the null and alternative
hypothesis and state your results. How will you interpret this result?..................................9
3. Explain the business implications of performing ANOVA for this particular case study…10

Problem 2:
1. Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed]. What insight do you draw from the EDA?....................................................... 12
2. Is scaling necessary for PCA in this case? Give justification and perform scaling……......24
3. Comment on the comparison between the covariance and the correlation matrices from this
data [on scaled data] ……………………………………………………………………….25
4. Check the dataset for outliers before and after scaling. What insight do you derive
here? ……………………………………………………………………………….29
5. Extract the Eigenvalues and Eigenvectors. [Using Sklearn PCA print both] ……...30
6. Perform PCA and export the data of the Principal Component (eigenvectors) into a
data frame with the original features……………………………………………….32
7. Write down the explicit form of the first PC (in terms of the eigenvectors. Use values
with two places of decimals only) ………………………………………………….32
8. Consider the cumulative values of the eigenvalues. How does it help you to decide on
the optimum number of principal components? What do the eigenvectors indicate?.32
9. Explain the business implication of using the Principal Component Analysis for this
case study. How may PCs help in the further analysis?..............................................34

2
List of Figures:
Fig.no Name Page.no
1. Interaction Plot 8
2. Dist. Plot of Apps 14
3. Boxplot of Apps 14
4. Dist. Plot of Accept 14
5. Boxplot of Accept 14
6. Dist. Plot of Enroll 15
7. Boxplot of Enroll 15
8. Dist. Plot of Top10perc 15
9. Boxplot of Top10perc 15
10. Dist. Plot of Top25perc 16
11. Boxplot of Top25perc 16
12. Dist. Plot of F. Undergrad 16
13. Boxplot of F. Undergrad 16
14. Dist. Plot of P. Undergrad 17
15. Boxplot of P. Undergrad 17
16. Dist. Plot of Outstate 17
17. Boxplot of Outstate 17
18. Dist. Plot of Room. Board 18
19. Boxplot of Room. Board 18
20. Dist. Plot of Books 18
21. Boxplot of Books 18
22. Dist. Plot of Personal 19
23. Boxplot of Personal 19
24. Dist. Plot of PhD 19
25. Boxplot of PhD 19
26. Dist. Plot of Terminal 20
27. Boxplot of Terminal 20
28. Dist. Plot of S.F. Ratio 20
29. Boxplot of S.F. Ratio 20
30. Dist. Plot of Perc. alumni 21
31. Boxplot of perc. Alumni 21
32. Dist. Plot of Expend 21
33. Boxplot of Expend 21
34. Dist. Plot of Grad. rate 22
35. Boxplot of Grad. rate 22
36. Pair plot 23

3
37. Heatmap 24
38. Heatmap of Covariance 27
39. Heatmap of Correlation 29
Fig. No Name Page. No
40. Boxplot showing outlier before scaling 29
41. Boxplot showing outlier after scaling 30
42. Scree plot 33

List of Tables:
Table. No Name Page. No
1. Sample Dataset 5
2. ANOVA Table of Education and Salary 6
3. ANOVA Table of Education and Salary 7
4. Two-way ANOVA Table 9
5. Sample Dataset 11
6. Summary Table of Dataset 13
7. Sample values after Scaling 25
8. Covariance Matrix 25 -27
9. Correlation Matrix 27-28
10. Eigen Values 30
11. Eigen Vectors 31
12. Dataframe showing Principal Component with the original 32
Features
13. Cumulative values of Eigenvalues 33
14. Selected Principal Component in Dataframe 34

4
Problem 1:
Salary is hypothesized to depend on educational qualification and occupation. To understand
the dependency, the salaries of 40 individuals are collected and each person’s educational
qualification and occupation are noted. Educational qualification is at three levels, High school
graduate, Bachelor, and Doctorate. Occupation is at four levels, Administrative and clerical,
Sales, Professional or specialty, and Executive or managerial. A different number of
observations are in each level of education – occupation combination.

Introduction:
Analysis of variance (ANOVA) is a statistical technique that is used to check if the means of
two or more groups are significantly different from each other. ANOVA checks the impact of
one or more factors by comparing the means of different samples.

Data Description:
The given dataset consists of three variables namely:
• Education: In this field we have three categories Doctorate, Bachelors and HS-Grad.
• Occupation: In this field we have four categories Prof-Specialty, Sales, Adm-Clerical
and Excel-Managerial.
• Salary: In this field salary of the people are given.

Sample Dataset:

Table:1 Sample Dataset

Information about Dataset:

The given Dataset consist of 40 non-null rows and 3 columns. Here Education and Occupation
are object data type whereas Salary is integer datatype.

5
1A. Q1. State the null and the alternate hypothesis for conducting one-way
ANOVA for both Education and Occupation individually.

The Hypothesis for the one-way ANOVA for Education and Salary:
Ho: The mean of salary is same at 3 categories of Education.
Ha: For atleast one level of Education, mean of salary is different.

The Hypothesis for the one-way ANOVA for Occupation and Salary:
Ho: The mean of salary is same at 4 categories of Occupation.
Ha: For atleast one level of Occupation, mean of salary is different.

Q2. Perform a one-way ANOVA on Salary with respect to Education. State


whether the null hypothesis is accepted or rejected based on the ANOVA
results.
Step 1: State the null and alternative hypothesis:
The Hypothesis for the one-way ANOVA for Education and Salary:
Ho: The mean of salary is same at 3 categories of Education.
Ha: For atleast one level of Education, mean of salary is different.
Step 2: Decide the significance level
Here we select α= 0.05
Step 3: Identify the test statistic
Here we have one independent variable Education. There are three categories of Education
variable.
One-way ANOVA determines how a response (Salary) is affected by the factor Education.
Step 4: Calculate p value using ANOVA table

Table 2: ANOVA Table of Education and Salary

6
Step 5: Decide to reject or accept null hypothesis
From the above ANOVA table, we can see that the p-value = 1.257709e-08.
Since the p-value < 0.05, we reject null hypothesis. So, we conclude that atleast one of the
categories in Education field having different mean.

Q3. Perform a one-way ANOVA on Salary with respect to Occupation.


State whether the null hypothesis is accepted or rejected based on the
ANOVA results.
Step 1: State the null and alternative hypothesis:
The Hypothesis for the one-way ANOVA for Occupation and Salary:
Ho: The mean of salary is same at 4 categories of Occupation.
Ha: For atleast one level of Occupation, mean of salary is different.
Step 2: Decide the significance level
Here we select α= 0.05
Step 3: Identify the test statistic
Here we have one independent variable Occupation. There are four categories of Occupation
variable.
One-way ANOVA determines how a response (Salary) is affected by the factor Occupation.
Step 4: Calculate p value using ANOVA table

Table 3: ANOVA table of Occupation and Salary

Step 5: Decide to reject or accept null hypothesis


From the above ANOVA table, we can see that the p-value = 0.458508.
Since the p-value < 0.05, we fail to reject null hypothesis. So, we conclude that all four
categories in Occupation field having same mean.

7
Q4: If the null hypothesis is rejected in either (2) or in (3), find out which
class means are significantly different. Interpret the result.
Null hypothesis is rejected in Q2. The mean of salary is different for atleast one of three
categories of Education.

1B. Q1. What is the interaction between two treatments? Analyze the
effects of one variable on the other (Education and Occupation) with the
help of an interaction plot.
Interaction effects occur when the effect of one variable depends on the value of another
variable.

Fig.1: Interaction Plot

From the above interaction plot, it is evident that there is significant interaction between
Education and Occupation.
• For Adm-clerical and Sales occupation both Doctorate and Bachelors earns the same
salary.
• Those who are Doctorates earing high salary in all four categories of occupation.
• For Prof-specialty occupation, Bachelors and HS-grad earns nearly same salary.
• HS-grad people do not hold Exec-managerial position.
• HS-grad people earn less salary when compared to Bachelors and Doctorate.
• Comparing to other three categories of Occupation, bachelors earn minimum salary in
Prof-specialty.
• Doctorates with Prof-specialty occupation earns the maximum salary in the given
dataset.

8
Q2. Perform a two-way ANOVA based on Salary with respect to both
Education and Occupation (along with their interaction Education *
Occupation). State the null and alternative hypothesis and state your
results. How will you interpret this result?
Step 1: State the null and alternative hypothesis:
Null Hypothesis (Ho):
1)The mean of salary is same at 3 categories of Education.
2)The mean of salary is same at 4 categories of Occupation.
3)There is no interaction between the two variables Education and Occupation.

Alternative Hypothesis (Ha):


1)For atleast one category of Education, mean of salary is different.
2)For atleast one category of Occupation, mean of salary is different.
3)There is an interaction between the two variables Education and Occupation.
Step 2: Decide the significance level
Here we select α = 0.05
Step 3: Identify the test statistic
Here we have two independent variables, Education and Occupation. There are three levels of
Education variable and four levels of Occupation.
Two-way ANOVA determines how a response (Salary) is affected by two factors, Education
and Occupation.
Step 4: Calculate p value using ANOVA table

Table 4: Two-Way ANOVA Table

9
Step 5: Decide to reject or accept null hypothesis
1)P-value of Education is 5.466264e-12 < 0.05, so we reject null hypothesis(1) and conclude
that the education factor is having effect on Salary.
2)P-value of Occupation is 7.211580e-02 > 0.05, so we fail to reject null hypothesis(2) and
conclude that the Occupation factor is not having effect on Salary.
3)P-value of interaction(Education : Occupation) is 2.232500e-05 < 0.05, so we reject null
hypothesis(3) and conclude that the interaction(Education : Occupation) is having effect on
Salary.
Hence, we interpret that there is an interaction between Education and occupation with
respect. Thus, the effect on salary is based on the interaction between Education and
Occupation.

Q3. Explain the business implications of


performing ANOVA for this particular case study

For this Case study, after performing ANOVA we can say that Salary is based on both
Education and Occupation. Here among the three categories of Education, Doctorate is earning
higher salary while HS-grad is earning minimum salary. So, we can say that the person with
higher Education have high probability of getting higher level occupation to earn higher Salary.

Problem 2:
The dataset contains information on various colleges. You are expected to do
a Principal Component Analysis for this case study according to the
instructions given.
Introduction:
Here we going to perform EDA and PCA.
Exploratory data analysis (EDA) is used to analyse and investigate data sets and summarize
their main characteristics, often employing data visualization methods.
Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used
to reduce the dimensionality of large data sets, by transforming a large set of variables into a
smaller one that still contains most of the information in the large set.

10
Data Description:
The given dataset consists of 777 rows and 18 columns.
1. Names: Names of various university and colleges.
2. Apps: Number of applications received.
3. Accept: Number of applications accepted.
4. Enroll: Number of new students enrolled.
5. Top10perc: Percentage of new students from top 10% of Higher secondary class.
6. Top25perc: Percentage of new students from 25% of Higher secondary class.
7. F. Undergrad: Number of full-time undergraduate students.
8. P. Undergrad: Number of part-time undergraduate students.
9. Outstate: Number of students for whom the particular college or university is out-of-
state tuition.
10. Room. Board: Cost of room and board.
11. Books: Estimated book costs for a student
12. Personal: Estimated personal spending for a student.
13. PhD: Percentage of faculties with terminal degree.
14. S. F. Ratio: Student/faculty ratio.
15. Perc. alumni: Percentage of alumni who donate.
16. Expend: The instructional expenditure per student.
17. Grad. Rate: Graduation rate.

Sample Dataset:

Table 5: Sample Dataset

11
Q1. Perform Exploratory Data Analysis [both univariate and multivariate
analysis to be performed]. What insight do you draw from the EDA?
Exploratory Data Analysis (EDA):
Checking data types of variables in the data frame:

The given data set consist of 18 variables in which Name variable is object data type, S. F.
Ratio variable is float data type and all the other 16 variables are integer data type.
Checking Missing values in the data set:

12
Here we can see that all the variables show 777 non-null count. Hence there is no missing
values in the data set.

Description of data set:

Table 6: Summary table of Data set

The summary table shows the count, mean, standard deviation, minimum value, 25% value,
50%value, 75% value and maximum value of all the numerical variables in the data set.

Checking Duplicate records in the Data set:

Hence there is no duplicate records in the given data set.

13
Univariate Analysis:
Univariate analysis is the simplest form of data analysis where the data being analysed
contains only one variable. The main purpose of univariate analysis is to describe the data
and find patterns that exist within it.
Now we can see the Description, Distribution plot and Boxplot of all numerical variables
individually:

Apps:

Fig 2: Dist. plot of apps Fig 3: Boxplot of Apps

From the above distribution plot of apps, we can say that the data is right skewed and boxplot
clearly shows the presence of outliers.

Accept:

Fig 4: Dist. Plot of Accept Fig 5: Boxplot of Accept

From the above distribution plot of Accept, we can say that the data is right skewed and
boxplot clearly shows the presence of outliers.

14
Enroll:

Fig 6: Dist. Plot of Enroll Fig 7: Boxplot of Enroll

From the above distribution plot of Enroll, we can say that the data is right skewed and
boxplot clearly shows the presence of outliers.

Top10perc:

Fig 8: Dist. Plot of Top10perc Fig 9: Boxplot of Top10perc

From the above distribution plot of Top10perc, we can say that the data is right skewed and
boxplot clearly shows the presence of outliers.

15
Top25perc:

Fig 10: Dist. Plot of Top25perc Fig 11: Boxplot of Top25perc

From the above distribution plot of Top25perc, we can say that the data is normally
distributed and boxplot clearly shows there is no outliers.

F. Undergrad:

Fig 12: Dist. Plot of F. Undergrad Fig 13: Boxplot of F. Undergrad

From the above distribution plot of F. Undergrad, we can say that the data is right skewed
and boxplot clearly shows the presence of outliers.

16
P. Undergrad:

Fig 14: Dist. Plot of P. Undergrad Fig 15: Boxplot of P. Undergrad

From the above distribution plot of P. Undergrad, we can say that the data is right skewed
and boxplot clearly shows the presence of outliers.

Outstate:

Fig 16: Dist. Plot of Outstate Fig 17: Boxplot of Outstate

From the above distribution plot of Outstate, we can say that the data is normally distributed
and boxplot clearly shows the presence of only one outlier.

17
Room. Board:

Fig 18: Dist. Plot of Room. Board Fig 19: Boxplot of Room. Board

From the above distribution plot of Room. Board, we can say that the data is normally
distributed and boxplot clearly shows the presence of outlier.

Books:

Fig 20: Dist. plot of Books Fig 21: Boxplot of Books

From the above distribution plot of Books, we can say that the data is bimodal and boxplot
clearly shows the presence of outlier.

18
Personal:

Fig 22: Dist. Plot of Personal Fig 23: Boxplot of Personal

From the above distribution plot of Personal, we can say that the data is right skewed and
boxplot clearly shows the presence of outliers.

PhD:

Fig 24: Dist. Plot of PhD Fig 25: Boxplot of PhD

From the above distribution plot of PhD, we can say that the data is left skewed and boxplot
clearly shows the presence of outliers.

19
Terminal:

Fig 26: Dist. Plot of Terminal Fig 27: Boxplot of Terminal

From the above distribution plot of Terminal, we can say that the data is left skewed and
boxplot clearly shows the presence of outliers.

S. F. Ratio:

Fig 28: Dist. Plot of S. F. Ratio Fig 29: Boxplot of S. F. Ratio

From the above distribution plot of S. F. Ratio, we can say that the data is normally
distributed and boxplot clearly shows the presence of outliers.

20
Perc. Alumni:

Fig 30: Dist. Plot of Perc. alumni Fig 31: Boxplot of Perc. alumni

From the above distribution plot of Perc. alumni, we can say that the data is slightly right
skewed and boxplot clearly shows the presence of outliers.

Expend:

Fig 32: Dist. Plot of Expend Fig 33: Boxplot of Expend

From the above distribution plot of Expend, we can say that the data is right skewed and
boxplot clearly shows the presence of outliers.

21
Grad. Rate:

Fig 34: Dist. Plot of Grad. rate Fig 35: Boxplot of Grad. rate

From the above distribution plot of Grad. Rate, we can say that the data is normally
distributed and boxplot clearly shows the presence of outliers.
Thus, from the above univariate analysis we can confirm that all variables having outliers
except Top25perc.
The distribution plot shows which variables are normally distributed, left skewed and right
skewed.
So, we can do bivariate analysis to know more about the variables.

Multivariate Analysis:
Data which involves 3 or more variables are termed as Multivariate data. These are similar to
bivariate but contains more than one dependent variable.
A pairs plot allows us to see both distribution of single variables and relationships between
two variables. Pair plot is also called a scatter plot matrix.
A pair plot allows us to see both distribution of single variables and relationship between two
variables.
The pair plot of the given dataset is shown below:

22
Fig 36: Pair plot

Here the diagonal shows distribution of single variable as histogram other graphs shows the
scatter plot of corresponding two variables.
Another way of visualizing multivariate analysis is using heatmap.
Heat Map Chart, or Heatmap is a two-dimensional visual representation of data, where values
are encoded in colours, delivering a convenient, insightful view of information. Essentially,
this chart type is a data table with rows and columns denoting different sets of categories.
Heatmap of the given data is shown below.

23
Fig 37: Heatmap

• From the heatmap we can infer that Accept variable is highly correlated to Apps
variable.
• Here we can also say that F. Undergrad is also highly correlated to Enroll variable.
• Also F. Under grad is correlated to Apps and Accept variable.
• Terminal is highly correlated to PhD variable.
• There are also variables showing less correlation with another variable.
Insights of EDA:
• Data set does not contain any missing values or duplicate records.
• Variables are left skewed or right skewed, some of them are normally distributed.
• All variables having outliers except Top25perc variable.
• Heatmap clearly shows that the data having highly correlated and hence there will be
chance of multicollinearity occurs during further analysis.
• Thus, PCA have to be performed before doing further analysis.

Q2. Is scaling necessary for PCA in this case? Give justification and perform scaling.
Yes, scaling is necessary for PCA in this case.

Here the dataset contains 17 numerical variables. Each variable is of different scales. For
example, application is in number of students whereas graduation rate is percentage of
graduates.

PCA requires that input variables have similar scale of measurements. The rule of thumb is
that if data is already on a different scale , scaling it will remove the information contained in
the fact that features have unequal variances.

24
In z-score scaling technique, the features are scaled in such a way that they end up having
properties of a standard normal distribution with mean equal to zero and standard deviation of
one.

Hence, we conclude that Scaling is necessary for PCA in this case. After performing z-score
scaling techniques values of the variables are changed. The sample is shown below.

Table 7: Sample values after scaling

Q3. Comment on the comparison between the covariance and the correlation matrices
from this data [on scaled data].
Both covariance and correlation measure the relationship and dependency between two
variables.
Covariance:
Covariance indicates the direction of the linear relationship between variables. Values ranges
from -infinity to +infinity. Positive covariance denotes that there is a direct relationship
whereas negative covariance denotes an inverse relationship between two variables.
Scalability affects covariance.

25
Table 8: Covariance Matrix

26
Fig 38: Heatmap of Covariance Matrix

From the above covariance matrix heatmap, we can see that there is both positive and
negative covariance occurs in the given data.
Here apps have positive covariance with all variables whereas outstate have negative
covariance with Enroll, full time graduate and part time graduate. Likewise, 17 variables
show the relationship between other variables based on the covariance matrix.
Correlation Matrix:
Correlation measures both the strength and direction of the linear relationship between two
variables. Correlation is not affected by a change in scales. correlation is limited to values
between the range -1 to +1.
Positively correlated variables have a direct and strong relationship. Negatively correlated
variables indicate inverse relationship.

27
Table 9: Correlation Matrix

28
Fig 39: Heatmap of Correlation

From the above correlation matrix heatmap, we can see that most of the variables are
positively correlated and few of them are negatively correlated to each other. Here Accept,
Enroll, Full time graduate shows highly positive correlation with Apps and S.F. Ratio shows
highly negative correlation with Expend.

Q4. Check the dataset for outliers before and after scaling. What insight do
you derive here?
Outliers Before Scaling:

Fig 40: Boxplot showing outliers before scaling

29
Here we can see that except Top25perc all other variables having outliers and their ranges are
very high.

Outliers after scaling:

Fig 41: Boxplot showing outliers after scaling

After scaling also outliers present but ranges have been minimized, because scaling does not
treat outliers.

Q5. Extract the Eigenvalues and Eigenvectors. [Using Sklearn PCA print
both].

Eigen Values:

Table 10: Eigen values

30
Eigen Vectors:

Table 11: Eigen Vectors

31
Q6. Perform PCA and export the data of the Principal Component
(eigenvectors) into a data frame with the original features.

Table 12: Dataframe showing Principal Component with the Original features

Q7. Write down the explicit form of the first PC (in terms of the
eigenvectors. Use values with two places of decimals only).

Explicit form of the first PC:

Q8. Consider the cumulative values of the eigenvalues. How does it help
you to decide on the optimum number of principal components? What do
the eigenvectors indicate?
A vital part of using PCA is the ability to estimate how many components are needed to
describe the data. This can be determined by looking at the cumulative explained variance
ratio as a function of the number of components and the scree plot.

32
Cumulative explained variance ratio is used to find a cut off for selecting the number of PCs.

Table 13: Cumulative values of the Eigenvalues

In these cumulative values, the first seven PCs explain 85.2% of variation in the data.
Scree Plot:
The scree plot orders the eigen values from largest to smallest. This helps us to determine the
number of components.

Fig 42: Scree Plot

The above scree plot shows that the eigen values start to form a straight line after seventh
principal component. So, we decide seven PC components for the given data set.
The first seven PCs have been shown in a Dataframe with its original features are shown
below.

33
Table 14. Selected Principal components in Dataframe

Here the Eigen vectors with the highest Eigen values bear the largest information about the
distribution of data. So, the selected Eigen vectors (principal Components) contain 85.2% of
the information in this case study.

Q9. Explain the business implication of using the Principal Component


Analysis for this case study. How may PCs help in the further analysis?
In this case study, information of various colleges has been given. Here we perform EDA to
understand the given dataset and helps clean up the data. EDA gives a clear picture of the
features and relationship between them through univariate and multivariate analysis.
Principal Component Analysis (PCA) is a technique used to emphasize variation and bring
out strong patterns in the dataset. PCA is an unsupervised learning algorithm that is used for
the dimensionality reduction in machine learning. The most important use of PCA is to
represent a multivariate data table as smaller set of variable.
After performing PCA for the given dataset we selected seven PCs which is reduced from 17
numerical variables. This seven PCs contains most of the information which helps us in
further analysis.

34

You might also like