Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

Advanced Statistics

Project

Business Report

16.05.2021

Rakshit Tibrewal

1
Index

Contents Page Number


Problem 1A 3-4
Problem 1B 4-5
Problem 2 6-8
Thank you 9

2
Problem 1A:

Salary is hypothesized to depend on educational qualification and occupation. To understand the


dependency, the salaries of 40 individuals [SalaryData.csv] are collected and each person’s
educational qualification and occupation are noted. Educational qualification is at three levels, High
school graduate, Bachelor, and Doctorate. Occupation is at four levels, Administrative and clerical,
Sales, Professional or specialty, and Executive or managerial. A different number of observations are
in each level of education – occupation combination.

 [Assume that the data follows a normal distribution. In reality, the normality assumption may not
always hold if the sample size is small.]

1.1 State the null and the alternate hypothesis for conducting one-way ANOVA for both Education
and Occupation individually.

Solution:

We have formulated the null hypothesis and the alternate hypothesis for one-way ANOVA
mentioned below:

For education the hypothesis are:

The null hypothesis H0 is the mean salary which is same with the different category of education.
The alternate hypothesis H1 is the mean salary which is different in at least one category of
education.

For occupation the hypothesis are:

The null hypothesis H0 is the mean salary which is same with the different category of occupations.
The alternate hypothesis H1 is the mean salary which is different in at least category of occupation.

1.2 Perform one-way ANOVA for Education with respect to the variable ‘Salary’. State whether the
null hypothesis is accepted or rejected based on the ANOVA results.

Solution:

The P value is less than 0.05, hence we reject the null hypothesis. Therefore, we infer the mean is
different in at least one category of education.

The F ratio output is 30.96 which we infer that the variance between educations levels is 30.96 times
higher the variance within each category, hence there is a larger variance between them noted.

3
1.3 Perform one-way ANOVA for variable Occupation with respect to the variable ‘Salary’. State
whether the null hypothesis is accepted or rejected based on the ANOVA results.

Solution:

The P value is more than 0.05, hence we fail to reject the null hypothesis.

The F ratio output 0.88 which is much lower as compared to the education levels and the variance
across the occupation levels is much lower as compared to the variance within each segment.

1.4 If the null hypothesis is rejected in either (1.2) or in (1.3), find out which class means are
significantly different. Interpret the result. (Non Graded)

Solution:

The Class mean for education is significantly different as compared with occupations. We infer that
value for education more as compared with occupations by the people.

Problem 1B:

1.5 What is the interaction between the two treatments? Analyze the effects of one variable on the
other (Education and Occupation) with the help of an interaction plot.

Solution:

We noted there is different salary at different levels of education and occupation. The inferences are
mentioned below:
1. The people who has a bachelor degree, they earn almost the same salary for different
occupations except when they are in prof speciality category.

2. The people who has a doctorate the people with that degree are the highest paid in the
category and we noted that they earn the lowest when they work in adminclerk profiles.

3. The people who is a high school graduate, the people in this category we infer are the lowest
paid and they do not have any interaction with anyone in other category.

4. The interaction effect, we infer that two points joins in a particular point in a pointplot
graph, we noted the people who is a doctorate and bachelor both work in same position in
admin and sales.

4
1.6 Perform a two-way ANOVA based on the Education and Occupation (along with their interaction
Education*Occupation) with the variable ‘Salary’. State the null and alternative hypotheses and
state your results. How will you interpret this result?

Solution:

The null hypothesis H0 is no interaction effect between the variables and the alternative hypothesis
is that there is an interaction effect between the variables.
From the above output we infer that the Education : the Occupation is 2.32500e-05 which is less
than 0.05,we infer that there is an interaction between the two variables. Hence we reject the null
hypothesis in two way ANOVA analysis.

1.7 Explain the business implications of performing ANOVA for this particular case study.
Solution:

We observe the analysis between two variables educations and occupation using ANOVA
techniques. We noted education has the significant mean difference as compared with occupation.

We noted the relationship between them and the interaction with the above results. We infer that
person doing doctorate get the highest paid salary in their occupation as compared with
other education, its can help the students to select the right path and business stand point we can
prepare additional facilities and courses to pursue and try to increase the pay in other edcutaion
sector according the data and the analysis results.

5
Problem 2:

The dataset Education - Post 12th Standard.csv contains information on various colleges. You are
expected to do a Principal Component Analysis for this case study according to the instructions
given. The data dictionary of the 'Education - Post 12th Standard.csv' can be found in the following
file: Data Dictionary.xlsx.

2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed].
What insight do you draw from the EDA?

Solution:

We have used python to generate graphs for each variable in univarite analysis and all the variables
together we have generated multivariate analysis. Refer the python file for the output. The analysis
are as follow:

Univariate analysis

We noted most for the variables are equally skewed, expect for variable apps, enroll, F.undergrad, P
undergrad, and expend are skewed towards right side. The only variable skewed to left is PhD.

Multivariate analysis

We observe all the 17 variables by using pairplot graph in python, we note each 17 variables graph
moment with each other. By this multi variable at a time we understand the data relation and
coordination with each other.

We infer the variable apps and F.Undergrad has plots towards the left side of the data for all the
variables. We also note that Phd and terminal variables plots are towards the left and side of the
graph. And majorly most of the variables are plot in the mid of the graph.

2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling.

Solution:

Yes, scaling is necessary for the PCA case because there are multiple variable with different data
present in it. So before we start performing PCA we to scale the data into single variable form using
scaling methods in python.

2.3 Comment on the comparison between the covariance and the correlation matrices from this
data. [on scaled data]

Solution:

We have generate output in the python file, we observe that the covariance matrices and the
correlation matrices are the same on the scaled data. Refer the python file.

6
2.4 Check the dataset for outliers before and after scaling. What insight do you derive here?

Solution:

We have generate output in the python file, we observe that the output in outliers using boxplot
techniques in python are the same on the data before and after scaling the data. Refer the python
file.

2.5 Extract the eigenvalues and eigenvectors.

Solution:

We have extracted the eigenvalues and eigenvectors in python file.

2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data frame
with the original features

Solution:

We have completed PCA and export the data of the Principal Component (eigenvectors) into a data
frame with the original features refer the python file.

2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two
places of decimals only).

Solution:

Refer the python file.

7
2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?

Solution:

The eigenvectors values generated using python. There are 5 values which is greater than one which
helps in selecting the number of principal components which is 5. Eigenvectors indicate the direction
of the principal components (new axes).

2.9 Explain the business implication of using the Principal Component Analysis for this case study.
How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal Components
Obtained]

Solution:

We have generated the analysis by using Principal Component method for this case study. Principal
component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing
interpretability but at the same time minimizing information loss. It does so by creating new
uncorrelated variables that successively maximize variance. 

In this case we infer that variables has variances using PCA technique we reduced the variances
scaled the data and PCA original values we generate below heat map as output.

We infer from the heat map that the red highlight rectangle boxes have the high variances as
compared to other variables in the data. It helps in taking business decision for the particular
variable whether change the process or upgrade process. It helps is taking business process task
effectively and efficiently.

8
Thank
You
9

You might also like