Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Business Report

Advanced Statistics Project

Ruchi 8/15/21 PGP-DSBA


Table of Contents

Problem 1 A ………………………………………………………………………………………………………………………… 1

1A.1 …………………………………………………………………………………………………………………………………… 2

1A.2 …………………………………………………………………………………………………………………………………… 2

1A.3 ……………………………………………………………………………………………………………………………………. 3

1B.1 …………………………………………………………………………………………………………………………………… 3

1B.2 ……………………………………………………………………………………………………………………………………. 4

1B.3 ……………………………………………………………………………………………………………………………………. 5

Problem 2 …………………………………………………………………………………………………………………………….. 6

2.1 ………………………………………………………………………………………………………………………………………… 7

2.2 ……………………………………………………………………………………………………………………………………….. 14

2.3 ……………………………………………………………………………………………………………………………………….. 15

2.4 ………………………………………………………………………………………………………………………………………... 16

2.5 ………………………………………………………………………………………………………………………………………… 17

2.6 ………………………………………………………………………………………………………………………………………… 18

2.7 …………………………………………………………………………………………………………………………………………. 20

2.8 …………………………………………………………………………………………………………………………………………. 21

2.9 ………………………………………………………………………………………………………………………………………… 21
Problem 1A

Salary is hypothesized to depend on educational qualification and occupation. To


understand the dependency, the salaries of 40 individuals [SalaryData.csv] are
collected and each person’s educational qualification and occupation are noted.
Educational qualification is at three levels, High school graduate, Bachelor, and
Doctorate. Occupation is at four levels, Administrative and clerical, Sales,
Professional or specialty, and Executive or managerial. A different number of
observations are in each level of education – occupation combination.

1
[Assume that the data follows a normal distribution. In reality, the normality assumption
may not always hold if the sample size is small.]
1A.1 State the null and the alternate hypothesis for conducting one-way ANOVA
for both Education and Occupation individually.

Null and Alternate hypothesis for Education:

H0: The mean salary earned by individuals is same with different educations
levels.

HA: The mean salary earned by individuals is different for at least one of the educational
levels.

Null and Alternate hypothesis for Occupation:

H0: The mean salary earned by individuals is same for all occupations.

HA: The mean salary earned by individuals is different for at least one of the
occupations.

1A.2 Perform a one-way ANOVA on Salary with respect to Education. State


whether the null hypothesis is accepted or rejected based on the ANOVA results.

Level of significance 𝛼 0.05

Since the p-value is less than the significance level, we can reject the null hypothesis and states
that there is difference in the mean salaries earned by individuals for different educational
levels.

2
1A.3 Perform a one-way ANOVA on Salary with respect to Occupation. State
whether the null hypothesis is accepted or rejected based on the ANOVA results.

Level of significance 𝛼 0.05

Since the p-value is greater than the level of significance, we can say that we cannot
reject the Null hypothesis. Hence, we can conclude that the mean salaries earned by
individuals for different occupations are same.

Problem 1B:
1B.1 What is the interaction between two treatments? Analyze the effects of one
variable on the other (Education and Occupation) with the help of an interaction
plot.[hint: use the ‘point plot’ function from the ‘seaborn’ function]

 Seems to be a strong interaction between Doctorate and Bachelors in the occupation of


Adm-clerical and Sales.
 There is also some interaction between Bachelors and HS-grade for the occupation
Prof-specialty.
 No Interaction between Doctorate and HS-grad for any of the occupations.

3
1B.2 Perform a two-way ANOVA based on Salary with respect to both Education
and Occupation (along with their interaction Education*Occupation). State the
null and alternative hypotheses and state your results. How will you interpret this
result?

H0: The means of ‘Salary’ with respect to each Education and Occupation is equal.

HA: The means of ‘Salary’ with respect to each Education and Occupation level is not equal or
unequal.

 For Education as the p- value is less than the significance level, we can say that Null
hypothesis is rejected and can conclude that Education level has significant impact on
mean salary.
 For occupation, as the p-value is greater than the significance level, we can say that Null
hypothesis cannot be rejected and we can conclude that Occupation has minimal to no
impact on the mean Salary.
 We can also observe that due to the inclusion of the interaction effect term, we can see
a slight change in the p-value of the first two treatments as compared to the Two way
ANOVA without the interaction effect term. Additionally, we can see that the p-value of
the interaction effect term of Education and Occupation suggests that Null hypothesis is
rejected in this case.
 In order to better understand the interaction effect term of Education and Occupation on
mean Salaries we can add more independent variables.

4
1B.3 Explain the business implications of performing ANOVA for this particular
case study.

We can assume that the report is intended for someone who doesn’t have any technical
expertise and hence the inferences we get from performing the ANOVA will be proven very
helpful to understand the given data. Following are the takeaways:

 As per the above observations, the interaction effect of education and occupation on
salary, it is imperative to say despite occupations lesser significance, there is definitely
some level of impact of occupation on salary.
 We can also conclude that for few occupations a higher salary is given to individuals
who only have a bachelor’s degree rather than someone with a Doctorate. We can simly
say that this is one of the things missing in our data i.e. additional independent variables
which can impact salary for an individual.
 We can also conclude from the above point that a person with a Doctorate degree can
be considered over-qualified for a lot of job opportunities which in turn will result people
with lesser education (Bachelors or HS-grad) higher salaries.

5
Problem 2:
The dataset Education - Post 12th Standard.csv contains information on various
colleges. You are expected to do a Principal Component Analysis for this case
study according to the instructions given. The data dictionary of the 'Education -
Post 12th Standard.csv' can be found in the following file: Data Dictionary.xlsx.

6
2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis
to be performed]. What insight do you draw from the EDA?

Exploratory Data Analysis:

 The Education Post 12th standard dataset has 777 rows and 18 variables in total.
 There are no categorical variables in the dataset.
 Most of the variables are int64 except Name which is ‘object’ data type and S.F. Ratio
which is ‘float64’ datatype.
 There are no missing values.
 There are no duplicate data records.
 There are many outliers in the dataset which if left untreated, will lead to inaccurate
results.

7
Univariate Analysis:

8
9
10
11
12
Additional Inferences by looking at the above graphs:

 Mean percentage of students in a university coming from best (top 10 schools) is


27%.
 Almost more than 20 universities have more than 50% of its student population who
come from best schools.
 Variables Apps, Accept, Enrol, F Undergrad and P Undergrad are highly right
skewed.
 PhD and Terminal are left skewed.
 Gradrate, S.F.Ratio and Room Board looks more or less normally distributed.
 Very few universities have very low S.F Ratio accounting to be approx. 5

Multivariate Analysis:

Heatmap showing correlation coefficients.

Observations and inferences:

 Few of the variables have strong correlation:


 Application and Acceptance
 Top10perc and graduation rate
 Terminal and PhD qualified
 Accept and Enroll
 Top10perc and Top25perc
 Graduate students and Enrollments
 Heat Map does show signs of multicollinearity which can be observed with significant
number of high correlation pairs of features. Multicollinearity exists when an independent
variable is highly correlated with one or more other independent variables in a multiple
regression equation. Multicollinearity is a problem because it undermines the statistical
significance of an independent variable.

13
2.2 Is scaling necessary for PCA in this case? Give justification and perform
scaling.

 The main objective of scaling or standardization is to normalize data within a particular


range. Since, the range of values of data may vary widely, it becomes a necessary step
in data preprocessing while using machine learning algorithms.

14
 In this method, we convert variables with different scales of measurement into a single
scale.
 In this case, we have mixed numerical data, where each attribute has entirely different
range which makes it really difficult to compare the variables. Hence, we need to scale
the data which can be done by applying Z-score or sklearn standardScaler.

Dataset after the data is been scaled by applying z-score.

2.3 Comment on the comparison between the covariance and the correlation
matrices from this data [on scaled data].

 Correlation is basically the scaled version of Covariance; both the parameters always
have either positive, negative or 0 as signs and values. When the sign is positive, we
can infer that the variables are positively correlated. When the sign is negative, we can
say that the variables are negatively correlated, and when the sign or value is 0 we can
conclude that there is no correlation between the variables.
 Correlation measures both the strength and direction of the linear relationship between
two variables.
 Whereas Covariance is a measure only the direction of linear relationship between
variables.

Correlation Matrix:

15
Covariance Matrix:

2.4 Check the dataset for outliers before and after scaling. What insight do you
derive here? [Please do not treat Outliers unless specifically asked to do so]

 Box plots showing the outliers before scaling the data:

16
 Box plots after the data is scaled and outliers have been treated.

Inferences:
 We can say by looking at the above box plots that scaling did not actually remove the
outliers. We need to take care of outliers separately.
 The scaled data can now be compared easily with other as all the variables are on same
z-score.
 Looks like the scaled data has similar max values however the min values are still
comparable.
 The mean value of each variable are comparable to 0 and standard deviation is 1.
 Range of all the variables are standardized hence the all variables are unit less now.

2.5 Extract the eigenvalues and eigenvectors.[Using Sklearn PCA Print Both]

 Eigenvalues and Eigenvectors are mainly used to capture key information that is stored
in large matrix.
 Eigenvalues are special set of scalar values that is associated with the set of linear
equations in a matrix equation. Whereas eigenvectors are non-zero vector that can be
changed at most at its scalar factor after we apply linear transformations.
 Eigenvalues and Eigenvectors can improve the efficiency in computationally intensive
tasks by reducing the dimensions after making sure that the key information is not lost.

17
2.6 Perform PCA and export the data of the Principal Component (eigenvectors)
into a data frame with the original features.

18
Eigen Values:

Variance Ratio:

19
Inferences:
 By observing the above values we can see that 33.3% of variance is coming from the
first data set while the second data set which is sum of first and second almost covers
62%of data. Hence, if we only consider 8 components we can say that we are covering
almost 90.8% of the data.
 We need to first decide the purpose of reducing the dimensionality before deciding on
any of the approach i.e. I order to decide how many Eigen values and vectors need to be
kept we need to look into the purpose first.
 The cumulative % gives the percentage of variance accounted for n components.
Hence, the cumulative percentage of the second component is the sum of first and
second. This helps us in deciding as to how many components to keep.

2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use
values with two places of decimals only). [hint: write the linear equation of PC
in terms of eigenvectors and corresponding features]

20
2.8 Consider the cumulative values of the eigenvalues. How does it help you to
decide on the optimum number of principal components? What do the
eigenvectors indicate?

 The first Eigen value shows that almost 33% of information covered by all 17 variables.
 Similarly, if we consider the second Eigen value (sum of first and second), it shows that
approx. 62% of information is covered.
 So on so forth, we can simply conclude by looking at these values that almost90% of
information is covered if we only consider the first 8 components.
 These values in turn helps us to make our decision as to how any components are to be
kept, which in this case is 8.
 In this case, if we simply say that if first 8 PCs are going to provide us the more than
90% of the information we can discard the rest as they have less significant in terms of
information.
 Scree plot also helps us with this analysis as by looking at the slope we can decide
when the lines are almost linear, we can discard the rest .

The eigenvectors (principal components) determine the directions of the new feature space, in
PCA. These vectors are basically multiplication of a vector with a matrix that changes the basis
of the vector and also its direction in a linear transformation.

2.9 Explain the business implication of using the Principal Component Analysis
for this case study. How may PCs help in the further analysis? [Hint: Write
Interpretations of the Principal Components Obtained]

 The concept of principal components is quite intuitive Instead of dealing with a large
number of possibly correlated variables, principal components are constructed as
suitable linear combination of the observed variables such that the components have
two important properties:
 The principal components (PCs) carry the total variance present in the data
 PCs are orthogonal i.e. uncorrelated to one another.
Reduction of dimension involves sacrificing certain amount of variance. A balance must be
struck so that significant reduction in the number of dimensions is achieved by sacrificing the
least possible amount of variance.

 For the above case study, we observed that the original dataset had 18 variables but
after carefully applying the concepts of PCA we were able to reduce it to 8 components,
which helps a Data Scientist to perform sorts of statistical solutions to the problem and to
get correct results.

21
 By looking at the values we were able to see that the first 8 components covered approx.
90% of the information provided by the dataset hence we were able to reduce the
dimensionality.

22

You might also like