Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

DSBA

DIPTI PATIL
Problem 1A:
Salary is hypothesized to depend on educational qualification and occupation. To understand
the dependency, the salaries of 40 individuals [SalaryData.csv] are collected and each
person’s educational qualification and occupation are noted. Educational qualification is at
three levels, High school graduate, Bachelor, and Doctorate. Occupation is at four levels,
Administrative and clerical, Sales, Professional or specialty, and Executive or managerial. A
different number of observations are in each level of education – occupation combination.
[Assume that the data follows a normal distribution. In reality, the normality assumption may
not always hold if the sample size is small.]
1. State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.

Hypothesis for conducting one-way ANOVA for Education wrt salary

Ho = For all Education levels, average salary is equal


Ha = For at least one Education level, average salary mean is not equal

Hypothesis for conducting one-way ANOVA for Occupation wrt salary

Ho = For all Occupation levels, average salary is equal


Ha = For at least one Occupation level, average salary is not equal

2. Perform a one-way ANOVA on Salary with respect to Education. State whether the
null hypothesis is accepted or rejected based on the ANOVA results.

df sum_sq mean_sq F PR(>F)

C(sd.Education) 2.0 1.026955e+11 5.134773e+10 30.95628 1.257709e-08

Residual 37.0 6.137256e+10 1.658718e+09 NaN NaN

Assumption: Confidence Level(P value) = 0.05

From the above table, we can see that P value < 0.05, therefore, we reject the Null
Hypothesis.

Conclusion: At least one level of Education effects the Average Salary.

3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether the
null hypothesis is accepted or rejected based on the ANOVA results.

Assumption: Confidence Level(P value) = 0.05


df sum_sq mean_sq F PR(>F)

C(sd.Occupation) 3.0 1.125878e+10 3.752928e+09 0.884144 0.458508

Residual 36.0 1.528092e+11 4.244701e+09 NaN NaN

From the above table, we can see that P value > 0.05, therefore, we fail to reject the Null
Hypothesis.

Conclusion: Salary is not effected for any level in Occupation.

4. If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result.

According to the plot above,

• HS graduate have significantly lower mean of Salary as compared to Doctorate and


Bachelors according to Education.
• The mean Salary for Bachelors and Doctorate is close by for all occupation (between
165K and 225K mostly), except for Prof-Speciality.
Problem 1B:

1. What is the interaction between two treatments? Analyze the effects of one variable
on the other (Education and Occupation) with the help of an interaction plot.[hint:
use the ‘pointplot’ function from the ‘seaborn’ function]

According to the above plot,


• There seems to be moderate interaction between the Treatments (categorical
Variables).
• There is less significant interaction effect between Bachelors and Doctorate
Treatment, but the interaction effect is significant for HS-grad.
• The interaction effect for Prof-Specialty is quite significant across Treatments,
Doctorates are highly paid for Prof-speciality as compared to other Education level.
• The mean salaries for HS graduate is much lesser than the Bachelors and Doctorate
Education wise.

2. Perform a two-way ANOVA based on Salary with respect to both Education and
Occupation (along with their interaction Education*Occupation). State the null and
alternative hypotheses and state your results. How will you interpret this result?

Hypothesis for conducting two-way ANOVA for Education & Occupation combined wrt
salary
Ho = For all levels of Education & Occupation, average salary is equal
Ha = For at least one Education & Occupation, average salary is not equal
Assumption: Confidence Level(P value) = 0.05

df sum_sq mean_sq F PR(>F)

C(sd.Education) 2.0 1.026955e+11 5.134773e+10 31.257677 1.981539e-08

C(sd.Occupation) 3.0 5.519946e+09 1.839982e+09 1.120080 3.545825e-01

Residual 34.0 5.585261e+10 1.642724e+09 NaN NaN

df sum_sq mean_sq F PR(>F)

C(sd.Education) 2.0 1.026955e+11 5.134773e+10 72.211958 5.466264e-12

C(sd.Occupation) 3.0 5.519946e+09 1.839982e+09 2.587626 7.211580e-02

C(sd.Education):C(Occupation) 9.0 3.813970e+10 4.237744e+09 5.959675 1.043931e-04

Residual 29.0 2.062102e+10 7.110697e+08 NaN NaN

From the above table, we can see that


• P value for ‘Education’ is < 0.05, therefore, we reject the Null Hypothesis. This means
there is significant effect is of levels of Education on Average Salary earned.
• P value for ‘Combination of Education & Occupation’ is < 0.05, therefore, we reject
the Null Hypothesis. This means there is significant effect of one of the ‘Combination
& Education’ level on Average Salary earned.
• P value for ‘Occupation’ is > 0.05, therefore we fail to reject the Null Hypothesis. This
means there is no significant effect of levels of Occupation on Average Salary earned.

3. Explain the business implications of performing ANOVA for this particular case study.

• The average salary highly depends on the Level of Education and least dependent on
the Level of Occupation.
• Even though from the Point plot we can see the Occupation does effect the average
salary earned, as Doctorate and Bachelors show more average salary earned than the
HS-grad. We may need to capture more number of observations to get this result
from ANOVA test.
• The Occupation Prof-Speciality have a very high value for Doctorate than Bachelors &
HS-grad. This means, less number of candidates are available this position among
Doctorate or the work requirement is very high, hence highest average salary among
Doctorate.
• For Sales and clerical amongst Occupation, average salary for Bachelors & Doctorate
is almost similar. This means that higher qualification such as Doctorate is not
preferred or required for these jobs.
• The ANOVA test suggests that the combinations of ‘Education & Occupation’ levels
does affect the average salary earned by candidate.

Problem 2:
The dataset Education - Post 12th Standard.csv contains information on various colleges. You
are expected to do a Principal Component Analysis for this case study according to the
instructions given. The data dictionary of the 'Education - Post 12th Standard.csv' can be
found in the following file: Data Dictionary.xlsx.

Q1. Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed]. What insight do you draw from the EDA?

Exploratory Data Analysis:


• The Data set have 777 observations and 18 variables.
• The 16 variables have datatype as ‘int’, 1 variable ‘Names’ is Object datatype and 1
variable ‘S_F_Ratio’ is float datatype.
• There are no missing values or duplicate rows in variables.
• The describe function gives the 5 point data summery for data set.

• Describe shows most of the variables (Apps, Accept, Enroll, Top10perc, F_undergrad,
P_undergrad, Personal, Expend) are right skewed as Mean value > Median value.
• For other variables Mean value is almost equal to Median value, which shows normal
distribution. We will confirm same by Distplot in Univariate Analysis.
• From Distplot we see that variables like ‘PhD’ & ‘Terminal’ are somewhat left skewed.

• It is seem from Box Plot above that variables like Apps, Accept, Enroll, F_undergrad,
P_undergrad, Personal, Expend have lots of Outliers present in data.
• From the above Correlation Matrix we can see that, there is high corelation between:
o Application and Application Accepted.
o Enrolment and Application Accepted.
o Full time undergraduate and Enrolment.
o Top 10 percentage & Top 25 percentage.
o Top 10 percentage & Expenditure on them.
• Also, from above heatmap we can see the Correlation between variables, which
causes Multicollinearity. Multicollinearity causes problem which undermines the
statistical significance or independent variable.

Q2. Is scaling necessary for PCA in this case? Give justification and perform scaling.

Yes, scaling is absolutely necessary for PCA in this case. The main objective of scaling is to
normalize a data and bring all the variable data on the same scale for further comparison. It is
a step of data pre processing which is applied to independent variables or features of data.

For example, in the given data, some variables are in ten thousands, some in hundreds, and
some in percentages, so we cannot compare the data. As large values will overweight the
smaller values so we perform scaling.

We apply, Z-Score method here.

Above is the scaled data ,which we will use for PCA.

Q3. Comment on the comparison between the covariance and the correlation matrices from this
data [on scaled data].

• Covariance matrix is equal to correlation matrix after standardization, the two


parameters always have the same sign (positive, negative, or 0). When the sign is
positive, the variables are said to be positively correlated; when the sign is negative,
the variables are said to be negatively correlated; and when the sign is 0, the variables
are said to be uncorrelated.
• Correlation measures both the strength and direction of the linear relationship
between two variables.
• Covariance is a measure used to determine how much two variables change in
tandem. It indicates the direction of the linear relationship between variables.
• Below are the covariance and correlation matrix for scaled data which are equivalent.
Q4. Check the dataset for outliers before and after scaling. What insight do you derive here?
[Please do not treat Outliers unless specifically asked to do so]
Before Scaling:

After Scaling:
Insights:
• The scaled dataset has all similar max values and comparable min values.
• The mean value for each of the variables are comparable to 0 and standard deviation
1.
• The range of each variables hence are now standardized and are all unit-less
quantities.
Q5. Extract the eigenvalues and eigenvectors.[print both]

Kindly refer jyupter file for all the Eigen Vectors value.

• Eigenvalue and Eigenmatrix are mainly used to capture key information that stored in
a large matrix.
• It provides summary of large matrix.
• Performing computation on large matrix is slow and require more memory and CPU,
eigenvectors and eigenvalues can improve the efficiency in computationally intensive
task by reducing dimensions after ensuring of the key information is maintained.

Q6. Perform PCA and export the data of the Principal Component (eigenvectors) into a data
frame with the original features

For performing PCA, we need to follow below steps:


• Step 1: Generate the covariance matrix
• Step 2: Get eigenvalues and eigenvector
• Step 3: View Scree Plot to identify the number of components to be built
• Step 4: We can perform PCA on the scaled data set by importing PCA from
sklearn.decomposition. We get following component output:

Loading of each feature on the components:

For all values of Loading components, kindly refer jyupter file.

We load these components into a dataframe along with the list of columns we had earlier
considered in scaled data. Below is the representative screenshot of in which the Portion of
Variable is extracted by PCs.
Q7. Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with
two places of decimals only). [hint: write the linear equation of PC in terms of eigenvectors
and corresponding features]
Explicit form of the first PC

Q8. Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?

• As we see from above picture, the first eigen value captures 33.26% of the
information represented by all the 17 variables
• Similarly, the 1st and 2nd eigen value together captures 62.02% of the information
and so on
• It is observed that by considering up to 8th eigen value, the total variance captured is
up to 90.79%
• Hence in this case,upto 8th eigen value captures 90.79% information present in 17
numeric features in the original dataset.
• Eigen vectors indicate the direction of the principal components, we can multiply the
original data by the eigenvectors to re-orient our data onto the new axes.
Above picture is the Scree Plot:
Yaxis represents variation captured from features/Dimensions/variable
Xaxis represents Principal Components.

Q9. Explain the business implication of using the Principal Component Analysis for this case
study. How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal
Components Obtained]

• Principal component analysis is a technique for dimension reduction — so it combines


input variables in a specific way, to drop the “least important” variables while still
retaining the most valuable parts of all of the variables! As an added benefit, each of
the “new” variables after PCA are all independent of one another. This is a benefit
because the assumptions of a linear model require our independent variables to be
independent of one another. If we decide to fit a linear regression model with these
“new” variables (see “principal component regression” below), this assumption will
necessarily be satisfied.

• There are 17 variables in the given data set. After doing PCA, we deduce that we can
reduce the number of variables from 17 to 8. As 8 PCs capture almost 90.33% of the
information present in the given dataset.

• Also, we can do the further analysis on 5 PCs which captures 79.66% of information of
17 features in the dataset if it is fine with the Business requirement.

• Below is the correlation map between PCs and the 17 variables/features.

You might also like