Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Project: Advance Statistics

Abhishek Gautam
PGP-DSBA
Date 17th Oct 2021

1
1.1 Answer

One way ANOVA(Education)

𝐻0: The mean salary of all the category will be equal (Doctorate, Bachelors, HS-Grad).

Alternate Hypothesis 𝐻1: the mean salary will be different in at least one of the category.

One way ANOVA(Occupation)

Null Hypothesis 𝐻0: The mean salary is the same across all the 4 categories of occupation(Specialty, Sales,

Adm-clerical, Exec-Managerial).

Alternate Hypothesis 𝐻1: The mean salary is different in at least one category of occupation

1.2 Answer

One way ANOVA for ‘Education’

The above is the ANOVA table for Education variable.

Since the p value = 1.257709e-08 is less than the significance level (alpha = 0.05), we can reject the null

hypothesis and conclude that there is a significant difference in the mean salaries for at least one category

of education.

2
One way ANOVA for ‘Occupation

Since the p value = 0.458508 is greater than the significance level (alpha = 0.05), we fail to reject the null
hypothesis (i.e. we accept H0) and conclude that there is no significant difference in the mean salaries
across the 4 categories of occupation.

1.3 Anova for occupation

Since the p value is greater than Alpha we cannot reject the Ho for occupation.

1.4

1.5

Interaction b/w Education and occupation

3
From above plot we can make out that the interaction between people with:
 Adm-Clerical job with Bachelors and Doctorates is comparably same
 Sales job with Bachelors and Doctorates is same.
 Prof-Speciality job with HS-grad and Bachelors is a bit.
 Exec-Managerial job role has with grad and doctorates is slightly higher
 From above plot we can figure out that people with educational level:
o Doctorates: are into higher salary brackets and mostly Prof-speciality roles orExec-
managerial roles or in sales profile, very few are doing Adm-clerical jobs
o Bachelors: fall in mid income range and found mostly working as an Exec -managers , Adm-
clerks or into sales but very few are found in Prof- speciality profile.
o HS-grads : are in low income brackets, mostly doing Prof-speciality or Adm -clerical work
and few are doing Sales but hardly any in Exec-managerial role.

1.6

H0:- the mean salary variable for each occupation type and education label should be equal

H1:- At least one of the means of salary type of occupation and education level are not equal.

Alpha = 0.05

P< 0.05 reject h0

p>0.05 fail to reject h0

two way anova

df sum_sq mean_sq F PR(>F)


1.98E-
C(Education) 2 1.03E+11 5.13E+10 31.25768
08
3.55E-
C(Occupation) 3 5.52E+09 1.84E+09 1.12008
01
Residual 34 5.59E+10 1.64E+09 NaN NaN

We can see that there is some sort of interaction b/w the two treatments, so we will introduce a new term
while performing the two way anova

Two way anova

df sum_sq mean_sq F PR(>F)


5.47E-
C(Education) 2 1.03E+11 5.13E+10 72.21196
12
C(Occupation) 3 5.52E+09 1.84E+09 2.587626 7.21E-

4
02
2.23E-
C(Education):C(Occupation) 6 3.63E+10 6.06E+09 8.519815
05
Residual 29 2.06E+10 7.11E+08 NaN NaN

P value is less than the significance level(0.05) we can reject the null hypothesis

Result one of the category salaries is different

1.7

2.1

2.2 Answer

 Our data set has 18 components hence got 18 principle components


 Performing the PCA is necessary to normalize the data

5
 PCA calculate a new projection to our data set.
 If we normalize our data all variables have the same standard deviation, thus all the variable have
the same weight and our PCA calculate relevant axis. This skew the PCA towards high magnitude
feature. We can speed up gradient descent or calculation in algorithm by scaling
 Scaling of data
 Z= value-mean/Standard deviation

6
7
8
Statical description: -

count mean std min 25% 50% 75% max IQR ll ul


Apps 777 3001.63835 3870.20148 81 776 1558 3624 48094 2848 -3496 7896
Accept 777 2018.80438 2451.11397 72 604 1110 2424 26330 1820 -2126 5154
Enroll 777 779.972973 929.17619 35 242 434 902 6392 660 -748 1892
Top10perc 777 27.558559 17.640364 1 15 23 35 96 20 -15 65
Top25perc 777 55.796654 19.804778 9 41 54 69 100 28 -1 111
-
F.Undergrad 777 3699.90734 4850.42053 139 992 1707 4005 31643 3013 8524.5
3527.5
P.Undergrad 777 855.298584 1522.43189 1 95 353 967 21836 872 -1213 2275
-
Outstate 777 10440.6692 4023.01648 2340 7320 9990 12925 21700 5605 21332.5
1087.5
Room.Board 777 4357.52638 1096.69642 1780 3597 4200 5050 8124 1453 1417.5 7229.5
Books 777 549.380952 165.10536 96 470 500 600 2340 130 275 795
Personal 777 1340.64221 677.071454 250 850 1200 1700 6800 850 -425 2975
PhD 777 72.660232 16.328155 8 62 75 85 103 23 27.5 119.5
Terminal 777 79.702703 14.722359 24 71 82 92 100 21 39.5 123.5
S.F.Ratio 777 14.089704 3.958349 2.5 11.5 13.6 16.5 39.8 5 4 24
perc.alumni 777 22.743887 12.391801 0 13 21 31 64 18 -14 58
Expend 777 9660.17117 5221.76844 3186 6751 8377 10830 56233 4079 632.5 16948.5
Grad.Rate 777 65.46332 17.17771 10 53 65 78 118 25 15.5 115.5

Observations:
 After Scaling Standard deviation is 1.0 for all variables.
 Post scaling Q1(25%) value and minimum values difference is lesser than original dataset in most
of the variables.

2.3 Answer

 Both the terms, Covariance and Corelation matrices measure the relationship and the dependency
between two variables.
 “Covariance” indicates the direction of the linear relationship between variables.
 “Correlation” on the other hand measures both the strength and direction of the linear relationship
between two variables.
 Correlation refers to the scaled form of covariance. Covariance is affected by the change in scale.
 Covariance indicates the direction of the linear relationship between variables. Correlation on
the other hand measures both the strength and direction of the linear relationship between two
variables

9
Heatmap

Observation:
 Highest corelation is seen among:
o Enrol variables with F undergrad
o Enrol with accept
o Apps with accept and apps
 Least corelation observed with SF ratio variable with: expend outstate grad rate, perc alumni, room
board and top 10perc

2.4

After standardization of data post scaling again box plot is draw on scaled data, also used describe
function. After scaling no much difference in tern of outlier’s reduction

10
Box plot with scaled data:-

When computing the empirical mean and standard deviation. Standard scaler there fore cannot guarantee
balanced feature scales in the presence of outliers.

Scaled data: -

count mean std min 25% 50% 75% max IQR ll ul


Apps 777 3001.63835 3870.20148 81 776 1558 3624 48094 2848 -3496 7896
Accept 777 2018.80438 2451.11397 72 604 1110 2424 26330 1820 -2126 5154
Enroll 777 779.972973 929.17619 35 242 434 902 6392 660 -748 1892
Top10perc 777 27.558559 17.640364 1 15 23 35 96 20 -15 65
Top25perc 777 55.796654 19.804778 9 41 54 69 100 28 -1 111
-
F.Undergrad 777 3699.90734 4850.42053 139 992 1707 4005 31643 3013 8524.5
3527.5
P.Undergrad 777 855.298584 1522.43189 1 95 353 967 21836 872 -1213 2275
-
Outstate 777 10440.6692 4023.01648 2340 7320 9990 12925 21700 5605 21332.5
1087.5
Room.Board 777 4357.52638 1096.69642 1780 3597 4200 5050 8124 1453 1417.5 7229.5
Books 777 549.380952 165.10536 96 470 500 600 2340 130 275 795
Personal 777 1340.64221 677.071454 250 850 1200 1700 6800 850 -425 2975
PhD 777 72.660232 16.328155 8 62 75 85 103 23 27.5 119.5
Terminal 777 79.702703 14.722359 24 71 82 92 100 21 39.5 123.5
S.F.Ratio 777 14.089704 3.958349 2.5 11.5 13.6 16.5 39.8 5 4 24
perc.alumni 777 22.743887 12.391801 0 13 21 31 64 18 -14 58
Expend 777 9660.17117 5221.76844 3186 6751 8377 10830 56233 4079 632.5 16948.5
Grad.Rate 777 65.46332 17.17771 10 53 65 78 118 25 15.5 115.5

11
2.5 Perform PCA
In below table we can see that first PC or Array explains 33.12% variance in our dataset, while first seven features
capture 70.12% variance

12
Heat Map co relation matrix between components and features

2.6 Answer

In PCA, given a mean cantered dataset X with n sample and p variables, the first principal component PC1
isgiven by the linear combination of the original variables X_1, X_2, …, X_p

PC_1 = w_{17}X_1 + w_{16}X_2 + … + w_{1p}X_pThe first principal component PC1 represents the
component that retains the maximum variance ofthe data. w1 corresponds to an eigenvector of the
covariance matrix

E=1*X^X/N-1

The explicit form of the PC1 is as below:


[ 0.2487656 , 0.2076015 , 0.17630359, 0.35427395, 0.34400128,0.15464096, 0.0264425 , 0.29473642,
0.24903045, 0.06475752,-0.04252854, 0.31831287, 0.31705602, -0.17695789, 0.20508237,0.31890875,
0.25231565],

13
2.7 Answer

Observations:
 The plot visually shows how much of the variance are explained, by how many principle
components.
 In the below plot we see that, the 1st PC explains variance 33.13%, 2nd PC explains 57.19% and so
on.
 Effectively we can get material variance explained (ie. 90%) by analysing 9 principle components
instead all of the 17 variables(attributes) in the dataset

PCA uses the eigenvectors of the covariance matrix to figure out how we should rotate the data. Because
rotation is a kind of linear transformation, your new dimensions will be sums of the old ones. The eigen-
vectors (Principle Components), determine the direction or Axes along which linear transformation acts,
stretching or compressing input vectors. They are the lines of change that represent the action of the larger
matrix, the very “line” in linear transformation.

2.8
Business implication for using PCA
 PCA is use for EDA and for getting the predictive models will be done by continuous variables.
 Dimensionality reduction can be done by PCA by sharing each data point for the first few principal
components to obtain lower dimension data while preserving as much as of the data’s variation
possible. In case reduction of dimension is higher it shows a higher variance.
 The first PCA can equivalently be defines as a direction that maximizes the variance of the
projection

14

You might also like