Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Business Report

Advanced Statistics Module Project - II

Prasad Mohan
PGPDSBA MAY 21 -A
Date: 15-08-2021

1
Executive Summary

The dataset Education – Post 12th Standard contains information on various colleges. We do an
Exploratory Data Analysis and Principal Component Analysis for this case study.

Introduction

The purpose of this study is to perform an initial analysis on the data to find out duplicates, null,
missing values and other anomalies through an EDA. The next step is to see if the data needs to be
scaled and the perform a PCA to find out the key factors that represent the data. This is useful in
business as too many dimensions require more resources to compute and not all the factors
contribute significantly.

Data Description

# Column Dtype
--- ---------- --------------
1 Names object
2 Apps int64
3 Accept int64
4 Enroll int64
5 Top10perc int64
6 Top25perc int64
7 F.Undergrad int64
8 P.Undergrad int64
9 Outstate int64
10 Room.Board int64
11 Books int64
12 Personal int64
13 PhD int64
14 Terminal int64
15 S.F.Ratio float64
16 perc.alumni int64
17 Expend int64
18 Grad.Rate int64

Sample Dataset

2
2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed]. What insight do you draw from the EDA?

The description of the data is as follows:

Table1 – data description

Figure 1 – Boxplot of data

3
1. There are 777 entries and the column count is 18
2. There are zero null values
3. Data types – int, float and object
4. No duplicate values are found
5. The data has a large number of outliers in nearly all of the columns (see Box plot)
6. The column data also shows skewness
7. The data is not normally distributed

Multivariate analysis:

For multivariate analysis, the correlation test is performed. Values closer to zero indicate there is no
established relationship, values closer to 1 indicate a strong positive relationship and values closer
to -1 indicate a strong negative correlation.

2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling.

Yes, scaling is necessary for PCA in this case. Sometimes datasets might have features with varying
magnitudes. To minimize the effect, we standardize the data through scaling. Based on our EDA,
we can conclude that the data is skewed and also has a lot of outliers. So scaling is necessary.

Scaled-data

We can observe from the new standard deviation, mean and the max, that the data is scaled.

4
Table-2 Scaled dataset

2.3 Comment on the comparison between the covariance and the correlation matrices from
this data.[on scaled data]

A correlation matrix is used to study the strength of a relationship between two variables. It not only
shows the direction of the relationship, but also shows how strong the relationship is. Whereas, a
covariance matrix is used to study the direction of the linear relationship between variables.

5
2.4 Check the dataset for outliers before and after scaling. What insight do you derive here?

Outliers are present before and after scaling, the scaling treatment has affected the median values
which are nicely aligned after scaling.

Fig – Outliers before scaling

Fig – Outliers after scaling

6
2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both]

Eigen values:

Eigen Vectors:

7
2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data
frame with the original features

Steps:

1) Standardisation of data
2) Compute Covariance matrix
3) Compute Eigen value and Eigen vectors
4) Feature reduction/Dimensionality reduction
5) Recasting data

Step 1 is already completed. From the scree plot we can observe the following:

Fig – scree plot

Top 12 features explain nearly 97% of the variance of the data. Hence, we select only 12.
After transformation of the array. We get the output below:

8
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with
two places of decimals only). [hint: write the linear equation of PC in terms of eigenvectors
and corresponding features]

PC1= 0.25*Apps + 0.21*Accept + 0.18*Enroll + 0.35*Top10perc + 0.34*Top25perc +


0.15*F.Undergrad+ 0.03*P.Undergrad + 0.29*Outstate + 0.25*Room.Board + 0.06*Books -
0.04*Personal + 0.32*PhD +0.32*Terminal -0.18*S.F.Ratio + 0.21*perc.alumni + 0.32*Expend +
0.25*Grad.Rate

2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components?

Typically, we want the explained variance to be between 95–99%. Based on the scree plot we
observe that first 12 features explain nearly 97% of the variance of the data. Hence, we select 12. If
we decide to limit ourselves to 95%, we can work with less than 12 variables.

2.9 Explain the business implication of using the Principal Component Analysis for this case
study. How may PCs help in the further analysis? [Hint: Write Interpretations of the
Principal Components Obtained]

The data we had to work with was skewed and had a lot of outliers. Moreover, we had to work with
nearly 17 features. The dimensionality reduction should not be arbitrary but based on the use case
as well. In this case nearly 12 components were able to explain more than 97% of the variance. If
the business requirements are not very demanding we can even work with 8 components that
explain nearly 90% of the variance. So the number of PCs mainly depends on the business
requirements.

******

You might also like