Principal Component Analysis (PCA)

San José State University
Math 253: Mathematical Methods for Data Visualization
Principal Component Analysis (PCA)

– A First Dimensionality Reduction Approach
Dr. Guangliang Chen

Introduction
• Many data sets have very high dimensions nowadays, causing signifi-
cant challenge in storing and processing them.
• We need a way to reduce the dimensionality of the data in order to

reduce memory requirement while increasing speed.
• If we discard some dimensions, will that degrade the performance?
• The answer can be no, as long as we do it carefully by preserving

only the information that is needed by the task. In fact, it may
even lead to better results in many cases.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 2/49
Different dimentionality reduction algorithms preserve different kinds of

information (when reducing the dimension):
• Principal Component Analysis (PCA): variance
• Multidimensional Scaling (MDS): distance
• ISOMap: geodesic distance
• Local Linear Embedding (LLE): local geometry
• Laplacian Eigenmaps: local affinity
• Linear Discriminant Analysis (LDA): separation among classes
A demonstration
“Useful” information of a data set is often contained in only a small number

of dimensions.
b
b
b
b b
b b b
b
b b
b b b b b
b b b
b b
b b b
b
b b b b
b b
b b
b
Another example
Average intensity value of each pixel of the MNIST handwritten digits:

mean intensity value at each pixel
0.6
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700 800
• Boundary pixels tend to be zero;
• Number of degrees of freedom of each digit is much less than 784.
The one-dimensional PCA problem

Problem. Given a set of data points x = tv + b
x1 , . . . , xn ∈ Rd , find a line S b
b
b x = tv + b′
xi
parametrized by x(t) = t · v + b b
b
(with kvk = 1) such that the orthog- b b
b
ai
onal projections of the data onto the × b
b
line b b b
b b
b
PS (xi ) = v vT (xi − b) +b b
| {z }
:=ai b
= ai v + b, 1≤i≤n
have the largest possible variance.
Mathematical formulation
First observe that for parallel lines, the projections are different, but the
amounts of variance are the same! ←− This implies that the choice of b
is not unique.
To make the problem well defined, we add a constraint by requiring that

1X 1X
0 = ā = ai = vT · (xi − b) = vT · (x̄ − b)
n n
This yields that b = x̄ = n1 xi , i.e., we only consider lines passing
P
through the centroid of the data set.
We have thus eliminated the variable b from the problem, so that we only
need to focus on the unit-vector variable v (representing the direction of
the line).
Since we now have ā = 0, the variance of the projections is simply

n
1 X
a2
n − 1 i=1 i
and we can correspondingly reformulate the original problem as follows:

X
max a2i , where ai = vT (xi − x̄).
v: kvk=1 | {z }
scatter
Let us further rewrite the objective function:

X X
a2i = vT (xi − x̄) (xi − x̄)T v
| {z }| {z }
ai ai
X h i
= vT (xi − x̄)(xi − x̄)T v
hX i
= vT (xi − x̄)(xi − x̄)T v
| {z }
:=C (d×d matrix)
= vT Cv.
Remark. The matrix C is called the sample covariance matrix or scatter

matrix of the data. It is square, symmetric, and positive semidefinite,
because it is a sum of such matrices!
Accordingly, we have obtained the following (Rayleigh quotient) problem
max vT Cv
v:kvk=1
By applying the theorem, we can easily obtain the following result.
Theorem 0.1. Given a set of data points x1 , . . . , xn in Rd with centroid

x̄ = n1 xi , the optimal direction for projecting the data (in order to have
P
maximum variance) is the largest eigenvector of the sample covariance

matrix C = (xi − x̄)(xi − x̄)T :
P
max vT Cv = λ1 , achieved when v = v1 .

v: kvk=1 |{z}
max scatter
Remark. It can be shown that
max vT Cv = λ2 , achieved when v = v2 ;

v:kvk=1,vT
1 v=0
max vT Cv = λ3 , achieved when v = v3 .

v: kvk=1,vT T
1 v=0,v2 v=0
This shows that v2 , v3 etc. are the next best orthogonal directions.
For each 1 ≤ i ≤ n, let v1

b
b b
ai = v1T (xi − x̄), v2
b
bi = v2T (xi − x̄). b

b
b b
(so on and so forth for subsequent
b b
orthogonal directions). b
b
b
b
The scatter of the projections of the
b
data onto each of those directions is b
X
a2i = v1T Cv1 = λ1
X
b2i = v2T Cv2 = λ2
The total scatter of the k-dimensional PCA projections is equal to the sum
of the scatter onto each direction. We prove this for the case of k = 2:
X X X X
k(ai , bi ) − (0, 0)k2 = (a2i + b2i ) = a2i + b2i = λ1 + λ2
It is also the maximum possible amount of scatter that can be preserved

by all planes of the same dimension.
Furthermore, the orthogonal projections onto different eigenvectors vi are

P P
uncorrelated: Since ai = 0 = bi , their covariance is
X X
ai bi = v1T (xi − x̄)(xi − x̄)T v2
= v1T Cv2 = v1T (λ2 v2 ) = λ2 (v1T v2 ) = 0.
Principal component analysis (PCA)

The previous procedure is called principal component analysis.
• vj is called the jth principal direction;
• The projection of the data point xi onto vj , i.e., vjT (xi − x̄), is
called the jth principal component of xi .
In fact, PCA is just a change of coordinate system to use the maximum-

variance directions of the data set!
Example 0.1. Perform PCA (by hand) on the following data set (rows
are data points):  
1 −1
−1 1 
 
X= .
 2 2
 
−2 −2
Computing
PCA requires constructing a d × d matrix from the given data

X
C= (xi − x̄)(xi − x̄)T
and computing its (top) eigenvectors
C ≈ Vk Λk VkT
which can be a significant challenge for large data sets in high dimensions.
We show that the eigenvectors of C can be efficiently computed from the

Singular Value Decomposition (SVD) of the centered data matrix.
   
x1T e 1T
x
 . 
.  n×d and Xe =  ..  ∈ Rn×d (where x
 
Let X =   . ∈R  .  e i = xi − x̄) be
xnT x
en T
the original and centered data matrices (rows are data points).
Then  
xe 1T
X
e iT = [x

en] · 
e1 . . . x
..  eT e
C= x
eix  .  = X X.

x
enT
Again, this shows that C is square, symmetric and positive semidefinite

and thus only has nonnegative eigenvalues.
PCA through SVD

Recall that the principal directions of a data set are given by the top
eigenvectors of the sample covariance matrix
eTX
C=X e ∈ Rd×d .
Algebraically, they are also the right singular vectors of X:

e

eTX
X e = VΣT UT · UΣVT = V ΣT Σ VT
| {z }
Λ
Thus, one may just use the SVD of X

e to compute the principal directions
(and components), which is much more efficient.
Interpretations:
Let the SVD of a centered data matrix X

e be the following
e = U · Σ · VT = UΣ · VT
X
Then
• Columns of V (right singular vectors vi ) are principal directions;
• Squared singular values (λi = σi2 ) represent amounts of scatter

captured by each principal direction;
• Columns of UΣ are different principal components of the data.
To see the last one, consider any principal direction vj . The corresponding
principal component is
e j = σ j uj
Xv
with scatter λj = σj2 .
Collectively, for the top k principal directions, the principal components of

the entire data set are
e 1 . . . Xv
Y = [Xv
|{z}
e 1 . . . vk ]
e k ] = X[v ←− XV
e k
n×k
 
σ1

= [σ1 u1 . . . σk uk ] = [u1 . . . uk ]  .. 
←− Uk Σk .
 . 

σk
Note also the following:
• The total scatter preserved by the k-dimensional projections is

X X
λj = σj2 .
1≤j≤k 1≤j≤k
• A parametric equation of the k-dimensional PCA plane is
x = x̄ + Vk α
• Projections of the data onto this plane are given by the rows of
e k VT
PS (X) = 1x̄T + XV k
| {z }
Y
An SVD-based algorithm for PCA

Input: Data matrix X ∈ Rn×d and integer k (with 0 < k < d)
Output: Top k principal directions v1 , . . . , vk and corresponding principal

components Y ∈ Rn×k .
Steps:
e = [x1 − x̄, . . . , xn − x̄]T where x̄ = 1 P
1. Center data: X n xi
e ≈ Uk Σk VT
2. Perform rank-k SVD: X k
3. Return: Y = XV
e k = Uk Σk
Connection to orthogonal least-squares fitting

We have seen that the following two xi b
planes coincide: b b
(1) PCA plane: which maximizes b

b
pi
the projection variance, b
x̄ + |ai|
(2) Orthogonal best-fit plane: b
which minimizes the orthogonal b

b
b
least-squares fitting error.
b
Mathematical justification: b
pi = v · vT (xi − x̄) +x̄

X X X
kxi − x̄k2 = a2i + kxi − pi k2 |{z} | {z }
| {z } | {z } | {z } proj p.c.
total scatter proj. var. ortho. fitting error
Other interpretations of PCA

The PCA plane also tries to preserve, as much as possible, the Euclidean
distances between the given data points:
kyi − yj k2 ≈ kxi − xj k2 for “most” pairs i 6= j
More on this when we get to the MDS part.
PCA can also be regarded as a feature extraction method:

1 1 eT e e T ),
vj = Cvj = X (Xvj ) ∈ Col(X for all j < rank(X)
e
λj λj
This shows that each vj is a linear combination of the centered data points
(and also a linear combination of the original data points).
MATLAB implementation of PCA

MATLAB built-in: [V, US] = pca(X); % Rows of X are observations
Alternatively, you may want to code it yourself:
Xtilde = X - mean(X,1);
[U,S,V] = svds(Xtilde, k); % k is the reduced dimension
Y = Xtilde*V;
Application to data visualization

Given a high dimensional data set x1 , . . . , xn ∈ Rd , one can visualize the
data by
• projecting the data onto a 2 or 3 dimensional PCA plane and
• plotting the principal components as new coordinates

b b f
b b
pc2 (Xv 2)
b b
b
b b b b
b
b b b
b
b b b b
b
b b
b b
b
b f
b b
pc1 (Xv 1)
2D visualization of MNIST handwritten digits
1. The “average” writer
2. The full appearance of each digit class

0-3
4 6
5
5
4 3 4 4
3
3
2
2 2
2
1 1
1
0 0
0 0
-1
-1
-2
-2 -1
-2
-3
-2 -4 -3
-4
-4
-5 -3
-6 -4 -2 0 2 4 6 -4 -3 -2 -1 0 1 2 3 4 5 -4 -2 0 2 4 6 -4 -2 0 2 4 6
(cont’d on next page)
4-6
4 5 5
3 4
4
3
2 3
2
1 2
1
0 1
0
-1 0
-1
-2 -1
-2
-3 -2
-3
-4 -3
-4
-5 -4
-5
-5 -4 -3 -2 -1 0 1 2 3 4 -4 -2 0 2 4 6 -6 -5 -4 -3 -2 -1 0 1 2 3 4
7-9
4
4
6
3
3
2
2 4
1
1
0
0 2
-1
-1
0 -2
-2
-3
-3
-2 -4
-4
-5
-5
-4
-6
-6 -4 -2 0 2 4 -5 -4 -3 -2 -1 0 1 2 3 4 5 -6 -5 -4 -3 -2 -1 0 1 2 3 4
How to set the parameter k in other settings?

Generally, there are two ways to choose the reduced dimension k:
• Set k = #“dominant” singular values
• Choose k such that the top k principal components explain a certain

fraction of the total scatter of the data:
k
X r
X
σi2 / σi2 > p.
i=1 i=1
| {z } | {z }
explained variance total variance
Common values of p are .95 (the most commonly used), or .99 (more
conservative, less reduction), or .90, .80 (more aggressive).
However, in practical contexts, it is possible to get much lower than this

threshold while maintaining or even improving the accuracy.
Example: MNIST handwritten digits
Concluding remarks on PCA

PCA projects the (centered) data onto a k-dim plane that
• maximize the amount of variance in the projection domain,
• minimizes the orthogonal least-squares fitting error
As a dimension reduction and feature extraction method, it is
• unsupervised (blind to labels),
• nonparameteric (model-free), and
• very popular!
PCA is a deterministic procedure, assuming no measurement errors in the

data:
e = Y · VT
X
To extend it to deal with measurement errors, we can assume a statistical
model
k
X
X
e ij = Fir wrj + ij , for all i, j
r=1
which in matrix form is
e =
X F
|{z} · W
|{z} + |{z}
E
factor scores factor loadings errors
This method is called factor analysis and its solution can be derived by
using the MLE approach.
Lastly, PCA is a linear projection method:
y = VT (x − x̄)
For nonlinear data, PCA will need to use a dimension higher than the
manifold dimension (in order to preserve most of the variance).
On the matter of centering

PCA requires data centering (equivalent to fitting a plane through the
centroid). What is the best plane through the origin (linear subspace)?
PCA
b b
b b b b b b
b b b b b b b b
b b b b b b b b
b b b b b b
b b b b b b
b b
Why using linear subspaces?
They are very useful for modeling document collections:
How to fit a plane through the origin?
Theorem 0.2. Let X ∈ Rn×d be the given data set and k ≥ 1 an integer.
The best k-dimensional plane through the origin for fitting the data is
spanned by the top k right singular vectors of X:
X
|{z} ≈ X k = Uk Σk Vk
|{z} | {z } |{z}
given data projections coefficients basis
Proof. It suffices to solve
min kX − XVVT k2
V∈Rd×k : VT V=I k
The optimal V is such that XVVT = Xk , and can be chosen to be Vk .
Example 0.2. Consider a data set PCA line SVD line

of 3 points in R2 :
3+ b
(1, 3), (2, 2), (3, 1).
The PCA line is 2+ b
1 1

x(t) = (2, 2) + t · √ , −√ , 1+ b
2 2
|
while the SVD line (best-fit line +1
| +2
| +3
|
through the origin) is
1 1

x(t) = t · √ ,√ .
2 2
Application: Visualization of 20 newsgroups data
Summary information:
• 18,774 documents partitioned nearly evenly across 20 different news-

groups.
• A total of 61,118 unique words (including stopwords) present in the

corpus.
A significant challenge:
• The stopwords dominate in most documents in terms of frequency

and make the newsgroups very hard to be .
A fake document-term matrix:
the an zzzz math design car cars

doc 1 8 12 1 4 2
doc 2 7 10 3 4
doc 3 9 15 5 2
doc 4 5 9 2 2 2
doc 5 9 7 3 3 1
doc 6 1 1 2
We will not use any text processing software to perform stopword removal
(or other kinds of language processing such as stemming), but rather
rely on the following statistical operations (in the shown order) on the
document-term frequency matrix X to deal with stopwords:
1. Convert all the frequency counts into binary (0/1) form
the an zzzz matrix design car cars

doc 1 1 1 1 1 1
doc 2 1 1 1 1
doc 3 1 1 1 1
doc 4 1 1 1 1 1
doc 5 1 1 1 1 1
doc 6 1 1 1
2. Remove words that occur either in exactly one document (rare words
or typos) or in “too many“ documents (stopwords or common words)
math design car cars

doc 1 1 1
doc 2 1 1
doc 3 1 1
doc 4 1 1 1
doc 5 1 1 1
doc 6 1
6 3 5 3 1
3. Apply the inverse document frequency (IDF) weighting to the re-

maining columns of X:
X(:, j) ← wj · X(:, j), wj = log(n/nj ),
where nj is the number of documents that contain the j-th word

doc 1 0.6931 0.1823
doc 2 0.6931 0.1823
doc 3 0.6931 0.1823
doc 4 0.1823 0.6931 1.0986
doc 5 0.1823 0.6931 1.0986
doc 6 0.6931
4. Rescale the rows of X to have unit norm in order to remove the

documents’ length information

doc 1 0.9671 0.2544
doc 2 0.9671 0.2544
doc 3 0.9671 0.2544
doc 4 0.1390 0.5284 0.8375
doc 5 0.1390 0.5284 0.8375
doc 6 1
By applying the above procedure (a particular TF-IDF weighting scheme1 )

to the 20newsgroups data and keeping only the words with frequencies
between 2 and 939 (average cluster size), we obtain a matrix of 18,768
nonempty documents and 55,570 unique words, with average row sparsity
73.4.
For ease of demonstration, we focus on six newsgroups in the processed

data set (one from each category) and project them by SVD into a
3-dimensional plane through the origin for visualization.
1
Full name: term frequency inverse document frequency.
See https://en.wikipedia.org/wiki/Tf-idf
We also display the top 20 words that are the most “relevant” to the
underlying topic of each class.
To rank the words based on relevance to each newsgroup, we first compute

the top right singular vector v1 of a fixed newsgroup (without centering),
which represents the dominant direction of the cluster.
Each keyword i corresponds to a distinct dimension of the data and is

represented by ei .
The following score can then be used to measure and compare the relevance
of each keyword:
score(i) = cos θi = hv1 , ei i = v1 (i), i = 1, . . . , 55570

Principal Component Analysis (PCA) - : San José State University Math 253: Mathematical Methods For Data Visualization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Principal Component Analysis (PCA) - : San José State University Math 253: Mathematical Methods For Data Visualization

Uploaded by

Copyright:

Available Formats

San José State University

Math 253: Mathematical Methods for Data Visualization

Dr. Guangliang Chen

• We need a way to reduce the dimensionality of the data in order to

• If we discard some dimensions, will that degrade the performance?

• The answer can be no, as long as we do it carefully by preserving

Different dimentionality reduction algorithms preserve different kinds of

• Principal Component Analysis (PCA): variance

• Multidimensional Scaling (MDS): distance

• ISOMap: geodesic distance

• Local Linear Embedding (LLE): local geometry

• Laplacian Eigenmaps: local affinity

• Linear Discriminant Analysis (LDA): separation among classes

“Useful” information of a data set is often contained in only a small number

Average intensity value of each pixel of the MNIST handwritten digits:

• Boundary pixels tend to be zero;

• Number of degrees of freedom of each digit is much less than 784.

The one-dimensional PCA problem

(with kvk = 1) such that the orthog- b b

To make the problem well defined, we add a constraint by requiring that

through the centroid of the data set.

Since we now have ā = 0, the variance of the projections is simply

and we can correspondingly reformulate the original problem as follows:

Let us further rewrite the objective function:

Remark. The matrix C is called the sample covariance matrix or scatter

Accordingly, we have obtained the following (Rayleigh quotient) problem

By applying the theorem, we can easily obtain the following result.

Theorem 0.1. Given a set of data points x1 , . . . , xn in Rd with centroid

maximum variance) is the largest eigenvector of the sample covariance

max vT Cv = λ1 , achieved when v = v1 .

Remark. It can be shown that

max vT Cv = λ2 , achieved when v = v2 ;

max vT Cv = λ3 , achieved when v = v3 .

For each 1 ≤ i ≤ n, let v1

bi = v2T (xi − x̄). b

It is also the maximum possible amount of scatter that can be preserved

Furthermore, the orthogonal projections onto different eigenvectors vi are

Principal component analysis (PCA)

• vj is called the jth principal direction;

In fact, PCA is just a change of coordinate system to use the maximum-

PCA requires constructing a d × d matrix from the given data

and computing its (top) eigenvectors

We show that the eigenvectors of C can be efficiently computed from the

Again, this shows that C is square, symmetric and positive semidefinite

PCA through SVD

Algebraically, they are also the right singular vectors of X:

Thus, one may just use the SVD of X

Let the SVD of a centered data matrix X

• Columns of V (right singular vectors vi ) are principal directions;

• Squared singular values (λi = σi2 ) represent amounts of scatter

• Columns of UΣ are different principal components of the data.

Collectively, for the top k principal directions, the principal components of

Note also the following:

• The total scatter preserved by the k-dimensional projections is

• A parametric equation of the k-dimensional PCA plane is

An SVD-based algorithm for PCA

Output: Top k principal directions v1 , . . . , vk and corresponding principal

Connection to orthogonal least-squares fitting

(1) PCA plane: which maximizes b

which minimizes the orthogonal b

pi = v · vT (xi − x̄) +x̄