Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

San José State University

Math 253: Mathematical Methods for Data Visualization

Principal Component Analysis (PCA)


– A First Dimensionality Reduction Approach

Dr. Guangliang Chen


Principal Component Analysis (PCA)

Introduction
• Many data sets have very high dimensions nowadays, causing signifi-
cant challenge in storing and processing them.

• We need a way to reduce the dimensionality of the data in order to


reduce memory requirement while increasing speed.

• If we discard some dimensions, will that degrade the performance?

• The answer can be no, as long as we do it carefully by preserving


only the information that is needed by the task. In fact, it may
even lead to better results in many cases.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 2/49
Principal Component Analysis (PCA)

Different dimentionality reduction algorithms preserve different kinds of


information (when reducing the dimension):

• Principal Component Analysis (PCA): variance

• Multidimensional Scaling (MDS): distance

• ISOMap: geodesic distance

• Local Linear Embedding (LLE): local geometry

• Laplacian Eigenmaps: local affinity

• Linear Discriminant Analysis (LDA): separation among classes

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 3/49
Principal Component Analysis (PCA)

A demonstration

“Useful” information of a data set is often contained in only a small number


of dimensions.
b
b
b
b b
b b b
b
b b
b b b b b
b b b
b b
b b b
b
b b b b
b b
b b
b

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 4/49
Principal Component Analysis (PCA)

Another example

Average intensity value of each pixel of the MNIST handwritten digits:


mean intensity value at each pixel
0.6

0.5

0.4

0.3

0.2

0.1

0
0 100 200 300 400 500 600 700 800

• Boundary pixels tend to be zero;

• Number of degrees of freedom of each digit is much less than 784.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 5/49
Principal Component Analysis (PCA)

The one-dimensional PCA problem


Problem. Given a set of data points x = tv + b

x1 , . . . , xn ∈ Rd , find a line S b
b
b x = tv + b′
xi
parametrized by x(t) = t · v + b b
b

(with kvk = 1) such that the orthog- b b

b
ai
onal projections of the data onto the × b
b

line b b b

b b
b

PS (xi ) = v vT (xi − b) +b b
| {z }
:=ai b

= ai v + b, 1≤i≤n
have the largest possible variance.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 6/49
Principal Component Analysis (PCA)

Mathematical formulation

First observe that for parallel lines, the projections are different, but the
amounts of variance are the same! ←− This implies that the choice of b
is not unique.

To make the problem well defined, we add a constraint by requiring that


1X 1X
0 = ā = ai = vT · (xi − b) = vT · (x̄ − b)
n n
This yields that b = x̄ = n1 xi , i.e., we only consider lines passing
P

through the centroid of the data set.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 7/49
Principal Component Analysis (PCA)

We have thus eliminated the variable b from the problem, so that we only
need to focus on the unit-vector variable v (representing the direction of
the line).

Since we now have ā = 0, the variance of the projections is simply


n
1 X
a2
n − 1 i=1 i

and we can correspondingly reformulate the original problem as follows:


X
max a2i , where ai = vT (xi − x̄).
v: kvk=1 | {z }
scatter

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 8/49
Principal Component Analysis (PCA)

Let us further rewrite the objective function:


X X
a2i = vT (xi − x̄) (xi − x̄)T v
| {z }| {z }
ai ai
X h i
= vT (xi − x̄)(xi − x̄)T v
hX i
= vT (xi − x̄)(xi − x̄)T v
| {z }
:=C (d×d matrix)

= vT Cv.

Remark. The matrix C is called the sample covariance matrix or scatter


matrix of the data. It is square, symmetric, and positive semidefinite,
because it is a sum of such matrices!

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 9/49
Principal Component Analysis (PCA)

Accordingly, we have obtained the following (Rayleigh quotient) problem

max vT Cv
v:kvk=1

By applying the theorem, we can easily obtain the following result.

Theorem 0.1. Given a set of data points x1 , . . . , xn in Rd with centroid


x̄ = n1 xi , the optimal direction for projecting the data (in order to have
P

maximum variance) is the largest eigenvector of the sample covariance


matrix C = (xi − x̄)(xi − x̄)T :
P

max vT Cv = λ1 , achieved when v = v1 .


v: kvk=1 |{z}
max scatter

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 10/49
Principal Component Analysis (PCA)

Remark. It can be shown that

max vT Cv = λ2 , achieved when v = v2 ;


v:kvk=1,vT
1 v=0

max vT Cv = λ3 , achieved when v = v3 .


v: kvk=1,vT T
1 v=0,v2 v=0

This shows that v2 , v3 etc. are the next best orthogonal directions.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 11/49
Principal Component Analysis (PCA)

For each 1 ≤ i ≤ n, let v1


b
b b
ai = v1T (xi − x̄), v2
b

bi = v2T (xi − x̄). b


b
b b
(so on and so forth for subsequent
b b
orthogonal directions). b
b
b
b
The scatter of the projections of the
b
data onto each of those directions is b
X
a2i = v1T Cv1 = λ1
X
b2i = v2T Cv2 = λ2

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 12/49
Principal Component Analysis (PCA)

The total scatter of the k-dimensional PCA projections is equal to the sum
of the scatter onto each direction. We prove this for the case of k = 2:
X X X X
k(ai , bi ) − (0, 0)k2 = (a2i + b2i ) = a2i + b2i = λ1 + λ2

It is also the maximum possible amount of scatter that can be preserved


by all planes of the same dimension.

Furthermore, the orthogonal projections onto different eigenvectors vi are


P P
uncorrelated: Since ai = 0 = bi , their covariance is
X X
ai bi = v1T (xi − x̄)(xi − x̄)T v2
= v1T Cv2 = v1T (λ2 v2 ) = λ2 (v1T v2 ) = 0.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 13/49
Principal Component Analysis (PCA)

Principal component analysis (PCA)


The previous procedure is called principal component analysis.

• vj is called the jth principal direction;

• The projection of the data point xi onto vj , i.e., vjT (xi − x̄), is
called the jth principal component of xi .

In fact, PCA is just a change of coordinate system to use the maximum-


variance directions of the data set!

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 14/49
Principal Component Analysis (PCA)

Example 0.1. Perform PCA (by hand) on the following data set (rows
are data points):  
1 −1
−1 1 
 
X= .
 2 2
 
−2 −2

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 15/49
Principal Component Analysis (PCA)

Computing

PCA requires constructing a d × d matrix from the given data


X
C= (xi − x̄)(xi − x̄)T

and computing its (top) eigenvectors

C ≈ Vk Λk VkT

which can be a significant challenge for large data sets in high dimensions.

We show that the eigenvectors of C can be efficiently computed from the


Singular Value Decomposition (SVD) of the centered data matrix.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 16/49
Principal Component Analysis (PCA)

   
x1T e 1T
x
 . 
.  n×d and Xe =  ..  ∈ Rn×d (where x
 
Let X =   . ∈R  .  e i = xi − x̄) be
xnT x
en T

the original and centered data matrices (rows are data points).

Then  
xe 1T
X
e iT = [x

en] · 
e1 . . . x
..  eT e
C= x
eix  .  = X X.

x
enT

Again, this shows that C is square, symmetric and positive semidefinite


and thus only has nonnegative eigenvalues.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 17/49
Principal Component Analysis (PCA)

PCA through SVD


Recall that the principal directions of a data set are given by the top
eigenvectors of the sample covariance matrix

eTX
C=X e ∈ Rd×d .

Algebraically, they are also the right singular vectors of X:


e
 
eTX
X e = VΣT UT · UΣVT = V ΣT Σ VT
| {z }
Λ

Thus, one may just use the SVD of X


e to compute the principal directions
(and components), which is much more efficient.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 18/49
Principal Component Analysis (PCA)

Interpretations:

Let the SVD of a centered data matrix X


e be the following

e = U · Σ · VT = UΣ · VT
X

Then

• Columns of V (right singular vectors vi ) are principal directions;

• Squared singular values (λi = σi2 ) represent amounts of scatter


captured by each principal direction;

• Columns of UΣ are different principal components of the data.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 19/49
Principal Component Analysis (PCA)

To see the last one, consider any principal direction vj . The corresponding
principal component is
e j = σ j uj
Xv
with scatter λj = σj2 .

Collectively, for the top k principal directions, the principal components of


the entire data set are
e 1 . . . Xv
Y = [Xv
|{z}
e 1 . . . vk ]
e k ] = X[v ←− XV
e k
n×k
 
σ1

= [σ1 u1 . . . σk uk ] = [u1 . . . uk ]  .. 
←− Uk Σk .
 . 

σk

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 20/49
Principal Component Analysis (PCA)

Note also the following:

• The total scatter preserved by the k-dimensional projections is


X X
λj = σj2 .
1≤j≤k 1≤j≤k

• A parametric equation of the k-dimensional PCA plane is

x = x̄ + Vk α

• Projections of the data onto this plane are given by the rows of
e k VT
PS (X) = 1x̄T + XV k
| {z }
Y

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 21/49
Principal Component Analysis (PCA)

An SVD-based algorithm for PCA


Input: Data matrix X ∈ Rn×d and integer k (with 0 < k < d)

Output: Top k principal directions v1 , . . . , vk and corresponding principal


components Y ∈ Rn×k .

Steps:
e = [x1 − x̄, . . . , xn − x̄]T where x̄ = 1 P
1. Center data: X n xi
e ≈ Uk Σk VT
2. Perform rank-k SVD: X k

3. Return: Y = XV
e k = Uk Σk

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 22/49
Principal Component Analysis (PCA)

Connection to orthogonal least-squares fitting


We have seen that the following two xi b

planes coincide: b b

(1) PCA plane: which maximizes b


b
pi
the projection variance, b
x̄ + |ai|
(2) Orthogonal best-fit plane: b

which minimizes the orthogonal b


b
b
least-squares fitting error.
b
Mathematical justification: b

pi = v · vT (xi − x̄) +x̄


X X X
kxi − x̄k2 = a2i + kxi − pi k2 |{z} | {z }
| {z } | {z } | {z } proj p.c.
total scatter proj. var. ortho. fitting error

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 23/49
Principal Component Analysis (PCA)

Other interpretations of PCA


The PCA plane also tries to preserve, as much as possible, the Euclidean
distances between the given data points:

kyi − yj k2 ≈ kxi − xj k2 for “most” pairs i 6= j

More on this when we get to the MDS part.

PCA can also be regarded as a feature extraction method:


1 1 eT e e T ),
vj = Cvj = X (Xvj ) ∈ Col(X for all j < rank(X)
e
λj λj
This shows that each vj is a linear combination of the centered data points
(and also a linear combination of the original data points).

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 24/49
Principal Component Analysis (PCA)

MATLAB implementation of PCA


MATLAB built-in: [V, US] = pca(X); % Rows of X are observations

Alternatively, you may want to code it yourself:

Xtilde = X - mean(X,1);
[U,S,V] = svds(Xtilde, k); % k is the reduced dimension
Y = Xtilde*V;

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 25/49
Principal Component Analysis (PCA)

Application to data visualization


Given a high dimensional data set x1 , . . . , xn ∈ Rd , one can visualize the
data by

• projecting the data onto a 2 or 3 dimensional PCA plane and

• plotting the principal components as new coordinates


b b f
b b
pc2 (Xv 2)
b b
b
b b b b

b
b b b

b
b b b b
b

b b
b b

b
b f
b b
pc1 (Xv 1)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 26/49
Principal Component Analysis (PCA)

2D visualization of MNIST handwritten digits

1. The “average” writer

2. The full appearance of each digit class


0-3
4 6
5
5
4 3 4 4
3
3
2
2 2
2
1 1
1
0 0
0 0
-1
-1
-2
-2 -1
-2
-3
-2 -4 -3
-4
-4
-5 -3
-6 -4 -2 0 2 4 6 -4 -3 -2 -1 0 1 2 3 4 5 -4 -2 0 2 4 6 -4 -2 0 2 4 6

(cont’d on next page)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 27/49
Principal Component Analysis (PCA)

4-6
4 5 5

3 4
4

3
2 3
2
1 2
1
0 1
0
-1 0
-1
-2 -1
-2
-3 -2
-3
-4 -3
-4
-5 -4
-5
-5 -4 -3 -2 -1 0 1 2 3 4 -4 -2 0 2 4 6 -6 -5 -4 -3 -2 -1 0 1 2 3 4

7-9
4
4
6
3
3
2
2 4
1
1
0
0 2
-1
-1

0 -2
-2
-3
-3
-2 -4
-4
-5
-5
-4
-6
-6 -4 -2 0 2 4 -5 -4 -3 -2 -1 0 1 2 3 4 5 -6 -5 -4 -3 -2 -1 0 1 2 3 4

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 28/49
Principal Component Analysis (PCA)

How to set the parameter k in other settings?


Generally, there are two ways to choose the reduced dimension k:

• Set k = #“dominant” singular values

• Choose k such that the top k principal components explain a certain


fraction of the total scatter of the data:
k
X r
X
σi2 / σi2 > p.
i=1 i=1
| {z } | {z }
explained variance total variance

Common values of p are .95 (the most commonly used), or .99 (more
conservative, less reduction), or .90, .80 (more aggressive).

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 29/49
Principal Component Analysis (PCA)

However, in practical contexts, it is possible to get much lower than this


threshold while maintaining or even improving the accuracy.

Example: MNIST handwritten digits

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 30/49
Principal Component Analysis (PCA)

Concluding remarks on PCA


PCA projects the (centered) data onto a k-dim plane that

• maximize the amount of variance in the projection domain,

• minimizes the orthogonal least-squares fitting error

As a dimension reduction and feature extraction method, it is

• unsupervised (blind to labels),

• nonparameteric (model-free), and

• very popular!

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 31/49
Principal Component Analysis (PCA)

PCA is a deterministic procedure, assuming no measurement errors in the


data:
e = Y · VT
X
To extend it to deal with measurement errors, we can assume a statistical
model
k
X
X
e ij = Fir wrj + ij , for all i, j
r=1
which in matrix form is
e =
X F
|{z} · W
|{z} + |{z}
E
factor scores factor loadings errors

This method is called factor analysis and its solution can be derived by
using the MLE approach.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 32/49
Principal Component Analysis (PCA)

Lastly, PCA is a linear projection method:

y = VT (x − x̄)

For nonlinear data, PCA will need to use a dimension higher than the
manifold dimension (in order to preserve most of the variance).

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 33/49
Principal Component Analysis (PCA)

On the matter of centering


PCA requires data centering (equivalent to fitting a plane through the
centroid). What is the best plane through the origin (linear subspace)?

PCA
b b

b b b b b b

b b b b b b b b

b b b b b b b b

b b b b b b

b b b b b b

b b

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 34/49
Principal Component Analysis (PCA)

Why using linear subspaces?

They are very useful for modeling document collections:

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 35/49
Principal Component Analysis (PCA)

How to fit a plane through the origin?

Theorem 0.2. Let X ∈ Rn×d be the given data set and k ≥ 1 an integer.
The best k-dimensional plane through the origin for fitting the data is
spanned by the top k right singular vectors of X:

X
|{z} ≈ X k = Uk Σk Vk
|{z} | {z } |{z}
given data projections coefficients basis

Proof. It suffices to solve

min kX − XVVT k2
V∈Rd×k : VT V=I k

The optimal V is such that XVVT = Xk , and can be chosen to be Vk .

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 36/49
Principal Component Analysis (PCA)

Example 0.2. Consider a data set PCA line SVD line


of 3 points in R2 :
3+ b
(1, 3), (2, 2), (3, 1).

The PCA line is 2+ b

1 1
 
x(t) = (2, 2) + t · √ , −√ , 1+ b
2 2
|
while the SVD line (best-fit line +1
| +2
| +3
|
through the origin) is

1 1
 
x(t) = t · √ ,√ .
2 2
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 37/49
Principal Component Analysis (PCA)

Application: Visualization of 20 newsgroups data

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 38/49
Principal Component Analysis (PCA)

Summary information:

• 18,774 documents partitioned nearly evenly across 20 different news-


groups.

• A total of 61,118 unique words (including stopwords) present in the


corpus.

A significant challenge:

• The stopwords dominate in most documents in terms of frequency


and make the newsgroups very hard to be .

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 39/49
Principal Component Analysis (PCA)

A fake document-term matrix:

the an zzzz math design car cars


doc 1 8 12 1 4 2
doc 2 7 10 3 4
doc 3 9 15 5 2
doc 4 5 9 2 2 2
doc 5 9 7 3 3 1
doc 6 1 1 2
We will not use any text processing software to perform stopword removal
(or other kinds of language processing such as stemming), but rather
rely on the following statistical operations (in the shown order) on the
document-term frequency matrix X to deal with stopwords:

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 40/49
Principal Component Analysis (PCA)

1. Convert all the frequency counts into binary (0/1) form

the an zzzz matrix design car cars


doc 1 1 1 1 1 1
doc 2 1 1 1 1
doc 3 1 1 1 1
doc 4 1 1 1 1 1
doc 5 1 1 1 1 1
doc 6 1 1 1

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 41/49
Principal Component Analysis (PCA)

2. Remove words that occur either in exactly one document (rare words
or typos) or in “too many“ documents (stopwords or common words)

math design car cars


doc 1 1 1
doc 2 1 1
doc 3 1 1
doc 4 1 1 1
doc 5 1 1 1
doc 6 1
6 3 5 3 1

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 42/49
Principal Component Analysis (PCA)

3. Apply the inverse document frequency (IDF) weighting to the re-


maining columns of X:

X(:, j) ← wj · X(:, j), wj = log(n/nj ),

where nj is the number of documents that contain the j-th word

math design car cars


doc 1 0.6931 0.1823
doc 2 0.6931 0.1823
doc 3 0.6931 0.1823
doc 4 0.1823 0.6931 1.0986
doc 5 0.1823 0.6931 1.0986
doc 6 0.6931

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 43/49
Principal Component Analysis (PCA)

4. Rescale the rows of X to have unit norm in order to remove the


documents’ length information

math design car cars


doc 1 0.9671 0.2544
doc 2 0.9671 0.2544
doc 3 0.9671 0.2544
doc 4 0.1390 0.5284 0.8375
doc 5 0.1390 0.5284 0.8375
doc 6 1

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 44/49
Principal Component Analysis (PCA)

By applying the above procedure (a particular TF-IDF weighting scheme1 )


to the 20newsgroups data and keeping only the words with frequencies
between 2 and 939 (average cluster size), we obtain a matrix of 18,768
nonempty documents and 55,570 unique words, with average row sparsity
73.4.

For ease of demonstration, we focus on six newsgroups in the processed


data set (one from each category) and project them by SVD into a
3-dimensional plane through the origin for visualization.
1
Full name: term frequency inverse document frequency.
See https://en.wikipedia.org/wiki/Tf-idf

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 45/49
Principal Component Analysis (PCA)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 46/49
Principal Component Analysis (PCA)

We also display the top 20 words that are the most “relevant” to the
underlying topic of each class.

To rank the words based on relevance to each newsgroup, we first compute


the top right singular vector v1 of a fixed newsgroup (without centering),
which represents the dominant direction of the cluster.

Each keyword i corresponds to a distinct dimension of the data and is


represented by ei .

The following score can then be used to measure and compare the relevance
of each keyword:

score(i) = cos θi = hv1 , ei i = v1 (i), i = 1, . . . , 55570

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 47/49
Principal Component Analysis (PCA)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 48/49
Principal Component Analysis (PCA)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 49/49

You might also like