# Principal Components - PowerPoint PPT Presentation

Loading...

PPT – Principal Components PowerPoint presentation | free to view - id: 1400dd-MmU2Y

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

## Principal Components

Description:

### We could plot pairs of variables. There are 15 such pairs ... The plot does not 'tilt' in either direction. SM339 Spring 08 - Principal Components ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 17
Provided by: johnt1
Category:
Tags:
User Comments (0)
Transcript and Presenter's Notes

Title: Principal Components

1
Principal Components
• As part of a study on football helmets,
scientists collected head measurements from a
number of football players
• They measured 6 different aspects of the players
heads

2
Principal Components
• Dealing with 6 dimensional data is difficult
• We could plot pairs of variables
• There are 15 such pairs
• But it turns out that even this may not give a
clear picture of the data

3
Principal Components
• Suppose we have data that is distributed over the
3D plane that goes thru (1,0,0), (0,1,0), (0,0,1)
• No matter which pair of axes we use, the plots
will just look like a cloud of points
• We will never realize that the data actually lies
on a 2 dimensional surface

4
Principal Components
• Recall the formula for correlation in simple
linear regression
• Corr Sxy/?(SxxSyy)
• The numerator, Sxy, is proportional to covariance
• It measures the extent to which larger values of
X are associated with larger (or smaller) values
of Y
• If Cov0, then X and Y are not related

5
Principal Components
• In terms of a plot, this means that the plot of Y
vs X is just a cloud of points
• The plot does not tilt in either direction

6
Principal Components
• The problem with the data in the plane example is
that the values are correlated
• If we could look down the edge of the plane, then
we would see that there is not a third dimension
to the data
• All the data lies in only the two dimensions of
the plane

7
Principal Components
• In reality, data rarely lies exactly on a plane
• But the cloud of points can extend much more in
some directions than in others
• Typically, the data forms a cloud that resembles
an ellipsoid

8
Principal Components
• If we can find the axes of the ellipsoid, we can
view the data in terms of these components
• The longest axis is the most interesting
• The shortest axis does not have much information

9
Principal Components
• NOTE There are two ways to proceed
• We can work with the Covariance matrix
• Or we can work with the Correlation matrix
• The theory is developed for the Cov matrix
• Sometimes the Corr matrix makes more sense

10
Principal Components
• For variables x1, x2, , xk, define the
covariance matrix so that the (i,j) element is
Sxixj
• This means that the matrix will be symmetric
• If two variables are uncorrelated, then the
corresponding element of Cov will be zero (or
nearly so)

11
Principal Components
• If we find the eigenvectors and eigenvalues of
Cov, this will diagonalize the Cov matrix
• Ccov(data)
• v,deig(c)
• D is a diagonal matrix of eigenvalues
• Diag(d) returns a list of the eigenvalues

12
Principal Components
• If we transform our data by V, then Cov of the
transformed data will be D
• Data2(data-ones(size(data)) diag(means(data)))v
• Then cov(data2)d
• This means that all the variables in data2 are
uncorrelated

13
Principal Components
• Data2 is called the principal components of data1
• Furthermore, the variances of data2 are the
eigenvalues of Cov(data1)
• This means that the largest eigenvalue is the
most variable PC
• This corresponds to the longest axis of the
original ellipsoid of data

14
Principal Components
• In some sense, the sum of the e-values is the
overall variance
• We can think of the individual e-values in terms
of what percent of the total they are
• Eig() tends to return the e-values in ascending
order
• We want them in descending order
• Dsort-sort(-diag(d))
• Then cumsum(dsort)/sum(dsort) tells us what
percent of the total would be contained in the
first k PCs

15
Principal Components
• General rule use enough PCs to contain 80-90 of
the total
• Balance this against how many PCs
• If only 2-3 PCs contain most of the total, then
our problem is a lot simpler than we thought
• Besides plots, we can use the PCs to detect
groupings of the data or outliers, say

16
Principal Components
• Other multivariate methods
• MANOVA ANOVA based on several variables rather
than just one
• MV discriminant analysis what distinguishes one
group from another?
• Canonical correlation what components of two
sets of data are most correlated?
• All of these involve ideas similar to PCA
About PowerShow.com