Principal Components - PowerPoint PPT Presentation


PPT – Principal Components PowerPoint presentation | free to view - id: 1400dd-MmU2Y


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Principal Components


We could plot pairs of variables. There are 15 such pairs ... The plot does not 'tilt' in either direction. SM339 Spring 08 - Principal Components ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 17
Provided by: johnt1


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Principal Components

Principal Components
  • As part of a study on football helmets,
    scientists collected head measurements from a
    number of football players
  • They measured 6 different aspects of the players

Principal Components
  • Dealing with 6 dimensional data is difficult
  • We could plot pairs of variables
  • There are 15 such pairs
  • But it turns out that even this may not give a
    clear picture of the data

Principal Components
  • Suppose we have data that is distributed over the
    3D plane that goes thru (1,0,0), (0,1,0), (0,0,1)
  • No matter which pair of axes we use, the plots
    will just look like a cloud of points
  • We will never realize that the data actually lies
    on a 2 dimensional surface

Principal Components
  • Recall the formula for correlation in simple
    linear regression
  • Corr Sxy/?(SxxSyy)
  • The numerator, Sxy, is proportional to covariance
  • It measures the extent to which larger values of
    X are associated with larger (or smaller) values
    of Y
  • If Cov0, then X and Y are not related

Principal Components
  • In terms of a plot, this means that the plot of Y
    vs X is just a cloud of points
  • The plot does not tilt in either direction

Principal Components
  • The problem with the data in the plane example is
    that the values are correlated
  • If we could look down the edge of the plane, then
    we would see that there is not a third dimension
    to the data
  • All the data lies in only the two dimensions of
    the plane

Principal Components
  • In reality, data rarely lies exactly on a plane
  • But the cloud of points can extend much more in
    some directions than in others
  • Typically, the data forms a cloud that resembles
    an ellipsoid

Principal Components
  • If we can find the axes of the ellipsoid, we can
    view the data in terms of these components
  • The longest axis is the most interesting
  • The shortest axis does not have much information

Principal Components
  • NOTE There are two ways to proceed
  • We can work with the Covariance matrix
  • Or we can work with the Correlation matrix
  • The theory is developed for the Cov matrix
  • Sometimes the Corr matrix makes more sense

Principal Components
  • For variables x1, x2, …, xk, define the
    covariance matrix so that the (i,j) element is
  • This means that the matrix will be symmetric
  • If two variables are uncorrelated, then the
    corresponding element of Cov will be zero (or
    nearly so)

Principal Components
  • If we find the eigenvectors and eigenvalues of
    Cov, this will diagonalize the Cov matrix
  • Ccov(data)
  • v,deig(c)
  • D is a diagonal matrix of eigenvalues
  • Diag(d) returns a list of the eigenvalues

Principal Components
  • If we transform our data by V, then Cov of the
    transformed data will be D
  • Data2(data-ones(size(data)) diag(means(data)))v
  • Then cov(data2)d
  • This means that all the variables in data2 are

Principal Components
  • Data2 is called the principal components of data1
  • Furthermore, the variances of data2 are the
    eigenvalues of Cov(data1)
  • This means that the largest eigenvalue is the
    most variable PC
  • This corresponds to the longest axis of the
    original ellipsoid of data

Principal Components
  • In some sense, the sum of the e-values is the
    overall variance
  • We can think of the individual e-values in terms
    of what percent of the total they are
  • Eig() tends to return the e-values in ascending
  • We want them in descending order
  • Dsort-sort(-diag(d))
  • Then cumsum(dsort)/sum(dsort) tells us what
    percent of the total would be contained in the
    first k PCs

Principal Components
  • General rule use enough PCs to contain 80-90 of
    the total
  • Balance this against how many PCs
  • If only 2-3 PCs contain most of the total, then
    our problem is a lot simpler than we thought
  • Besides plots, we can use the PCs to detect
    groupings of the data or outliers, say

Principal Components
  • Other multivariate methods
  • MANOVA ANOVA based on several variables rather
    than just one
  • MV discriminant analysis what distinguishes one
    group from another?
  • Canonical correlation what components of two
    sets of data are most correlated?
  • All of these involve ideas similar to PCA