Principal Component Analysis and Cluster Analysis on Minnesota Vital Statistics 20022006 - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

Principal Component Analysis and Cluster Analysis on Minnesota Vital Statistics 20022006

Description:

Principal Component Analysis and Cluster Analysis. on Minnesota Vital Statistics 2002-2006 ... 'Minnesota Vital Statistics State and County Trends.' Minnesota ... – PowerPoint PPT presentation

Number of Views:404

Avg rating:3.0/5.0

Slides: 54

Provided by: ter142

Category:

more less

Transcript and Presenter's Notes

Title: Principal Component Analysis and Cluster Analysis on Minnesota Vital Statistics 20022006

1
Principal Component Analysis and Cluster
Analysison Minnesota Vital Statistics 2002-2006

Teri Johnson

2
Outline

Principal Component Analysis
Cluster Analysis
Cluster Analysis on Principal Component Scores
Data
Conclusion
Related Work

3
Principal Component Analysis (PCA)

Introduced by Karl Pearson and Harold Hotelling
Explain the variation in a collection of
correlated variables in terms of a new collection
of uncorrelated variables

4
Theory of PCA
PCA makes components that are linear combinations
of the original data. The ith component can be
written as
Where the alphas can be written in vector form
.
The variance of the component is
The covariance of two components is
5
Theory of PCA
Then for each component, we want to maximize the
variance such that
To do this we use the method of Lagrange
Multipliers which allows us to find the max of a
function with many variables that is subject to
many constraints
Gives a solution where is an eigenvector of
corresponding to the ith largest eigenvalue
6
Theory of PCA
We rescale the principal components so more
important components have a higher scale
Form a matrix where
Have a matrix diagonals are
Using these matrices, we have
Rescale the vectors using
Thus the rescaled covariance matrix is
7
Criteria for choosing the number of Principal
Components

Keep the components that explain about 70-90 of
the total variation.
Keep components whose eigenvalues are more than
the average
Choose number of components is based on the
elbow of Scree plot

8
PCA Graphs

Scatterplot

Biplot

9
Cluster Analysis

Derived by R.C Tyron
Determine groups of observations that are uniform
and are separated from different groups
Many procedures

10
K-means

Separates the data set into k groups
Separated by minimizing the within group sum of
squares over all the variations

11
K-Means Algorithm

Find an initial partition
Calculate the change in the within group sum of
squares by moving each observation from one
cluster to another.
Make the changes that lead the best improvement.
Repeat steps 2 and 3 until there we have no more
improvement by moving an observation

12
Agglomerative Hierarchical Clustering

Each observation in its own cluster
Find the pair of distinct clusters with the
smallest distance and combine.
Repeat step 2 until there is only one cluster.

13
Cluster Analysis Distance

Many ways to measure distance
Group Average Clustering

distance between two clusters
distance between individuals
number of observations
14
The Use of Principal Components on Cluster
Analysis

ADVTAGES
The time it takes to compute the clusters is
greatly reduced
Gives an idea of what the cluster would look like
and the characteristics of a cluster
Issues related to multicollinearity in the data
is avoided

DISADVANTAGES
Number of clusters vary
Distance varies

15
DATA

Collected from Minnesota Department of Health
demographic, birthrate, and mortality information
For years 2002 and 2006
Split into nine different regions

16
PCA of all MN counties

The First component Average Component
The Second component Mothers/Money Component
Positive values? poverty, child decency
The Third component Death/Births Component
Positive values ? births

17
PCA of all MN counties
18
PCA for the Regions in MN
Regions North Prairie, South Prairie, East
Central, North, West Central, North Central,
South East, Metro, North East

The First component Average Component

19
PCA for the Regions in MN
20
PCA on the Counties of the South East Region

The First Component Average Component
The Second Component Money Component
Negative values ? poverty
The Third Component Death/Births Component
Negative values ? death

21
PCA on the Counties of the South East Region
22
PCA on the Counties of the East Central Region

The First component Average Component
The Second component Ethnicity Component
Negative values ? American Indian
Positive Values ? White/Latin

23
PCA on the Counties of the East Central Region
24
PCA on the counties of the Metro Region

The First Component Average Component

25
PCA on the counties of the Metro Region
26
PCA on the counties in the North Central Region

The First Component Average Component
The Second Component Money Component
Negative values ?poverty

27
PCA on the counties in the North Central Region
28
PCA on the counties in the West Central Region

The First Component Average Component
The Second Component Mothers/Money Component
Negative values? money/type of birth
Positive values ? poverty/birth rates/ mothers

29
PCA on the counties in the West Central Region
30
PCA on the counties of the North Prairie Region

The First Component Average Component
The Second Component Wealth Component
Negative values ? wealth
The Third Component Death/Births Component
Negative values ? death

31
PCA on the counties of the Northern Prairie Region
32
PCA on the counties of the South Prairie Region

The First Component Average Component
The Second Component Ethnicity Component
Negative values ?minorities
The Third Component American Indian/Births
Component
Negative values ?births Positive values ?American
Indian
The Fourth Component Money Component
Negative income/unemployment

33
PCA on the counties of the Southern Prairie Region
34
PCA on the counties of the Northern region

The First Component Average Component

35
PCA on the counties of the Northern region
36
PCA on the counties of the North East Region

The First Component Average Component
The Second Component Death/Births Component
Negative values ?birth

37
PCA on the counties of the North East Region
38
Cluster Analysis on the counties of Minnesota
Original Data
PC Scores
39
Cluster Analysis on the counties of different
Regions
Original Data
PC Scores
40
Cluster Analysis on the counties of South East
Region
Original Data
PC Scores
41
Cluster Analysis on the counties of East Central
Region
Original Data
PC Scores
42
Cluster Analysis on the counties of Metro Region
Original Data
PC Scores
43
Cluster Analysis on the counties of North Central
Region
Original Data
PC Scores
44
Cluster Analysis on the counties of West Central
Region
Original Data
PC Scores
45
Cluster Analysis on the counties of North Prairie
Region
Original Data
PC Scores
46
Cluster Analysis on the counties of South Prairie
Region
Original Data
PC Scores
47
Cluster Analysis on the counties of Northern
Region
Original Data
PC Scores
48
Cluster Analysis on the counties of North East
Region
Original Data
PC Scores
49
Conclusion PCA

First component is the average component
Second component is usually the money component
Third is a birth/death component
3 PC (entire state) vs. 1 PC (region)
PCs differ by region locale
3 regions No obvious outliers

50
Conclusion Cluster Analysis

3 groups vs. 5 groups
Different Clusters from PC scores and original
Data
Clustered by differs from PC scores and original
data

51
Related Work

PCA
Sparse Principal Component Analysis
Kernel Principal Component Analysis
Cluster Analysis
divisive clustering
two-step clustering
inter-cluster dissimilarity Centroid method,
median clustering, and Wards method

52
References

Everitt, Brian S. An R and S-Plus Companion to
Multivariate Analysis (Springer Texts in
Statistics). New York Springer, 2007.
Jolliffe, I. T. Principal component analysis. New
York Springer, 2002.
Everitt, Brian, and Graham Dunn. Applied
multivariate data analysis. London Arnold,
Oxford UP, 2001.
Troy, Austin. "Geodemographic Segmentation."
Encyclopedia of GIS. Springer, 2008. 348-349.
Rovan, J. "Principal Component Analysis." 2006.
Univerze v Ljubljani. Spring 2009 lt
http//miha.ef.uni-lj.si/_dokumenti3plus2/193004/M
VA_06_Principal_Component_Analysis.pdfgt.
Manly, Bryan F. J. Multivariate statistical
methods a primer. Boca Raton, FL Chapman
Hall/CRC P, 2005.
"Minnesota Vital Statistics State and County
Trends." Minnesota Department of Health. 2008.
2008-2009 lthttp//www.health.state.mn.us/divs/chs/
Trends/gt.
"MFRC Landscape Program Regions." Minnesota
Forest Resources Council. 2008-2009
lthttp//www.frc.state.mn.us/Landscp/LandRegions.ht
mlgt.
Zou, Hui, Trevor Hastie, and Rob Tibshirani.
"Sparse Principal Component Analysis." Journal of
Computational and Graphical Statistics 15 (2006)
262-86.
Twining, Carole. "Kernel Principal Component
Analysis." University of Edinburgh School of
Informatics. 2001. Manchester University. Spring
2009 lthttp//homepages.inf.ed.ac.uk/rbf/CVonline/L
OCAL_COPIES/TWINING1/kpca2html.htmlgt.
Garson, G. D. "Cluster Analysis." College of
Humanities and Social Sciences. 2009. North
Carolina State University. Spring 2009
lthttp//faculty.chass.ncsu.edu/garson/PA765/cluste
r.htmgt.