Title: Principal Component Analysis and Cluster Analysis on Minnesota Vital Statistics 20022006
1Principal Component Analysis and Cluster
Analysison Minnesota Vital Statistics 2002-2006
2Outline
- Principal Component Analysis
- Cluster Analysis
- Cluster Analysis on Principal Component Scores
- Data
- Conclusion
- Related Work
3Principal Component Analysis (PCA)
- Introduced by Karl Pearson and Harold Hotelling
- Explain the variation in a collection of
correlated variables in terms of a new collection
of uncorrelated variables
4Theory of PCA
PCA makes components that are linear combinations
of the original data. The ith component can be
written as
Where the alphas can be written in vector form
.
The variance of the component is
The covariance of two components is
5Theory of PCA
Then for each component, we want to maximize the
variance such that
To do this we use the method of Lagrange
Multipliers which allows us to find the max of a
function with many variables that is subject to
many constraints
Gives a solution where is an eigenvector of
corresponding to the ith largest eigenvalue
6Theory of PCA
We rescale the principal components so more
important components have a higher scale
Form a matrix where
Have a matrix diagonals are
Using these matrices, we have
Rescale the vectors using
Thus the rescaled covariance matrix is
7Criteria for choosing the number of Principal
Components
- Keep the components that explain about 70-90 of
the total variation. - Keep components whose eigenvalues are more than
the average - Choose number of components is based on the
elbow of Scree plot
8PCA Graphs
9Cluster Analysis
- Derived by R.C Tyron
- Determine groups of observations that are uniform
and are separated from different groups - Many procedures
10K-means
- Separates the data set into k groups
- Separated by minimizing the within group sum of
squares over all the variations
11K-Means Algorithm
- Find an initial partition
- Calculate the change in the within group sum of
squares by moving each observation from one
cluster to another. - Make the changes that lead the best improvement.
- Repeat steps 2 and 3 until there we have no more
improvement by moving an observation
12Agglomerative Hierarchical Clustering
- Each observation in its own cluster
- Find the pair of distinct clusters with the
smallest distance and combine. - Repeat step 2 until there is only one cluster.
13Cluster Analysis Distance
- Many ways to measure distance
- Group Average Clustering
distance between two clusters
distance between individuals
number of observations
14The Use of Principal Components on Cluster
Analysis
- ADVTAGES
- The time it takes to compute the clusters is
greatly reduced - Gives an idea of what the cluster would look like
and the characteristics of a cluster - Issues related to multicollinearity in the data
is avoided
- DISADVANTAGES
- Number of clusters vary
- Distance varies
15DATA
- Collected from Minnesota Department of Health
- demographic, birthrate, and mortality information
- For years 2002 and 2006
- Split into nine different regions
16PCA of all MN counties
- The First component Average Component
- The Second component Mothers/Money Component
- Positive values? poverty, child decency
- The Third component Death/Births Component
- Positive values ? births
17PCA of all MN counties
18PCA for the Regions in MN
Regions North Prairie, South Prairie, East
Central, North, West Central, North Central,
South East, Metro, North East
- The First component Average Component
19PCA for the Regions in MN
20PCA on the Counties of the South East Region
- The First Component Average Component
- The Second Component Money Component
- Negative values ? poverty
- The Third Component Death/Births Component
- Negative values ? death
21PCA on the Counties of the South East Region
22PCA on the Counties of the East Central Region
- The First component Average Component
- The Second component Ethnicity Component
- Negative values ? American Indian
- Positive Values ? White/Latin
23PCA on the Counties of the East Central Region
24PCA on the counties of the Metro Region
- The First Component Average Component
25PCA on the counties of the Metro Region
26PCA on the counties in the North Central Region
- The First Component Average Component
- The Second Component Money Component
- Negative values ?poverty
27PCA on the counties in the North Central Region
28PCA on the counties in the West Central Region
- The First Component Average Component
- The Second Component Mothers/Money Component
- Negative values? money/type of birth
- Positive values ? poverty/birth rates/ mothers
29PCA on the counties in the West Central Region
30PCA on the counties of the North Prairie Region
- The First Component Average Component
- The Second Component Wealth Component
- Negative values ? wealth
- The Third Component Death/Births Component
- Negative values ? death
31PCA on the counties of the Northern Prairie Region
32PCA on the counties of the South Prairie Region
- The First Component Average Component
- The Second Component Ethnicity Component
- Negative values ?minorities
- The Third Component American Indian/Births
Component - Negative values ?births Positive values ?American
Indian - The Fourth Component Money Component
- Negative income/unemployment
33PCA on the counties of the Southern Prairie Region
34PCA on the counties of the Northern region
- The First Component Average Component
35PCA on the counties of the Northern region
36PCA on the counties of the North East Region
- The First Component Average Component
- The Second Component Death/Births Component
- Negative values ?birth
37PCA on the counties of the North East Region
38Cluster Analysis on the counties of Minnesota
Original Data
PC Scores
39Cluster Analysis on the counties of different
Regions
Original Data
PC Scores
40Cluster Analysis on the counties of South East
Region
Original Data
PC Scores
41Cluster Analysis on the counties of East Central
Region
Original Data
PC Scores
42Cluster Analysis on the counties of Metro Region
Original Data
PC Scores
43Cluster Analysis on the counties of North Central
Region
Original Data
PC Scores
44Cluster Analysis on the counties of West Central
Region
Original Data
PC Scores
45Cluster Analysis on the counties of North Prairie
Region
Original Data
PC Scores
46Cluster Analysis on the counties of South Prairie
Region
Original Data
PC Scores
47Cluster Analysis on the counties of Northern
Region
Original Data
PC Scores
48Cluster Analysis on the counties of North East
Region
Original Data
PC Scores
49Conclusion PCA
- First component is the average component
- Second component is usually the money component
- Third is a birth/death component
- 3 PC (entire state) vs. 1 PC (region)
- PCs differ by region locale
- 3 regions No obvious outliers
50Conclusion Cluster Analysis
- 3 groups vs. 5 groups
- Different Clusters from PC scores and original
Data - Clustered by differs from PC scores and original
data
51Related Work
- PCA
- Sparse Principal Component Analysis
- Kernel Principal Component Analysis
- Cluster Analysis
- divisive clustering
- two-step clustering
- inter-cluster dissimilarity Centroid method,
median clustering, and Wards method
52References
- Everitt, Brian S. An R and S-Plus Companion to
Multivariate Analysis (Springer Texts in
Statistics). New York Springer, 2007. - Jolliffe, I. T. Principal component analysis. New
York Springer, 2002. - Everitt, Brian, and Graham Dunn. Applied
multivariate data analysis. London Arnold,
Oxford UP, 2001. - Troy, Austin. "Geodemographic Segmentation."
Encyclopedia of GIS. Springer, 2008. 348-349. - Rovan, J. "Principal Component Analysis." 2006.
Univerze v Ljubljani. Spring 2009 lt
http//miha.ef.uni-lj.si/_dokumenti3plus2/193004/M
VA_06_Principal_Component_Analysis.pdfgt. - Manly, Bryan F. J. Multivariate statistical
methods a primer. Boca Raton, FL Chapman
Hall/CRC P, 2005. - "Minnesota Vital Statistics State and County
Trends." Minnesota Department of Health. 2008.
2008-2009 lthttp//www.health.state.mn.us/divs/chs/
Trends/gt. - "MFRC Landscape Program Regions." Minnesota
Forest Resources Council. 2008-2009
lthttp//www.frc.state.mn.us/Landscp/LandRegions.ht
mlgt. - Zou, Hui, Trevor Hastie, and Rob Tibshirani.
"Sparse Principal Component Analysis." Journal of
Computational and Graphical Statistics 15 (2006)
262-86. - Twining, Carole. "Kernel Principal Component
Analysis." University of Edinburgh School of
Informatics. 2001. Manchester University. Spring
2009 lthttp//homepages.inf.ed.ac.uk/rbf/CVonline/L
OCAL_COPIES/TWINING1/kpca2html.htmlgt. - Garson, G. D. "Cluster Analysis." College of
Humanities and Social Sciences. 2009. North
Carolina State University. Spring 2009
lthttp//faculty.chass.ncsu.edu/garson/PA765/cluste
r.htmgt.
53Acknowledgements