Principal Component Analysis and Cluster Analysis on Minnesota Vital Statistics 20022006 - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Principal Component Analysis and Cluster Analysis on Minnesota Vital Statistics 20022006

Description:

Principal Component Analysis and Cluster Analysis. on Minnesota Vital Statistics 2002-2006 ... 'Minnesota Vital Statistics State and County Trends.' Minnesota ... – PowerPoint PPT presentation

Number of Views:404
Avg rating:3.0/5.0
Slides: 54
Provided by: ter142
Category:

less

Transcript and Presenter's Notes

Title: Principal Component Analysis and Cluster Analysis on Minnesota Vital Statistics 20022006


1
Principal Component Analysis and Cluster
Analysison Minnesota Vital Statistics 2002-2006
  • Teri Johnson

2
Outline
  • Principal Component Analysis
  • Cluster Analysis
  • Cluster Analysis on Principal Component Scores
  • Data
  • Conclusion
  • Related Work

3
Principal Component Analysis (PCA)
  • Introduced by Karl Pearson and Harold Hotelling
  • Explain the variation in a collection of
    correlated variables in terms of a new collection
    of uncorrelated variables

4
Theory of PCA
PCA makes components that are linear combinations
of the original data. The ith component can be
written as
Where the alphas can be written in vector form
.
The variance of the component is
The covariance of two components is
5
Theory of PCA
Then for each component, we want to maximize the
variance such that
To do this we use the method of Lagrange
Multipliers which allows us to find the max of a
function with many variables that is subject to
many constraints
Gives a solution where is an eigenvector of
corresponding to the ith largest eigenvalue
6
Theory of PCA
We rescale the principal components so more
important components have a higher scale
Form a matrix where
Have a matrix diagonals are
Using these matrices, we have
Rescale the vectors using
Thus the rescaled covariance matrix is
7
Criteria for choosing the number of Principal
Components
  • Keep the components that explain about 70-90 of
    the total variation.
  • Keep components whose eigenvalues are more than
    the average
  • Choose number of components is based on the
    elbow of Scree plot

8
PCA Graphs
  • Scatterplot
  • Biplot

9
Cluster Analysis
  • Derived by R.C Tyron
  • Determine groups of observations that are uniform
    and are separated from different groups
  • Many procedures

10
K-means
  • Separates the data set into k groups
  • Separated by minimizing the within group sum of
    squares over all the variations

11
K-Means Algorithm
  • Find an initial partition
  • Calculate the change in the within group sum of
    squares by moving each observation from one
    cluster to another.
  • Make the changes that lead the best improvement.
  • Repeat steps 2 and 3 until there we have no more
    improvement by moving an observation

12
Agglomerative Hierarchical Clustering
  • Each observation in its own cluster
  • Find the pair of distinct clusters with the
    smallest distance and combine.
  • Repeat step 2 until there is only one cluster.

13
Cluster Analysis Distance
  • Many ways to measure distance
  • Group Average Clustering

distance between two clusters
distance between individuals
number of observations
14
The Use of Principal Components on Cluster
Analysis
  • ADVTAGES
  • The time it takes to compute the clusters is
    greatly reduced
  • Gives an idea of what the cluster would look like
    and the characteristics of a cluster
  • Issues related to multicollinearity in the data
    is avoided
  • DISADVANTAGES
  • Number of clusters vary
  • Distance varies

15
DATA
  • Collected from Minnesota Department of Health
  • demographic, birthrate, and mortality information
  • For years 2002 and 2006
  • Split into nine different regions

16
PCA of all MN counties
  • The First component Average Component
  • The Second component Mothers/Money Component
  • Positive values? poverty, child decency
  • The Third component Death/Births Component
  • Positive values ? births

17
PCA of all MN counties
18
PCA for the Regions in MN
Regions North Prairie, South Prairie, East
Central, North, West Central, North Central,
South East, Metro, North East
  • The First component Average Component

19
PCA for the Regions in MN
20
PCA on the Counties of the South East Region
  • The First Component Average Component
  • The Second Component Money Component
  • Negative values ? poverty
  • The Third Component Death/Births Component
  • Negative values ? death

21
PCA on the Counties of the South East Region
22
PCA on the Counties of the East Central Region
  • The First component Average Component
  • The Second component Ethnicity Component
  • Negative values ? American Indian
  • Positive Values ? White/Latin

23
PCA on the Counties of the East Central Region
24
PCA on the counties of the Metro Region
  • The First Component Average Component

25
PCA on the counties of the Metro Region
26
PCA on the counties in the North Central Region
  • The First Component Average Component
  • The Second Component Money Component
  • Negative values ?poverty

27
PCA on the counties in the North Central Region
28
PCA on the counties in the West Central Region
  • The First Component Average Component
  • The Second Component Mothers/Money Component
  • Negative values? money/type of birth
  • Positive values ? poverty/birth rates/ mothers

29
PCA on the counties in the West Central Region
30
PCA on the counties of the North Prairie Region
  • The First Component Average Component
  • The Second Component Wealth Component
  • Negative values ? wealth
  • The Third Component Death/Births Component
  • Negative values ? death

31
PCA on the counties of the Northern Prairie Region
32
PCA on the counties of the South Prairie Region
  • The First Component Average Component
  • The Second Component Ethnicity Component
  • Negative values ?minorities
  • The Third Component American Indian/Births
    Component
  • Negative values ?births Positive values ?American
    Indian
  • The Fourth Component Money Component
  • Negative income/unemployment

33
PCA on the counties of the Southern Prairie Region
34
PCA on the counties of the Northern region
  • The First Component Average Component

35
PCA on the counties of the Northern region
36
PCA on the counties of the North East Region
  • The First Component Average Component
  • The Second Component Death/Births Component
  • Negative values ?birth

37
PCA on the counties of the North East Region
38
Cluster Analysis on the counties of Minnesota
Original Data
PC Scores
39
Cluster Analysis on the counties of different
Regions
Original Data
PC Scores
40
Cluster Analysis on the counties of South East
Region
Original Data
PC Scores
41
Cluster Analysis on the counties of East Central
Region
Original Data
PC Scores
42
Cluster Analysis on the counties of Metro Region
Original Data
PC Scores
43
Cluster Analysis on the counties of North Central
Region
Original Data
PC Scores
44
Cluster Analysis on the counties of West Central
Region
Original Data
PC Scores
45
Cluster Analysis on the counties of North Prairie
Region
Original Data
PC Scores
46
Cluster Analysis on the counties of South Prairie
Region
Original Data
PC Scores
47
Cluster Analysis on the counties of Northern
Region
Original Data
PC Scores
48
Cluster Analysis on the counties of North East
Region
Original Data
PC Scores
49
Conclusion PCA
  • First component is the average component
  • Second component is usually the money component
  • Third is a birth/death component
  • 3 PC (entire state) vs. 1 PC (region)
  • PCs differ by region locale
  • 3 regions No obvious outliers

50
Conclusion Cluster Analysis
  • 3 groups vs. 5 groups
  • Different Clusters from PC scores and original
    Data
  • Clustered by differs from PC scores and original
    data

51
Related Work
  • PCA
  • Sparse Principal Component Analysis
  • Kernel Principal Component Analysis
  • Cluster Analysis
  • divisive clustering
  • two-step clustering
  • inter-cluster dissimilarity Centroid method,
    median clustering, and Wards method

52
References
  • Everitt, Brian S. An R and S-Plus Companion to
    Multivariate Analysis (Springer Texts in
    Statistics). New York Springer, 2007.
  • Jolliffe, I. T. Principal component analysis. New
    York Springer, 2002.
  • Everitt, Brian, and Graham Dunn. Applied
    multivariate data analysis. London Arnold,
    Oxford UP, 2001.
  • Troy, Austin. "Geodemographic Segmentation."
    Encyclopedia of GIS. Springer, 2008. 348-349.
  • Rovan, J. "Principal Component Analysis." 2006.
    Univerze v Ljubljani. Spring 2009 lt
    http//miha.ef.uni-lj.si/_dokumenti3plus2/193004/M
    VA_06_Principal_Component_Analysis.pdfgt.
  • Manly, Bryan F. J. Multivariate statistical
    methods a primer. Boca Raton, FL Chapman
    Hall/CRC P, 2005.
  • "Minnesota Vital Statistics State and County
    Trends." Minnesota Department of Health. 2008.
    2008-2009 lthttp//www.health.state.mn.us/divs/chs/
    Trends/gt.
  • "MFRC Landscape Program Regions." Minnesota
    Forest Resources Council. 2008-2009
    lthttp//www.frc.state.mn.us/Landscp/LandRegions.ht
    mlgt.
  • Zou, Hui, Trevor Hastie, and Rob Tibshirani.
    "Sparse Principal Component Analysis." Journal of
    Computational and Graphical Statistics 15 (2006)
    262-86.
  • Twining, Carole. "Kernel Principal Component
    Analysis." University of Edinburgh School of
    Informatics. 2001. Manchester University. Spring
    2009 lthttp//homepages.inf.ed.ac.uk/rbf/CVonline/L
    OCAL_COPIES/TWINING1/kpca2html.htmlgt.
  • Garson, G. D. "Cluster Analysis." College of
    Humanities and Social Sciences. 2009. North
    Carolina State University. Spring 2009
    lthttp//faculty.chass.ncsu.edu/garson/PA765/cluste
    r.htmgt.

53
Acknowledgements
  • UMM Stats Faculty
Write a Comment
User Comments (0)
About PowerShow.com