Title: Properties of sample covariance matrices from modern highdimensional genomic data sets RSS Mancheste
1Properties of sample covariance matrices from
modern high-dimensional genomic data setsRSS
Manchester, October 2008
- David Hoyle
- Faculty of Medicine,
- University of Manchester, UK
- david.hoyle_at_manchester.ac.uk
2Outline
- Molecular Biology
- PCA in Molecular Biology
- Pathologies of large covariance matrices
3Molecular Biology
Biological system Data are complex,
high-dimensional and noisy
4Exploratory Analysis
- Interested in detecting patterns in data
Covariance important. Tools such as PCA often used
5PCA
- This is Principal Component Analysis (PCA)
- Standard algorithm of multi-variate statistics
6PCA in Molecular Biology (1)
- PCA ? Diagnostic, Prognostic
7PCA in Molecular Biology (2)
8PCA in Molecular Biology (3)
PCA ? Correct for stratification within GWA
studies Price et al., Nature Genetics 38904-909,
2006
9Model Selection
- How to select S
- Retain 90 of variance ?
- Elbow in scree plot ?
- Deviation beyond MP edge ?
- Sampling variation
- Johnstone, Ann. Stat. (2001)
Asymptotically estimate of S ? beyond edge
10Goals
- To understand effects of high-dimensionality and
small sample size on PCA - Can we use understanding to improve estimation
algorithms for high-dimensional data ?
11Approaches
- Sample points are random vectors
- Sample covariance is a random matrix
- Use tools of Random Matrix Theory (RMT)
12RMT
- History in both Physics Statistics
- Wishart distribution (1928)
- Semi-circle law of Wigner (1955)
- Physics - paradigms for real systems
- Statisticians interested in covariance
- Communities have diverged
13Physics, Statistics RMT
- Use physics techniques to study the statistics
problem - Two aspects to focus upon
- Accuracy of sample eigenvalues
- Accuracy of sample eigenvectors
14Statistical Physics of Learning
B
R JB
J
- Statistical Mechanics of Learning, Engel Van
den Broeck - Advanced Mean Field Methods, Opper Saad
- Introduction to the Theory of Neural
Computation, Hertz, Palmer, Krogh - Statistical Physics of Spin Glasses
Information Processing, Nishimori
15Learning in PCA
C ?2I ?2ABBT
Weak Signal-to-Noise
A 1 d
Or N large and p-small
16Connection to Statistics
-
- spiked covariance model Johnstone (2001)
- Learning curve phase transition obtained by D.
Paul - Statistica Sinica 17 (2007), 1617-1642
- Johnstone, Baik, El Karoui, Bai, Silverstein,
Peche - In general
17Learning in PCA
Large sample size scenario easy,
T.W. Anderson, Ann. Math. Stat., 1963
ML estimators are okay for large sample sizes
18Learning Theory
- Asymptotic limit achieves perfect learning
- But we dont have lots of data !!
- Study scenario where
19Model
20Retarded Learning
Asymptotic theory
Replica Analysis Reimann et al., J. Phys A,
29 (1996) 3521 Hoyle Rattray,
Europhys. Lett., 62 (2003) 117 Variational Bounds
Herschkowitz Opper, Phys. Rev. Lett. 86
(2001) 2174
21Transition in top eigenvalue
- Quantity maximized is Rayleigh quotient of sample
covariance matrix - Maximum value of quotient gives top eigenvalue
- Phase transition in
implies phase transition in
22Top Eigenvalue
Hoyle Rattray , EuroPhysics Letters 62 (2003),
117 Hoyle Rattray, Phys. Rev. E, 69 (2004)
026124
23Replica Analysis - Eigenvectors
Hoyle Rattray, Phys. Rev. E 75 (2007), 016101
24Replica Analysis - Eigenspectra
Hoyle Rattray, Phys. Rev. E, 69 (2004) 026124
25Replica Analysis - Eigenvectors
26Summary
- Understanding the properties of standard
algorithms with high-dimensional data is
important - Common areas between statistical inference
statistical physics - Considering distinguished asymptotic limit
important - Can recover accurate estimates of population
covariance eigenvalues
27Thank you