Title: Spectral Graph Theory and Ancestry in Genomewide Association Studies
1Spectral Graph Theory and Ancestry in Genome-wide
Association Studies
- Kathryn Roeder
- Coauthors Diana Luca, Ann Lee,
- Bernie Devlin Bert Klei
2Find Genetic Variants Associated with Disease
- Data
- Cases
- Controls
- Several unrecorded ethnic groups
- ½ million genetic markers (SNPs)
- Analysis Logistic regression
- Test each variant for association
3Data
- X counts the number of minor alleles at each
marker across genome. - Y takes on value 1 for affected and 0 for
unaffected individuals.
4Population Stratification
5Distribution of the O blood type in native
populations around the world.
6 Spurious Associations
- Common genetic approach to control spurious
findings - Restrict to common continental ancestry
- Statistical approaches using G
- Match cases and controls by ancestry
-
G2
C
Y
G1
7High Dimensional Data
- N
- Information content
- Each SNP offers related information about
ancestry - Information per SNP is minimal
- Over genome it counts up
8Clusters of subpopulations
- Individuals from a common subpopulation are
correlated and form clusters.
9Principal component map
- Let X be a N x L matrix, centered and scaled by
allele frequency. - Project XXt U?Ut into D-dimensional space
defined by the D largest eigenvalues. - Ancestry of ith individual is (ui1,ui2,,uiD)
10PCA and Multidimensional Scaling
11Representation of Ancestry (Patterson et al.)
12Eigenstrat (Price et al, Nature Genetics 2006)
- Modeling ancestry using Eigenstrat
- Find eigenvectors U1, U2, , U10
- Regress out eigenvectors to remove confounding of
ancestry
13Alternative to Regression is Matching
Pair Match vs Full Match
14Distance Between Individuals
- Match cases and controls that are similar in
ancestry - Distance Euclidean distance between
eigenvectors, scaled by eigenvalues - Use eigenvalues ?1, ?2, to weight importance of
axes
15Determining D ( dimensions)
- Test for structure equivalent to testing for
large eigenvalues. (Patterson et al. 2006) - Standardize distribution based on
- N subjects
- L independent SNPs
- Tracy Widom distribution describes distribution
of the largest eigenvalue. Get a P-value for each
dimension tested. (Johnstone 1991). - Weakness distribution depends on L which must be
estimated. MOM estimator assumes null.
16Association data
- Cases
- Americans of European descent
- Controls
- Northern Germans
- Southern Germans
17Starting Data (D17)
PO or FS similarities
Duplicates or MZ-twins
18Northern/Southern Germans
19After Eliminating outliers and unmatchables
(D2)
20Performance good, requires care
- Outliers cases (or controls) that are unusual
at many SNPs - High influence on eigenmap, failure to discover
key dimensions of ancestry - Unmatchables cases far from controls or vice
versa. - Caution must remove outliers and unmatchables
21Multi-ethnic example.
- Cases and controls descended from
- Europe
- Africa
- Asia
- Expect hierarchical structure
- Multi-continent
- Differences within continent
223 Major Ethnic Groups (D2)
African Descent
European Descent
Asian Descent
23European sub-sample D4
24Problems
- Approach is too sensitive to outliers
-
- Not able to discover hierarchical structure
without performing several iterations of
analysis. - Tests for number of dimensions are not reliable
when data are heterogeneous. - Estimate of independent dimensions (L) is poor
when data are heterogenious. - Try a new eigenmap.
25Can spectral graph theory help?
- For N subjects, define a graph G (V,E)
- V vertices
- E edges
- Wij weights between vertices (i,j)
- large value for tight connection
- 0 for no connection
- Must be positive and symmetric
- Di degree of the vertex
26Defining a Weight matrix
27Goal
- Want to embed a weighted graph into D-dimensional
space so that similar vertices are close. -
- A map that meets this requirement is the
Laplacian eigenmap of G. - Based on top non-trivial eigenvectors
28Spectral embedding and clustering
- Close connection between locality-preserving
spectral embeddings, spectral clustering and
cuts. - Min Cut finds outliers and clumps
- Other weighted cuts find major splits and
clusters
29Spectral Embedding Laplacian
- Choices for embedding.
- Degree matrix
- D diag(D1,,DN)
- Embedding based on decomposing
- W
- Laplacian L D-W
- Normalized Laplacian D-1/2LD-1/2
30 Cuts
31Choices of eigenmaps
- W
- Emphasizes outliers
- PCA analysis similar since
- W 0-thresholded version of K
- Normalized Laplacian D-1/2(D-W)D-1/2
- Finds major clusters
- Robust to outliers
32Eigen-gap Heuristic
- Tracy-Widom theory for eigenvalues of a
covariance matrix no longer applies. - The eigen-gap heuristic says,
- find the largest k such that
- ?k ?k1
- is significantly greater than 0.
- Find the distribution of this statistic by
simulation. - Distribution depends on N, not L. Easy to obtain.
33Approach
- Find eigenvectors and eigenvalues of normalized
Laplacian - Use square-root-threshold weights
- Use eigengaps to determine dimension
34Example with hierarchical structure(3 major
groups, 3 subgroups in 1)
Groups 2 and 3 Black Group 1
clusters Red
Blue Green
35POPRES data
- Worldwide sample
- 6000 people worldwide
- 500,000 SNPs
- Ancestry (only used to validate)
- Country of birth
- Country of birth of parents
36Eigenvectors for POPRES
Africa Europe China Mexico India
37Eigenvectors for POPRES
Africa Europe China Mexico India
38Eigenvectors for POPRES
Africa Europe China Mexico India
39Eigenvectors for POPRES
Africa Europe China Mexico India
40Clustering
- Distance calculated based on D eigenvectors.
- Hierarchical clustering (Wards algorithm) at
each step the union of every possible cluster
pair is considered. The clusters whose fusion
results in the minimum information loss are
combined. - Start with k2
- Test for population structure
- If one cluster is homogeneous, set aside the
cluster and continue - Repeat this process until all clusters are
homogeneous - K homogeneous.
- Find matches and unmatchable observations
41Overall Clustering Results
- 4 Non-European clusters
- Africa
- China
- Mexico
- India
- 8 European clusters
42Clusters of European Ancestry
- Clusters generate a map of history and geography
of Europe - Portugal, Spain
- Italy
- Italy
- Switzerland (France)
- Ireland, England, UK, (France)
- France, Belgium, (Switz, Italian)
- Netherlands, Belgium, Germany, (UK)
- Germany, Poland, Hungary, Romania
- Other nationalities with small samples fit too
- Slovakia, Slovenia, Kosovo, Montenegro, Czech,
Croatia
43Comparison to principal components
- Using effective number of independent SNPs,
find D3 - missing structure in Europeans.
- Using actual number of SNPs, D27
- pick up a lot of noise
- Clustering finds 127 clusters
- Unhelpful for interpretation.
44Summary
- The traditional eigenmap based on principal
components has bad properties. - Normalized Laplacian leads to a better spectral
embedding that helps to find more meaningful
ancestry vectors. - Resulting eigenvectors emphasize major clusters
and are robust to outliers. - Final clustering corresponds to true ancestry
remarkably well. - Matching cases and controls from homogeneous
clusters provides appropriate strata for
conditional logistic analysis.