Spectral Graph Theory and Ancestry in Genomewide Association Studies - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Spectral Graph Theory and Ancestry in Genomewide Association Studies

Description:

Spectral Graph Theory and Ancestry in Genome-wide Association Studies. Kathryn Roeder ... Final clustering corresponds to true ancestry remarkably well. ... – PowerPoint PPT presentation

Number of Views:197

Avg rating:3.0/5.0

Slides: 45

Provided by: statIa6

Category:

more less

Transcript and Presenter's Notes

Title: Spectral Graph Theory and Ancestry in Genomewide Association Studies

1
Spectral Graph Theory and Ancestry in Genome-wide
Association Studies

Kathryn Roeder
Coauthors Diana Luca, Ann Lee,
Bernie Devlin Bert Klei

2
Find Genetic Variants Associated with Disease

Data
Cases
Controls
Several unrecorded ethnic groups
½ million genetic markers (SNPs)
Analysis Logistic regression
Test each variant for association

3
Data

X counts the number of minor alleles at each
marker across genome.
Y takes on value 1 for affected and 0 for
unaffected individuals.

4
Population Stratification
5
Distribution of the O blood type in native
populations around the world.
6
Spurious Associations

Common genetic approach to control spurious
findings
Restrict to common continental ancestry
Statistical approaches using G
Match cases and controls by ancestry

G2
C
Y
G1
7
High Dimensional Data

N
Information content
Each SNP offers related information about
ancestry
Information per SNP is minimal
Over genome it counts up

8
Clusters of subpopulations

Individuals from a common subpopulation are
correlated and form clusters.

9
Principal component map

Let X be a N x L matrix, centered and scaled by
allele frequency.
Project XXt U?Ut into D-dimensional space
defined by the D largest eigenvalues.
Ancestry of ith individual is (ui1,ui2,,uiD)

10
PCA and Multidimensional Scaling
11
Representation of Ancestry (Patterson et al.)
12
Eigenstrat (Price et al, Nature Genetics 2006)

Modeling ancestry using Eigenstrat
Find eigenvectors U1, U2, , U10
Regress out eigenvectors to remove confounding of
ancestry

13
Alternative to Regression is Matching
Pair Match vs Full Match
14
Distance Between Individuals

Match cases and controls that are similar in
ancestry
Distance Euclidean distance between
eigenvectors, scaled by eigenvalues
Use eigenvalues ?1, ?2, to weight importance of
axes

15
Determining D ( dimensions)

Test for structure equivalent to testing for
large eigenvalues. (Patterson et al. 2006)
Standardize distribution based on
N subjects
L independent SNPs
Tracy Widom distribution describes distribution
of the largest eigenvalue. Get a P-value for each
dimension tested. (Johnstone 1991).
Weakness distribution depends on L which must be
estimated. MOM estimator assumes null.

16
Association data

Cases
Americans of European descent
Controls
Northern Germans
Southern Germans

17
Starting Data (D17)
PO or FS similarities
Duplicates or MZ-twins
18
Northern/Southern Germans
19
After Eliminating outliers and unmatchables
(D2)
20
Performance good, requires care

Outliers cases (or controls) that are unusual
at many SNPs
High influence on eigenmap, failure to discover
key dimensions of ancestry
Unmatchables cases far from controls or vice
versa.
Caution must remove outliers and unmatchables

21
Multi-ethnic example.

Cases and controls descended from
Europe
Africa
Asia
Expect hierarchical structure
Multi-continent
Differences within continent

22
3 Major Ethnic Groups (D2)
African Descent
European Descent
Asian Descent
23
European sub-sample D4
24
Problems

Approach is too sensitive to outliers
Not able to discover hierarchical structure
without performing several iterations of
analysis.
Tests for number of dimensions are not reliable
when data are heterogeneous.
Estimate of independent dimensions (L) is poor
when data are heterogenious.
Try a new eigenmap.

25
Can spectral graph theory help?

For N subjects, define a graph G (V,E)
V vertices
E edges
Wij weights between vertices (i,j)
large value for tight connection
0 for no connection
Must be positive and symmetric
Di degree of the vertex

26
Defining a Weight matrix

27
Goal

Want to embed a weighted graph into D-dimensional
space so that similar vertices are close.
A map that meets this requirement is the
Laplacian eigenmap of G.
Based on top non-trivial eigenvectors

28
Spectral embedding and clustering

Close connection between locality-preserving
spectral embeddings, spectral clustering and
cuts.
Min Cut finds outliers and clumps
Other weighted cuts find major splits and
clusters

29
Spectral Embedding Laplacian

Choices for embedding.
Degree matrix
D diag(D1,,DN)
Embedding based on decomposing
W
Laplacian L D-W
Normalized Laplacian D-1/2LD-1/2

30
Cuts
31
Choices of eigenmaps

W
Emphasizes outliers
PCA analysis similar since
W 0-thresholded version of K
Normalized Laplacian D-1/2(D-W)D-1/2
Finds major clusters
Robust to outliers

32
Eigen-gap Heuristic

Tracy-Widom theory for eigenvalues of a
covariance matrix no longer applies.
The eigen-gap heuristic says,
find the largest k such that
?k ?k1
is significantly greater than 0.
Find the distribution of this statistic by
simulation.
Distribution depends on N, not L. Easy to obtain.

33
Approach

Find eigenvectors and eigenvalues of normalized
Laplacian
Use square-root-threshold weights
Use eigengaps to determine dimension

34
Example with hierarchical structure(3 major
groups, 3 subgroups in 1)
Groups 2 and 3 Black Group 1
clusters Red
Blue Green
35
POPRES data

Worldwide sample
6000 people worldwide
500,000 SNPs
Ancestry (only used to validate)
Country of birth
Country of birth of parents

36
Eigenvectors for POPRES
Africa Europe China Mexico India
37
Eigenvectors for POPRES
Africa Europe China Mexico India
38
Eigenvectors for POPRES
Africa Europe China Mexico India
39
Eigenvectors for POPRES
Africa Europe China Mexico India
40
Clustering

Distance calculated based on D eigenvectors.
Hierarchical clustering (Wards algorithm) at
each step the union of every possible cluster
pair is considered. The clusters whose fusion
results in the minimum information loss are
combined.
Start with k2
Test for population structure
If one cluster is homogeneous, set aside the
cluster and continue
Repeat this process until all clusters are
homogeneous
K homogeneous.
Find matches and unmatchable observations

41
Overall Clustering Results

4 Non-European clusters
Africa
China
Mexico
India
8 European clusters

42
Clusters of European Ancestry

Clusters generate a map of history and geography
of Europe
Portugal, Spain
Italy
Italy
Switzerland (France)
Ireland, England, UK, (France)
France, Belgium, (Switz, Italian)
Netherlands, Belgium, Germany, (UK)
Germany, Poland, Hungary, Romania
Other nationalities with small samples fit too
Slovakia, Slovenia, Kosovo, Montenegro, Czech,
Croatia

43
Comparison to principal components

Using effective number of independent SNPs,
find D3
missing structure in Europeans.
Using actual number of SNPs, D27
pick up a lot of noise
Clustering finds 127 clusters
Unhelpful for interpretation.

44
Summary

The traditional eigenmap based on principal
components has bad properties.
Normalized Laplacian leads to a better spectral
embedding that helps to find more meaningful
ancestry vectors.
Resulting eigenvectors emphasize major clusters
and are robust to outliers.
Final clustering corresponds to true ancestry
remarkably well.
Matching cases and controls from homogeneous
clusters provides appropriate strata for
conditional logistic analysis.