Spectral Graph Theory and Ancestry in Genomewide Association Studies - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Spectral Graph Theory and Ancestry in Genomewide Association Studies

Description:

Spectral Graph Theory and Ancestry in Genome-wide Association Studies. Kathryn Roeder ... Final clustering corresponds to true ancestry remarkably well. ... – PowerPoint PPT presentation

Number of Views:197
Avg rating:3.0/5.0
Slides: 45
Provided by: statIa6
Category:

less

Transcript and Presenter's Notes

Title: Spectral Graph Theory and Ancestry in Genomewide Association Studies


1
Spectral Graph Theory and Ancestry in Genome-wide
Association Studies
  • Kathryn Roeder
  • Coauthors Diana Luca, Ann Lee,
  • Bernie Devlin Bert Klei

2
Find Genetic Variants Associated with Disease
  • Data
  • Cases
  • Controls
  • Several unrecorded ethnic groups
  • ½ million genetic markers (SNPs)
  • Analysis Logistic regression
  • Test each variant for association

3
Data
  • X counts the number of minor alleles at each
    marker across genome.
  • Y takes on value 1 for affected and 0 for
    unaffected individuals.

4
Population Stratification
5
Distribution of the O blood type in native
populations around the world.
6
Spurious Associations
  • Common genetic approach to control spurious
    findings
  • Restrict to common continental ancestry
  • Statistical approaches using G
  • Match cases and controls by ancestry

G2
C
Y
G1
7
High Dimensional Data
  • N
  • Information content
  • Each SNP offers related information about
    ancestry
  • Information per SNP is minimal
  • Over genome it counts up

8
Clusters of subpopulations
  • Individuals from a common subpopulation are
    correlated and form clusters.

9
Principal component map
  • Let X be a N x L matrix, centered and scaled by
    allele frequency.
  • Project XXt U?Ut into D-dimensional space
    defined by the D largest eigenvalues.
  • Ancestry of ith individual is (ui1,ui2,,uiD)

10
PCA and Multidimensional Scaling
11
Representation of Ancestry (Patterson et al.)
12
Eigenstrat (Price et al, Nature Genetics 2006)
  • Modeling ancestry using Eigenstrat
  • Find eigenvectors U1, U2, , U10
  • Regress out eigenvectors to remove confounding of
    ancestry

13
Alternative to Regression is Matching
Pair Match vs Full Match
14
Distance Between Individuals
  • Match cases and controls that are similar in
    ancestry
  • Distance Euclidean distance between
    eigenvectors, scaled by eigenvalues
  • Use eigenvalues ?1, ?2, to weight importance of
    axes

15
Determining D ( dimensions)
  • Test for structure equivalent to testing for
    large eigenvalues. (Patterson et al. 2006)
  • Standardize distribution based on
  • N subjects
  • L independent SNPs
  • Tracy Widom distribution describes distribution
    of the largest eigenvalue. Get a P-value for each
    dimension tested. (Johnstone 1991).
  • Weakness distribution depends on L which must be
    estimated. MOM estimator assumes null.

16
Association data
  • Cases
  • Americans of European descent
  • Controls
  • Northern Germans
  • Southern Germans

17
Starting Data (D17)
PO or FS similarities
Duplicates or MZ-twins
18
Northern/Southern Germans
19
After Eliminating outliers and unmatchables
(D2)
20
Performance good, requires care
  • Outliers cases (or controls) that are unusual
    at many SNPs
  • High influence on eigenmap, failure to discover
    key dimensions of ancestry
  • Unmatchables cases far from controls or vice
    versa.
  • Caution must remove outliers and unmatchables

21
Multi-ethnic example.
  • Cases and controls descended from
  • Europe
  • Africa
  • Asia
  • Expect hierarchical structure
  • Multi-continent
  • Differences within continent

22
3 Major Ethnic Groups (D2)
African Descent
European Descent
Asian Descent
23
European sub-sample D4
24
Problems
  • Approach is too sensitive to outliers
  • Not able to discover hierarchical structure
    without performing several iterations of
    analysis.
  • Tests for number of dimensions are not reliable
    when data are heterogeneous.
  • Estimate of independent dimensions (L) is poor
    when data are heterogenious.
  • Try a new eigenmap.

25
Can spectral graph theory help?
  • For N subjects, define a graph G (V,E)
  • V vertices
  • E edges
  • Wij weights between vertices (i,j)
  • large value for tight connection
  • 0 for no connection
  • Must be positive and symmetric
  • Di degree of the vertex

26
Defining a Weight matrix

27
Goal
  • Want to embed a weighted graph into D-dimensional
    space so that similar vertices are close.
  • A map that meets this requirement is the
    Laplacian eigenmap of G.
  • Based on top non-trivial eigenvectors

28
Spectral embedding and clustering
  • Close connection between locality-preserving
    spectral embeddings, spectral clustering and
    cuts.
  • Min Cut finds outliers and clumps
  • Other weighted cuts find major splits and
    clusters

29
Spectral Embedding Laplacian
  • Choices for embedding.
  • Degree matrix
  • D diag(D1,,DN)
  • Embedding based on decomposing
  • W
  • Laplacian L D-W
  • Normalized Laplacian D-1/2LD-1/2

30
Cuts
31
Choices of eigenmaps
  • W
  • Emphasizes outliers
  • PCA analysis similar since
  • W 0-thresholded version of K
  • Normalized Laplacian D-1/2(D-W)D-1/2
  • Finds major clusters
  • Robust to outliers

32
Eigen-gap Heuristic
  • Tracy-Widom theory for eigenvalues of a
    covariance matrix no longer applies.
  • The eigen-gap heuristic says,
  • find the largest k such that
  • ?k ?k1
  • is significantly greater than 0.
  • Find the distribution of this statistic by
    simulation.
  • Distribution depends on N, not L. Easy to obtain.

33
Approach
  • Find eigenvectors and eigenvalues of normalized
    Laplacian
  • Use square-root-threshold weights
  • Use eigengaps to determine dimension

34
Example with hierarchical structure(3 major
groups, 3 subgroups in 1)
Groups 2 and 3 Black Group 1
clusters Red
Blue Green
35
POPRES data
  • Worldwide sample
  • 6000 people worldwide
  • 500,000 SNPs
  • Ancestry (only used to validate)
  • Country of birth
  • Country of birth of parents

36
Eigenvectors for POPRES
Africa Europe China Mexico India
37
Eigenvectors for POPRES
Africa Europe China Mexico India
38
Eigenvectors for POPRES
Africa Europe China Mexico India
39
Eigenvectors for POPRES
Africa Europe China Mexico India
40
Clustering
  • Distance calculated based on D eigenvectors.
  • Hierarchical clustering (Wards algorithm) at
    each step the union of every possible cluster
    pair is considered. The clusters whose fusion
    results in the minimum information loss are
    combined.
  • Start with k2
  • Test for population structure
  • If one cluster is homogeneous, set aside the
    cluster and continue
  • Repeat this process until all clusters are
    homogeneous
  • K homogeneous.
  • Find matches and unmatchable observations

41
Overall Clustering Results
  • 4 Non-European clusters
  • Africa
  • China
  • Mexico
  • India
  • 8 European clusters

42
Clusters of European Ancestry
  • Clusters generate a map of history and geography
    of Europe
  • Portugal, Spain
  • Italy
  • Italy
  • Switzerland (France)
  • Ireland, England, UK, (France)
  • France, Belgium, (Switz, Italian)
  • Netherlands, Belgium, Germany, (UK)
  • Germany, Poland, Hungary, Romania
  • Other nationalities with small samples fit too
  • Slovakia, Slovenia, Kosovo, Montenegro, Czech,
    Croatia

43
Comparison to principal components
  • Using effective number of independent SNPs,
    find D3
  • missing structure in Europeans.
  • Using actual number of SNPs, D27
  • pick up a lot of noise
  • Clustering finds 127 clusters
  • Unhelpful for interpretation.

44
Summary
  • The traditional eigenmap based on principal
    components has bad properties.
  • Normalized Laplacian leads to a better spectral
    embedding that helps to find more meaningful
    ancestry vectors.
  • Resulting eigenvectors emphasize major clusters
    and are robust to outliers.
  • Final clustering corresponds to true ancestry
    remarkably well.
  • Matching cases and controls from homogeneous
    clusters provides appropriate strata for
    conditional logistic analysis.
Write a Comment
User Comments (0)
About PowerShow.com