Title: Dimensionality Reduction in the Analysis of Human Genetics Data
1Dimensionality Reduction in the Analysis of Human
Genetics Data
Petros Drineas Rensselaer Polytechnic
Institute Computer Science Department
To access my web page
drineas
2Human genetic history
Much of the biological and evolutionary history
of our species is written in our DNA
sequences. Population genetics can help
translate that historical message.
The genetic variation among humans is a small
portion of the human genome.All humans are
almost than 99.9 identical.
3Our objective
Fact Dimensionality Reduction techniques (such
as Principal Components Analysis PCA) separate
different populations and result to plots that
correlate well with geography.
4Our objective
 Fact
 Dimensionality Reduction techniques (such as
Principal Components Analysis PCA) separate
different populations and result to plots that
correlate well with geography.  Our goal
 Based on this observation, we seek unsupervised,
efficient algorithms for the selection of a small
set of genetic markers that can be used to  capture population structure, and
 predict individual ancestry.
5The math behind
 To this end, we employ matrix algorithms and
matrix decompositions such as  the Singular Value Decomposition (SVD), and
 the CX decomposition.
 We provide novel, unsupervised algorithms for
selecting ancestry informative markers.
6Overview
 Background
 The Singular Value Decomposition (SVD)
 The CX decomposition
 Removing redundant markers (Column Subset
Selection Problem)  Selecting Ancestry Informative Markers
 A worldwide set of populations
 Admixed EuropeanAmerican populations
 The POPulation REference Sample (POPRES)
7Single Nucleotide Polymorphisms (SNPs)
Single Nucleotide Polymorphisms the most common
type of genetic variation in the genome across
different individuals. They are known locations
at the human genome where two alternate
nucleotide bases (alleles) are observed (out of
A, C, G, T).
SNPs
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
individuals
There are 10 million SNPs in the human genome,
so this matrix could have 10 million columns.
8Our data as a matrix
SNPs
Individuals
AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT
AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT
CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT
TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG
GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG
GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG
TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC
AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT
CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC
CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG
CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT
AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC
CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC
CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT
AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG
AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA
CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG
TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG
AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA
GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG
AA
SNPs
Individuals
0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0
1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1
1 1 1 1 1 1 1 0 0 0 1 0 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 0 0
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
1 1 1 1 0 1 0 0 0 0 0 1 0 1 1 1 0
1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0
0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1
1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
1 0 1 1 1
example ?? 1 ?G 0 GG 1
9Forging population variation structure
 Genetic diversity and population (sub)structure
is caused by  Mutation
 Mutations are changes to the base pair sequence
of the DNA.  Natural selection
 Genotypes that correspond to favorable traits and
are heritable become more common in successive
generations of a population of reproducing
organisms.  ? Mutations increase genetic diversity.
 Under natural selection, beneficial mutations
increase in frequency, and vice versa.
10Forging population variation structure
 Genetic drift
 Sampling effects on evolution
 Example say that the RAF of a SNP in a small
population is p.  The offspring generation would (in expectation)
have a RAF of p as well for the same SNP.  In reality, it will have a RAF of p (a drifted
frequency)  Gene flow
 Transfer of alleles between populations
(immigration)  Nonrandom mating
 Reduces interaction between (sub)populations.
 Other demographic events
11Dimensionality reduction
SNPs
Individuals
0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0
1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1
1 1 1 1 1 1 1 0 0 0 1 0 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 0 0
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
1 1 1 1 0 1 0 0 0 0 0 1 0 1 1 1 0
1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0
0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1
1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
1 0 1 1 1
 Mutation/natural selection/population history,
etc. result to significant structure in the data.  This structure can be extracted via
dimensionality reduction techniques.  In most cases, unsupervised techniques suffice!
 In most cases, linear dimensionality reduction
techniques (PCA) suffice!
12Why study population structure?
 Mapping causative genes for common complex
disorders  (e.g. diabetes, heart conditions, obesity, etc.)
 History of human populations
 Genealogy
 Forensics
 Conservation genetics
13Population stratification
 Definition
 Correlation between subpopulations in the
case/control samples and the phenotype under
investigation.  Effects
 Confounds the study, typically leading to false
positive correlations.  Solution

 The problem can be addressed either by careful
sample collection or by statistical
postprocessing of the results (Price et al
(2006) Nat Genet).
14Population stratification (contd)
Population 1
Cases
Population 2
AC
AA
CC
Example of the confounding effects of population
stratification in an association study.
Controls
Marchini et al (2004) Nat Genet
15Recall our objective
 Develop unsupervised, efficient algorithms for
the selection of a small set of SNPs that can be
used to  capture population structure, and
 predict individual ancestry.
 Why? cost efficiency.
 Lets discuss (briefly) prior work
16 Inferring population structure
Oceania
Africa
Europe
Central Asia
East Asia
Middle East
America
377 STRPs, Rosenberg et al (2004) Science
 Examples of available algorithms/software
packages  STRUCTURE Pritchard et al (2000) Genetics
 FRAPPE Li et al (2008) Science
17Selecting ancestry informative markers
 Existing methods (Fst, Informativeness, d)
 Rosenberg et al (2003) Am J Hum Genet
 Allele frequency based.
 Require prior knowledge of individual ancestry
(supervised).  Such knowledge may not be available.
 (e.g., populations of complex ancestry, large
multicentered studies of anonymous samples,
etc.)  Unsupervised feature selection techniques are
often preferable because they tend to not overfit
the data.
18The Singular Value Decomposition (SVD)
Let A be a matrix with m rows (one for each
subject) and n columns (one for each
SNP). Matrix rows points (vectors) in a
Euclidean space, e.g., given 2 objects (x d),
each described with respect to two features, we
get a 2by2 matrix. Two objects are close if
the angle between their corresponding vectors is
small.
19SVD, intuition
Let the blue circles represent m data points in a
2D Euclidean space. Then, the SVD of the mby2
matrix of the data will return
20Singular values
?2
?1 measures how much of the data variance is
explained by the first singular vector. ?2
measures how much of the data variance is
explained by the second singular vector.
?1
21SVD formal definition
? rank of A U (V) orthogonal matrix containing
the left (right) singular vectors of A. S
diagonal matrix containing the singular values of
A.
22Rankk approximations via the SVD
?
A
VT
U
features
significant
sig.
noise
noise
significant
noise
objects
23Rankk approximations (Ak)
 Uk (Vk) orthogonal matrix containing the top k
left (right) singular vectors of A.  k diagonal matrix containing the top k singular
values of A.
24PCA and SVD
Principal Components Analysis (PCA) essentially
amounts to the computation of the Singular Value
Decomposition (SVD) of a covariance matrix. SVD
is the algorithmic tool behind MultiDimensional
Scaling (MDS) and Factor Analysis.
25The data
European Americans
South Altaians


Spanish
African Americans
Japanese
Chinese
Puerto Rico
Nahua
Mende
Mbuti
Mala
Burunge
Quechua
Africa
Europe
E Asia
America
274 individuals from 12 populations genotyped on
10,000 SNPs Shriver et al (2005) Human Genomics
26America
Africa
Asia
Europe
Paschou et al (2007) PLoS Genetics
27America
Africa
Asia
Europe
Not altogether satisfactory the singular vectors
are linear combinations of all SNPs, and of
course can not be assayed! Can we find actual
SNPs that capture the information in the singular
vectors? (E.g., spanning the same subspace )
Paschou et al (2007) PLoS Genetics
28SVD decomposes a matrix as
The SVD has strong optimality properties.
Top k left singular vectors
 X UkTA ?k VkT
 The columns of Uk are linear combinations of up
to all columns of A.
29CX decomposition
Goal make (some norm) of ACX small.
c columns of A
Why? If A is an subjectSNP matrix, then
selecting representative columns is equivalent to
selecting representative SNPs to capture the same
structure as the top eigenSNPs. We want c as
small as possible!
30CX decomposition
Goal make (some norm) of ACX small.
c columns of A
Theory for any matrix A, we can find C such
that is almost equal to the norm of AAk with c
k.
31CX decomposition
c columns of A
Easy to prove that optimal X CA. (C is the
MoorePenrose pseudoinverse of C.) Thus, the
challenging part is to find good columns (SNPs)
of A to include in C. From a mathematical
perspective, this is a hard combinatorial problem.
32A theoremDrineas et al (2008) SIAM J Mat Anal
Appl
Given an mbyn matrix A, there exists an
algorithm that picks, in expectation, at most O(
k log k / ?2 ) columns of A runs in O(mn2) time,
and with probability at least 11020
33The CX algorithm
Input mbyn matrix A, target rank k, number of
columns c Output C, the matrix consisting of the
selected columns
 CX algorithm
 Compute probabilities pj summing to 1
 For each j 1,2,,n, pick the jth column of A
with probability min1,cpj  Let C be the matrix consisting of the chosen
columns  (C has in expectation at most c columns)
34Subspace sampling (Frobenius norm)
Vk orthogonal matrix containing the top k right
singular vectors of A. S k diagonal matrix
containing the top k singular values of A.
Remark The rows of VkT are orthonormal vectors,
but its columns (VkT)(i) are not.
35Subspace sampling (Frobenius norm)
Vk orthogonal matrix containing the top k right
singular vectors of A. S k diagonal matrix
containing the top k singular values of A.
Remark The rows of VkT are orthonormal vectors,
but its columns (VkT)(i) are not.
Subspace sampling in O(mn2) time
Leverage scores (many references in the
statistics community)
Normalization s.t. the pj sum up to 1
36Deterministic variant of CX
Paschou et al (2007) PLoS Genetics Mahoney and
Drineas (2009) PNAS
Input mbyn matrix A, integer k, and c
(number of SNPs to pick) Output the selected
SNPs
 CX algorithm
 Compute the scores pj
 Pick the columns (SNPs) corresponding to the top
c scores.
37Deterministic variant of CX (contd)
Paschou et al (2007) PLoS Genetics Mahoney and
Drineas (2009) PNAS
Input mbyn matrix A, integer k, and c
(number of SNPs to pick) Output the selected
SNPs we will call them PCA Informative Markers
or PCAIMs
 CX algorithm
 Compute the scores pj
 Pick the columns (SNPs) corresponding to the top
c scores.
In order to estimate k for SNP data, we developed
a permutationbased test to determine whether a
certain principal component is significant or
not. (A similar test was presented in Patterson
et al (2006) PLoS Genetics)
38Worldwide data
European Americans
South Altaians


Spanish
African Americans
Japanese
Chinese
Puerto Rico
Nahua
Mende
Mbuti
Mala
Burunge
Quechua
Africa
Europe
E Asia
America
274 individuals, 12 populations, 10,000 SNPs
using the Affymetrix array Shriver et al (2005)
Human Genomics
39Selecting PCAcorrelated SNPs for individual
assignment to four continents (Africa, Europe,
Asia, America)
Africa
Europe
Asia
America
top 30 PCAcorrelated SNPs
PCAscores
SNPs by chromosomal order
Paschou et al (2007) PLoS Genetics
40Correlation coefficient between true and
predicted membership of an individual to a
particular geographic continent. (Use a subset
of SNPs, cluster the individuals using kmeans.)
Paschou et al (2007) PLoS Genetics
41 Crossvalidation on HapMap data
Paschou et al (2007) PLoS Genetics
42 9 indigenous populations from four different
continents (Africa, Europe, Asia, Americas)  All SNPs and 10 principal components perfect
clustering!  50 PCAIMs SNPs, almost perfect clustering.
43The Human Genome Diversity Panel Appox. 1000
individuals 650K Illumina Array Li et al (2008)
Science
44Highest scoring genes in HGDP dataset
Gene Function (RefSeq)
EDAR Ectodermal development, hair follicle formation.
PTK6 Intracellular signal transducer in epithelial tissues. Sensitization of cells to epidermal growth factor.
GALNT13 Initiates Olinked glycosylation of mucins.
SPATA20 Associated with spermatogenesis.
MCHR1 Plasma membrane protein which binds melaninconcentrating hormone. Probably involved in the neuronal regulation of food consumption.
FOXP1 Forkhead box transcription factors play important roles in the regulation of tissue and cell typespecific gene transcription during both development and adulthood.
PSCD3 Involved in the control of Golgi structure and function.
CNTNAP2 Member of the neurexin family which functions in the vertebrate nervous system as cell adhesion molecules and receptors.
OCA2 Skin/Hair/Eye pigmentation.
EGFR This protein is a receptor for members of the epidermal growth factor family. Associated with the melanin pathway.
Barreiro et al (2008) Nat Genet, Sabeti et al
(2007) Nature, The International HapMap
Consortium (2007) Nature
45 A problem with the CX decomposition
Input mbyn matrix A, integer k, and c (number
of SNPs to pick) Output the selected PCA
Informative Markers or PCAIMs
 CX algorithm
 Compute the scores pj
 Pick the columns (SNPs) corresponding to the top
c scores.
Problem Highly correlated SNPs (a.k.a., SNPs
that are in LD) get similar high scores, and
thus the deterministic variant would select
redundant SNPs. How do we remove this redundancy?
46Column Subset Selection Problem (CSSP)
Definition Given an mbyc matrix A, find k
columns of A forming an mbyk matrix C that are
maximally uncorrelated (a.k.a., select maximally
uncorrelated SNPs).
47Column Subset Selection Problem (CSSP)
Definition Given an mbyc matrix A, find k
columns of A forming an mbyk matrix C that are
maximally uncorrelated (a.k.a., select maximally
uncorrelated SNPs). Metric of correlation A
common formulation is to select a set of SNPs
that span a parallelpiped of maximal
volume. This formulation is NPhard (i.e.,
intractable even for small values of m, c, and k).
48Column Subset Selection Problem (CSSP)
Definition Given an mbyc matrix A, find k
columns of A forming an mbyk matrix C that are
maximally uncorrelated (a.k.a., select maximally
uncorrelated SNPs). Metric of correlation A
common formulation is to select a set of SNPs
that span a parallelpiped of maximal
volume. This formulation is NPhard (i.e.,
intractable even for small values of m, c, and
k). Significant prior work The CSSP has been
studied in the Numerical Linear Algebra
community, and many provably accurate
approximation algorithms exist.
49Prior work on the CSSP 1965 2000Boutsidis,
Mahoney, and Drineas (2009) SODA, under review in
Num Math
50The greedy QR algorithm
We use a standard greedy approach (the
RankRevealing QR factorization). The algorithm
performs k iterations In the first iteration,
the top PCAIM is picked In the second
iteration, a PCAIM is picked that is as
uncorrelated to with the previously selected
PCAIM as possible In the third iteration the
chosen PCAIM has to be as uncorrelated as
possible with the first two previously selected
PCAIMs And so on
Efficient implementations are available, and run
in minutes for typical values of m, c, and k.
Paschou et al (2008) PLoS Genetics
51A European American sample
 Datasets
 CHORI dataset
 980 European Americans, 300,000 SNPs (Illumina
chip)  Simon et al (2006) Am J Cardiol Albert et al
(2001) JAMA  Coriell dataset
 541 European Americans, 300,000 SNPs (Illumina
chip)  Fung et al (2006) LANCET
 HapMap
 90 Yoruba (YRI), 90 CEPH (CEU), 90 Han Chinese
Japanese (CHBJPT)
Paschou et al (2008) PLoS Genetics
52Europeans
Africans
Asians
PCA plot of EuropeanAmerican populations and
HapMap 2 populations. (The top three eigenSNPs
are presented.)
Paschou et al (2008) PLoS Genetics
53(No Transcript)
54POPRES  The Population Reference Sample
 6,000 individuals of European, AfricanAmerican,
East Asian, South Asian, and Mexican origin  Genotyped with Affy 500K array set
 Nelson et al (2008) Am J Hum Genet
 3,192 Europeans
 Correlation between genetic structure and
geographic origin  Lao et al (2008) Current Biology
 Novembre et al (2008) Nature
 We analyzed 1,387 individuals from Novembre et
al.  Randomly included only 200 UK and 125 Swiss
French (even out sample sizes)  excluded
 Europeans not sampled in Europe
 Putative relatives
 Outliers in preliminary PCA
55(No Transcript)
56Conclusions
 Using linear algebraic techniques (e.g., matrix
decompositions) we selected markers that capture
population structure.  Our technique requires no prior assumptions and
builds upon the power of SVD and PCA to identify
population structure in various settings,
including admixed populations.  Prior theoretical work and mathematical
understanding of the underlying problem was
fundamental in designing our algorithm!
57Future research
 Unsupervised dimensionality reduction techniques
are NOT successful in separating cases from
controls in GWAS studies.  Why?
 Because the disease signal is too weak.
 Potential remedies?
 Supervised techniques (Fischer Discriminant
Analysis or LDA, etc.).  Sparse approximations for regression problems.
 Goal?
 Design a global test that may help uncover
effects of genegene interactions in disease risk.
58Acknowledgements
 Collaborators
 P. Paschou, Democritus University, Greece
 E. Ziv, UCSF
 E. Burchard, UCSF
 K. K. Kidd, Yale University
 M. Shriver, Penn State
 R. Krauss, Oakland Research Institute
 Students
 Asif Javed, RPI (now at IBM)
 Jamey Lewis, RPI
Funding NSF (Drineas), Tourette Syndrome
Association (Paschou), NIH (Ziv)
 P. Paschou, M. W. Mahoney, A. Javed, J. Kidd, A.
Pakstis, S. Gu, K. Kidd, and P. Drineas. (2007)
Intra and interpopulation genotype
reconstruction from tagging SNPs, Genome
Research, 17(1), pp. 96107.  P. Paschou, E. Ziv, E. Burchard, S. Choudhry, W.
RodriguezCintron, M. W. Mahoney, and P. Drineas.
(2007) PCAcorrelated SNPs for structure
identification in worldwide human populations,
PLoS Genetics, 3(9), pp. 16721686.  P. Paschou, P. Drineas, J. Lewis, C. Nievergelt,
D. Nickerson, J. Smith, P. Ridker, D. Chasman, R.
Krauss, and E. Ziv. (2008) Tracing substructure
in the European American population with
PCAinformative markers, PLoS Genetics, 4(7), pp.
113.  M. W. Mahoney and P. Drineas. (2009) CUR matrix
decompositions for improved data analysis,
Proceedings of the National Academy of Sciences,
106(3), pp. 697702.