Title: Identifying conserved spatial patterns in genomes
1Identifying conserved spatial patterns in genomes
- Rose Hoberman
- Computer Science Department
- Carnegie Mellon University
University of Chicago 0ct 20, 2006
2My focus Spatial Comparative Genomics
- Understanding genome structure, especially how
the spatial arrangement of elements within the
genome changes and evolves.
3A simple model of a genome
4Genomic Change
Ancestral genome
speciation
species 2
species 1
Sequence Mutation Chromosomal Rearrangements
5Inversions
Types of Genomic Rearrangements
Duplications/Insertions
Loss
6
3
4
5
3
7
1
2
20
6Inversions
Types of Genomic Rearrangements
Duplications/Insertions
Loss
Fissions and fusions
8
9
7
11
12
10
6
20
17
16
4
5
3
1
2
4
3
1
2
13
14
15
7An Essential Task forSpatial Comparative Genomics
Identify chromosomal regions that descended from
the same region in the genome of the common
ancestor
Species 1
8
11
12
10
9
5
3
7
2
13
14
15
3
4
1
8
12
20
17
16
9
7
11
10
4
5
3
1
6
2
4
3
1
2
13
14
15
4
3
1
2
Species 2
8Outline
- Introduction and Motivation
- Evolution of spatial organization
- Applications why identify related genomic
regions? - Problem Background
- Why is this challenging?
- Introduction to cluster finding
- Results
- Statistics for pairwise clusters
- Statistics for three-way clusters
9Identification of homologous chromosomal segments
is a key task in comparative genomics
- Genome evolution
- Reconstruct history of chromosomal rearrangements
- Infer ancestral genetic map
- Phylogeny reconstruction
- Identify ancient whole genome duplications
Pevzner, Tesler. Genome Research 2003
10Identification of homologous chromosomal segments
is a key task in comparative genomics
- Genome evolution
- Reconstruct history of chromosomal rearrangements
- Infer ancestral genetic map
- Phylogeny reconstruction
- Identify ancient whole genome duplications
Guillaume Bourque et al. Genome Research 2004
11Identification of homologous chromosomal segments
is a key task in comparative genomics
- Genome evolution
- Reconstruct history of chromosomal rearrangements
- Infer ancestral genetic map
- Phylogeny reconstruction
- Identify ancient whole genome duplications
Ancestral chromosome
Whole genome duplication
chromosome 2
chr 1
chromosome 1
chr 2
12Identification of homologous chromosomal segments
is a key task in comparative genomics
- Genome evolution
- Reconstruct history of chromosomal rearrangements
- Infer ancestral genetic map
- Phylogeny reconstruction
- Identify ancient whole genome duplications
McLysaght et al Nature Genetics, 2002.
13Identification of conserved chromosomal structure
is a key task in comparative genomics
- Understand gene function and regulation in
bacteria
Insertion
- Infer functional associations
- Predict operons
- Identify horizontal transfers
Loss
Inversions
14Outline
- Introduction and Motivation
- Evolution of spatial organization
- Applications why identify related genomic
regions? - Problem Background
- Why is this challenging?
- Introduction to cluster finding
- Results
- Statistics for pairwise clusters
- Statistics for three-way clusters
15Closely related genomes
Species 1
8
11
12
9
10
5
7
2
13
4
3
1
14
15
3
20
17
16
8
9
7
11
12
10
4
5
3
1
6
2
4
3
1
2
13
14
15
4
3
1
2
Species 2
Related regions are easy to identify
16Five hundred million years...
17More Distantly Related Genomes
5
8
9
11
12
18
20
11
4
3
7
2
13
10
14
15
17
16
19
3
18
1
8
9
11
12
10
4
5
6
2
3
2
20
13
14
15
17
16
4
7
1
1
1
- Homologous regions are harder to detect, but
there is still spatial evidence of common
ancestry - Similar gene content
- Neither gene content nor order is perfectly
preserved
18The signature of diverged regions
5
8
11
12
18
9
11
3
7
2
13
17
16
19
20
18
4
1
10
14
15
3
8
12
6
20
17
9
11
10
4
5
1
2
3
1
2
13
14
15
16
4
1
7
- Gene clusters
- Similar gene content
- Neither gene content nor order is perfectly
19A Framework for Identifying Gene Clusters
- Find homologous genes
- Formally define a gene cluster
- Design an algorithm to find clusters
- Statistically verify clusters
20Why Validate Clusters Statistically?
- After sufficient time has passed, gene order will
become randomized - Uniform random data tends to be clumpy
- Some genes will end up close together in both
genomes simply by chance
20
11
4
3
7
2
13
10
14
15
17
16
19
3
18
1
8
11
12
5
6
2
2
20
13
17
16
7
9
10
4
1
3
1
14
15
4
1
21Cluster Statistics
Cluster Models
r-windows max-gap
2 regions Durand Sankoff, 2003 my work
3 regions my work unsolved
22Outline
- Introduction and Motivation
- Evolution of spatial organization
- Applications why identify related genomic
regions? - Problem Background
- Why is this challenging?
- Introduction to cluster finding
- Results
- Statistics for pairwise clusters
- Statistics for three-way clusters
23A Framework for Identifying Gene Clusters
- Find homologous genes
- Formally define a gene cluster
- Design an algorithm to find clusters
- Statistically verify clusters
24Gene Homology
- Identification of homologous gene pairs
- generally based on sequence similarity
- conserved genomic context is also informative
- Assumptions
- matches are binary (similarity scores are
discarded) - each gene is homologous to at most one other gene
in the other genome
25A Framework for Identifying Gene Clusters
- Find homologous genes
- Formally define a gene cluster
- Devise an algorithm to identify clusters
- Statistically verify clusters
26Where are the gene clusters?
- Intuitive notion pairs of regions that are dense
with homologs - How can we formalize this intuition?
27The r-window cluster definition
r 6
- Two windows of size r that share at least m
homologous gene pairs - r is a user-specified parameter
-
(Calvacanti et al 03, Durand and Sankoff 03,
Friedman and Hughes 01, Raghupathy and Durand 05)
28A max-gap chain
g? 2
gap? 3
- The distance or gap between genes is equal to
the number of intervening genes - A set of genes in a genome form a max-gap chain
if - the gap between adjacent genes is never greater
than g (a user-specified parameter)
29The max-gap cluster definition
gap? 3
g? 2
g? 3
- A set of genes form a max-gap cluster in two
genomes if - the genes forms a max-gap chain in each genome
- the cluster is maximal (i.e. not contained within
a larger cluster)
30A Framework for Identifying Gene Clusters
- Find homologous genes
- Formally define a gene cluster
- Devise an algorithm to identify clusters
- Statistically verify clusters
31Max-gap search algorithms
- Many genomic studies use a max-gap criteria
- Each group designs their own search algorithm
- These are often greedy algorithms, but greedy
algorithms miss disordered max-gap clusters - Hoberman et al, RECOMB Comp Genomics 2005
32Greedy, Agglomerative Algorithms
g 2
- initialize a cluster as a single homologous pair
- search for a gene in proximity on both
chromosomes - either extend the cluster and repeat, or
terminate
33Greedy Algorithms Impose Order Constraints
g 2
A max-gap cluster of size four
- A greedy, agglomerative algorithm will not find
this cluster since there is no max-gap cluster of
size two
34Algorithms and Definition Mismatch
- Greedy, bottom-up algorithms will not find all
max-gap clusters - There is an efficient divide-and-conquer
algorithm to find maximal max-gap clusters
(Bergeron et al, WABI, 2002) - Cluster statistics depend on the search space,
which depends on which algorithm is used
35A Framework for Identifying Gene Clusters
- Find homologous genes
- Formally define a gene cluster
- Devise an algorithm to identify clusters
- Statistically verify clusters
An example
36Example Whole genome self-comparison to detect
duplicated blocks
- Chose g30
- Compared all human chromosomes to all other
chromosomes to find max-gap gene clusters
McLysaght et al. Nature Genetics, 2002.
37How can we use statistical analysis?
Chr 17
10 genes duplicated out of 100
29 genes
Chr 3
- Could two regions display this degree of
similarity simply by chance? - Is g30 a reasonable choice of gap size?
- Are larger clusters less likely to occur by
chance? - How large does a cluster have to be before we are
surprised to observe it?
3 genes duplicated out of 25
13 genes
McLysaght et al. Nature Genetics, 2002.
38Outline
- Introduction and Motivation
- Evolution of spatial organization
- Applications why identify related genomic
regions? - Problem Background
- Why is this challenging?
- Introduction to cluster finding
- Results
- Statistics for pairwise clusters
- Statistics for three-way clusters
39Cluster Statistics
Cluster Models
r-windows max-gap
2 regions Durand Sankoff, 2003 my work
3 regions my work unsolved
40The max-gap definition is the most widely used
cluster definition in genomic analyses
- Overbeek et al 1999 inferring functional
coupling of genes in bacteria - Vision et al 2000 origins of genomic
duplications in Arabidopsis - Friedman and Hughes 2001 gene duplication and
structure of eukaryotic genomes - Tamames 2001 evolution of gene order
conservation in prokaryotes - Vandepoele et al 2002 microcolinearity between
Arabidopsis and rice - McLysaght et al 2002 genomic duplication during
early chordate evolution - Simillion et al 2002 hidden duplications in
Arabidopsis - Blanc et al 2003 recent polyploidy in
Arabidopsis - Luc et al 2003 gene teams for comparative
genomics - Chen et al 2004 operon prediction in newly
sequenced bacteria - Bourque et al 2005 comparison of mammalian and
chicken genome architectures -
Yet there is no formal statistical model for
max-gap clusters
41The Question
Suppose two whole genomes were compared, and this
max-gap cluster was identified
- Is this cluster biologically meaningful?
- Could it have occurred in a comparison of two
random genomes?
42Statistical Testing
- Hypothesis testing
- Alternate hypothesis shared ancestry
- Null hypothesis random gene order
- Discard clusters that could have arisen under the
null model - Determine the probability of observing a similar
cluster under the null hypothesis
43The Problem
h4
- Given an allowed gap size of g, what is the
probability of observing a max-gap cluster
containing exactly h matching gene pairs? - How do we calculate this probability?
44The Inputs
n22
m6
h4
g2
n number of genes in each genome m number of
matching genes pairs g the maximum gap allowed
in a cluster h number of matching genes in the
cluster
45The probability when m n
If gene content is identical
- the probability of a max-gap cluster is 1
- (regardless of the allowed gap size)
46Probability of a cluster of size h when m lt n
m genes
m-h genes
h genes
Basic approach Enumerate all ways to
- Place m-h remaining genes so they do not extend
the cluster
- Create chains of h genes in both genomes
- Normalize to get a probability
47Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
All configurations of m gene pairs in two
genomes of size n
48Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
All configurations of m gene pairs in two
genomes of size n
49Number of ways to place h genes in two genomes so
they form a cluster
h genes
m genes
m-h genes
Select h spots in each genome, so they form a
max-gap chain
Choose h genes to compose the cluster
Assign each gene to a selected spot in each genome
Hoberman, Sankoff, Durand RECOMB Comparative
Genomics, 2004
50Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
All configurations of m gene pairs in two
genomes of size n
51Counting the number of ways to place m-h genes
outside the cluster
g 1
h 3
- Approach
- design a rule specifying where the genes can be
placed so that the cluster is not extended - count the positions that satisfy the rule
52Counting the number of ways to place m-h genes
outside the cluster
gaps 1
g 1
- Rule 1 A gene can go anywhere except in the
cluster (the white box). -
Too lenient ? overcounts
53Counting the number of ways to place m-h genes
outside the cluster
g 1
g 1
g 1
- Rule 2 Every gene must be at least g1 positions
from the cluster (outside the grey box). -
Too strict ? undercounts
54Counting the number of ways to place m-h genes
outside the cluster
gap gt 1
g 1
h 3
gap gt 1
- Rule 2 Every gene must be at least g1 positions
from the cluster (outside the grey box). -
Too strict ? undercounts
55Counting the number of ways to place m-h genes
outside the cluster
g 1
gap gt 1
Too lenient ? overcounts
- Rule 3 At most one member of each gene pair can
be in the grey box. -
56Counting the number of ways to place m-h genes
outside the cluster
gaps 1
g 1
Too lenient ? overcounts
- Rule 3 At most one member of each gene pair can
be in the grey box. -
57Counting the number of ways to place m-h genes
outside the cluster
g 1
- Acceptable positions for a gene depend on the
positions of the remaining genes - Use strict and lenient rules to calculate upper
and lower bounds on G
58Estimating G
- Upper bound
- Erroneously enumerates this configuration
- Lower bound
- Fails to enumerate this configuration
59Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
Hoberman, Sankoff, Durand Journal of
Computational Biology, 2005
60What can we learn from this statistical result?
- Are we less likely to observe a large cluster
(containing more gene pairs) than a small
cluster? - How large does a cluster have to be before we are
surprised to observe it? - How do we choose the maximum allowed gap value?
61Whole-genome comparison cluster statistics
n1000, m250
g20
With a significance threshold of 10-4, any
cluster containing 8 or more genes is significant.
h (cluster size)
62E. coli vs B. Subtilis
Algorithm Bergeron et al, 02 Statistics Hoberma
n et al, 05
clusters above the orange line are significant at
the .001 level
Complete cluster doesnt form until g110
Typical operon sizes
By g9, probabilities no longer decrease
monotonically with cluster size
Under null hypothesis, by g25 all genes should
form a single cluster
63- Statistical analysis of max-gap gene clusters
- Data analysis Identifies statistically
significant max-gap clusters - Search Suggests a more principled approach for
choosing a gap size that will yield significant
clusters - Modeling Provides insight into criteria for
cluster definitions - larger clusters should be less likely to occur by
chance
64Outline
- Introduction and Motivation
- Evolution of spatial organization
- Applications why identify related genomic
regions? - Problem Background
- Why is this challenging?
- Introduction to cluster finding
- Results
- Statistics for pairwise clusters
- Statistics for three-way clusters
65Cluster Statistics
Cluster Models
r-windows max-gap
2 regions Durand Sankoff, 2003 my work
3 regions my work unsolved
Joint work with Narayanan Raghupathy
66Duplicated segments can be particularly difficult
to identify
Ancestral chromosome
Whole genome duplication
chromosome 2
chromosome 1
chr 1
chr 2
67Duplicated segments can be particularly difficult
to identify
Ancestral chromosome
Whole genome duplication
chr 1
Reciprocal gene loss
chr 2
68- Comparisons of multiple genomes offer
significantly more power to detect highly
diverged homologous segments
Arabidopsis thaliana region 1
Rice chr 2
Arabidopsis thaliana region 2
Simillion et al, PNAS 2002
69Existing Statistical Approaches
W2
x2
x12
x23
- Only consider x123, genes shared between all
regions - disregards pairwise overlaps
x123
x3
x1
x13
W3
W1
- Conduct multiple pairwise comparisons
- does not consider the greater impact of genes
shared among all three regions - requires that at least two pairwise clusters are
independently significant
No existing statistical approach considers all
these quantities in a single test.
Methods reviewed in Simillion et al, Bioessays,
2004
Durand and Sankoff, J Comp Biology, 2003
70r-windows for pairwise comparison
- Two windows of size r that share at least m
homologous gene pairs
71With two regions there is a natural test statistic
- An r-window cluster spanning two regions can be
characterized by the number of shared genes, m.
The probability that two random windows share m
genes
Durand and Sankoff, J Comp Biology, 2003
72With three regions there are many more quantities
to consider
x3
- Three-way overlap x123
- Pairwise overlaps x12, x23, x13
x13
x23
x123
x2
x1
x12
We want to consider both types of overlaps
73Our r-window statistics for three windows
x3
First tests that uses both pairwise and three-way
overlaps
x23
x13
x123
x1
x2
x12
- Given three randomly selected windows of length
r, - we derived equations for
- the probability of observing such a cluster
- under a null hypothesis of random gene order.
Raghupathy, Hoberman, Durand. APBC 2007.
74How serious are these limitations?
Our derived expressions can be used to compare
pairwise and three-way cluster probabilities for
typical genome parameters and window sizes.
- Pairwise tests have a number of limitations
- They do not consider the greater impact of genes
shared among all three regions - They require that at least two pairwise
comparisons are independently significant
75What is the impact of ignoring three-way genes?
b) Three distinct genes are shared by each pair
of windows
- Three genes are shared by all three windows
Each pair of regions shares exactly three genes
3
0
3
0
0
3
3
0
76Which cluster is least likely to occur by chance?
n5000, r100
0
0
h
0
h
-
h
h
0
h
h
77(a) is less likely to occur than (b)
Pairwise tests cannot distinguish these two
clusters
b) Three distinct genes are shared by each pair
of windows
- Three genes are shared by all three windows
but pairwise tests score them identically
78How much of a limitation is this?
Our derived expressions can be used to compare
pairwise and three-way cluster probabilities for
typical genome parameters and window sizes.
- Pairwise tests have a number of limitations
- They do not consider the greater impact of genes
shared among all three regions - They require that at least two pairwise
comparisons are independently significant
79When no genes appear in all three regions, are
pairwise statistical tests sufficient?
n5000, r 100, x1230
0
a0.001
Two Pairwise Tests
h
0
Three-window
80When no genes appear in all three regions, are
pairwise statistical tests sufficient?
- The three-window test is strictly more general
than two pairwise tests - The three-window test will always reject the null
hypothesis for any cluster in which the pairwise
tests reject the null hypothesis
0
Two Pairwise Tests
0
Three-window
81These results suggest that multi-region tests can
identify more distantly related homologous
regions.
- There is an ongoing debate about whether there
was 1 or 2 whole genome duplications in early
vertebrates - More powerful statistical tests might help
resolve this issue
82Summary
- First statistical tests for max-gap clusters, the
most commonly used definition - determines minimum significant cluster size
- allows more principled selection of gap parameter
- reveals problems with max-gap definition
- More sensitive statistical tests for comparisons
of three genomic regions - first test to consider both pairwise and
three-way overlaps - can identify more distantly related regions
- may be able to resolve outstanding questions in
molecular evolution
83Acknowledgements
- Collaborators
- Dannie Durand, Biological Sciences and Computer
Science, CMU - David Sankoff, Math and Statistics University of
Ottawa - Narayanan Raghupathy, Biological Sciences, CMU
- Funders
- Barbara Lazarus Women_at_IT Fellowship
- The Sloan Foundation
84Thanks
85Questions?
86(No Transcript)
87(No Transcript)
88(No Transcript)
89(No Transcript)
90(No Transcript)
91(No Transcript)
92(No Transcript)
93(No Transcript)
94Advantages of an analytical approach
- Analyzing incomplete datasets
- Principled parameter selection
- Understanding statistical trends
- Insight into tradeoffs between definitions
- Efficiency
- Accuracy
95The number of ways to create a chain of h genes
Ways to place the leftmost gene in the chain, so
there are at least L-1 places left
1 2 3 4 5 .
n-L1 .
n
The maximum length of the chain is L (h-1)g h
96The number of ways to create a chain of h genes
Ways to place the leftmost gene in the chain, so
there are at least L-1 slots left
Choices for the size of each gap (from 0 to g)
There are h-1 gaps in a chain of h genes
97The number of ways to create a chain of h genes
Ways to place the leftmost gene in the chain, so
there are at least L-1 slots left
Chains near the end of the genome
Choices for the size of each gap (from 0 to g)
There are h-1 gaps in a chain of h genes
1 2 3 4 5 .
n-L1 .
n
98Counting clusters at the end of the genome
- Gaps are constrained
- And sum of gaps is constrained
l w-1
l m
99g2
g3
gm-1
g1
l lt w
A known solution
100Counting clusters at the end of the genome
- Gaps are constrained
- And sum of gaps is constrained
l w-1
l m
101Cluster Length
w-2
w
w-1
m1
m
d(m,g,m) d(m,g,m1)
d(m,g,w-2) d(m,g,w-1)
l w-1
d(m,g,m) d(m,g,m1)
d(m,g,w-2)
d(m,g,m) d(m,g,m1)
d(m,g,m1)
d(m,g,m)
d(m,g,m1)
d(m,g,m)
d(m,g,m)
l m
Line of Symmetry
102Exploiting Symmetry
l
w
m
g
g
g
g-1
1
m1
w-1
g
g-1
1
g
g-2
2
m2
w-2
g-1
g-1
1
1
g
g-2
2
103Cluster Length
w-2
w
w-1
m1
m
d(m,g,m) d(m,g,m1)
d(m,g,w-2) d(m,g,w-1)
(g1)m-1
l w-1
d(m,g,m) d(m,g,m1)
d(m,g,w-2)
(g1)m-1
d(m,g,m) d(m,g,m1)
(g1)m-1
d(m,g,m1)
d(m,g,m)
d(m,g,m1)
d(m,g,m)
d(m,g,m)
l m
104Number of ways to position h genes in a genome
of n genes so they form a max-gap chain
Starting positions near end
Starting positions
Ways to place remaining h-1 genes
105(No Transcript)