Identifying conserved spatial patterns in genomes - PowerPoint PPT Presentation

About This Presentation
Title:

Identifying conserved spatial patterns in genomes

Description:

... how the spatial arrangement of elements within the genome changes and evolves. A simple model of a genome ... Identification of homologous chromosomal ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 106
Provided by: DerekD45
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Identifying conserved spatial patterns in genomes


1
Identifying conserved spatial patterns in genomes
  • Rose Hoberman
  • Computer Science Department
  • Carnegie Mellon University


University of Chicago 0ct 20, 2006
2
My focus Spatial Comparative Genomics
  • Understanding genome structure, especially how
    the spatial arrangement of elements within the
    genome changes and evolves.

3
A simple model of a genome
  • an ordered list of genes

4
Genomic Change
Ancestral genome
speciation
species 2
species 1
Sequence Mutation Chromosomal Rearrangements
5
Inversions
Types of Genomic Rearrangements
Duplications/Insertions
Loss
6
3
4
5
3
7
1
2
20
6
Inversions
Types of Genomic Rearrangements
Duplications/Insertions
Loss
Fissions and fusions
8
9
7
11
12
10
6
20
17
16
4
5
3
1
2
4
3
1
2
13
14
15
7
An Essential Task forSpatial Comparative Genomics
Identify chromosomal regions that descended from
the same region in the genome of the common
ancestor
Species 1
8
11
12
10
9
5
3
7
2
13
14
15
3
4
1
8
12
20
17
16
9
7
11
10
4
5
3
1
6
2
4
3
1
2
13
14
15
4
3
1
2
Species 2
8
Outline
  • Introduction and Motivation
  • Evolution of spatial organization
  • Applications why identify related genomic
    regions?
  • Problem Background
  • Why is this challenging?
  • Introduction to cluster finding
  • Results
  • Statistics for pairwise clusters
  • Statistics for three-way clusters

9
Identification of homologous chromosomal segments
is a key task in comparative genomics
  • Genome evolution
  • Reconstruct history of chromosomal rearrangements
  • Infer ancestral genetic map
  • Phylogeny reconstruction
  • Identify ancient whole genome duplications

Pevzner, Tesler. Genome Research 2003
10
Identification of homologous chromosomal segments
is a key task in comparative genomics
  • Genome evolution
  • Reconstruct history of chromosomal rearrangements
  • Infer ancestral genetic map
  • Phylogeny reconstruction
  • Identify ancient whole genome duplications

Guillaume Bourque et al. Genome Research 2004
11
Identification of homologous chromosomal segments
is a key task in comparative genomics
  • Genome evolution
  • Reconstruct history of chromosomal rearrangements
  • Infer ancestral genetic map
  • Phylogeny reconstruction
  • Identify ancient whole genome duplications

Ancestral chromosome
Whole genome duplication
chromosome 2
chr 1
chromosome 1
chr 2
12
Identification of homologous chromosomal segments
is a key task in comparative genomics
  • Genome evolution
  • Reconstruct history of chromosomal rearrangements
  • Infer ancestral genetic map
  • Phylogeny reconstruction
  • Identify ancient whole genome duplications

McLysaght et al Nature Genetics, 2002.
13
Identification of conserved chromosomal structure
is a key task in comparative genomics
  • Understand gene function and regulation in
    bacteria

Insertion
  • Infer functional associations
  • Predict operons
  • Identify horizontal transfers

Loss
Inversions
14
Outline
  • Introduction and Motivation
  • Evolution of spatial organization
  • Applications why identify related genomic
    regions?
  • Problem Background
  • Why is this challenging?
  • Introduction to cluster finding
  • Results
  • Statistics for pairwise clusters
  • Statistics for three-way clusters

15
Closely related genomes
Species 1
8
11
12
9
10
5
7
2
13
4
3
1
14
15
3
20
17
16
8
9
7
11
12
10
4
5
3
1
6
2
4
3
1
2
13
14
15
4
3
1
2
Species 2
Related regions are easy to identify
16
Five hundred million years...
17
More Distantly Related Genomes
5
8
9
11
12
18
20
11
4
3
7
2
13
10
14
15
17
16
19
3
18
1
8
9
11
12
10
4
5
6
2
3
2
20
13
14
15
17
16
4
7
1
1
1
  • Homologous regions are harder to detect, but
    there is still spatial evidence of common
    ancestry
  • Similar gene content
  • Neither gene content nor order is perfectly
    preserved

18
The signature of diverged regions
5
8
11
12
18
9
11
3
7
2
13
17
16
19
20
18
4
1
10
14
15
3
8
12
6
20
17
9
11
10
4
5
1
2
3
1
2
13
14
15
16
4
1
7
  • Gene clusters
  • Similar gene content
  • Neither gene content nor order is perfectly

19
A Framework for Identifying Gene Clusters
  1. Find homologous genes
  2. Formally define a gene cluster
  3. Design an algorithm to find clusters
  4. Statistically verify clusters

20
Why Validate Clusters Statistically?
  • After sufficient time has passed, gene order will
    become randomized
  • Uniform random data tends to be clumpy
  • Some genes will end up close together in both
    genomes simply by chance

20
11
4
3
7
2
13
10
14
15
17
16
19
3
18
1
8
11
12
5
6
2
2
20
13
17
16
7
9
10
4
1
3
1
14
15
4
1
21
Cluster Statistics
Cluster Models
r-windows max-gap
2 regions Durand Sankoff, 2003 my work
3 regions my work unsolved
22
Outline
  • Introduction and Motivation
  • Evolution of spatial organization
  • Applications why identify related genomic
    regions?
  • Problem Background
  • Why is this challenging?
  • Introduction to cluster finding
  • Results
  • Statistics for pairwise clusters
  • Statistics for three-way clusters

23
A Framework for Identifying Gene Clusters
  1. Find homologous genes
  2. Formally define a gene cluster
  3. Design an algorithm to find clusters
  4. Statistically verify clusters

24
Gene Homology
  • Identification of homologous gene pairs
  • generally based on sequence similarity
  • conserved genomic context is also informative
  • Assumptions
  • matches are binary (similarity scores are
    discarded)
  • each gene is homologous to at most one other gene
    in the other genome

25
A Framework for Identifying Gene Clusters
  • Find homologous genes
  • Formally define a gene cluster
  • Devise an algorithm to identify clusters
  • Statistically verify clusters

26
Where are the gene clusters?
  • Intuitive notion pairs of regions that are dense
    with homologs
  • How can we formalize this intuition?

27
The r-window cluster definition
r 6
  • Two windows of size r that share at least m
    homologous gene pairs
  • r is a user-specified parameter

(Calvacanti et al 03, Durand and Sankoff 03,
Friedman and Hughes 01, Raghupathy and Durand 05)
28
A max-gap chain
g? 2
gap? 3
  • The distance or gap between genes is equal to
    the number of intervening genes
  • A set of genes in a genome form a max-gap chain
    if
  • the gap between adjacent genes is never greater
    than g (a user-specified parameter)

29
The max-gap cluster definition
gap? 3
g? 2
g? 3
  • A set of genes form a max-gap cluster in two
    genomes if
  • the genes forms a max-gap chain in each genome
  • the cluster is maximal (i.e. not contained within
    a larger cluster)

30
A Framework for Identifying Gene Clusters
  • Find homologous genes
  • Formally define a gene cluster
  • Devise an algorithm to identify clusters
  • Statistically verify clusters

31
Max-gap search algorithms
  • Many genomic studies use a max-gap criteria
  • Each group designs their own search algorithm
  • These are often greedy algorithms, but greedy
    algorithms miss disordered max-gap clusters
  • Hoberman et al, RECOMB Comp Genomics 2005

32
Greedy, Agglomerative Algorithms
g 2
  • initialize a cluster as a single homologous pair
  • search for a gene in proximity on both
    chromosomes
  • either extend the cluster and repeat, or
    terminate

33
Greedy Algorithms Impose Order Constraints
g 2
A max-gap cluster of size four
  • A greedy, agglomerative algorithm will not find
    this cluster since there is no max-gap cluster of
    size two

34
Algorithms and Definition Mismatch
  • Greedy, bottom-up algorithms will not find all
    max-gap clusters
  • There is an efficient divide-and-conquer
    algorithm to find maximal max-gap clusters
    (Bergeron et al, WABI, 2002)
  • Cluster statistics depend on the search space,
    which depends on which algorithm is used

35
A Framework for Identifying Gene Clusters
  • Find homologous genes
  • Formally define a gene cluster
  • Devise an algorithm to identify clusters
  • Statistically verify clusters

An example
36
Example Whole genome self-comparison to detect
duplicated blocks
  • Chose g30
  • Compared all human chromosomes to all other
    chromosomes to find max-gap gene clusters

McLysaght et al. Nature Genetics, 2002.
37
How can we use statistical analysis?
Chr 17
10 genes duplicated out of 100
29 genes
Chr 3
  • Could two regions display this degree of
    similarity simply by chance?
  • Is g30 a reasonable choice of gap size?
  • Are larger clusters less likely to occur by
    chance?
  • How large does a cluster have to be before we are
    surprised to observe it?

3 genes duplicated out of 25
13 genes
McLysaght et al. Nature Genetics, 2002.
38
Outline
  • Introduction and Motivation
  • Evolution of spatial organization
  • Applications why identify related genomic
    regions?
  • Problem Background
  • Why is this challenging?
  • Introduction to cluster finding
  • Results
  • Statistics for pairwise clusters
  • Statistics for three-way clusters

39
Cluster Statistics
Cluster Models
r-windows max-gap
2 regions Durand Sankoff, 2003 my work
3 regions my work unsolved
40
The max-gap definition is the most widely used
cluster definition in genomic analyses
  • Overbeek et al 1999 inferring functional
    coupling of genes in bacteria
  • Vision et al 2000 origins of genomic
    duplications in Arabidopsis
  • Friedman and Hughes 2001 gene duplication and
    structure of eukaryotic genomes
  • Tamames 2001 evolution of gene order
    conservation in prokaryotes
  • Vandepoele et al 2002 microcolinearity between
    Arabidopsis and rice
  • McLysaght et al 2002 genomic duplication during
    early chordate evolution
  • Simillion et al 2002 hidden duplications in
    Arabidopsis
  • Blanc et al 2003 recent polyploidy in
    Arabidopsis
  • Luc et al 2003 gene teams for comparative
    genomics
  • Chen et al 2004 operon prediction in newly
    sequenced bacteria
  • Bourque et al 2005 comparison of mammalian and
    chicken genome architectures

Yet there is no formal statistical model for
max-gap clusters
41
The Question
Suppose two whole genomes were compared, and this
max-gap cluster was identified
  • Is this cluster biologically meaningful?
  • Could it have occurred in a comparison of two
    random genomes?

42
Statistical Testing
  • Hypothesis testing
  • Alternate hypothesis shared ancestry
  • Null hypothesis random gene order
  • Discard clusters that could have arisen under the
    null model
  • Determine the probability of observing a similar
    cluster under the null hypothesis

43
The Problem
h4
  • Given an allowed gap size of g, what is the
    probability of observing a max-gap cluster
    containing exactly h matching gene pairs?
  • How do we calculate this probability?

44
The Inputs
n22
m6
h4
g2
n number of genes in each genome m number of
matching genes pairs g the maximum gap allowed
in a cluster h number of matching genes in the
cluster
45
The probability when m n
If gene content is identical
  • the probability of a max-gap cluster is 1
  • (regardless of the allowed gap size)

46
Probability of a cluster of size h when m lt n
m genes
m-h genes
h genes
Basic approach Enumerate all ways to
  1. Place m-h remaining genes so they do not extend
    the cluster
  1. Create chains of h genes in both genomes

  1. Normalize to get a probability

47
Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
All configurations of m gene pairs in two
genomes of size n
48
Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
All configurations of m gene pairs in two
genomes of size n
49
Number of ways to place h genes in two genomes so
they form a cluster
h genes
m genes
m-h genes
Select h spots in each genome, so they form a
max-gap chain
Choose h genes to compose the cluster
Assign each gene to a selected spot in each genome
Hoberman, Sankoff, Durand RECOMB Comparative
Genomics, 2004
50
Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
All configurations of m gene pairs in two
genomes of size n
51
Counting the number of ways to place m-h genes
outside the cluster
g 1
h 3
  • Approach
  • design a rule specifying where the genes can be
    placed so that the cluster is not extended
  • count the positions that satisfy the rule

52
Counting the number of ways to place m-h genes
outside the cluster
gaps 1
g 1
  • Rule 1 A gene can go anywhere except in the
    cluster (the white box).

Too lenient ? overcounts
53
Counting the number of ways to place m-h genes
outside the cluster
g 1
g 1
g 1
  • Rule 2 Every gene must be at least g1 positions
    from the cluster (outside the grey box).

Too strict ? undercounts
54
Counting the number of ways to place m-h genes
outside the cluster
gap gt 1
g 1
h 3
gap gt 1
  • Rule 2 Every gene must be at least g1 positions
    from the cluster (outside the grey box).

Too strict ? undercounts
55
Counting the number of ways to place m-h genes
outside the cluster
g 1
gap gt 1
Too lenient ? overcounts
  • Rule 3 At most one member of each gene pair can
    be in the grey box.

56
Counting the number of ways to place m-h genes
outside the cluster
gaps 1
g 1
Too lenient ? overcounts
  • Rule 3 At most one member of each gene pair can
    be in the grey box.

57
Counting the number of ways to place m-h genes
outside the cluster
g 1
  • Acceptable positions for a gene depend on the
    positions of the remaining genes
  • Use strict and lenient rules to calculate upper
    and lower bounds on G

58
Estimating G
  • Upper bound
  • Erroneously enumerates this configuration
  • Lower bound
  • Fails to enumerate this configuration

59
Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
Hoberman, Sankoff, Durand Journal of
Computational Biology, 2005
60
What can we learn from this statistical result?
  • Are we less likely to observe a large cluster
    (containing more gene pairs) than a small
    cluster?
  • How large does a cluster have to be before we are
    surprised to observe it?
  • How do we choose the maximum allowed gap value?

61
Whole-genome comparison cluster statistics
n1000, m250
g20
With a significance threshold of 10-4, any
cluster containing 8 or more genes is significant.
h (cluster size)
62
E. coli vs B. Subtilis
Algorithm Bergeron et al, 02 Statistics Hoberma
n et al, 05
clusters above the orange line are significant at
the .001 level
Complete cluster doesnt form until g110
Typical operon sizes
By g9, probabilities no longer decrease
monotonically with cluster size
Under null hypothesis, by g25 all genes should
form a single cluster
63
  • Statistical analysis of max-gap gene clusters
  • Data analysis Identifies statistically
    significant max-gap clusters
  • Search Suggests a more principled approach for
    choosing a gap size that will yield significant
    clusters
  • Modeling Provides insight into criteria for
    cluster definitions
  • larger clusters should be less likely to occur by
    chance

64
Outline
  • Introduction and Motivation
  • Evolution of spatial organization
  • Applications why identify related genomic
    regions?
  • Problem Background
  • Why is this challenging?
  • Introduction to cluster finding
  • Results
  • Statistics for pairwise clusters
  • Statistics for three-way clusters

65
Cluster Statistics
Cluster Models
r-windows max-gap
2 regions Durand Sankoff, 2003 my work
3 regions my work unsolved
Joint work with Narayanan Raghupathy
66
Duplicated segments can be particularly difficult
to identify
Ancestral chromosome
Whole genome duplication
chromosome 2
chromosome 1
chr 1
chr 2
67
Duplicated segments can be particularly difficult
to identify
Ancestral chromosome
Whole genome duplication
chr 1
Reciprocal gene loss
chr 2
68
  • Comparisons of multiple genomes offer
    significantly more power to detect highly
    diverged homologous segments

Arabidopsis thaliana region 1
Rice chr 2
Arabidopsis thaliana region 2
Simillion et al, PNAS 2002
69
Existing Statistical Approaches
W2
x2
x12
x23
  • Only consider x123, genes shared between all
    regions
  • disregards pairwise overlaps

x123
x3
x1
x13
W3
W1
  • Conduct multiple pairwise comparisons
  • does not consider the greater impact of genes
    shared among all three regions
  • requires that at least two pairwise clusters are
    independently significant

No existing statistical approach considers all
these quantities in a single test.
Methods reviewed in Simillion et al, Bioessays,
2004
Durand and Sankoff, J Comp Biology, 2003
70
r-windows for pairwise comparison
  • Two windows of size r that share at least m
    homologous gene pairs

71
With two regions there is a natural test statistic
  • An r-window cluster spanning two regions can be
    characterized by the number of shared genes, m.

The probability that two random windows share m
genes
Durand and Sankoff, J Comp Biology, 2003
72
With three regions there are many more quantities
to consider
x3
  • Three-way overlap x123
  • Pairwise overlaps x12, x23, x13

x13
x23
x123
x2
x1
x12
We want to consider both types of overlaps
73
Our r-window statistics for three windows
x3
First tests that uses both pairwise and three-way
overlaps
x23
x13
x123
x1
x2
x12
  • Given three randomly selected windows of length
    r,
  • we derived equations for
  • the probability of observing such a cluster
  • under a null hypothesis of random gene order.

Raghupathy, Hoberman, Durand. APBC 2007.
74
How serious are these limitations?
Our derived expressions can be used to compare
pairwise and three-way cluster probabilities for
typical genome parameters and window sizes.
  • Pairwise tests have a number of limitations
  • They do not consider the greater impact of genes
    shared among all three regions
  • They require that at least two pairwise
    comparisons are independently significant

75
What is the impact of ignoring three-way genes?
b) Three distinct genes are shared by each pair
of windows
  1. Three genes are shared by all three windows

Each pair of regions shares exactly three genes
3
0
3
0
0
3
3
0
76
Which cluster is least likely to occur by chance?
n5000, r100

0
0
h
0
h

h
h
0
h
h
77
(a) is less likely to occur than (b)
Pairwise tests cannot distinguish these two
clusters
b) Three distinct genes are shared by each pair
of windows
  1. Three genes are shared by all three windows

but pairwise tests score them identically
78
How much of a limitation is this?
Our derived expressions can be used to compare
pairwise and three-way cluster probabilities for
typical genome parameters and window sizes.
  • Pairwise tests have a number of limitations
  • They do not consider the greater impact of genes
    shared among all three regions
  • They require that at least two pairwise
    comparisons are independently significant

79
When no genes appear in all three regions, are
pairwise statistical tests sufficient?
n5000, r 100, x1230
0
a0.001
Two Pairwise Tests
h
0
Three-window
80
When no genes appear in all three regions, are
pairwise statistical tests sufficient?
  • The three-window test is strictly more general
    than two pairwise tests
  • The three-window test will always reject the null
    hypothesis for any cluster in which the pairwise
    tests reject the null hypothesis

0
Two Pairwise Tests
0
Three-window
81
These results suggest that multi-region tests can
identify more distantly related homologous
regions.
  • There is an ongoing debate about whether there
    was 1 or 2 whole genome duplications in early
    vertebrates
  • More powerful statistical tests might help
    resolve this issue

82
Summary
  • First statistical tests for max-gap clusters, the
    most commonly used definition
  • determines minimum significant cluster size
  • allows more principled selection of gap parameter
  • reveals problems with max-gap definition
  • More sensitive statistical tests for comparisons
    of three genomic regions
  • first test to consider both pairwise and
    three-way overlaps
  • can identify more distantly related regions
  • may be able to resolve outstanding questions in
    molecular evolution

83
Acknowledgements
  • Collaborators
  • Dannie Durand, Biological Sciences and Computer
    Science, CMU
  • David Sankoff, Math and Statistics University of
    Ottawa
  • Narayanan Raghupathy, Biological Sciences, CMU
  • Funders
  • Barbara Lazarus Women_at_IT Fellowship
  • The Sloan Foundation

84
Thanks
85
Questions?
86
(No Transcript)
87
(No Transcript)
88
(No Transcript)
89
(No Transcript)
90
(No Transcript)
91
(No Transcript)
92
(No Transcript)
93
(No Transcript)
94
Advantages of an analytical approach
  • Analyzing incomplete datasets
  • Principled parameter selection
  • Understanding statistical trends
  • Insight into tradeoffs between definitions
  • Efficiency
  • Accuracy

95
The number of ways to create a chain of h genes
Ways to place the leftmost gene in the chain, so
there are at least L-1 places left
1 2 3 4 5 .
n-L1 .
n
The maximum length of the chain is L (h-1)g h
96
The number of ways to create a chain of h genes
Ways to place the leftmost gene in the chain, so
there are at least L-1 slots left
Choices for the size of each gap (from 0 to g)
There are h-1 gaps in a chain of h genes
97
The number of ways to create a chain of h genes
Ways to place the leftmost gene in the chain, so
there are at least L-1 slots left
Chains near the end of the genome
Choices for the size of each gap (from 0 to g)
There are h-1 gaps in a chain of h genes
1 2 3 4 5 .
n-L1 .
n
98
Counting clusters at the end of the genome
  • Gaps are constrained
  • And sum of gaps is constrained

l w-1
l m
99
g2
g3
gm-1
g1
l lt w
A known solution
100
Counting clusters at the end of the genome
  • Gaps are constrained
  • And sum of gaps is constrained

l w-1
l m
101
Cluster Length
w-2
w
w-1

m1
m
d(m,g,m) d(m,g,m1)
d(m,g,w-2) d(m,g,w-1)
l w-1
d(m,g,m) d(m,g,m1)
d(m,g,w-2)

d(m,g,m) d(m,g,m1)


d(m,g,m1)
d(m,g,m)

d(m,g,m1)
d(m,g,m)

d(m,g,m)
l m
Line of Symmetry
102
Exploiting Symmetry
l
w

m
g
g
g
g-1
1

m1
w-1

g
g-1
1
g
g-2
2
m2
w-2

g-1
g-1
1
1
g
g-2
2
103
Cluster Length
w-2
w
w-1

m1
m
d(m,g,m) d(m,g,m1)
d(m,g,w-2) d(m,g,w-1)
(g1)m-1
l w-1
d(m,g,m) d(m,g,m1)
d(m,g,w-2)
(g1)m-1

d(m,g,m) d(m,g,m1)

(g1)m-1

d(m,g,m1)
d(m,g,m)

d(m,g,m1)
d(m,g,m)

d(m,g,m)
l m
104
Number of ways to position h genes in a genome
of n genes so they form a max-gap chain
Starting positions near end
Starting positions
Ways to place remaining h-1 genes

105
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com