Locating conserved genes in whole genome scale - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Locating conserved genes in whole genome scale

Description:

... outperforms MaxMinCluster and MUMmer-3 on closely related ... can apply either MUMmer-3 or MaxMinCluster. these clusters are treated as MUM with bigger weight ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 37
Provided by: admi1560
Learn more at: https://cs.nyu.edu
Category:

less

Transcript and Presenter's Notes

Title: Locating conserved genes in whole genome scale


1
Locating conserved genesin whole genome scale
  • Prudence Wong
  • University of Liverpool
  • June 2005
  • joint work with
  • HL Chan, TW Lam, HF Ting,
  • SM Yiu (HKU), WK Sung (NUS)

2
Outline
  • Motivation
  • Challenges of Whole Genome Alignment
  • Four approaches and their performance
  • Longest Common Subsequence
  • Clustering Approach
  • Mutation Sensitive Selection
  • Hybrid Approach
  • Remarks

3
Outline
  • Motivation
  • Challenges of Whole Genome Alignment
  • Four approaches and their performance
  • Longest Common Subsequence
  • Clustering Approach
  • Mutation Sensitive Selection
  • Hybrid Approach
  • Remarks

4
Mouse Human
Mouse and human are genetically very similar
Do they look like the same?
What do we mean by similar?
Many genes that can be found in human are also
found in mouse as well conserved genes
Mouse Chromosome 16
Human Chromosome 16
m16
h03
5
Whole Genome Alignment
Identify regions on the genomes that possibly
contain their conserved genes.
possibly a mutation
Difference in ordering of conserved could be
related to mutations. For related species, num.
of mutations is usually small.
6
Outline
  • Motivation
  • Challenges of Whole Genome Alignment
  • Four approaches and their performance
  • Longest Common Subsequence
  • Clustering Approach
  • Mutation Sensitive Selection
  • Hybrid Approach
  • Remarks

7
Data size
  • Usually very large (e.g., human chromosomes vs
    mouse chromosomes)

Examples Examples Examples Examples Examples
Human Chr No. Length Mouse Chr No. Length
1 245M 1 134M
3 200M 2 181M
11 135M 7 134M
15 100M 8 129M
20 64M 16 99M
Cannot use global alignment tools because of the
large size
8
Observations
  • a conserved gene may not be identical in the two
    genomes, nevertheless, there are some common
    substrings unique to this conserved gene (called
    MUM)
  • locate all MUMs over the two genomes, yet not
    every MUM corresponds to conserved genes

9
Number of MUMs
Mouse Chr No. Human Chr No. of MUMs
7 19 52,394
15 22 71,613
16 16 66,536
16 22 61,200
17 16 29,001
17 19 56,236
19 11 29,814
Size is smaller comparing with chromosome length
10
MUMs for M16-H03
Conserved genes
Mouse Chromosome 16
Human Chromosome 03
11
How to choose the right MUMs?
Generation of MUM using suffix tree
12
Outline
  • Motivation
  • Challenges of Whole Genome Alignment
  • Four approaches and their performance
  • Longest Common Subsequence
  • Clustering Approach
  • Mutation Sensitive Selection
  • Hybrid Approach
  • Remarks

13
MUM Selection
  • MUMmer-1 Delcher et al. Nucleic Acids Research
    1999
  • longest common subsequences (effectively assume
    no mutations)
  • MUMmer-2 Delcher et al. Nucleic Acids Research
    2002 MUMmer-3 Kurtz et al. Genome Biology
    2004
  • clustering heuristics
  • most popular tool to uncover conserved genes in
    WG scale
  • MaxMinCluster Wong et al. Bioinformatics 2004
  • clustering, optimization
  • MSS Mutation Sensitive Selection Chan et al.
    Bioinformatics 2005
  • capture mutations
  • Hybrid approach Chan et al. Bioinformatics
    2005
  • combine mutation sensitive and clustering
    approaches

our results
14
Overview of Results
  • Average coverage (sensitivity) in

Mouse/ Human Intragenus Baculovirade Intergenus Baculovirade
MUMmer-3 77 (27) 66 (71) 43 (62)
MaxMinCluster 84 (29) 69 (75) 45 (59)
MSS 91 (29) 79 (75) 36 (53)
MUMmer-3MSS 91 (28) 79 (75) 48 (43)
MaxMinClustesrMSS 91 (27) 79 (82) 51 (53)
  • coverage of published conserved genes reported
  • sensitivity of MUMs reported that reside in
    published conserved genes

15
Overview of Results
  • Average coverage (sensitivity) in

Mouse/ Human Intragenus Baculovirade Intergenus Baculovirade
MUMmer-3 77 (27) 66 (71) 43 (62)
MaxMinCluster 84 (29) 69 (75) 45 (59)
MSS 91 (29) 79 (75) 36 (53)
MUMmer-3MSS 91 (28) 79 (75) 48 (43)
MaxMinClustesrMSS 91 (27) 79 (82) 51 (53)
MSS outperforms MaxMinCluster and MUMmer-3 on
closely related species
  • coverage of published conserved genes reported
  • sensitivity of MUMs reported that reside in
    published conserved genes

16
Overview of Results
  • Average coverage (sensitivity) in

Mouse/ Human Intragenus Baculovirade Intergenus Baculovirade
MUMmer-3 77 (27) 66 (71) 43 (62)
MaxMinCluster 84 (29) 69 (75) 45 (59)
MSS 91 (29) 79 (75) 36 (53)
MUMmer-3MSS 91 (28) 79 (75) 48 (43)
MaxMinClustesrMSS 91 (27) 79 (82) 51 (53)
BUT MSS performs worse on species relatively
farther apart
  • coverage of published conserved genes reported
  • sensitivity of MUMs reported that reside in
    published conserved genes

17
Overview of Results
  • Average coverage (sensitivity) in

Mouse/ Human Intragenus Baculovirade Intergenus Baculovirade
MUMmer-3 77 (27) 66 (71) 43 (62)
MaxMinCluster 84 (29) 69 (75) 45 (59)
MSS 91 (29) 79 (75) 36 (53)
MUMmer-3MSS 91 (28) 79 (75) 48 (43)
MaxMinClustesrMSS 91 (27) 79 (82) 51 (53)
  • coverage of published conserved genes reported
  • sensitivity of MUMs reported that reside in
    published conserved genes

both hybrid approaches perform well for species
farther apart
18
Outline
  • Motivation
  • Challenges of Whole Genome Alignment
  • Four approaches and their performance
  • Longest Common Subsequence
  • Clustering Approach
  • Mutation Sensitive Selection
  • Hybrid Approach
  • Remarks

19
Longest Common Subsequence
LCS
20
Outline
  • Motivation
  • Challenges of Whole Genome Alignment
  • Four approaches and their performance
  • Longest Common Subsequence
  • Clustering Approach
  • Mutation Sensitive Selection
  • Hybrid Approach
  • Remarks

LCS Approach (MUMmer-1) does not take mutations
into account
  • MUMmer-2 -3 cluster by heuristic
  • MaxMinCluster formalizes clustering as a
    combinatorial optimization problem

21
Clustering approach
  • Observations
  • Noise MUMs are usually short and isolated
  • A conserved gene usually contains a sequence of
    MUMs that are close and have sufficient length gt
    clusters

Gene X
Gene Y
Gene Y
Gene X
Noise
22
Challenge
  • Challenge some conserved genes do not induce
    clusters of sufficient length
  • Solution relax the definition of clusters to
    allow the presence of noise

23
Noisy cluster
  • Suppose Gap100, MinSize40

gt 100 apart
length 20
a 1-noisy cluster
24
Noisy cluster
  • Suppose Gap100, MinSize40

gt 100 apart
length 20
a 2-noisy cluster
25
MaxMinClustesr
  • Problem formulation
  • find a collection of k-noisy clusters such that
    the smallest cluster has the maximum weight
  • Dynamic programmingO(k2n2) time, O(k2n) space

26
Outline
  • Motivation
  • Challenges of Whole Genome Alignment
  • Four approaches and their performance
  • Longest Common Subsequence
  • Clustering Approach
  • Mutation Sensitive Selection
  • Hybrid Approach
  • Remarks

Capture mutations more directly
27
Mutation Sensitive Selection
  • select subsets of MUMs

transformed by a few mutations
subset of MUMs
  • three types of mutationsreversal,
    transposition, reversed-transposition

28
k-mutated subsequences
  • Given two sequences A B and an integer k,
  • a pair of subsequence X of A subsequence Y of B
    is called a pair of k-mutated subsequences ifX
    can be transformed to Y by at most k mutations

a pair of 2-mutated subsequences
reversal
transposition
MUMs are signed reversal reverts sign of MUMs
29
Mutation Sensitive Selection
  • Problem formulation
  • To find a pair of k-mutated subsequences with
    maximum weight
  • We believe that the problem is NP-hard
  • The Genome Rearrangement Problem, believed to be
    NP-hard, can be reduced to this problem
  • We give an efficient approximation algorithm
  • the resulting weight is close to (at least
    1/(3k1) times) the maximum possible weight

O(n2log n kn2) time, O(n2) space
30
Outline
  • Motivation
  • Challenges of Whole Genome Alignment
  • Four approaches and their performance
  • Longest Common Subsequence
  • Clustering Approach
  • Mutation Sensitive Selection
  • Hybrid Approach
  • Remarks

31
Hybrid Approach
  • first apply clustering approach to identify
    clusters which are obviously conserved genes
  • can apply either MUMmer-3 or MaxMinCluster
  • these clusters are treated as MUM with bigger
    weight
  • then apply MSS to process these MUM together with
    the remaining MUM

32
Outline
  • Motivation
  • Challenges of Whole Genome Alignment
  • Four approaches and their performance
  • Longest Common Subsequence
  • Clustering Approach
  • Mutation Sensitive Selection
  • Hybrid Approach
  • Remarks

33
Remarks
  • Experiments show that
  • MaxMinCluster gt LCS
  • MMS gt MaxMinCluster for closely related species
  • MMS does not perform well for species relatively
    farther apart
  • Hybrid approach is the best for both closely
    related and farther apart species

34
Thank you!
  • Q A

35
Approximation Algorithm
  • Super-Backbone
  • maximum weight common subsequences
  • Identify k mutation blocks
  • having high weight
  • do not overlap with Super-Backbone too much
  • this is formulated as a sub-problem and solved
    optimally by dynamic programming
  • Report Super-Backbone k mutation blocks

O(n2log n kn2) time, O(n2) space
36
Mutations
  • three types of mutationsreversal,
    transposition, reversed-transposition

a b c d e f g h i j k l m n o p q r s t u v w x y
z
a d c b e f g h i j k l m n o p q r s t u v w x y
z
a d c b e k l m n o p q r s t u v w x y f g h i j
z
a d c b e k l t s r q p o m n u v w x y f g h i j
z
Write a Comment
User Comments (0)
About PowerShow.com