PatternHunter II: Highly Sensitive and Fast Homology Search - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

PatternHunter II: Highly Sensitive and Fast Homology Search

Description:

Bioinformatics and Computational Molecular Biology (Fall 2005) ... Bioinformatics and Computational Molecular Biology (Fall 2005): Representation ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 19
Provided by: Inte46
Category:

less

Transcript and Presenter's Notes

Title: PatternHunter II: Highly Sensitive and Fast Homology Search


1
PatternHunter II Highly Sensitive and Fast
Homology Search
Bioinformatics and Computational Molecular
Biology (Fall 2005) Representation
Ming Li, Bin Ma Derek Kisman, John Tromp
R94922059 ???
2
Overview
  • Homology search
  • Local alignment algorithms
  • PH I
  • PH II
  • Multiple Spaced Seeds
  • Computing hit probability
  • Finding a good seed set
  • PH II Design
  • Performance

3
Local alignment
  • Smith-Waterman
  • Smith and Waterman, 1981 Waterman and Eggert,
    1987
  • SSearch
  • FastA
  • Wilbur and Lipman, 1983 Lipman and Pearson, 1985
  • BLAST
  • Altschul et al., 1990 Altschul et al., 1997
  • Blast Family BLASTN, BLASTP, etc.
  • MEGABLAST

4
PatternHunter
  • Seed
  • Tradeoff sensitivity lt-gt computation
  • Consecutive k letters
  • k11 in Blastn, k28 in MegaBlast
  • Nonconsecutive k letters
  • Spaced seed
  • A model of k as its weight

5
PatternHunter II
  • Genome Informatics 14 (2003)
  • Extend single optimized spaced seed of PH to
    multiple ones
  • Speed BLASTN (MEGABLAST)
  • Sensitivity Smith-Waterman (SSearch)

6
Definition
  • A homologous region, R
  • A seed hits R
  • A seed set Aa1,ak hits R
  • Similarity
  • R has px identities
  • Sensitivity
  • Hit probability
  • Optimal (DP) 1

7
Computing Hit Probability
  • NP-hard on multiple seeds
  • DP on 1 seed
  • Extend DP to multiple seeds

8
Computing Hit Probability of Multiple Seeds
  • Let Aa1,ak be a set of k seeds and R a random
    region of Length L with similarity level p.
  • Binary string b is a suffix of R0i
  • Answer f ( L,? ), ? empty string

9
Computing Hit Probability of Multiple Seeds
10
Computing Hit Probability of Multiple Seeds
11
Finding a Good Seed Set
  • NP-hard for both optimal seed and multiple seeds
  • Greedy

12
Finding a Good Seed Set
  • Compute the 1st seed a1 which maximizes the hit
    probability of a1
  • Compute the 2nd seed a2 which maximizes the hit
    probability of a1, a2
  • Repeat until
  • Reach the desired number of seeds
  • Reach the desired hit probability

13
Finding a Good Seed Set
  • May not optimize the combined hit probability
  • Good enough
  • Optimal
  • 16 weight, 11 seeds, L64, similarity70, first
    four seeds111010010100110111,1111001100101000010
    11,110100001100010101111,1110111010001111
  • Greedy
  • 16 weight, 12 seeds, L64, similarity70, first
    four seeds111010010100110111,1111000100010011010
    111,1100110100101000110111,1110100011110010001101

14
Performance of the seeds
  • From low to high
  • Solid weight-11 k1,2,4,8,16 seeds
  • Dashed 1-seed, weight10,9,8,7

15
Performance of the seeds
  • Reducing the weight by 1
  • Increase the expected number of hits by a factor
    of 4
  • Doubling the number of seeds
  • Increase the expected number of hits by a factor
    of 2
  • Better Multiple seeds

16
PH II Performance
  • Compare with Blast(Blastn), Smith-Waterman(SSearch
    )
  • Sensitivity of SSearch 1
  • Alignment score
  • BLAST methods (hash, DP)
  • match1, mismatch-1, gapopen-5, gapextension-1

17
PH II Performance
  • From low to high
  • Solid PH II, 1, 2, 4, 8 seeds weight 11
  • Dashed Blastn, seed weight 11

18
Complexity Proof
  • Finding optimal spaced seeds
  • NP-hard
  • Finding one optimal seed
  • NP-hard
  • Computing the hit probability of multiple seeds
  • NP-hard
Write a Comment
User Comments (0)
About PowerShow.com