PatternHunter: A Fast and Highly Sensitive Homology Search Method - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

PatternHunter: A Fast and Highly Sensitive Homology Search Method

Description:

Mega-BLAST uses seeds of length 28. PatternHunter uses 'spaced seeds' ... (homology identity = 0.7, homology length=64) 111011001011010111, 1111000100010011010111, ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 27
Provided by: min8160
Category:

less

Transcript and Presenter's Notes

Title: PatternHunter: A Fast and Highly Sensitive Homology Search Method


1
PatternHunter A Fast and Highly Sensitive
Homology Search Method
  • Bin Ma
  • Department of Computer Science
  • University of Western Ontario

2
A homology between mouse and human genomes
GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCA
TAAGTTCCAACAAAGTTTGC

GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGAT
CCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGA
TGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----

GAATACTCAACAGCAACATCAAC
GGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG -
-----------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGA
GGGCAGGCGAGCTCAGGTA

TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACC
AAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACC
ATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG

GGATGAGATGGAACGTGTGATGACCAT
TATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG
Smith-Waterman is the most accurate method. Time
complexityO(mn).
3
BLAST finds a hit and then extends
GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCA
TAAGTTCCAACAAAGTTTGC

GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGAT
CCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGA
TGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----

GAATACTCAACAGCAACATCAAC
GGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG -
-----------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGA
GGGCAGGCGAGCTCAGGTA

TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACC
AAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACC
ATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG

GGATGAGATGGAACGTGTGATGACCAT
TATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG
Seed match hit
4
Example of missing a target
  • Fail
  • GAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT
  • GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT
  • Dilemma
  • Sensitivity needs shorter seeds
  • the success rate of finding a homology
  • Speed needs longer seeds
  • Mega-BLAST uses seeds of length 28.

5
PatternHunter uses spaced seeds
  • 111010010100110111 (called a model)
  • Eleven required matches (weight11)
  • Seven dont care positions
  • GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT
  • GAATACTCAACAGCAACACTAATGGCAGCAGAAAAT
  • 111010010100110111
  • Hit all the required matches are satisfied.
  • BLAST seed model 11111111111

6
Observations re. spaced seeds
  • Seed models with different shapes can detect
    different homologies.
  • Two consequences
  • Some models may detect more homologies than
    others
  • More sensitive homology search
  • PatternHunter I
  • Can use several seed models simultaneously to hit
    more homologies
  • Approaching 100 sensitive homology search
  • PatternHunter II

7
Spaced Seed PatternHunter I
8
Weight of a seed
  • Lemma The expected number of hits of a weight W
    length M seed model within a length L region with
    similarity p is (L-M1)pW
  • Proof There are (L-M1) positions a hit can
    occur. At each position, pW hit is expected.
    Q.E.D.
  • Seed models with the same weight generate
    approximately the same amount of hits.
  • Speed is approximately the same.
  • Sensitivity is not necessarily the same.
  • num of hits v.s. num of regions that contain
    hits.

GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT
GAATACTCAACAGCAACACTAATGGC
AGCAGAAAAT 111010010100110111
9
Simulated sensitivity curves
10
Why spaced seeds are better?
  • TTGACCTCACC?
  • ?
  • TTGACCTCACC?
  • 11111111111
  • 11111111111

CAA?A??A?C??TA?TGG? ???????? CAA?A??A?C
??TA?TGG? 111010010100110111 111010010100110111
  • BLASTs seed usually uses more than one hits to
    detect one homology (redundant)
  • Spaced seeds uses fewer hits to detect one
    homology (efficient)

11
PHs seed does not overlap heavily
  • PHs seed do not overlap heavily when shifts
  • 111010010100110111
  • 111010010100110111
  • 111010010100110111
  • 111010010100110111
  • 111010010100110111
  • 111010010100110111
  • 111010010100110111
  • ......
  • The hits at different positions are independent.
  • The probability of having the second hit is 5p6
  • compare to BLASTs model p p2 p3 p4

12
Indeed
  • Indeed, under the condition that there is one hit
    in a length 64, 70 similar homology, the average
    number of hits in that region is
  • 2.0 for PHs weight-11 seed
  • 3.6 for contiguous weight-11 seed.

13
A dynamic programming algorithm to compute
sensitivity
  • R1..n Random homology, Pr(Ri1) p We want
    Pr(R is hit by a seed model x)
  • DPi,s denotes Pr(R1..i is hit R1..i ends
    with s)
  • 1 sx and s is hit
  • DPi,s DPi-1,s1..s-1 sx and s is
    not hit
  • pDPi,(1s) (1-p)DPi,(0s)
    else
  • O(n2x). Better algorithm exists.

14
PatternHunter I performance
  • Blastn MB28 PH
  • E.coli (4.7M) v.s. H.inf (1.8M)
  • 716s /158M 5s/561M 34s/78M
  • Arabidopsis chr2 (19.6M) v.s. chr4 (17.5M)
  • -- 21720s/1087M 5020s/279M
  • Human chr21 (26.2M) v.s. chr22 (35M)
  • -- -- 14512s/419M
  • All used a 700MHZ PentiumIII PC with 1G byte
    memory.
  • Human (3G) v.s. Mouse (3G)
  • Using 2-hit, weight 12 seed, PH used 6 days with
    a 1GHZ PentiumIII PC with 2G byte memory.
  • With Blast, it would otherwise take months with
    parallel computers to finish.

15
Multiple Seeds PatternHunter II
16
PatternHunter II Optimized Multiple seeds
  • Basic Searching Algorithm
  • Select a group of spaced seed models
  • For each hit of each model, conduct extension to
    find a homology.
  • Selecting optimal multiple seed is NP-hard.

17
Seed Selection Algorithm
  • Let A be an empty set.
  • Let s be the seed such that A?s has the highest
    hit probability.
  • AA?s if AltK go to 2.
  • Approximation ratio 1-1/e
  • Computing the hit probability of multiple seeds
    is NP-hard.
  • Efficient algorithm when number of zeros is
    limited.
  • PTAS to compute the probability approximately.

18
PTAS to compute the probability approximately.
  • Randomly generate m homologies independently.
    Suppose n of them are hit by our seeds. Let p be
    the sensitivity of our seeds.
  • If , then with probability
  • 1-2/K,
  • Can be proved by Chernoffs bounds.

19
The seeds obtained under a simple homology
distribution
  • (homology identity 0.7, homology length64)
  • 111011001011010111,
  • 1111000100010011010111,
  • 1100110100101000110111,
  • 1110100011110010001101,

20
Simulated sensitivity curves
  • Solid curves Multiple (1, 2, 4, 8, 16) weight-12
    spaced seeds.
  • Dashed curves Optimal spaced seeds with weight
    11, 10, 9, 8.
  • Typically, Doubling the seed number gains
    better sensitivity than decreasing the weight by
    1.

Two weight-12
One weight-11
One weight-12
21
Coding region seeds
  • The first two bases of a codon is more conserved
    than the third base.
  • Coding regions matches have patterns like
    110110
  • The seeds trained under a coding region homology
    distribution are called the coding region seeds.
  • PHIIs default seeds were trained under a simple
    distribution (0.8, 0.8, 0.5).

22
Experiments on real data
  • About 30k mouse ESTs (25Mb) and 4k human ESTs
    (3Mb)
  • downloaded from NCBI genbank.
  • low complexity regions were filtered out.
  • SSearch (Smith-Waterman method) finds all pairs
    of ESTs with significant local alignments.
  • Check how many percents of those pairs can be
    found by BLAST and different configurations of
    PatternHunter.

23
Sensitivity curves
24
Recent development
  • Can 100 sensitivity be achieved with reasonable
    speed?
  • Yes.
  • When gt80 similarity, 100 sensitivity can be
    achieved with approximately 40 weight-9 seeds.

25
Open questions
  • Can the hit probability of one (or constant
    number of) seed be computed in polynomial time?
  • Current Polynomial time algorithms exist when
    num of 0s in one seed is O(log n).
  • PTAS.
  • Can the optimal seed (or set of seeds) be found
    in polynomial time?
  • For general distributions of the homologies,
    these are NP-hard.

26
How the hits are found efficiently?
  • Put all the seeds of database in a lookup table.
  • For each seed in the query, find all the
    occurrences of the seed in the database by
    looking at the lookup table.
Write a Comment
User Comments (0)
About PowerShow.com