PatternHunter: faster and more sensitive homology search - PowerPoint PPT Presentation

1 / 119
About This Presentation
Title:

PatternHunter: faster and more sensitive homology search

Description:

PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 B92902033 B92902039 – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 120
Provided by: sapphirejade
Category:

less

Transcript and Presenter's Notes

Title: PatternHunter: faster and more sensitive homology search


1
PatternHunter faster and more sensitive homology
search
  • By Bin Ma, John Tromp and Ming Li

B92902019 ??? B92902033 ??? B92902039
??? B92902072 ??? B92902086 ??? B92902087 ???
2
Agenda
  • PatternHunter
  • Spaced Seed
  • Algorithm
  • Performance
  • PatternHunter II
  • Algorithm
  • Performance
  • Translated PatternHunter

3
PatternHunter Spaced Seed
4
Outline
  • A short review about BLAST.
  • Some definition and background.
  • Whats the difference and the same between BLAST
    and PatternHunter.
  • Why PatternHunter is better??
  • Nonconsecutive seeds
  • Proof

5
Blast Algorithm
  • Find seeded matches
  • Extent to HSPs (High scoring Segment Pairs)
  • Gapped Extension, dynamic programming
  • Report significant local alignments

6
A short review about BLAST
  • Find hits.
  • BLAST first scans the database for words that
    score at least T when aligned with some word
    within the query sequence. Any aligned word pair
    satisfying this condition is called a hit.

7
A short review about BLAST
  • Find HSPs
  • HSP (High scoring Segment Pair) is much longer
    than a single word pair, and may therefore
    entail multiple hits on the same diagonal within
    a relative shot distance of one another.

8
A short review about BLAST
  • Generate gapped alignment
  • This means that two or more HSPs in BLAST with
    scores well below 38 bits can, in combination,
    rise to statistical significance. If any one of
    these HSPs is missed, so may be the combined
    result.

9
A short review about BLAST
  • In summary, the new gapped BLAST algorithm
    requires two non-overlapping hits of score at
    least T, within a distance A of one another, to
    invoke an ungapped extension of the second hit.
    If the HSP generated normalized score at least Sg
    bits, then a gapped extension is triggered.

10
Some definition, some background
  • Similarity
  • How similar it is between two sequences?
  • Usually mean that the probability of the same
    symbol appear in anywhere of two sequences.
  • Sensitivity
  • The probability to find a local alignment.
  • Specificity
  • In all local alignments, how many alignments are
    homologous.

11
Define the Seed
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002
  • Defining the seed
  • w -gt weight or number of positions to match
  • Blastn 11 MegaBlast 28
  • model -gt relative position of letters for each w
  • m -gt length of model window

12
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002
Seed Parameters
w 11
letters
0, 1
  • 1 1 1 0 1 0 0 1 0 1 0 0 1 1 0 1 1 1

m 18

model
1 exact match required 0 no match required,
any value
Patternhunter most sensitive model
Blastn seed is all 1s
13
Seed, Hit, Homology
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002
  • What is a seed?
  • Seeds determine how an algorithm looks for hits
  • What is a hit?
  • Hits indicate a similarity that may indicate a
    homology

14
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002
hit
GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCA
TAAGTTCCAACAAAGTTTGC

GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGAT
CCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGA
TGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----

GAATACTCAACAGCAACATCAAC
GGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG -
-----------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGA
GGGCAGGCGAGCTCAGGTA

TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACC
AAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACC
ATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG

GGATGAGATGGAACGTGTGATGACCAT
TATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG
Human-Mouse genome homology
15
Example
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002
  • Consider the following two sequences
  • GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT
  • GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT
  • Whats the differences in finding the seed
    between Blast and PatternHunter?

16
BLAST usesconsecutive seeds
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002
  • In BLAST, we often use the consecutive model with
    weight 11.
  • GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT
  • GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT
  • ? 11111111111 ? ? ? 11111111111 ?
  • However, it fails to find the alignment in the
    two sequence.

17
Consecutive seeds
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002
  • Theres also a dilemma for BLAST type of search.
  • Dilemma
  • Sensitivity needs shorter seeds
  • too many random hits, slow computation
  • Speed needs longer seeds
  • lose distant homologies

18
PatternHunter uses non-consecutive seed
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002
  • In PatternHunter, we often use the spaced model
    with weight 11 and length 18.
  • GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT
  • GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT
  • 111010010100110111

19
Consecutive vs. Nonconsecutive?
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002
  • The non-consecutive seed is the primary
    difference and strength of Patternhunter
  • Blastn
  • 1 1 1 1 1 1 1 1 1 1 1
  • PatternHunter
  • 1 1 1 0 1 0 0 1 0 1 0 0 1 1 0 1 1 1

20
A trivial comparison between spaced and
consecutive seed
Reference Ming Li, NHC2005
  • Consider 111 and 1101.
  • To fail seed 111, we can use
  • 110110110110
  • 66.66 similarity
  • But we can prove, seed 1101 will hit every region
    with 61 similarity for sufficient long region.

21
Proof
Reference Ming Li, NHC2005
  • Suppose there is a length 100 region which is not
    hit by 1101.
  • We can break the region into blocks of 1a0b.
    Besides the last block, the other blocks have the
    following few cases
  • 10b for bgt1
  • 110b for bgt2
  • 1110b for bgt2
  • In each block, similarity lt 3/5.
  • The last block has at most 3 matches.
  • So, in total there are at most 61 matches in 100
    positions. The similarity is lt61.

22
Formalize
Reference Ming Li, NHC2005
  • Given i.i.d. sequence (homology region) with
    Pr(1)p and Pr(0)1-p for each bit
  • 1100111011101101011101101011111011101
  • Which seed is more likely to hit this region
  • BLAST seed 11111111111
  • Spaced seed 11111111111

11111111111
23
Expect Less, Get More
Reference Ming Li, NHC2005
  • Lemma The expected number of hits of a weight W
    length M seed model within a length L region with
    homology level p is
  • (L-M1)pW
  • Proof. E(hits) ?i1 L-M1 pW
  • Example In a region of length 64 with p0.7
  • Pr(BLAST seed hits)0.3
  • E( of hits by BLAST seed)1.07
  • Pr(optimal spaced seed hits)0.466, 50 more
  • E( of hits by spaced seed)0.93, 14 less

24
Why Is Spaced Seed Better?
Reference Ming Li, NHC2005
  • A wrong, but intuitive, proof seed s, interval
    I, similarity p
  • E(hits) Pr(s hits) E(hits s hits)
  • Thus
  • Pr(s hits) Lpw / E(hits s hits)
  • For optimized spaced seed, E(hits s hits)
  • 11111111111 Non overlap
    Prob
  • 11111111111 6
    p6
  • 11111111111 6
    p6
  • 11111111111 6
    p6
  • 11111111111 7
    p7
  • ..
  • For spaced seed the divisor is 1p6p6p6p7
  • For BLAST seed the divisor is bigger 1 p p2
    p3

25
Simulated sensitivity curves
Reference Ming Li, NHC2005
26
Observations of spaced seeds
Reference Ming Li, NHC2005
  • Seed models with different shapes can detect
    different homologies.
  • Two consequences
  • Some models may detect more homologies than
    others
  • More sensitive homology search
  • PatternHunter I
  • Can use several seed models simultaneously to hit
    more homologies
  • Approaching 100 sensitive homology search
  • PatternHunter II

27
PatternHunter Algorithm Performance
28
Outline
  • Hit generation
  • Hit extension
  • Gapped extension
  • Performance

29
Hit generation
  • Index created for each position in the query
    sequence

30
Hit generation
  • Similar to MegaBlast Hash tables
  • Encode ATCG into binary code
  • 00, 01, 10, 11 respectively
  • Find each situations in one of the sequence and
    record the offsets in the hash table

31
Hit generation
  • An example
  • Now we want to find hits between sequences S and
    T

32
Spaced seed
  • For sequence T
  • Model
  • Seed

A 00 T 01 C 10 G
11
Scan
A T A T G C A T
1 1 0 1 0 1 1 0
??
??
A T T C A
0001011000 88
Weight5 ? the value is between 0210-1
33
After filling in the hash table
???
Position in T
  • For each position in S
  • Calculate int value
  • 2. Find hits in S by the lookup value

0
1
2
3
10 19 34
(NULL)
14
10 48 134
???
???
87
88
2 8 33
???
34
Hash tables space required
???
Position in T
0
34
19
10
4w integers T integers Total 4(w1)4T
bytes
1
(NULL)
14
2
3
134
48
10
???
???
87
88
33
8
2
???
35
Cost a lot to make a hash table?
  • If the number of hits found for one index is
    large, the cost of computing index is relatively
    negligible.

36
Hit extension
  • HSP Highscoring Segment Pair
  • Scan those hits with a window, and choose the
    highest-scored one.

37
Hit extension
S
The chosen hit
T
38
Hit extension
  • Set the mid point of the chosen hit as the cut
    point, split the graph into 4

39
Hit extension
S
T
40
Hit extension
  • And then do the Smith-Waterman in 2 of the 4,
    until it reaches the dropoff score.

41
Hit extension
S
Smith-Waterman
Cost1/2O(mn)
Smith-Waterman
T
42
Hit extension
  • If the resulting segment pair has a score below
    certain minimum, then ignore it.
  • Else we gain a HSP and do the next step-gap
    extension.

43
Hit extension
  • A question when doing extension in 2 ways, how
    to synchronize the score?

44
Gapped Extension
  • To find the best way to extend an HSP to the left
    across gaps.
  • To extend an HSP we try all candidates from a
    diagonal-sorted set.
  • Penalty for gap open gap extension cropping

45
Gapped Extension
Search front
46
From left to right
Optimal Left
Too Far Right
Too Far Right
Optimal Left
47
From left to right
Optimal Left
Too Far Right
48
Descriptions in the paper
  • We use a red-black tree for this.
  • Insert HSP when the optimal alignment to its left
    is found
  • Retired from the tree once newly generated HSPs
    are too far beyond its right endpoint to make use
    of it.

49
Thought 1
  • The first one will be inserted ? Fast

50
Thought 1
  • May not find the best one

End
Start
Better
Worse
51
Thought 2
  • Insert HSP when the optimal alignment to its left
    is found

Not complete HSP
52
Thought 2
Insert both HSPs
Far but long (Good)
Close but short (Bad)
Next turn
53
Thought 1
  • Retired alignments are put into a priority queue
    according to their scores.

Tree 1
Tree 2
54
Performance
Ref. Altschul,S.F. et al (1997) Nucleic Acids
Res., 25, 33893402.
Ref. Bin Ma, John Tromp, Ming Li Bioinformatics
Vol. 18 no. 3 2002
55
PatternHunter II
56
Outline
  • Overview
  • PatternHunter II design
  • Computing hit probability
  • Finding seeds set
  • Seed performance
  • PHII performance

57
Overview
  • PatternHunter spaced seed
  • PH2 design for better sensitivityAchieve a
    sensitivity approaching that of Smith-Waterman
    with a speed similar to the default Blastn
  • Extend single spaced seed to multiple ones
  • Two main problem
  • Large memory required for multiple hash tables
  • Complexity of finding optimal seed combination

58
PatternHunter II design
  • A hash table is built for each seeds
  • All hits generated from all hash tables are used
    for gap extension
  • In two-hit mode, two nearby hits can be from
    different hash tables

59
PatternHunter II design (cont.)
  • Large memory problem
  • Divide into smaller segments
  • e.g., with k 8, w 11, and n 32 x 106,
  • the hash tables use about 256MBytes of
    memory
  • Extend alignments across division boundary
  • Still may lose alignments

60
Computing hit probability
  • Use DP, but extend the algorithm from single seed
    to multiple seeds
  • Definition
  • Homologous region R with length L
  • Substring from i to j is denoted by Ri j
  • A set of k seeds A a1, ,ak
  • A hits R if theres an ai that hits R
  • p is called the similarity level of R if R p
    identities

61
Computing hit probability (cont.)
  • For a binary string b and ,
    define
  • The goal is to find f(L, e)
  • For any i gt b, we have
  • We can compute f(i,b) from other f(i,b)
    computed earlier

62
Computing hit probability (cont.)
  • Definition
  • b is compatible with a seed a if bb-j 1
    whenever aa-j 1 for 0 lt j ? min(a, b)
  • Define
  • B be the set of binary strings that are not hit
    by A but compatible with some a in A.
  • B(x) denote the longest proper prefix of x in B

63
Computing hit probability (cont.)
  • First, eis in B
  • Suppose b is in B, then b is compatible with some
    a in A by definition. Therefore, 1b is also
    compatible with some a in A
  • If 1b is not in B, it must hit some a in A, so
    f(i,1b)1
  • If 0b is not in B, it cannot be hit by A,
    therefore it cannot be compatible with any a in
    A, so f(i,0b)f(i-bb, 0b), where 0bB(0b)

64
Computing hit probability (cont.)
Ref. Li,M. et al, (2004) Comput. Biol., 2,
417440.
65
Computing hit probability (cont.)
  • Can also compute k-hits probability
  • Change f(i,b) to f(i,b,k)
  • We already have k 1. By induction, compute each
    f(i,b,k) from f(i,b,k-1)

66
Computing hit probability (cont.)
Ref. Li,M. et al, (2004) Comput. Biol., 2,
417440.
67
Computing hit probability (cont.)
  • Complexity
  • It is proved that computing the hit probability
    of multiple seeds is NP-hard
  • The time complexity of the algorithm is which

68
Computing hit probability (cont.)
  • Implement Algorithm DP on PC
  • It took 0.70 sec to compute hit probability for a
    set of 16 weight-11 seeds with length lt 21 on a
    random region with length 64
  • It only took 0.37 sec for the same number of set
    and the same length but change the weight to 12
  • The running time largely depends on the maximum
    number of 0 in every seed

69
Finding seeds set
  • Cannot enumerate all possible seed sets by
    Algorithm DP
  • The number of them are exponential!
  • Also, finding the optimal space seed set is
    proved NP-hard
  • Use a greedy method

70
Finding seeds set (cont.)
  • Compute the first seed a1 which maximizes the hit
    probability of the set a1
  • Then computer the second seed a2 for the set a1,
    a2. Then a3
  • Compute ai until
  • Achieve the desire number of seeds
  • Achieve the desire hit probability

71
Finding seeds set (cont.)
  • May not optimize the hit probability
  • It is still time-consuming
  • e.g. It took 12 CPU days for a Pentium 4 3GHz PC
    to compute a set of 16 weight-11 seeds, each of
    them are no longer then 21
  • It take much longer time if the seeds become
    slightly longer
  • Need a different approach

72
Finding seeds set (cont.)
  • Suppose we already have N seeds, and C is the
    candidate set for the (N1)-th seed
  • For each c in C, estimates the hit probability in
    m random region samples
  • m is reasonably large, such as 500
  • Remove the worst performing halve from C, and
    increase m to 2m
  • Repeat until only one seed left

73
Seed performance
  • Two ways to increase the sensitivity
  • Increase the number of seeds
  • Reduce the weight of a single seed
  • Both increase running time
  • The sensitivity of doubling the number of seeds
    is approximately equal to reducing the weight of
    a single seed by 1
  • At high level, doubling the number of seeds
    achieves better sensitivity

74
Seed performance (cont.)
  • From low to high
  • Solid curves using the first k(1, 2, 4, 8, 16)
    weight-11 seeds
  • Dashed curves single optimal weight w(10, 9, 8,
    7) seeds

Ref. Li,M. et al, (2004) Comput. Biol., 2,
417440.
75
Comparison
  • Sensitivity / Speed
  • PatternHunter II
  • Blast
  • Smith-Waterman algorithm
  • SSearch

76
SSearch Configuration
  • Smith-Waterman algorithm
  • A sub-program in the FASTA package
  • FASTA package
  • ftp//ftp.virginia.edu/pub/FASTA/

77
Common Environment
  • Score scheme
  • Match 1
  • Mismatch -1
  • Gapopen -5
  • Gapextension -1
  • Local alignments scores gt 16

78
Common Environment
  • DNA sequences
  • 2 sets of human and mouse EST sequences
  • ftp//ftp.ncbi.nlm.nih.gov/blast/db/FASTA/
  • month.est_human.Z
  • month.est_mouse.Z
  • Pentium IV 3GHz Linux PC

79
Term Explanation
  • EST
  • Expressed Sequence Tag
  • A unique stretch of DNA within a coding region of
    a gene that is useful for identifying.
  • A short sub-sequence of a transcribed sequence.

80
Term Explanation
  • Coding Regions
  • Regions of DNA/RNA sequences that code for
    proteins. Usually starts with a start codon (ATG)
    and ends with a stop codon.
  • The coding region of a gene is the portion of DNA
    that is transcribed into mRNA and translated into
    proteins.

81
Repeat Masking
  • Fact
  • Long sequences of identical letters
  • Especially of As and Ts
  • example (Will be shown later)
  • Solution
  • Turn all those sequences of ten or more
    repetitive letters to Ns.

82
SSearch Result
  • Num of humans EST 4
  • Num of mouses EST 2005
  • EST example (show)

Ref. Li,M. et al, (2004) Comput. Biol., 2,
417440.
83
Optimal Versus Sub-Optimal
  • Neither PatternHunter nor Blast tries to compute
    the optimal alignments for the homologies they
    have found.
  • Q Why not find the optimal alignments?
  • Ans
  • use Blast or PH2 to detect, then compute.

84
Found
  • SSearch finds a local alignment
  • score x
  • PatternHunter II finds a local alignment
  • score gt x/2
  • Then found for a pair of ESTs

85
Sensitivity Definition
  • Smith-Waterman
  • Finds y pairs of ESTs
  • Local alignment score at least x
  • Other programs
  • y of the y pairs can be found
  • With alignment score gt x/2
  • Ratio y / y

86
Blastn Configuration
  • Version 2.2.6
  • NCBIs website
  • -F F option
  • To turn off the low-complexity region filtering
  • Weight 11 seeds
  • 11111111111

87
Speed comparison
Ref. Li,M. et al, (2004) Comput. Biol., 2,
417440.
88
Sensitivity comparison
  • From low to high
  • Dashed Blastn, seed weight 11
  • Solid PH II, 1, 2, 4, 8 seeds weight 11

89
Compare with other seeds
  • From left to right
  • PH II, two weight 11 seeds
  • PH II, one weight 10 seed
  • 1101100101000101101
  • HMM model ,

90
Seed Selection
  • Use heuristic or exponential time algorithms
  • For general seed selection problem
  • PTAS
  • polynomial time approximation scheme

91
Homology Search
  • Time-consuming
  • DNA-DNA searches
  • Blastn
  • translated DNA-protein searches
  • tBlastx
  • tPH
  • protein-protein searches
  • Small query and database sizes

92
Conclusion
  • Optimized spaced seeds
  • Blastn PH II
  • Same sensitivity
  • Speeds up by 5-100 times
  • Optimized multiple spaced seeds
  • PH II Smith-Waterman
  • Approximately same sensitivity
  • gt1000 times faster

93
Translated PatternHunter
94
Outline
  • Whats translated search?
  • BLASTs translated search
  • Translated Pattern Hunter
  • Performance

95
Whats translated search?
  • To translate a DNA sequence into a protein
    sequence for alignment with another protein
    sequence
  • But whats translation?

96
Whats translation?
  • In biology, translation means to translate DNA
    into amino acids (AA) with a universal genetic
    code map on a 3-codon basis.
  • The DNA sequence is transcribed into a RNA
    sequence in which all Ts are replaced by Us

97
The Genetic code
  • We can use translation in homology search since
    the genetic code is universal
  • Degeneracy some DNA codons map to the same AA
  • They usually differs in the third codon
  • Translation is one-way DNA ? Protein

98
Why we need translated search?
  • When a DNA database or a Protein database is not
    available
  • Blastx DNA query, protein database
  • tBlastn protein query, DNA database
  • To find very distant homologies
  • tBlastx DNA query database, both translated
  • Slowest but more functional structural homology
    in addition to sequential homology
  • Why?

99
Substitution Matrix
  • Some AAs are similar in their chemical or
    physical properties
  • Not only match/mismatch in substitution anymore!
  • Stop codon is assigned the most negative score in
    BLAST and tPH
  • PAM (Point Accepted Mutation)
  • Based on global alignment of closely related
    proteins (1 divergence for PAM1)
  • BLOSUM (BLOck SUbstitution Matrix)
  • Based on local alignment of divergent proteins
    (62 similarity for BLOSUM 62)

100
Substitution Matrix
  • Short alignments need to be relatively strong to
    rise above background noise, so can only detect
    close related homologies

Query Length Substitution Matrix Gap costs
lt35 PAM-30 (9,1)
35-50 PAM-70 (10,1)
50-85 BLOSUM-80 (10,1)
85 BLOSUM-62 (10,1)
adapted from NCBI substitution matrix
101
BLASTs translated search
  • The same in tBlast, tBlastn, tBlastx
  • Aligns the 6-frame translations of the DNA
    sequence against another protein sequence

102
Reading Frame of DNA Sequence
  • The DNA sequence can be read in six reading
    frames, three in the forward and three in the
    reverse direction.

Open Reading Frame
103
BLASTs translated search
  1. Translate the DNA sequence into all 6 possible
    frames
  2. Align each frame against the protein sequence,
    just like BLASTp.
  3. The pairs with significant scores are reported

104
How good is significant?
  • The expected number of alignments scoring S or
    greater between two sequences m, n is
  • E mnKe?S or E mne-S
  • where K,?, used for normalization, depend on the
    sequence composition
  • Different K,?is used for each frame
  • Non-conding sequence tend to yield alignments of
    marginal significance

105
Translated PatternHunter
  • The version of PH for translated search
  • Compared with PatternHunter, tPH uses very
    different algorithms for hit generation and
    gapped extensions

106
Hit Generation in tPH
  • Weight 5 instead of 11
  • Space complexity 520 114 in PH
  • Length 6 or 7
  • Does not require exact matches
  • Hit all the five pairs have scores 0 and the
    total score is above a tolerance T
  • Use BLOSUM 62
  • Multiple seeds are used

107
Hit Generation in tPH
Seed 1011, T7
A
A
C
G
U
U
U
U
C
U
A
C
U
A
G
A
A
A
G
A
G
C
A
Query
All possible hits
Indexed Subject
108
Gapped Extension in tPH
  • The same as in BLAST?
  • BLAST cant handle frame shift errors
  • Huh?

109
Frame Shift Error
  • When a single DNA is deleted/inserted, it cause
    the reading frame to shift

A
A
C
G
U
U
U
U
C
U
A
C
U
A
G
A
A
A
G
A
G
A
  • BLAST cant detect such variation
  • It aligns the 6 frames with subject independently
  • In fact, most frame shift mutations can
    completely abolish the proteins function
  • They are usually lethal

110
Frame Shift Error
  • In this example
  • BLAST can only find at most two separated
    segments
  • tPH can connect them with a single deletion of
    C
  • How?

111
Gapped Extension in tPH
  • tPH regards the DNA sequences as a sequence of
    overlapped codons
  • Use a modified Smith-Waterman algorithm that can
    take frame shift into account
  • Substitution S(i-1, j-3) s (pi, nj-2..j)
  • Insertion of DNA S(i, j-1) frameshift
  • Insertion of DNA S(i, j-2) frameshift
  • Insertion of AA S(i, j-3) gap
  • Deletion of AA S(i-1, j) gap

112
Scoring Scheme
nGACACUAGAAUCG
P AspArgTyrSer
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 6 4 3
0 0 0 8
0 0 0 6
0 0 0 10
Query GAC ACU A-- GAA --- UCG Asp Thr
--- Glu Tyr Ser Subject Asp --- --- Arg Tyr Ser
S(i-1, j-3) s (pi, nj-2..j) S(i, j-1)
frameshift (-1) S(i, j-2) frameshift (-1) S(i,
j-3) gap (-2) S(i-1, j) gap (-2)
113
Performance Evaluation
  • 4407 human expressed sequence tag (EST) sequences
  • Split in the middle as subject and query

114
Number of Alignments Found
  • T12 for BLAST
  • 3x speed
  • Higher sensitivity

Ref. Derek Kisman et al, Bioinformatics Vol. 21
no. 4 2005
115
Unique Alignment Found
  • Most contains frameshifts

Ref. Derek Kisman et al, Bioinformatics Vol. 21
no. 4 2005
116
Using 4 Seeds
  • Differs from PH2
  • Short seeds
  • High dependency between seeds

Ref. Derek Kisman et al, Bioinformatics Vol. 21
no. 4 2005
117
Reference
  • PatternHunter
  • Bin Ma, John Tromp, Ming Li Bioinformatics Vol.
    18 no. 3 2002
  • Ming Li, NHC2005
  • PatternHunter II
  • Li,M., Ma,B., Kisman,D. and Tromp,J. (2004)
    Comput. Biol., 2, 417440.
  • NTU R94922059 ???s powerpoint

118
Reference
  • tPatternHunter
  • Derek Kisman, Ming Li, Bin Ma, and Li Wang,
    Bioinformatics Vol. 21 no. 4 2005
  • Others
  • Wikipedia http//en.wikipedia.org/wiki
  • NCBI http//www.ncbi.nlm.nih.gov

119
Thank you for your attention!
Write a Comment
User Comments (0)
About PowerShow.com