CSE182-L4: Keyword matching - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

CSE182-L4: Keyword matching

Description:

Q: What is the score of the best alignment of the two strings s and t? ... Concatenate all of the database sequences to form one giant sequence. ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 61
Provided by: vineet50
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: CSE182-L4: Keyword matching


1
CSE182-L4 Keyword matching
2
Backward scoring
  • Defin Sbi,j Best scoring alignment of the
    suffixes si1..n and tj1..m
  • Q What is the score of the best alignment of the
    two strings s and t?
  • HW Write the recurrences for Sb

3
Forward/Backward computations
j
m
1
  • Fj Score of the best scoring alignment of
    s1..n/2 and t1..j
  • Fj Sn/2,j
  • Bj Score of the best scoring alignment of
    sn/21..n and tj1..m
  • Bj Sbn/2,j

n/2
4
Forward/Backward computations
j
m
1
  • At the optimal coordinate, j
  • FjBjSn,m
  • In O(nm) time, and O(m) space, we can compute one
    of the coordinates on the optimum path.

n/2
5
Forward, Backward computation
  • There exists a coordinate, j
  • FjBjSn,m
  • In O(nm) time, and O(m) space, we can compute one
    of the coordinates on the optimum path.

6
Linear Space Alignment
  • Align(1..n,1..m)
  • For all 1ltj lt m
  • Compute FjS(n/2,j)
  • For all 1ltj lt m
  • Compute BjSb(n/2,j)
  • j maxj FjBj
  • X Align(1..n/2,1..j)
  • Y Align(n/2..n,j..m)
  • Return X,j,Y

7
Linear Space complexity
  • T(nm) c.nm T(nm/2) O(nm)
  • Space O(m)

8
Summary
  • We considered the basics of sequence alignment
  • Opt score computation
  • Reconstructing alignments
  • Local alignments
  • Affine gap costs
  • Space saving measures
  • Can we recreate Blast?

9
Blast and local alignment
  • Concatenate all of the database sequences to form
    one giant sequence.
  • Do local alignment computation with the query.

10
Large database search
Database size n100M, Querysize m1000. O(nm)
1011 computations
11
Why is Blast Fast?
12
Silly Question!
  • True or False No two people in new york city
    have the same number of hair

13
Observations
  • Much of the database is random from the querys
    perspective
  • Consider a random DNA string of length n.
  • PrAPrC PrGPrT0.25
  • Assume for the moment that the query is all As
    (length m).
  • What is the probability that an exact match to
    the query can be found?

14
Basic probability
  • Probability that there is a match starting at a
    fixed position i 0.25m
  • What is the probability that some position i has
    a match.
  • Dependencies confound probability estimates.

15
Basic ProbabilityExpectation
  • Q Toss a coin each time it comes up heads, you
    get a dollar
  • What is the money you expect to get after n
    tosses?
  • Let Xi be the amount earned in the i-th toss

16
Expected number of matches
  • Expected number of matches can still be computed.

i
  • Let Xi1 if there is a match starting at
    position i, Xi0 otherwise
  • Expected number of matches

17
Expected number of exact Matches is small!
  • Expected number of matches n0.25m
  • If n107, m10,
  • Then, expected number of matches 9.537
  • If n107, m11
  • expected number of hits 2.38
  • n107,m12,
  • Expected number of hits 0.5 lt 1
  • Bottom Line An exact match to a substring of the
    query is unlikely just by chance.

18
Observation 2
  • What is the pigeonhole principle?

19
Why is this important?
  • Suppose we are looking for sequences that are 80
    identical to the query sequence of length 100.
  • Assume that the mismatches are randomly
    distributed.
  • What is the probability that there is no stretch
    of 10 bp, where the query and the subject match
    exactly?
  • Rough calculations show that it is very low.
    Exact match of a short query substring to a truly
    similar subject is very high.
  • The above equation does not take dependencies
    into account
  • Reality is better because the matches are not
    randomly distributed

20
Just the Facts
  • Consider the set of all substrings of the query
    string of fixed length W.
  • Prob. of exact match to a random database string
    is very low.
  • Prob. of exact match to a true homolog is very
    high.
  • Keyword Search (exact matches) is MUCH faster
    than sequence alignment

21
BLAST
Database (n)
  • Consider all (m-W) query words of size W (Default
    11)
  • Scan the database for exact match to all such
    words
  • For all regions that hit, extend using a dynamic
    programming alignment.
  • Can be many orders of magnitude faster than SW
    over the entire string

22
Why is BLAST fast?
  • Assume that keyword searching does not consume
    any time and that alignment computation the
    expensive step.
  • Query m1000, random Db n107, no TP
  • SW O(nm) 1000107 1010 computations
  • BLAST, W11
  • E(11-mer hits) 1000 (1/4)11 1072384
  • Number of computations 23841001002.384107
  • Ratio1010/(2.384107)420
  • Further speed improvements are possible

23
Keyword Matching
  • How fast can we match keywords?
  • Hash table/Db index? What is the size of the hash
    table, for m11
  • Suffix trees? What is the size of the suffix
    trees?
  • Trie based search. We will do this in class.







AATCA
567
24
Related notes
  • How to choose the alignment region?
  • Extend greedily until the score falls below a
    certain threshold
  • What about protein sequences?
  • Default word size 3, and mismatches are
    allowed.
  • Like sequences, BLAST has been evolving
    continuously
  • Banded alignment
  • Seed selection
  • Scanning for exact matches, keyword search versus
    database indexing

25
P-value computation
  • How significant is a score? What happens to
    significance when you change the score function
  • A simple empirical method
  • Compute a distribution of scores against a random
    database.
  • Use an estimate of the area under the curve to
    get the probability.
  • OR, fit the distribution to one of the standard
    distributions.

26
Z-scores for alignment
  • Initial assumption was that the scores followed a
    normal distribution.
  • Z-score computation
  • For any alignment, score S, shuffle one of the
    sequences many times, and recompute alignment.
    Get mean and standard deviation
  • Look up a table to get a P-value

27
Blast E-value
  • Initial (and natural) assumption was that scores
    followed a Normal distribution
  • 1990, Karlin and Altschul showed that ungapped
    local alignment scores follow an exponential
    distribution
  • Practical consequence
  • Longer tail.
  • Previously significant hits now not so
    significant

28
Exponential distribution
  • Random Database, Pr(1) p
  • What is the expected number of hits to a sequence
    of k 1s
  • Instead, consider a random binary Matrix.
    Expected of diagonals of k 1s

29
  • As you increase k, the number decreases
    exponentially.
  • The number of diagonals of k runs can be
    approximated by a Poisson process
  • In ungapped alignments, we replace the coin
    tosses by column scores, but the behaviour does
    not change (Karlin Altschul).
  • As the score increases, the number of alignments
    that achieve the score decreases exponentially

30
Blast E-value
  • Choose a score such that the expected score
    between a pair of residues lt 0
  • Expected number of alignments with a particular
    score
  • For small values, E-value and P-value are the
    same

31
Blast Variants
  1. What is mega-blast?
  2. What is discontiguous mega-blast?
  3. Phi-Blast/Psi-Blast?
  4. BLAT?
  5. PatternHunter?

Longer seeds. Seeds with dont care
values Later Database pre-processing Seeds with
dont care values
32
Silly Quiz
  • Name a famous Bioinformatics Researcher
  • Name a famous Bioinformatics Researcher who is a
    woman

33
Scoring DNA
  • DNA has structure.

34
DNA scoring matrices
  • So far, we considered a simple match/mismatch
    criterion.
  • The nucleotides can be grouped into Purines (A,G)
    and Pyrimidines.
  • Nucleotide substitutions within a group
    (transitions) are more likely than those across a
    group (transversions)

35
Scoring proteins
  • Scoring protein sequence alignments is a much
    more complex task than scoring DNA
  • Not all substitutions are equal
  • Problem was first worked on by Pauling and
    collaborators
  • In the 1970s, Margaret Dayhoff created the first
    similarity matrices.
  • One size does not fit all
  • Homologous proteins which are evolutionarily
    close should be scored differently than proteins
    that are evolutionarily distant
  • Different proteins might evolve at different
    rates and we need to normalize for that

36
PAM 1 distance
  • Two sequences are 1 PAM apart if they differ in 1
    of the residues.

1 mismatch
  • PAM1(a,b) Prresidue b substitutes residue a,
    when the sequences are 1 PAM apart

37
PAM1 matrix
  • Align many proteins that are very similar
  • Is this a problem?
  • PAM1 distance is the probability of a
    substitution when 1 of the residues have changed
  • Estimate the frequency Pba of residue a being
    substituted by residue b.
  • S(a,b) log10(Pab/PaPb) log10(Pba/Pb)

38
PAM 1
39
PAM distance
  • Two sequences are 1 PAM apart when they differ in
    1 of the residues.
  • When are 2 sequences 2 PAMs apart?

2 PAM
40
Higher PAMs
  • PAM2(a,b) ?c PAM1(a,c). PAM1 (c,b)
  • PAM2 PAM1 PAM1 (Matrix multiplication)
  • PAM250
  • PAM1PAM249
  • PAM1250

41
Note This is not the score matrix What happens
as you keep increasing the power?
42
Scoring using PAM matrices
  • Suppose we know that two sequences are 250 PAMs
    apart.
  • S(a,b) log10(Pab/PaPb) log10(Pba/Pb)
    log10(PAM250(a,b)/Pb)

43
BLOSUM series of Matrices
  • Henikoff Henikoff Sequence substitutions in
    evolutionarily distant proteins do not seem to
    follow the PAM distributions
  • A more direct method based on hand-curated
    multiple alignments of distantly related proteins
    from the BLOCKS database.
  • BLOSUM60 Merge all proteins that have greater
    than 60. Then, compute the substitution
    probability.
  • In practice BLOSUM62 seems to work very well.

44
PAM vs. BLOSUM
  • What is the correspondence?
  • PAM1 Blosum1
  • PAM2 Blosum2
  • Blosum62
  • PAM250 Blosum100

45
Dictionary Matching, R.E. matching, and position
specific scoring
46
Keyword search
  • Recall In BLAST, we get a collection of keywords
    from the query sequence, and identify all db
    locations with an exact match to the keyword.
  • Question Given a collection of strings
    (keywords), find all occrrences in a database
    string where they keyword might match.

47
Dictionary Matching
1POTATO 2POTASSIUM 3TASTE
P O T A S T P O T A T O
database
dictionary
  • Q Given k words (si has length li), and a
    database of size n, find all matches to these
    words in the database string.
  • How fast can this be done?

48
Dict. Matching string matching
  • How fast can you do it, if you only had one word
    of length m?
  • Trivial algorithm O(nm) time
  • Pre-processing O(m), Search O(n) time.
  • Dictionary matching
  • Trivial algorithm (l1l2l3)n
  • Using a keyword tree, lpn (lp is the length of
    the longest pattern)
  • Aho-Corasick O(n) after preprocessing O(l1l2..)
  • We will consider the most general case

49
Direct Algorithm
P O P O P O T A S T P O T A T O
P O T A T O
P O T A T O
P O T A T O
P O T A T O
P O T A T O
  • Observations
  • When we mismatch, we (should) know something
    about where the next match will be.
  • When there is a mismatch, we (should) know
    something about other patterns in the dictionary
    as well.

50
The Trie Automaton
  • Construct an automaton A from the dictionary
  • Av,x describes the transition from node v to a
    node w upon reading x.
  • Au,T v, and Au,S w
  • Special root node r
  • Some nodes are terminal, and labeled with the
    index of the dictionary word.

1POTATO 2POTASSIUM 3TASTE
v
u
1
r
S
2
w
3
51
An O(lpn) algorithm for keyword matching
  • Start with the first position in the db, and the
    root node.
  • If successful transition
  • Increment current pointer
  • Move to a new node
  • If terminal node success
  • Else
  • Retract current pointer
  • Increment start pointer
  • Move to root repeat

52
Illustration
P O T A S T P O T A T O
v
1
S
53
Idea for improving the time
  • Suppose we have partially matched pattern i
    (indicated by l, and c), but fail subsequently.
    If some other pattern j is to match
  • Then prefix(pattern j) suffix first c-l
    characters of pattern(i))

c
l
P O T A S T P O T A T O
P O T A S S I U M
Pattern i
T A S T E
1POTATO 2POTASSIUM 3TASTE
Pattern j
54
Improving speed of dictionary matching
  • Every node v corresponds to a string sv that is a
    prefix of some pattern.
  • Define Fv to be the node u such that su is the
    longest suffix of sv
  • If we fail to match at v, we should jump to Fv,
    and commence matching from there
  • Let lpv su

2
3
4
5
1
S
11
6
7
9
10
8
55
An O(n) alg. For keyword matching
  • Start with the first position in the db, and the
    root node.
  • If successful transition
  • Increment current pointer
  • Move to a new node
  • If terminal node success
  • Else (if at root)
  • Increment current pointer
  • Mv start pointer
  • Move to root
  • Else
  • Move start pointer forward
  • Move to failure node

56
Illustration
P O T A S T P O T A T O
l
c
1
P
O
T
A
T
O
v
T
S
U
I
S
M
A
S
E
T
57
Time analysis
  • In each step, either c is incremented, or l is
    incremented
  • Neither pointer is ever decremented (lpv lt
    c-l).
  • l and c do not exceed n
  • Total time lt 2n

l
c
P O T A S T P O T A T O
58
Blast Putting it all together
  • Input Query of length m, database of size n
  • Select word-size, scoring matrix, gap penalties,
    E-value cutoff

59
Blast Steps
  1. Generate an automaton of all query keywords.
  2. Scan database using a Dictionary Matching
    algorithm (O(n) time). Identify all hits.
  3. Extend each hit using a variant of local
    alignment algorithm. Use the scoring matrix and
    gap penalties.
  4. For each alignment with score S, compute the
    bit-score, E-value, and the P-value. Sort
    according to increasing E-value until the cut-off
    is reached.
  5. Output results.

60
Protein Sequence Analysis
  • What can you do if BLAST does not return a hit?
  • Sometimes, homology (evolutionary similarity)
    exists at very low levels of sequence similarity.
  • A Accept hits at higher P-value.
  • This increases the probability that the sequence
    similarity is a chance event.
  • How can we get around this paradox?
  • Reformulated Q suppose two sequences B,C have
    the same level of sequence similarity to sequence
    A. If A B are related in function, can we assume
    that A C are? If not, how can we distinguish?
Write a Comment
User Comments (0)
About PowerShow.com