# L3: Blast: Keyword match basics - PowerPoint PPT Presentation

PPT – L3: Blast: Keyword match basics PowerPoint presentation | free to download - id: 1643e-YzA4Y

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## L3: Blast: Keyword match basics

Description:

### What is the probability that some position i has a match. ... Exact match of a short query substring to ... Scan the database for exact match to all such words ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 36
Provided by: vineet50
Category:
Tags:
Transcript and Presenter's Notes

Title: L3: Blast: Keyword match basics

1
L3 Blast Keyword match basics
2
Silly Quiz
• TRUE or FALSE
• In New York City at any moment, there are 2
people (not bald) with exactly the same number
of hairs!

3
Assignment 1 is online
• Due 10/6 (Thursday) in class.

4
An O(nm) algorithm for score computation
For i 1 to n For j 1 to m
• The iteration ensures that all values on the
right are computed in earlier steps.
• How much space do you need?

5
Alignment?
• Is O(nm) space too much?
• What if the query and database are each 1Mbp?

6
Alignment (Linear Space)
• Score computation

For i 1 to n For j 1 to m
7
Linear Space
• In Linear Space, we can do each row of the D.P.
• We need to compute the optimum path from the
origin (0,0) to (m,n)

8
Linear Space (contd)
• At in/2, we know scores of all the optimal paths
ending at that row.
• Define Fj Sn/2,j
• One of these j is on the true path. Which one?

9
Backward alignment
• Let Si,j be the optimal score of aligning
si..n with tj..m
• Define Bj Sn/2,j
• One of these j is on the true path. Which one?

10
Forward, Backward computation
• At the optimal coordinate, j
• FjBjSn,m
• In O(nm) time, and O(m) space, we can compute one
of the coordinates on the optimum path.

11
Linear Space Alignment
• Align(1..n,1..m)
• For all 1
• Compute FjS(n/2,j)
• For all 1
• Compute BjSb(n/2,j)
• j maxj FjBj
• X Align(1..n/2,1..j)
• Y Align(n/2..n,j..m)
• Return X,j,Y

12
Linear Space complexity
• T(nm) c.nm T(nm/2) O(nm)
• Space O(m)

13
Why is Blast Fast?
14
Large database search
Database size n10M, Querysize m300. O(nm) 3.
109 computations
15
Observations
• Much of the database is random from the querys
perspective
• Consider a random DNA string of length n.
• PrAPrC PrGPrT0.25
• Assume for the moment that the query is all 1s
(length m).
• What is the probability that an exact match to
the query can be found?

16
Basic probability
• Probability that there is a match starting at a
fixed position i 0.25m
• What is the probability that some position i has
a match.
• Dependencies confound probability estimates.

17
Basic ProbabilityExpectation
• Q Toss a coin each time it comes up heads, you
get a dollar
• What is the money you expect to get after n
tosses?
• Let Xi be the amount earned in the i-th toss

18
Expected number of matches
• Expected number of matches can still be computed.

i
• Let Xi1 if there is a match starting at
position i, Xi0 otherwise
• Expected number of matches

19
Expected number of exact Matches is small!
• Expected number of matches n0.25m
• If n107, m10,
• Then, expected number of matches 9.537
• If n107, m11
• expected number of hits 2.38
• n107,m12,
• Expected number of hits 0.5
• Bottom Line An exact match to a substring of the
query is unlikely just by chance.

20
Observation 2
• What is the pigeonhole principle?

21
Why is this important?
• Suppose we are looking for sequences that are 80
identical to the query sequence of length 100.
• Assume that the mismatches are randomly
distributed.
• What is the probability that there is no stretch
of 10 bp, where the query and the subject match
exactly?
• Rough calculations show that it is very low.
Exact match of a short query substring to a truly
similar subject is very high.
• The above equation does not take dependencies
into account
• Reality is better because the matches are not
randomly distributed

22
Just the Facts
• Consider the set of all substrings of the query
string of fixed length W.
• Prob. of exact match to a random database string
is very low.
• Prob. of exact match to a true homolog is very
high.
• Keyword Search (exact matches) is MUCH faster
than sequence alignment

23
BLAST
Database (n)
• Consider all (m-W) query words of size W (Default
11)
• Scan the database for exact match to all such
words
• For all regions that hit, extend using a dynamic
programming alignment.
• Can be many orders of magnitude faster than SW
over the entire string

24
Why is BLAST fast?
• Assume that keyword searching does not consume
any time and that alignment computation the
expensive step.
• Query m1000, random Db n107, no TP
• SW O(nm) 1000107 1010 computations
• BLAST, W11
• E(11-mer hits) 1000 (1/4)11 1072384
• Number of computations 23841001002.384107
• Ratio1010/(2.384107)420
• Further speed improvements are possible

25
Keyword Matching
• How fast can we match keywords?
• Hash table/Db index? What is the size of the hash
table, for m11
• Suffix trees? What is the size of the suffix
trees?
• Trie based search. We will do this in class.

AATCA
567
26
Related notes
• How to choose the alignment region?
• Extend greedily until the score falls below a
certain threshold
• Default word size 3, and mismatches are
allowed.
• Like sequences, BLAST has been evolving
continuously
• Banded alignment
• Seed selection
• Scanning for exact matches, keyword search versus
database indexing

27
P-value computation
• How significant is a score? What happens to
significance when you change the score function
• A simple empirical method
• Compute a distribution of scores against a random
database.
• Use an estimate of the area under the curve to
get the probability.
• OR, fit the distribution to one of the standard
distributions.

28
Z-scores for alignment
• Initial assumption was that the scores followed a
normal distribution.
• Z-score computation
• For any alignment, score S, shuffle one of the
sequences many times, and recompute alignment.
Get mean and standard deviation
• Look up a table to get a P-value

29
Blast E-value
• Initial (and natural) assumption was that scores
followed a Normal distribution
• 1990, Karlin and Altschul showed that ungapped
local alignment scores follow an exponential
distribution
• Practical consequence
• Longer tail.
• Previously significant hits now not so
significant

30
Exponential distribution
• Random Database, Pr(1) p
• What is the expected number of hits to a sequence
of k 1s
• Instead, consider a random binary Matrix.
Expected of diagonals of k 1s

31
• As you increase k, the number decreases
exponentially.
• The number of diagonals of k runs can be
approximated by a Poisson process
• In ungapped alignments, we replace the coin
tosses by column scores, but the behaviour does
not change (Karlin Altschul).
• As the score increases, the number of alignments
that achieve the score decreases exponentially

32
Blast E-value
• Choose a score such that the expected score
between a pair of residues
• Expected number of alignments with a particular
score
• For small values, E-value and P-value are the
same

33
Blast Variants
• What is mega-blast?
• What is discontiguous mega-blast?
• Phi-Blast/Psi-Blast?
• BLAT?
• PatternHunter?

Longer seeds. Seeds with dont care
values Later Database pre-processing Seeds with
dont care values
34
Keyword Matching
P O T A S T P O T A T O
O
T
A
T
O
T
U
I
S
A
E
35
(No Transcript)