BLAST for faster inexact searches Based on Larry Hunters Bioinformatics class - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

BLAST for faster inexact searches Based on Larry Hunters Bioinformatics class

Description:

Dynamic programming solutions are relatively slow ... Repeat for each of the n-2 words, giving about 50*n words (out of 203=8000 possible) ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 21

Provided by: digita1

Category:

more less

Transcript and Presenter's Notes

Title: BLAST for faster inexact searches Based on Larry Hunters Bioinformatics class

1
BLAST for faster inexact searchesBased on Larry
Hunters Bioinformatics class
2
Why BLAST?

Dynamic programming solutions are relatively slow
Need some way to search a large database to find
sequences that have an inexact match to a query
sequence
Competing solutions FASTA BLAST.
Both imperfect approximations to DP. DP finds
some distantly related sequences the
approximations don't
BLAST is more commonly used, although both are
fine.

3
Sequence search basics

BLAST/FASTA are 50-100x faster than DP
If searching for coding regions, always translate
nucleotide to amino acid sequence. WHY?
Use appropriate substitution and gap scores
BLOSUM62 is good for weak protein similarities
Use PAM30, PAM70 or BLOSUM45 for better results
on more similar sequences, BLOSUM80 for most
distant

4
BLOSUM62 Score Matrix

A R N D C Q E G H I L K M F P S
T W Y V
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1
0 -3 -2 0
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1
-1 -3 -2 -3
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1
0 -4 -2 -3
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0
-1 -4 -3 -3
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1
-1 -2 -2 -1
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0
-1 -2 -1 -2
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0
-1 -3 -2 -2
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0
-2 -2 -3 -3
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1
-2 -2 2 -3
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2
-1 -3 -1 3
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2
-1 -2 -1 1
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0
-1 -3 -2 -2
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1
-1 -1 -1 1
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2
-2 1 3 -1
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1
-1 -4 -3 -2
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
1 -3 -2 -2
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1
5 -2 -2 0
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3
-2 11 2 -3

5
How does BLAST work?

BLAST2 (gapped BLAST)
Break sequence into overlapping words, by
default of length 3. n-2 words for sequence of
length n. ABCDE ? ABC, BCD, CDE
For each word, define 50 other words that are
similar (use substitution matrix threshold T)
Repeat for each of the n-2 words, giving about
50n words (out of 2038000 possible)
Use index to find all places in DB with exact
match to any of those words.

6
Extending alignments

Identify database sequences contains multiple
matching words on the same diagonal (think DP
alignments) and within a short distance.
Extend these short, ungapped alignments in both
directions along sequence so long as score of
alignment increases.
Call these extended alignments HSP's for high
scoring pairs

7
What about Statistical Significance?

How do we know that a HSP is not due to chance?
Matching sequences is different than sampling a
normal distribution
Use a Extreme distribution
Do Monte Carlo studies to estimate
probabilities of random matches

8
Extreme Distributions use K and ? parameters

Estimated by aligning a lot of random sequences
drawn on a particular distribution of amino
acids, and fitting the extreme value distribution
to those alignments
Depend on the particular substitution matrix

9
Is an HSP Significant?

What is the probability of scoring at least as
large as x by chance?
Extreme value distributionm is length of
database,
n is length of query,
l is average length of alignment between two
random sequences of those lengths using this
scoring scheme.

10
How to make gapped BLAST2

Multiple HSPs in one target sequence ?
possibility of gapped alignment.
How to produce final gapped alignment?
Just run DP on (relatively) small number of
database sequences that produce multiple
above-threshold HSPs.

11
BLAST2

Default on NCBI web site
Provides gapped alignments
Use BLAST to find HSPs, then runs DP to find
optimal alignments. Fast (enough) database
search, along with optimal alignments.
Still might miss some alignments DP would find as
database search tool
Can set certain gap penalties, word sizes and
thresholds in Advanced settings

12
Validation

How do we know how well BLAST (or any other
approach) is working?
Various ways to measure
Sensitivity How many of the actually homologous
sequences is BLAST finding
Specificity Of the sequences that BLAST says are
similar, how many are actually homologous?
These depend on the E value (cutoff).
Sensitivity/Specificity trade off!

13
How to calculate

Define
True Positives (TP) Homologous and above
threshold
False Positives (FP) Above threshold, but not
homologous
False Negatives (FN) Homologous, but below
threshold
True Negatives (TN) Below threshold and not
homologous
Sensitivity TP/TPFN
Specificity TN/TNFP
Alternative to Specificity is Positive Predictive
Value, PPV TP/TPFP

14
Relationships

For DB searching, sensitivity can be less useful
than PPV, since TN is very large.
There is a tradeoff between sensitivity and
specificity (or PPV), since changing the cutoff
(E threshold) influences these measures in
opposite directions.
Higher threshold means lower FP but also lower TP
Lower threshold means higher TP but also higher FP

15
Example

Imagine 100 homologous sequences in the DB
Set BLAST threshold very high, returns 50
sequences, 45 of which are actual homologs
TP 45, FP 5, FN 55

16
Example

Imagine 100 homologous sequences in the DB
Set BLAST threshold very high, returns 50
sequences, 45 of which are actual homologs
TP 45, FP 5, FN 55
Sensitivity 45, PPV 90
Set BLAST threshold low, returns 150 sequences
including 95 actual homologs
TP95, FP55, FN5
Sensitivity 95, PPV 63 (95/150)

17
ROC curves

Shows entire sensitivity/specificity trade off
over all thresholds in one graph
Not always smooth!
The top left corner is perfect
discrimination45 line is random choice.
Closer to top left is betterarea under the
curve is called ROC score

18
For sequence searching

Since TN is very large, we can make a similar
graph of FP vs. TP at various cutoffs
In this graph, furtherto the right is better

19
Sequence-based database searching

Matches of gt50 identity in a 20-40 amino acid
region occur frequently by chance.
Most sequences that share statistically
significant similarity throughout their entire
lengths are homologous.
If A is homologous to B, and B to C, then A must
be homologous to C, even if they share no
significant sequence similarity.

20
Class Web Site