BLAST for faster inexact searches Based on Larry Hunters Bioinformatics class - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

BLAST for faster inexact searches Based on Larry Hunters Bioinformatics class

Description:

Dynamic programming solutions are relatively slow ... Repeat for each of the n-2 words, giving about 50*n words (out of 203=8000 possible) ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 21
Provided by: digita1
Category:

less

Transcript and Presenter's Notes

Title: BLAST for faster inexact searches Based on Larry Hunters Bioinformatics class


1
BLAST for faster inexact searchesBased on Larry
Hunters Bioinformatics class
2
Why BLAST?
  • Dynamic programming solutions are relatively slow
  • Need some way to search a large database to find
    sequences that have an inexact match to a query
    sequence
  • Competing solutions FASTA BLAST.
  • Both imperfect approximations to DP. DP finds
    some distantly related sequences the
    approximations don't
  • BLAST is more commonly used, although both are
    fine.

3
Sequence search basics
  • BLAST/FASTA are 50-100x faster than DP
  • If searching for coding regions, always translate
    nucleotide to amino acid sequence. WHY?
  • Use appropriate substitution and gap scores
  • BLOSUM62 is good for weak protein similarities
  • Use PAM30, PAM70 or BLOSUM45 for better results
    on more similar sequences, BLOSUM80 for most
    distant

4
BLOSUM62 Score Matrix
  • A R N D C Q E G H I L K M F P S
    T W Y V
  • A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1
    0 -3 -2 0
  • R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1
    -1 -3 -2 -3
  • N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1
    0 -4 -2 -3
  • D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0
    -1 -4 -3 -3
  • C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1
    -1 -2 -2 -1
  • Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0
    -1 -2 -1 -2
  • E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0
    -1 -3 -2 -2
  • G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0
    -2 -2 -3 -3
  • H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1
    -2 -2 2 -3
  • I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2
    -1 -3 -1 3
  • L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2
    -1 -2 -1 1
  • K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0
    -1 -3 -2 -2
  • M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1
    -1 -1 -1 1
  • F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2
    -2 1 3 -1
  • P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1
    -1 -4 -3 -2
  • S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
    1 -3 -2 -2
  • T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1
    5 -2 -2 0
  • W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3
    -2 11 2 -3

5
How does BLAST work?
  • BLAST2 (gapped BLAST)
  • Break sequence into overlapping words, by
    default of length 3. n-2 words for sequence of
    length n. ABCDE ? ABC, BCD, CDE
  • For each word, define 50 other words that are
    similar (use substitution matrix threshold T)
  • Repeat for each of the n-2 words, giving about
    50n words (out of 2038000 possible)
  • Use index to find all places in DB with exact
    match to any of those words.

6
Extending alignments
  • Identify database sequences contains multiple
    matching words on the same diagonal (think DP
    alignments) and within a short distance.
  • Extend these short, ungapped alignments in both
    directions along sequence so long as score of
    alignment increases.
  • Call these extended alignments HSP's for high
    scoring pairs

7
What about Statistical Significance?
  • How do we know that a HSP is not due to chance?
  • Matching sequences is different than sampling a
    normal distribution
  • Use a Extreme distribution
  • Do Monte Carlo studies to estimate
    probabilities of random matches

8
Extreme Distributions use K and ? parameters
  • Estimated by aligning a lot of random sequences
    drawn on a particular distribution of amino
    acids, and fitting the extreme value distribution
    to those alignments
  • Depend on the particular substitution matrix

9
Is an HSP Significant?
  • What is the probability of scoring at least as
    large as x by chance?
  • Extreme value distributionm is length of
    database,
  • n is length of query,
  • l is average length of alignment between two
    random sequences of those lengths using this
    scoring scheme.

10
How to make gapped BLAST2
  • Multiple HSPs in one target sequence ?
    possibility of gapped alignment.
  • How to produce final gapped alignment?
  • Just run DP on (relatively) small number of
    database sequences that produce multiple
    above-threshold HSPs.

11
BLAST2
  • Default on NCBI web site
  • Provides gapped alignments
  • Use BLAST to find HSPs, then runs DP to find
    optimal alignments. Fast (enough) database
    search, along with optimal alignments.
  • Still might miss some alignments DP would find as
    database search tool
  • Can set certain gap penalties, word sizes and
    thresholds in Advanced settings

12
Validation
  • How do we know how well BLAST (or any other
    approach) is working?
  • Various ways to measure
  • Sensitivity How many of the actually homologous
    sequences is BLAST finding
  • Specificity Of the sequences that BLAST says are
    similar, how many are actually homologous?
  • These depend on the E value (cutoff).
    Sensitivity/Specificity trade off!

13
How to calculate
  • Define
  • True Positives (TP) Homologous and above
    threshold
  • False Positives (FP) Above threshold, but not
    homologous
  • False Negatives (FN) Homologous, but below
    threshold
  • True Negatives (TN) Below threshold and not
    homologous
  • Sensitivity TP/TPFN
  • Specificity TN/TNFP
  • Alternative to Specificity is Positive Predictive
    Value, PPV TP/TPFP

14
Relationships
  • For DB searching, sensitivity can be less useful
    than PPV, since TN is very large.
  • There is a tradeoff between sensitivity and
    specificity (or PPV), since changing the cutoff
    (E threshold) influences these measures in
    opposite directions.
  • Higher threshold means lower FP but also lower TP
  • Lower threshold means higher TP but also higher FP

15
Example
  • Imagine 100 homologous sequences in the DB
  • Set BLAST threshold very high, returns 50
    sequences, 45 of which are actual homologs
  • TP 45, FP 5, FN 55

16
Example
  • Imagine 100 homologous sequences in the DB
  • Set BLAST threshold very high, returns 50
    sequences, 45 of which are actual homologs
  • TP 45, FP 5, FN 55
  • Sensitivity 45, PPV 90
  • Set BLAST threshold low, returns 150 sequences
    including 95 actual homologs
  • TP95, FP55, FN5
  • Sensitivity 95, PPV 63 (95/150)

17
ROC curves
  • Shows entire sensitivity/specificity trade off
    over all thresholds in one graph
  • Not always smooth!
  • The top left corner is perfect
    discrimination45 line is random choice.
  • Closer to top left is betterarea under the
    curve is called ROC score

18
For sequence searching
  • Since TN is very large, we can make a similar
    graph of FP vs. TP at various cutoffs
  • In this graph, furtherto the right is better

19
Sequence-based database searching
  • Matches of gt50 identity in a 20-40 amino acid
    region occur frequently by chance.
  • Most sequences that share statistically
    significant similarity throughout their entire
    lengths are homologous.
  • If A is homologous to B, and B to C, then A must
    be homologous to C, even if they share no
    significant sequence similarity.

20
Class Web Site
  • Read description of BLAST
  • Go through tutorial on sequence searching
  • Start thinking about a project
  • Use PubMed to find recent papers
Write a Comment
User Comments (0)
About PowerShow.com