BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990.

Description:

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of ... – PowerPoint PPT presentation

Number of Views:253
Avg rating:3.0/5.0
Slides: 27
Provided by: Saurab2
Category:

less

Transcript and Presenter's Notes

Title: BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990.


1
BLASTBasic Local Alignment Search ToolAltschul
et al. J. Mol Bio. 1990.
  • CS 466
  • Saurabh Sinha

2
Motivation
  • Sequence homology to a known protein suggest
    function of newly sequenced protein
  • Bioinformatics task is to find homologous
    sequence in a database of sequences
  • Databases of sequences growing fast

3
Alignment
  • Natural approach to check if the query sequence
    is homologous to a sequence in the database is to
    compute alignment score of the two sequences
  • Alignment score counts gaps (insertions,
    deletions) and replacements
  • Minimizing the evolutionary distance

4
Alignment
  • Global alignment optimize the overall similarity
    of the two sequences
  • Local alignment find only relatively conserved
    subsequences
  • Local similarity measures preferred for database
    searches
  • Distantly related proteins may only share
    isolated regions of similarity

5
Alignment
  • Dynamic programming is the standard approach to
    sequence alignment
  • Algorithm is quadratic in length of the two
    sequences
  • Not practical for searches against very large
    database of sequences (e.g., whole genome)

6
Scoring alignments
  • Scoring matrix 4 x 4 matrix (DNA) or 20 x 20
    matrix (protein)
  • Amino acid sequences PAM matrix
  • Consider amino acid sequence alignment for very
    closely related proteins, extract replacement
    frequencies (probabilities), extrapolate to
    greater evolutionary distances
  • DNA sequences match 5, mismatch -4

7
BLAST the MSP
  • Given two sequences of same length, the
    similarity score of their alignment (without
    gaps) is the sum of similarity values for each
    pair of aligned residues
  • Maximal segment pair (MSP) Highest scoring pair
    of identical length segments from the two
    sequences
  • The similarity score of an MSP is called the MSP
    score
  • BLAST heuristically aims to find this

8
Locally maximal segment pair
  • A molecular biologist may be interested in all
    conserved regions shared by two proteins, not
    just their highest scoring pair
  • A segment pair (segments of identical lengths) is
    locally maximal if its score cannot be improved
    by extending or shortening in either direction
  • BLAST attempts to find all locally maximal
    segment pairs above some score cutoff.

9
Rapid approximation of MSP score
  • Goal is to report those database sequences that
    have MSP score above some threshold S.
  • Statistics tells us what is the highest threshold
    S at which chance similarities are likely to
    appear
  • Tractability to statistical analysis is one of
    the attractive features of the MSP score

10
Rapid approximation of MSP score
  • BLAST minimizes time spent on database sequences
    whose similarity with the query has little chance
    of exceeding this cutoff S.
  • Main strategy seek only segment pairs (one from
    database, one query) that contain a word pair
    with score gt T
  • Intuition If the sequence pair has to score
    above S, its most well matched word (of some
    predetermined small length) must score above T
  • Lower T gt Fewer false negatives
  • Lower T gt More pairs to analyze

11
Implementation
  1. Compile a list of high scoring words
  2. Scan database for hits to this word list
  3. Extend hits

12
Step 1 Compiling list of words from query
sequence
  • For proteins List of all w-length words that
    score at least T when compared to some word in
    query sequence
  • Question Does every word in the query sequence
    make it to the list?
  • For DNA list of all w-length words in the query
    sequence, often with w12

13
Step 2 Scanning the database for hits
  • Find exact matches to list words
  • Can be done in linear time
  • two methods (next slides)
  • Each word in list points to all occurrences of
    the word in word list from previous step

14
Scanning the database for hits
  • Method 1 Let w4, so 204 possible words
  • Each integer in 0 204-1 is an index for an
    array
  • Array element point to list of all occurrences of
    that word in query
  • Not all 204 elements of array are populated
  • only the ones in word list from previous step

15
Scanning the database for hits
  • Method 2 use deterministic finite automaton or
    finite state machine.
  • Similar to the keyword trees seen in course.
  • Build the finite state machine out of all words
    in word list from previous step

16
Step 3 Extending hits
  • Once a word pair with score gt T has been found,
    extend it in each direction.
  • Extend until score gt S is obtained
  • During extension, score may go up, and then down,
    and then up again
  • Terminate if it goes down too much (a certain
    distance below the best score found for shorter
    extensions)
  • One implementation allows gaps during extension

17
BLAST approximating the MSP
  • BLAST may not find all segment pairs above
    threshold S
  • Trying to approximate the MSP
  • Bounds on the error not hard bounds, but
    statistical bounds
  • Highly likely to find the MSP

18
Statistics
  • Suppose the MSP has been calculated by BLAST (and
    suppose this is the true MSP)
  • Suppose this observed MSP scores S.
  • What are the chances that the MSP score for two
    unrelated sequences would be gt S?
  • If the chances are very low, then we can be
    confident that the two sequences must not have
    been unrelated

19
Statistics
  • Given two random sequences of lengths m and n
  • Probability that they will produce an MSP score
    of gt x ?

20
Statistics
  • Number of separate SPs with score gt x is Poisson
    distributed with mean y(x) Kmn exp(-?x), where
  • ? is the positive solution of
  • ?pipjexp(?s(i,j)) 1
  • K is a constant
  • s(i,j) is the scoring matrix, pi is the frequency
    of i in random sequences

21
Statistics
  • Poisson distribution
  • Pr(x) (e- ? ?x)/x!
  • Pr(SPs gt ?)
  • 1 - Pr(SPs lt ?-1)

22
Statistics
  • For ?1, Pr(SPs gt 1) 1-e-y(x)
  • Choose S such that 1-e-y(S) is small
  • Suppose the probability of having at least 1 SP
    with score gt S is 0.001.
  • This seems reasonably small
  • However, if you test 10000 random sequences, you
    expect 10 to cross the threshold
  • Therefore, require E-value to be small.
  • That is, expected number of random sequence pairs
    with score gt S should be small.

23
More statistics
  • We just saw how to choose threshold S
  • How to choose T ?
  • BLAST is trying to find segment pairs (SPs)
    scoring above S
  • If an SP scores S, what is the probability that
    it will have a w-word match of score T or more?
  • We want this probability to be high

24
More statistics Choosing T
  • Given a segment pair (from two random sequences)
    that scores S, what is the probability q that it
    will have no w-word match scoring above T?
  • Want this q to be low
  • Obtained from simulations
  • Found to decrease exponentially as S increases

25
BLAST is the universally used bioinformatics tool
26
http//flybase.org/blast/
Write a Comment
User Comments (0)
About PowerShow.com