Statistics Inference of BLAST - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Statistics Inference of BLAST

Description:

... www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html. http://www.bioinfbook.org/ http://www.sdsc.edu/~babu/UCSD/week02/dbSearch_tut.html. BLAST procedure. Query seq. ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 33
Provided by: alh6
Category:

less

Transcript and Presenter's Notes

Title: Statistics Inference of BLAST


1
Statistics Inference of BLAST
  • ??? Ai-Ling Hour
  • ???? ?????
  • 02-29052464
  • 022446_at_mail.fju.edu.tw

2
Contents
  • Algorithms
  • Parameters
  • Score
  • p-value
  • E-value

3
References
  • http//www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschu
    l-1.html

4
http//www.bioinfbook.org/
http//www.sdsc.edu/babu/UCSD/week02/dbSearch_tut
.html
5
BLAST procedure
Query seq.
Database
E-value threshold
Output Homolog Subject HSPs
6
Preliminary
  • BLAST programs
  • Input / Output
  • Gap penalty
  • Score matrix
  • Filter
  • HSP
  • Score
  • E-Value

7
Algorithm
  • The original algorithm does not allow gaps but
    allows multiple hits to the same database
    sequence.
  • BLAST programs were designed for fast database
    searching, with minimal sacrifice of sensitivity
    to distantly related sequences.
  • BLAST programs search databases in a special,
    compressed format.
  • BLAST looks first for short subsequences which it
    then tries to extend.

8
How BLAST works
  • Make a list of high scoring words (gtT)
  • Compare wordlist against database
  • If two hits within a given window, gapped
    extension of second hit in both directions

9
BLAST Search Algorithm
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/BL
AST_algorithm.html
10
better
large w
lower T
slower
Sensitivity
Search speed
faster
worse
small w
higher T
http//www.bioinfbook.org/
11
Program Advanced Options
-G Cost to open gap Integer default 5 for
nucleotides 11 proteins -E Cost to extend gap
Integer default 2 nucleotides 1 proteins -q
Penalty for nucleotide mismatch Integer
default -3 -r reward for nucleotide match
Integer default 1 -e expect value Real
default 10 -W wordsize Integer default
11 nucleotides 3 proteins
12
Filtering
  • Low-complexity
  • SEG amino acid sequences
  • DUST nucleic acid sequences.
  • Human repeats, ...
  • lookup table
  • Lower Case

13
SEG output
14
Score of Alignment
  • How strong an alignment can be expected from
    chance alone
  • To analyze how high a score is likely to arise by
    chance, a model of random sequences is needed.
  • The expected score for aligning a random pair of
    amino acid is required to be negative
  • An extreme value distribution

15
The probability density function of the extreme
value distribution (characteristic value u0 and
decay constant l1)
0.40
0.35
0.30
0.25
normal distribution
extreme value distribution
probability
0.20
0.15
0.10
0.05
0
0
1
2
3
4
5
-1
-2
-3
-4
-5
x
http//www.bioinfbook.org/
16
Extreme Value Distribution
  • The most one can say reliably is that if 100
    random alignments have score inferior to the
    alignment of interest, the P-value in question is
    likely less than 0.01.
  • Multiple tests An alignment with P-value 0.0001
    in the context of a single trial may be assigned
    a P-value of only 0.1 if it was selected as the
    best among 1000 independent trials.

17
Entropy
18
Entropy
  • A random DNA, p0.25 for each
  • H-4(0.25)(-2)2 bits
  • p(A/T)0.9, p(C/G)0.1
  • H-2(0.45)(-1.15)(0.05) (-4.32)1.47 bits

19
lod Score
  • Pairs of amino acids or nucleotides
  • log2 odd ratio
  • Random model qij pi pj
  • Sij log (qij /pipj)
  • Pairing randomly Sij 0

20
Score
  • The scores of any substitution matrix with
    negative expected score can be written uniquely
    in the form
  • Sijln(qij/pipj)/?
  • where the qij, called target frequencies, are
    positive numbers that sum to 1, the pi are
    background frequencies for the various residues,
    and lambda is a positive constant.

21
Bit Score
  • Raw scores have little meaning without detailed
    knowledge of the scoring system used, or more
    simply its statistical parameters K and lambda.
  • Unless the scoring system is understood, citing a
    raw score alone is like citing a distance without
    specifying feet, meters, or light years.
  • S(?S-ln K) / ln2

22
Parameters
  • The parameters K and ? can be thought of simply
    as natural scales for the search space size and
    the scoring system respectively.

23
E-value
  • Bit score S', which has a standard set of units.
  • The E-value corresponding to a given bit score is
    simply
  • Emn2-S

24
E-value
  • The number of hits one can "expect" to see just
    by chance when searching a database of a
    particular size
  • The E value describes the random background noise
    that exists for matches between sequences.
  • The expected number of HSPs with score at least S
    is given by the formula
  • EKmne-?S

25
p-value
  • The number of random HSPs with score gt S is
    described by a Poisson distribution
  • This means that the probability of finding
    exactly x HSPs with score gtS is given by
    e-E(Ex/x!), E is the E-value of S
  • Specifically the chance of finding zero HSPs with
    score gtS is e-E, so the probability of finding
    at least one such HSP is 1- e-E

26
E values and p values
Very small E values are very similar to p values.
E values of about 1 to 10 are far easier to
interpret than corresponding p values. E p 10
0.99995460 5 0.99326205 2 0.86466472 1 0.63212
056 0.1 0.09516258 (about 0.1) 0.05 0.04877058
(about 0.05) 0.001 0.00099950 (about
0.001) 0.0001 0.0001000
Table 4.4 page 107
27
query length 142196 database 6,672,153
sequences 23,415,242,475 total letters E lt
1E-100
28
query length 142196 database 6,672,153
sequences 23,415,242,475 total letters E lt
1E-100
29
query length 352 database 6,672,153
sequences 23,415,242,475 total letters E lt
1E-50
30
query length 352 database 6,672,153
sequences 23,415,242,475 total letters E lt
1E-50
31
(No Transcript)
32
Thank you
Write a Comment
User Comments (0)
About PowerShow.com