Alignment Score Statistics - PowerPoint PPT Presentation

About This Presentation
Title:

Alignment Score Statistics

Description:

cbio course, spring 2005, Hebrew University (Alignment) Score Statistics ... of length n sampled from the uniform distribution over the DNA alphabet. ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 24
Provided by: yoseph3
Category:

less

Transcript and Presenter's Notes

Title: Alignment Score Statistics


1
(Alignment) Score Statistics
2
Motivation
  • Reminder
  • Basic motivation we want to check if 2 sequences
    are related or not
  • We align 2 sequences and get a score (s) which
    measures how similar they are
  • Given s, do we accept the hypothesis the 2 are
    related or reject it ?

How high should s be so that we believe they
are related ??
3
Motivation (2)
  • We need a rigorous way to decide on a threshold
    s so that for s gt s, we call the sequences
    related
  • Note
  • s should obviously be s(n,m) where n and m are
    the length of the 2 sequences aligned
  • When we try matching sequence x against a D.B of
    N (Ngtgt1) sequences, we need to account for the
    fact we might see high scores just by chance
  • We can make 2 kinds of mistakes in our calls
  • FP
  • FN
  • ? We want our rigorous way to control FP FN
    mistakes

4
Motivation (3)
  • The problem of assigning statistical significance
    to scores and controlling our FP and FN mistakes
    is of general interest.
  • Examples
  • Similarities between protein sequnece to profile
    HMM
  • Log ratio scores when searching DNA sequence
    motifs
  • .
  • The methods we develop now will be of general use

5
Reminder
  • In the last lesson we talked about 2 ways to
    analyze alignment scores and their significance
  • Bayesian
  • Classical EVD approach
  • We reviewed how the amount of FP mistakes can be
    controlled using each of these approaces
  • We reviewed Karlin Altshul (1990) results

6
Review First Approach Bayesian
Assume we have two states in our world M
(Model related sequences) R (Random un
realated sequences) Given a fixed alignment of
two sequences (x,y) we ask from which state it
came from M or R ?
We saw
Where
7
(No Transcript)
8
(No Transcript)
9
Review Bayesian Approach cont.
  • We saw that in order to control the expected
    number of false identifications, when testing
    scores which came from R, we need the threshold
    over the scores S to have S log(number of
    trials K )
  • Where
  • Number of trials for scoring a sequence of length
    m in local aligment against N sequences of length
    n is nmN
  • K in 0,1 is correlation factor compensating
    for the fact the trials are correlated.

10
Review EVD Approach
  • In the EVD approach we are interested in the
    question given a score s for aligning x and y,
    If this s came from a distribution of scores for
    unrelated sequences (like R in the Bayesian
    approach), Whats the probability of seeing a
    score as good as s by chance, simply because I
    tried so many matches of sequences against x?
  • R here is the null hypothesis we are testing
    against.
  • If P(score gt s we tried N scores) lt Threshold
    (say 0.01) then we reject the null hypothesis
    (R)
  • NOTE
  • There is no second hypothesis here.
  • We are guarding against type 1 errors (FP)
  • No control or assumptions are made about FN here
    !!
  • This setting is appropriate for the problem we
    have at hand (D.B search)

11
(No Transcript)
12
(No Transcript)
13
Toy Problem
  • Let s,t be two randomly chosen DNA sequences of
    length n sampled from the uniform distribution
    over the DNA alphabet.
  • Align s versus t with no gaps (i.e. s1 is
    aligned to t1 until sn is aligned to tn.)
  • What is the probability that there are k matches
    (not necessarily continuous ones) between s and
    t?
  • Suppose you are a researcher and you have two
    main hypothesis
  • Either these two sequences are totally unrelated
    or there was a common ancestor to both of them
    (there is no indel option here).
  • How would you use the number of matches to decide
    between the two options and attach a statistical
    confidence measure to this decision?

14
Pvalue for score 30 and n 100
NOTE As in our real problem - pvalue of score
depends on n
15
EVD for our problem
In the EVD approach, we are interested in the
question what is the probability of me seeing
such a good a score as S , only from matches to
non related sequences, if I tried N such matches?
Compute If we want to guarantee the P Max(S1
SN) gt S lt 0.05 where Si are scores of matches
against non related sequences sampled i.i.d ,
then 1 pvalue(S)N gt 0.95 i.e
16
(No Transcript)
17
Guarding against mistakes evaluating performace
  • In the EVD we kept guarding against FP mistakes.
    This is very important when doing tasks where
    many tests are preformed, as in our case of D.B
    search
  • Sometime we are not able to compute EVD and we
    still want to control FPR. A very strict and
    simple Solution is the Bonferroni corrected
    pvalue pvalueN
  • Where N is the number of tests perfromed.
  • Note The relation to the union bound is clear
  • Problem Bonf. Controls the FWER (family wise
    error rate) i.e the probability of seeing even 1
    mistake in the results we report as significant
    (a FP mistake).
  • It does so with very basically no assumption on
    the distribution, the relations between the
    hypothesis tested etc. and still guaranties
    control over FWER
  • The price to pay is in FN ..

18
Bonf. Example on our case
  • We saw
  • If we want to guarantee the P Max(S1 SN) gt
    S lt 0.05 where Si are scores of matches
    against non related sequences sampled i.i.d ,
    then
  • 1 pvalue(S)N gt 0.95
  • i.e

Compare for N 10 the result for this equation
0.005116 to the Bonf. Corrected pvalue 0.05/N
0.005 For N 20 0.00256 vs. 0.002 etc
If we used the strict Bonf. for the same
guarantee level we wanted, we might have rejected
some good results.
19
How to estimate performance?
  • Say you have a method with a score (in our case
    method scoring local alignment with affine
    gaps, and scoring matrix Sigma (e.g. Sigma
    PAM1)
  • You set a threshold over the scores based on some
    criteria (e.g EVD estimation of the scores in
    random matches)
  • You want to evaluate your methods performace on
    some test data set. The data set would
    typically contain some true and false examples.
  • Assumption you KNOW the answers of this test
    set !
  • Aim You want to see the tradeoff you get for
    using various thresholds on the scores, in terms
    of FP and FN on the data set.

20
ROC curves
  • ROC Receiver Operator Curve

Sensitivity TP/ (TPFN) What ratio of the
true ones we capture
Best Performance
100
FPR False Positive Rate Empirical pvalue
FP/ (FP TN) FP / ( real negatives)
What ratio of the bad ones we pass
0
0
100
21
ROC curves
  • NOTE Each point in the ROC matches a certain
    threshold over the methods scores
  • Each method gives a different Curve
  • We can now compare methods performance
  • At a certain point on the graph
  • Via the total size of area under the graph

22
FDR
  • A less stringent statistical crieteria is FDR
    (False Detection Rate), suggested by Benjamini
    Huchberg (95).
  • Main idea control the rate of false reports in
    the total amount of reports you give.
  • i.e. FDR 5 means that the expected ratio of
    false detections in your total number of
    detections is going to be 5
  • E FP/(FPTP) 0.05
  • The expectation (E) is done over the total
    distribution which may contain both True and
    False hypothesis. When there are no True
    hypothesis then FDR is the same as Bonf. But if
    not, it will give you more power to the test

23
EVD for our problem
In the EVD approach, we are interested in the
question what is the probability of me seeing
such a good a score as S , only from matches to
non related sequences, if I tried N such matches?
Compute If we want to guarantee the P Max(S1
SN) gt S lt 0.05 where Si are scores of matches
against non related sequences sampled i.i.d ,
then 1 pvalue(S)N gt 0.95 i.e
Compare for N 10 the result 0.005116 to the
Bonf. Corrected pvalue 0.05/N 0.005 For N
20 0.00256 vs. 0.002 etc
24
Back to our Toy Problem
Assume the data we need to handle came from two
sources, as in the Bayesian approach R no
related sequences, p(a,a) 0.25 M related
sequences p(a,a) 0.4 p(a,b) 0.2 Delta
scoring matrix i.e. S(a,a) 1 S(a,b) 0
25
Finish with a Thought
  • In our toy problem whats the relation between
    the graph of the last slide and the ROC curve we
    talked about?
  • How does the relative amount of samples from M
    and R in our data set effects the ROC? How should
    the total distribution over the scores look like?
Write a Comment
User Comments (0)
About PowerShow.com