Title: Alignment Score Statistics
1(Alignment) Score Statistics
2Motivation
- Reminder
- Basic motivation we want to check if 2 sequences
are related or not - We align 2 sequences and get a score (s) which
measures how similar they are - Given s, do we accept the hypothesis the 2 are
related or reject it ?
How high should s be so that we believe they
are related ??
3Motivation (2)
- We need a rigorous way to decide on a threshold
s so that for s gt s, we call the sequences
related - Note
- s should obviously be s(n,m) where n and m are
the length of the 2 sequences aligned - When we try matching sequence x against a D.B of
N (Ngtgt1) sequences, we need to account for the
fact we might see high scores just by chance - We can make 2 kinds of mistakes in our calls
- FP
- FN
- ? We want our rigorous way to control FP FN
mistakes
4Motivation (3)
- The problem of assigning statistical significance
to scores and controlling our FP and FN mistakes
is of general interest. - Examples
- Similarities between protein sequnece to profile
HMM - Log ratio scores when searching DNA sequence
motifs - .
- The methods we develop now will be of general use
5Reminder
- In the last lesson we talked about 2 ways to
analyze alignment scores and their significance - Bayesian
- Classical EVD approach
- We reviewed how the amount of FP mistakes can be
controlled using each of these approaces - We reviewed Karlin Altshul (1990) results
6Review First Approach Bayesian
Assume we have two states in our world M
(Model related sequences) R (Random un
realated sequences) Given a fixed alignment of
two sequences (x,y) we ask from which state it
came from M or R ?
We saw
Where
7(No Transcript)
8(No Transcript)
9Review Bayesian Approach cont.
- We saw that in order to control the expected
number of false identifications, when testing
scores which came from R, we need the threshold
over the scores S to have S log(number of
trials K ) - Where
- Number of trials for scoring a sequence of length
m in local aligment against N sequences of length
n is nmN - K in 0,1 is correlation factor compensating
for the fact the trials are correlated.
10Review EVD Approach
- In the EVD approach we are interested in the
question given a score s for aligning x and y,
If this s came from a distribution of scores for
unrelated sequences (like R in the Bayesian
approach), Whats the probability of seeing a
score as good as s by chance, simply because I
tried so many matches of sequences against x? - R here is the null hypothesis we are testing
against. - If P(score gt s we tried N scores) lt Threshold
(say 0.01) then we reject the null hypothesis
(R) - NOTE
- There is no second hypothesis here.
- We are guarding against type 1 errors (FP)
- No control or assumptions are made about FN here
!! - This setting is appropriate for the problem we
have at hand (D.B search)
11(No Transcript)
12(No Transcript)
13Toy Problem
- Let s,t be two randomly chosen DNA sequences of
length n sampled from the uniform distribution
over the DNA alphabet. - Align s versus t with no gaps (i.e. s1 is
aligned to t1 until sn is aligned to tn.) - What is the probability that there are k matches
(not necessarily continuous ones) between s and
t? - Suppose you are a researcher and you have two
main hypothesis - Either these two sequences are totally unrelated
or there was a common ancestor to both of them
(there is no indel option here). - How would you use the number of matches to decide
between the two options and attach a statistical
confidence measure to this decision?
14Pvalue for score 30 and n 100
NOTE As in our real problem - pvalue of score
depends on n
15EVD for our problem
In the EVD approach, we are interested in the
question what is the probability of me seeing
such a good a score as S , only from matches to
non related sequences, if I tried N such matches?
Compute If we want to guarantee the P Max(S1
SN) gt S lt 0.05 where Si are scores of matches
against non related sequences sampled i.i.d ,
then 1 pvalue(S)N gt 0.95 i.e
16(No Transcript)
17Guarding against mistakes evaluating performace
- In the EVD we kept guarding against FP mistakes.
This is very important when doing tasks where
many tests are preformed, as in our case of D.B
search - Sometime we are not able to compute EVD and we
still want to control FPR. A very strict and
simple Solution is the Bonferroni corrected
pvalue pvalueN - Where N is the number of tests perfromed.
- Note The relation to the union bound is clear
- Problem Bonf. Controls the FWER (family wise
error rate) i.e the probability of seeing even 1
mistake in the results we report as significant
(a FP mistake). - It does so with very basically no assumption on
the distribution, the relations between the
hypothesis tested etc. and still guaranties
control over FWER - The price to pay is in FN ..
18Bonf. Example on our case
- We saw
- If we want to guarantee the P Max(S1 SN) gt
S lt 0.05 where Si are scores of matches
against non related sequences sampled i.i.d ,
then - 1 pvalue(S)N gt 0.95
- i.e
Compare for N 10 the result for this equation
0.005116 to the Bonf. Corrected pvalue 0.05/N
0.005 For N 20 0.00256 vs. 0.002 etc
If we used the strict Bonf. for the same
guarantee level we wanted, we might have rejected
some good results.
19How to estimate performance?
- Say you have a method with a score (in our case
method scoring local alignment with affine
gaps, and scoring matrix Sigma (e.g. Sigma
PAM1) - You set a threshold over the scores based on some
criteria (e.g EVD estimation of the scores in
random matches) - You want to evaluate your methods performace on
some test data set. The data set would
typically contain some true and false examples. - Assumption you KNOW the answers of this test
set ! - Aim You want to see the tradeoff you get for
using various thresholds on the scores, in terms
of FP and FN on the data set.
20ROC curves
- ROC Receiver Operator Curve
Sensitivity TP/ (TPFN) What ratio of the
true ones we capture
Best Performance
100
FPR False Positive Rate Empirical pvalue
FP/ (FP TN) FP / ( real negatives)
What ratio of the bad ones we pass
0
0
100
21ROC curves
- NOTE Each point in the ROC matches a certain
threshold over the methods scores - Each method gives a different Curve
- We can now compare methods performance
- At a certain point on the graph
- Via the total size of area under the graph
22FDR
- A less stringent statistical crieteria is FDR
(False Detection Rate), suggested by Benjamini
Huchberg (95). - Main idea control the rate of false reports in
the total amount of reports you give. - i.e. FDR 5 means that the expected ratio of
false detections in your total number of
detections is going to be 5 - E FP/(FPTP) 0.05
- The expectation (E) is done over the total
distribution which may contain both True and
False hypothesis. When there are no True
hypothesis then FDR is the same as Bonf. But if
not, it will give you more power to the test
23EVD for our problem
In the EVD approach, we are interested in the
question what is the probability of me seeing
such a good a score as S , only from matches to
non related sequences, if I tried N such matches?
Compute If we want to guarantee the P Max(S1
SN) gt S lt 0.05 where Si are scores of matches
against non related sequences sampled i.i.d ,
then 1 pvalue(S)N gt 0.95 i.e
Compare for N 10 the result 0.005116 to the
Bonf. Corrected pvalue 0.05/N 0.005 For N
20 0.00256 vs. 0.002 etc
24Back to our Toy Problem
Assume the data we need to handle came from two
sources, as in the Bayesian approach R no
related sequences, p(a,a) 0.25 M related
sequences p(a,a) 0.4 p(a,b) 0.2 Delta
scoring matrix i.e. S(a,a) 1 S(a,b) 0
25Finish with a Thought
- In our toy problem whats the relation between
the graph of the last slide and the ROC curve we
talked about? - How does the relative amount of samples from M
and R in our data set effects the ROC? How should
the total distribution over the scores look like?