Alignment Score Statistics presentation

About This Presentation

Transcript and Presenter's Notes

Title: Alignment Score Statistics

1
(Alignment) Score Statistics
2
Motivation

Reminder
Basic motivation we want to check if 2 sequences
are related or not
We align 2 sequences and get a score (s) which
measures how similar they are
Given s, do we accept the hypothesis the 2 are
related or reject it ?

How high should s be so that we believe they
are related ??
3
Motivation (2)

We need a rigorous way to decide on a threshold
s so that for s gt s, we call the sequences
related
Note
s should obviously be s(n,m) where n and m are
the length of the 2 sequences aligned
When we try matching sequence x against a D.B of
N (Ngtgt1) sequences, we need to account for the
fact we might see high scores just by chance
We can make 2 kinds of mistakes in our calls
FP
FN
? We want our rigorous way to control FP FN
mistakes

4
Motivation (3)

The problem of assigning statistical significance
to scores and controlling our FP and FN mistakes
is of general interest.
Examples
Similarities between protein sequnece to profile
HMM
Log ratio scores when searching DNA sequence
motifs
.
The methods we develop now will be of general use

5
Reminder

In the last lesson we talked about 2 ways to
analyze alignment scores and their significance
Bayesian
Classical EVD approach
We reviewed how the amount of FP mistakes can be
controlled using each of these approaces
We reviewed Karlin Altshul (1990) results

6
Review First Approach Bayesian
Assume we have two states in our world M
(Model related sequences) R (Random un
realated sequences) Given a fixed alignment of
two sequences (x,y) we ask from which state it
came from M or R ?
We saw
Where
7
(No Transcript)
8
(No Transcript)
9
Review Bayesian Approach cont.

We saw that in order to control the expected
number of false identifications, when testing
scores which came from R, we need the threshold
over the scores S to have S log(number of
trials K )
Where
Number of trials for scoring a sequence of length
m in local aligment against N sequences of length
n is nmN
K in 0,1 is correlation factor compensating
for the fact the trials are correlated.

10
Review EVD Approach

In the EVD approach we are interested in the
question given a score s for aligning x and y,
If this s came from a distribution of scores for
unrelated sequences (like R in the Bayesian
approach), Whats the probability of seeing a
score as good as s by chance, simply because I
tried so many matches of sequences against x?
R here is the null hypothesis we are testing
against.
If P(score gt s we tried N scores) lt Threshold
(say 0.01) then we reject the null hypothesis
(R)
NOTE
There is no second hypothesis here.
We are guarding against type 1 errors (FP)
No control or assumptions are made about FN here
!!
This setting is appropriate for the problem we
have at hand (D.B search)

11
(No Transcript)
12
(No Transcript)
13
Toy Problem

Let s,t be two randomly chosen DNA sequences of
length n sampled from the uniform distribution
over the DNA alphabet.
Align s versus t with no gaps (i.e. s1 is
aligned to t1 until sn is aligned to tn.)
What is the probability that there are k matches
(not necessarily continuous ones) between s and
t?
Suppose you are a researcher and you have two
main hypothesis
Either these two sequences are totally unrelated
or there was a common ancestor to both of them
(there is no indel option here).
How would you use the number of matches to decide
between the two options and attach a statistical
confidence measure to this decision?

14
Pvalue for score 30 and n 100
NOTE As in our real problem - pvalue of score
depends on n
15
EVD for our problem
In the EVD approach, we are interested in the
question what is the probability of me seeing
such a good a score as S , only from matches to
non related sequences, if I tried N such matches?
Compute If we want to guarantee the P Max(S1
SN) gt S lt 0.05 where Si are scores of matches
against non related sequences sampled i.i.d ,
then 1 pvalue(S)N gt 0.95 i.e
16
(No Transcript)
17
Guarding against mistakes evaluating performace

In the EVD we kept guarding against FP mistakes.
This is very important when doing tasks where
many tests are preformed, as in our case of D.B
search
Sometime we are not able to compute EVD and we
still want to control FPR. A very strict and
simple Solution is the Bonferroni corrected
pvalue pvalueN
Where N is the number of tests perfromed.
Note The relation to the union bound is clear
Problem Bonf. Controls the FWER (family wise
error rate) i.e the probability of seeing even 1
mistake in the results we report as significant
(a FP mistake).
It does so with very basically no assumption on
the distribution, the relations between the
hypothesis tested etc. and still guaranties
control over FWER
The price to pay is in FN ..

18
Bonf. Example on our case

We saw
If we want to guarantee the P Max(S1 SN) gt
S lt 0.05 where Si are scores of matches
against non related sequences sampled i.i.d ,
then
1 pvalue(S)N gt 0.95
i.e

Compare for N 10 the result for this equation
0.005116 to the Bonf. Corrected pvalue 0.05/N
0.005 For N 20 0.00256 vs. 0.002 etc
If we used the strict Bonf. for the same
guarantee level we wanted, we might have rejected
some good results.
19
How to estimate performance?

Say you have a method with a score (in our case
method scoring local alignment with affine
gaps, and scoring matrix Sigma (e.g. Sigma
PAM1)
You set a threshold over the scores based on some
criteria (e.g EVD estimation of the scores in
random matches)
You want to evaluate your methods performace on
some test data set. The data set would
typically contain some true and false examples.
Assumption you KNOW the answers of this test
set !
Aim You want to see the tradeoff you get for
using various thresholds on the scores, in terms
of FP and FN on the data set.

20
ROC curves

ROC Receiver Operator Curve

Sensitivity TP/ (TPFN) What ratio of the
true ones we capture
Best Performance
100
FPR False Positive Rate Empirical pvalue
FP/ (FP TN) FP / ( real negatives)
What ratio of the bad ones we pass
0
0
100
21
ROC curves

NOTE Each point in the ROC matches a certain
threshold over the methods scores
Each method gives a different Curve
We can now compare methods performance
At a certain point on the graph
Via the total size of area under the graph

22
FDR

A less stringent statistical crieteria is FDR
(False Detection Rate), suggested by Benjamini
Huchberg (95).
Main idea control the rate of false reports in
the total amount of reports you give.
i.e. FDR 5 means that the expected ratio of
false detections in your total number of
detections is going to be 5
E FP/(FPTP) 0.05
The expectation (E) is done over the total
distribution which may contain both True and
False hypothesis. When there are no True
hypothesis then FDR is the same as Bonf. But if
not, it will give you more power to the test

23
EVD for our problem
In the EVD approach, we are interested in the
question what is the probability of me seeing
such a good a score as S , only from matches to
non related sequences, if I tried N such matches?
Compute If we want to guarantee the P Max(S1
SN) gt S lt 0.05 where Si are scores of matches
against non related sequences sampled i.i.d ,
then 1 pvalue(S)N gt 0.95 i.e
Compare for N 10 the result 0.005116 to the
Bonf. Corrected pvalue 0.05/N 0.005 For N
20 0.00256 vs. 0.002 etc
24
Back to our Toy Problem
Assume the data we need to handle came from two
sources, as in the Bayesian approach R no
related sequences, p(a,a) 0.25 M related
sequences p(a,a) 0.4 p(a,b) 0.2 Delta
scoring matrix i.e. S(a,a) 1 S(a,b) 0
25
Finish with a Thought

In our toy problem whats the relation between
the graph of the last slide and the ROC curve we
talked about?
How does the relative amount of samples from M
and R in our data set effects the ROC? How should
the total distribution over the scores look like?

Write a Comment

User Comments (0)

About PowerShow.com

Alignment Score Statistics PowerPoint PPT Presentation