Statistical Significance Assesment of Local Sequence Alignment - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Statistical Significance Assesment of Local Sequence Alignment

Description:

Comparing sequences of letters or data strings. Essential for computational tasks from file and web searches to image processing ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 41

Provided by: Nich179

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Significance Assesment of Local Sequence Alignment

1
Statistical Significance Assesment of Local
Sequence Alignment
Nicholas Chia
2
Outline

Why should you care?
How does it work?
Statistics of gapped alignments the need for
speed
A Markov Model approach
Where does the physics come in?
And does it solve the problem?
Conclusion

3
What is Sequence Alignment?

Comparing sequences of letters or data strings
Essential for computational tasks from file and
web searches to image processing
One of the most widely used tool in molecular
biology

4
The Biology of Sequences I
Fundamental biology is based on sequences of
nucleic and amino acids!
DNA
RNA
Protein
mRNA
tRNA
rRNA
siRNA
miRNA
smRNA
5
The Biology of Sequence II
DNA
RNA
Proteins
mRNA
tRNA
rRNA
siRNA
miRNA
smRNA
DNA
Replication is not perfect. Three types of errors
occur.
Mutation
GCCTATACCGTA
GCCTATTCCGTA
Insertion
GCCTATACCGTA
GCCTATAGCCGTA
Deletion
GCCTATACCGTA
GCCTATCCGTA
Basis for evolution
6
Role of Alignment in Biology

Identifying function and structure

protein families
all known proteins

Discerning evolutionary relationships

Tree of Life web project

Systems biology

7
Gapped Alignment
scoring matrix
number of gaps
alignment score
gap cost
G
A
T
C
G
G
T
A
C
-
Smith-Waterman Local Alignment for ? gt 1
8
Choosing a Scoring System
In theory, the optimal mismatch and gap costs are
related to the "true" number of mutations,
insertions, and deletions between a pair of
sequences.
Number of "true" Mutation
1 / Mismatch cost, ?

Number of "true" Gaps
1 / Gap cost, ?

In practice, one does not observe the
evolutionary path at each step and thus the
"true" path is lost.
Instead, one judges the optimality of a scoring
system by the number of true hits versus the
number of false hits and its ability to reliably
find future true hits. Ultimately, sequence
relationships are determined by experimental
assays probing for function and structure.
What distinguishes a true hit from a false hit?
9
Significance Assessment
Studying the distribution of alignment scores
among random sequences yields information about
the rarity, a.k.a., biological import, of a given
alignment score
Evidence suggests gapped alignment scores are
distributed according to the extreme value
distribution
but statistically characterizing the slow
exponential tail of this distribution requires a
large number of simulations.
score
maximum alignment score
Can we find a better method to understand gapped
distributions?
alignment
Can we evaluate ? faster?
10
Small Change in Geometry
T
C
G
T
A
Needleman-Wunsch global alignment score
G
C
T
If we can solve for ?, we know ?!
G
C
T
C
G
but how do we solve for ??
A
i.e., how do we account for the length dependence
of the score?
Change geometry!

fixing the width also fixes the state space
allowing us to model alignment as a Markov
process

replaces a 2D length dependence with a 1D width
dependence where we can use a Markov matrix to
solve infinite t behavior

t1
11
The Markov Model
Using score differences between lattice sites as
our elements, we can write a Markov matrix
describing the transition from t to t1
In this way we can describe the fixed width
dynamics for infinite t
t
t1
largest eigenvalue of the modified Markov matrix
In order to obtain information about the relevant
quantity
we modify our Markov matrix to include the
necessary ?
all alignment parameters
dependence.
12
Understanding the W-dependence
Definition
... not quite that simple

cannot solve for extremely large W
non-trivial width dependence

So, what can we do?
Kardar-Parisi-Zhang Systems
ASEP Asymmetric Exclusion Process
KPZ systems all share properties on a course
grained level
KPZ
Sequence Alignment
13
KPZ Surface Growth
Describes, on a course grained level, tumor
growth, bacterial growth, flame fronts, forest
formations, asymmetric exclusion processes,
geological formations, avalanches, crystal
defects, directed polymers, sputter deposition,
electrochemical deposition, shoreline
growth/recession, sequence alignment, and vicious
imbibitions.
height profile
noise
surface tension

The KPZ Equation

surface relaxation
lowest order non-linear growth term
The set of processes which are governed by the
KPZ equation are collectively known as the KPZ
universality class. Properties found for any
system apply to all systems in a given
universality class.
Thus, just by studying the dynamics of a single
member of the KPZ universality class, we can
uncover truths for the large number of dynamic
processes.
M. Kardar, G. Parisi, and Y. Zhang, Phys. Rev.
Lett. 56 889 (1986).
14
Universal Properties
T. Haplin-Healy and Y. Zhang, Phys. Rep. 254 215
(1995).
15
Totally ASymmetric Exclusion Process

Single Particle TASEP
Periodic boundary
Particles move only to the right (1) if space
exists (2) with probability q
Sublattice-parallel update scheme - inherently
discrete time scheme

Multiple Particle TASEP
Only one particle per site considered for
hopping, up to n particles per site

16
Background
Generating Function
Particle current or hopping per site
Summarizes system parameters (num. of part.,
hopping prob., part. per site)
Derrida-Lebowitz Scaling Function
Proposed universal scaling function, i.e.,
completely independent of system, solved
analytically
How long does this solution take to converge?
Applies to the continuous time ASEP
17
Finite Size Effects
Once again, this solution converges very slowly.
Direct calculation of
is possible via matrix method. However, slow
convergence toward the limits N??, t?? make this
method difficult.
Extremely time consuming calculation
18
Reformulated Solution
Define Infinite Limit
Can calculate directly via a matrix method for
small system sizes (N 22)
Analytical solution exists for single particle
case for all negative ?
Reformulation
Can measure G directly! Not using cumulant ratios.
Additional term arises from our newly defined
left hand side which must go to zero for fixed N
as ??-?
Does this agree with the observed behavior?
Scaling Function N ? ?
19
Continuous Time ASEP
Since Derrida et al. have already solved
analytically for a continuous time ASEP for ? and
?N, this method can be tested directly using
their solutions.
When plotted in this manner, the rescaled small
system size solutions quickly converge to the
universal scaling function. This can be done much
faster and more easily than with previous methods.
20
Understanding the W-dependence
Definition
... not quite that simple

cannot solve for extremely large W
non-trivial width dependence

Kardar-Parisi-Zhang Systems
So, what can we do?
Derrida Lebowitz, PRL 1998
From Derrida and Lebowitz comes the scaling
function G, which gives the form of the width
dependence as follows
ASEP
KPZ
Sequence Alignment
Gapped Alignment
KPZ systems all share properties on a course
grained level
By understanding the W-dependence, we can solve
for ? and ?!
21
Single Particle TASEP
Solution for the Single Particle TASEP
Plot
Since there is a solution for l in the single
particle TASEP, one can plot the left hand side
versus the right hand side and rescale
appropriately in order to isolate the DL proposed
universal behavior.
22
Multi-Particle TASEP
Plot
Once a and b are fitted for one N, these scaling
factors give good agreement for size systems,
i.e., the scaling factors a and b are independent
of system size!
23
Calculating ?
parameter dependent scaling factors
Equation for ?
Equation for ?W
W-dependence
Can solve for ? if we know the scaling factors a?
and b?!
Fit difference in order to obtain scaling factors
Solution for ?
24
Convergence of ?
0.65
?
0.6
0.55
0
2
4
6
8
10
12
W
25
Results
26
Conclusion

Succeeded in calculating ? with precision on
fast timescales
Demonstrated a non-sampling method for
calculating ?
Successful synthesis of computational biology
and statistical physics
In principle, can be generalized to more complex
scoring systems
Uncovered a method for numerically verifying the
Derrida-Lebowitz Scaling Function and for using
it to overcome finite size effects

27
Acknowledgments
Ralf Bundschuh
Nigel Goldenfeld Carl Woese
Eduardo Fradkin
28
Technique for Calculating ?w(??)
The System Parameters - ?

match-mismatch scoring matrix

linear gap costs

technical condition to reduce state space

periodic boundary conditions

Bernoulli randomness (uncorrelated bonds)
approximation

Calculating ?w(??)

construct the modified Markov matrices
symbolically
eigenvalue simply too difficult to solve
symbolically large matrices (105)
ARPACK solves for the largest eigenvalue ?w with
precision and speed since the matrices are sparse
(N log N) - Lehoucq et al., SIAM 1997

29
Mapping
30
Detecting Protein Homologs using Hybrid
PSI-BLAST
Nicholas Chia, Yuheng Li?, Mario Lauria?, and
Ralf Bundschuh?
Department of Physics Department of Computer
Science and Engineering ?Biophysics Graduate
Program The Ohio State University
31
Protein Classification
individual protein families
all known proteins
query sequence
BLAST
PSI-BLAST (position specific scoring matrix)
Reliable protein classification requires the use
of structural information in order to detect
distant family members
How can we improve homology detection?
Number of sequences growing far faster than
number of structures
growth rate
Protein structures
sequences
32
Probabilistic Alignment
number of gaps
scoring matrix
BLOSUM62
alignment weight
gap weight
C
A
S
E
T
C
S
A
-
E
Typically, deterministic alignments measure
homology by scoring only the optimal alignment
path one path one score
However, probabilistic alignment measures
homology by summing the weights over all
alignment paths many paths one score
33
Hybrid Alignment
Deterministic Alignment
vs.
Global alignment algorithm searches for the
highest scoring path from upper-left to
lower-right and reports its score
Global alignment sums weights over all alignment
paths from upper-left to lower-right and reports
a final score
C
A
S
E
T
C
A
S
E
T
C
C
A
A
S
S
E
E
Local alignment searches all alignment segments
beginning or ending anywhere and returns highest
scoring segment
Local alignment sums over alignment segments
beginning everywhere and returns the highest
scoring endpoint
Local alignment scores for random sequences are
distributed according to the Gumbel distribution
Logarithm of local alignment scores for random
sequences are distributed according to the Gumbel
distribution
O(N2) computational complexity
Hybrid PSI-BLAST
NCBI PSI-BLAST
34
Significance Assessment
Studying the distribution of alignment scores
among random sequences yields information about
the rarity, a.k.a., biological import, of a given
alignment score
Need to solve for the two Gumbel parameters for
Hybrid Alignment
but while incorporating position-specific
scoring matrices and gap costs.
log of maximum alignment score
even easier ???
If the average total weight entering a point
equals 1, i.e.,
How can this property be used to benefit homology
detection?
then
35
Position-Specific Gap Costs
Since ?1 for any ?, we may freely choose
position dependent gap weights without undergoing
the usual difficulty of the significance
assessment.
Hybrid PSI-BLAST
PSI-BLAST

Choosing gap weights for a sequence model
Use existing alignments
Need to know gap probabilities in alignment

36
Measuring Probabilities
C
A
S
E
T
C
A
S
E
max score measures alignment likelihood
forward-backward algorithm
Probability of a gap transition
Probability of a match transition
Probability of any transition
?
?
?
(1-2?)
backward
forward
x
max score
37
Iterative Model Building

Score query against sequences in database. Select
close homologs for inclusion in model building.
Extract transition and amino acid emission
probabilities of homologs with the
forward-backward algorithm.
Construct an alignment model from the average
transition and amino acid emission probabilities.
Search sequence database again, this time scoring
sequences in database against the newly built
model.
Iterate steps 2-4 until convergence or maximum
iterations are reached.

C
A
S
E
T
C
A
S
E
38
Technical Details
The System Parameters
match?match match?insertion match?deletion
insertion?match insertion?insertion

affine gap scheme

deletion?match deletion?deletion
insertion ? deletion
deletion ? insertion

finite size correction of the Gumbel parameters
? and ?

initial scoring scheme BLOSUM62 (11,1)

5 rounds of iteration

pseudocount

weighting scheme

adaptive model inclusion threshold

The Database

Superfamily PDB90 v1.69

Fuzzy countingSUPERFAMILY counting rules

39
Results
40
Conclusions

Hybrid combines the features of probabilistic
alignment with well understood significance
assessments and can be used to improve sequence
homology detection
Position specific gap weights improve homology
detection

Future Work