Title: Optimatization of a New Score Function for the Detection of Remote Homologs
1Optimatization of a New Score Function for the
Detection of Remote Homologs
2Introduction
- New method to calculate a score function, aiming
to optimize the ability to discriminate between
homologs and non-homologs - Existing software uses the following to compute
an alignment score
3Number of times AA i is aligned with AA j
Number of gaps in alignment
Number of residues in each gap beyond one
Score function / Substitution matrix Contribution
to score for AA match/mismatch
Contribution to score for gap initialization
Contribution to score for gap extension
4Current Methods to Calculate Homology
- p(Sr gt x) probability that a random pair of
proteins of the same length would have that score - E expected number of random proteins in the db
that would have at least that score - P probability that there is at least one random
pair with a higher score - As p(Sr gt x), E, P increase, the likelihood that
the given pair is homologous decreases
5Current Score Matrices
- PAM (percent accepted mutations) Dayhoff
- GCB, JTT used to apply to larger sequence
datasets - BLOSUM62 Henikoff Henikoff, constructed using
a dataset of aligned sequence blocks - STR protein sequences aligned based on their
observed structures
6Limitations of Current Score Functions
- Current score functions assume independent
evolution of each location, overlooking
correlations - Score functions derived from a db of properly
aligned proteins, not on alignments between
random sequences - Gap penalty a priori
7Theory
- Z score for alignment
- Characterize the significance of alignment score
by calculating the likelihood that this score or
higher would be obtained by a random match - Account for variations in E with the length of
the proteins
8Theory
- Score function optimized by maximizing the
confidence ltCgt over the training set - Avoids dependence on extreme E values (easily
detected or overly distant homologies) - Eliminates contribution of falsely identified
homologies (overly distant)
9Database Preparation
- Use set of known homologs whose homology cannot
be reliably determined with standard pairwise
comparison, in order to optimize score function
for detection of distant homologs - Training set 900 pairs of protein in same COG
with lt 25 sequence identity
10Optimization of Score Function
- Align using BLOSOM62 matrix
- Calculate Z and C for each pair of homologs, then
averaged over pairs in training set to yield ltCgt - Generate initial alignments using gap penalties
that yielded highest C values - 10 cycles of optimization and realignments until
score function converged
11Results
- Small changes in gap penalties most of the
improvement cones from refinements of - OPTIMA resulting score function
- has significantly improved average confidence ltCgt
value compared with other score matrices - ltp(Sr gt x)gt, ltPgt significantly decreased
12Summary
- Aim optimize score matrix to discriminate
between homologs and non-homologs - OPTIMA score function more successful at
discriminating between homologs and non-homologs
compared with standard score matrices - Gap penalties treated as additional parameters to
be optimized