Title: Evolution and Scoring Rules
1Evolution and Scoring Rules
- Example Score
- 5 x ( matches) (-4) x ( mismatches)
- (-7) x (total length of all gaps)
- Example Score
- 5 x ( matches) (-4) x ( mismatches)
- (-5) x ( gap openings) (-2) x (total
length of all gaps)
2(No Transcript)
3(No Transcript)
4Scoring Matrices
5Scoring Rules vs. Scoring Matrices
- Nucleotide vs. Amino Acid Sequence
- The choice of a scoring rule can strongly
influence the outcome of sequence analysis - Scoring matrices implicitly represent a
particular theory of evolution - Elements of the matrices specify the similarity
of one residue to another
6Translation - Protein Synthesis Every 3
nucleotides (codon) are translated into one amino
acid
DNA A T G C 11 RNA A U G
C 31 Protein 20 amino acids
Replication
Transcription
Translation
7Nucleotide sequence determines the amino acid
sequence
8Translation - Protein Synthesis
RNA Protein
5 -gt 3 N-term -gt C-term
9(No Transcript)
10(No Transcript)
11Log Likelihoods used as Scoring Matrices PAM
- Accepted Mutations1500 changes in 71 groups
w/ gt 85 similarity BLOSUM Blocks
Substitution Matrix2000 blocks from 500
families
12Log Likelihoods used as Scoring MatricesBLOSUM
13Likelihood Ratio for Aligning a Single Pair of
Residues
- Above the probability that two residues are
aligned by evolutionary descent - Below the probability that they are aligned by
chance - Pi, Pj are frequencies of residue i and j in all
protein sequences (abundance)
14Likelihood Ratio of Aligning Two Sequences
15- The alignment score of aligning two sequences is
the log likelihood ratio of the alignment under
two models - Common ancestry
- By chance
16- PAM and BLOSUM matrices are all log likelihood
matrices - More specificly
- An alignment that scores 6 means that the
alignment by common ancestry is 2(6/2)8 times
as likely as expected by chance.
17BLOSUM matrices for Protein
- S. Henikoff and J. Henikoff (1992). Amino acid
substitution matrices from protein blocks. PNAS
89 10915-10919 - Training Data 2000 conserved blocks from BLOCKS
database. Ungapped, aligned protein segments.
Each block represents a conserved region of a
protein family
18Constructing BLOSUM Matrices of Specific
Similarities
- Sets of sequences have widely varying similarity.
Sequences with above a threshold similarity are
clustered. - If clustering threshold is 62, final matrix is
BLOSUM62
19- A toy example of constructing a BLOSUM matrix
from 4 training sequences
20Constructing a BLOSUM matr.1. Counting mutations
21Constructing a BLOSUM matr.2. Tallying mutation
frequencies
22Constructing a BLOSUM matr.3. Matrix of mutation
probs.
234. Calculate abundance of each residue (Marginal
prob)
245. Obtaining a BLOSUM matrix
25- Constructing the real BLOSUM62 Matrix
261.2.3.Mutation Frequency Table
274. Calculate Amino Acid Abundance
285. Obtaining BLOSUM62 Matrix
29(No Transcript)
30PAM Matrices (Point Accepted Mutations)
- Mutations accepted by natural selection
31PAM Matrices
- Accepted Point Mutation
- Atlas of Protein Sequence and Structure,
- Suppl 3, 1978, M.O. Dayhoff.
- ed. National Biomedical Research Foundation,
1 - Based on evolutionary principles
32Constructing PAM Matrix Training Data
33PAM Phylogenetic Tree
34PAM Accepted Point Mutation
35Mutability
36Total Mutation Rate
is the total mutation rate of all amino acids
37Normalize Total Mutation Rate
38Mutation Probability Matrix Normalized Such that
the Total Mutation Rate is 1
39Mutation Probability Matrix (transposed) M10000
40-- PAM1 mutation prob. matr. --PAM2
Mutation Probability Matrix? -- Mutations that
happen in twice the evolution period of that for
a PAM1
41PAM Matrix Assumptions
42In two PAM1 periods
- A?R A?A and A?R or
- A?N and N?R or
- A?D and D?R or
- or
- A?V and V?R
43Entries in a PAM-2 Mut. Prob. Matr.
44PAM-k Mutation Prob. Matrix
45PAM-1 log likelihood matrix
46PAM-k log likelihood matrix
47PAM-250
48- PAM6060, PAM8050,
- PAM12040
- PAM-250 matrix provides a better scoring
alignment than lower-numbered PAM matrices for
proteins of 14-27 similarity
49Sources of Error in PAM
50Comparing Scoring Matrix
- PAM
- Based on extrapolation of a small evol. Period
- Track evolutionary origins
- Homologous seq.s during evolution
- BLOSUM
- Based on a range of evol. Periods
- Conserved blocks
- Find conserved domains
51Choice of Scoring Matrix
52Global Alignment with Affine Gaps
- Complex Dynamic Programming
53Problem w/ Independent Gap Penalties
- The occurrence of x consecutive
deletions/insertions is more likely than the
occurrence of x isolated mutations - We should penalize x long gap less than x
- times of the penalty for one gap
54Affine Gap Penalty
- w2 is the penalty for each gap
- w1 is the _extra_ penalty for the 1st gap
55Scoring Rule not Additive!
- We need to know if the current gap is a new gap
or the continuation of an existing gap - Use three Dynamic Programming matrices to keep
track of the previous step
56- S1 is the vertical sequence
- S2 is the horizontal sequence
- (From Diagonal) a(i,j) current position is a
match - (From Left) b(i,j) current position is a gap in
S1 - (From Above) c(i,j) current position is a gap in
S2 - Filling the next element in each matrix depends
on the previous step, which is stored in the
three matrices.
57(No Transcript)
58Last step a match
a gap in S2
a gap in S1
new gap in S2
a continued gap in S2
a gap in S2 following a gap in S1
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65Decisions in Seq. Alignment
- Local or global alignment?
- Which program to use
- Type of scoring matrix
- Value of gap penalty
66 Aij10
67PAM-k log-likelihood matrix
68(No Transcript)