Lecture 3: Scoring Matrices and Multiple Sequence Alignment BMI/IBGP 730 - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Lecture 3: Scoring Matrices and Multiple Sequence Alignment BMI/IBGP 730

Description:

Lecture 3: Scoring Matrices and Multiple Sequence Alignment. BMI ... The name comes from an operations research task, and has nothing to do with ... Waterman ... – PowerPoint PPT presentation

Number of Views:521
Avg rating:3.0/5.0
Slides: 42
Provided by: kunh
Category:

less

Transcript and Presenter's Notes

Title: Lecture 3: Scoring Matrices and Multiple Sequence Alignment BMI/IBGP 730


1
Lecture 3 Scoring Matrices and Multiple Sequence
Alignment BMI/IBGP 730
  • Kun Huang
  • Department of Biomedical Informatics
  • Ohio State University

2
Review of sequence alignment Scoring
matrices Multiple sequence alignment Tools and
examples
3
Dynamic programming
  • The name comes from an operations research task,
    and has nothing to do with writing programs.
    Programming use tabular structure for computing.

17
4
Dynamic programming matrix
  • Each cell has the score for the best aligned
    sequence prefix up to that position.
  • Example
  • ATGCT vs. ACCGCT
  • Match 2, mismatch 0, gap -1

Matching matrix, NOT the dynamical programming
matrix!
5
Dynamic programming matrix
0 1 2 3 4 5
6
0 1 2 3 4 5 6 7
A A
A T A _
A _ A T
A T A C
6
Optimal alignment by traceback
  • We traceback a path that gets us the highest
    score. If we don't have end gap penalties,
    then takeany path from thelast row or columnto
    the first.
  • Otherwise we needto include the top and bottom
    corners

0 1 2 3 4 5
6
0 1 2 3 4 5 6 7
AT - GCT ACCGCT
A - TGCT ACCGCT
7
Dynamic programming
  • Global alignment an alignment of two or more
    sequences that matches as many characters as
    possible in all of the sequences. Needleman-Wunch
    algorithm
  • Local alignment an alignment that includes only
    the best matching, highest-scoring regions in two
    or more sequences. Smith-Waterman algorithm
  • Difference all the scores are kept in the
    dynamical programming matrix for global
    alignment only the positive scores are kept in
    the dynamical programming matrix for local
    alignment, the negative ones are converted to
    zero.

8
BLAST Algorithm Intuition
The BLAST algorithm.The BLAST algorithm is a
heuristic search method that seeks words of
length W (default 3 in blastp) that score at
least T when aligned with the query and scored
with a substitution matrix. Words in the database
that score T or greater are extended in both
directions in an attempt to find a locally
optimal ungapped alignment or HSP (high scoring
pair) with a score of at least S or an E value
lower than the specified threshold. HSPs that
meet these criteria will be reported by BLAST,
provided they do not exceed the cutoff value
specified for number of descriptions and/or
alignments to report.

9
BLAST Algorithm Intuition
Databases are pre-indexed by the words. Without
gaps Altschul, S. F., Gish, W., Miller, W.,
Myers, E. W., Lipman, D. J., J. Mol. Biol. (1990)
215403-410 With gaps Altschul, S. F., Madden,
T. L., Schaffer, A. A., Zhang, J., Zhang, Z.,
Miller, W., Lipman, D. J., Nucleic Acids Research
(1997) 25(17)3389-3402
http//www.compbio.dundee.ac.uk/ftp/preprints/revi
ew93/Figure10.pdf
10
An alignment score
  • An alignment score is the sum of all the match
    scores of an alignment, with a penalty subtracted
    for each gap.
  • Gap penalties are usually "affine" meaning that
    the penalty for a gap computed as a linear
    function of its length x
  • gap penalty ax b.a b c -
    - da c c e f d9 2 7 6 gt 24 - (10 2) 12

Gap start continuationpenalty
AlignmentScore
Matchscore
11
How we pick substitution scores?
  • Problem how to score each residue pair in two
    aligned sequences?
  • intuitively the score should reflect the
    likelihood that two sequences are aligned because
    they related as opposed to a random alignment
  • We assume the letter a occurs independently with
    frequency qa
  • random match scenario the likelihood that the
    letters a and b occur by chance is qaqb
  • true match scenario the aligned pair a,b occurs
    with joint probability pab

12
Substitution scores
  • The likelihood that the residue pair (a, b)
    occurs as an aligned pair, as opposed to an
    unaligned pair, is pab/qaqb
  • If we use the logarithm of this ratio (log-odds
    ratio), the score for the entire alignment is
    additive

13
Substitution matrices
  • As we wanted, the total score is the sum of
    individual scores s(a, b) for each aligned pair
  • The s(a, b) can be arranged in a matrix called
    substitution or score matrix
  • for proteins, this is a 20 x 20 matrix
  • A number of such matrices have been proposed,
    each computed according to different strategies

14
Alignment as An Optimization Problem
  • Optimization criteria / cost function
  • What sort of alignment should be considered
  • Scoring system
  • Additive model
  • Based on probability compared with random
    sequence (PAM, BLOSUM)
  • Assumption of independence
  • More complicated cases
  • Gap penalty linear (s -gd)
  • affine (s -d (g-1)e)

15
Clusters of amino acids
Hydrophobic
Polar
A
T
Q
D
P
V
G
S
N
E
A
Acid
I
L
Y
F
K
M
H
C
W
R
Basic
Aromatic
16
Match scores
  • Match scores are oftencalculated on the basis
    of the frequency of particular mutations in
    very similar sequences.
  • We can then transform substitution frequencies
    into log odds scores, and then normalize.

17
Example of construction of a phylogenetic tree
Organism A A W T V A A A V R T S I Organism B A Y
T V A A A V R T S I Organism C A W T V A A A V L
T S I
A
C
B
W to Y
L to R
  • Assumptions
  • high similarity gt each position should have only
    changed once during evolution
  • number of changes proportional to evolutionary
    distance

18
The PAM (point accepted mutation) approach
  • Amino acid substitutions were estimated using
    1572 changes in 71 groups of protein sequences
    with at least 85 similarity
  • reasons for requiring such high degree of
    similarity
  • to exclude possibility that observed
    substitutions reflect two successive changes
    instead of just one
  • to make possible building phylogenetic trees for
    each group and reconstruct probable history of
    amino acid changes
  • The number of changes of every amino acid a into
    every b was the base for computing the scores
  • two-step procedure
  • of changes -gt relative mutability (relative
    number of changes)
  • relative mutability -gt scores

19
Computing relative mutability in PAM
  • Problem how to account for differences across
    groups in sequence composition, mutation rate?
  • a normalization factor called exposure to
    mutation for each amino acid was computed for
    each group
  • the of changes per amino acid (in each group)
    was divided by this factor
  • the relative frequencies thus obtained were then
    summed for all groups (sum called relative
    mutability)
  • Exposure to mutation of a defined as Pa K
  • Pa frequency of occurrences of a in group
  • K total of all changes in group per 100 sites

20
Deriving scores in PAM
  • Relative mutability used to compute PAM1
  • ma fa/(100paK)
  • Mab (fab/fa) ma
  • PPhelt-gtTyr 260/1572 relative mutability of
    Phe fraction of Phelt-gtTyr out of 260 Phe
    changes
  • Repeat for all other 18 PPhelt-gtwhatever
  • The resulting 20 transition probabilities were
    normalized so that their sum was 1
  • To obtain the scores, each Pab probability was
  • divided by Pb (likelihood, odds ratio)
  • the log taken, multiplied by 10 and rounded

21
PAM matrices and evolutionary distance
  • All these computations were based on an amount of
    evolution corresponding to 1 mutation in 100
    amino acids on average
  • called a 1 PAM evolutionary distance
  • Assuming a Markov model for amino acid
    substitutions the matrices for an evolutionary
    distances of N can be computed as PAM1N
  • Markov model assumes that every change is
    independent of the previous changes at that site
  • example PAM250 is computed as PAM1250

22
Alignment as An Optimization Problem
  • Optimization criteria / cost function
  • What sort of alignment should be considered
  • Scoring system
  • Additive model
  • Based on probability compared with random
    sequence (PAM, BLOSUM)
  • Assumption of independence
  • More complicated cases
  • Gap penalty linear (s -gd)
  • affine (s -d (g-1)e)

23
The PAM250 scoring matrix
24
BLOSUM
  • The BLOSUM matrix approach is based on the study
    of 2000 conserved amino acid patterns
  • protein were grouped in 500 families based on
    similarity of biochemical function
  • the patterns were found by looking for motifs of
    type d1 aa1 d2 aa2 d3 aa3, where di are
    stretches of up to 24 amino acids located in all
    sequences
  • These initial patterns were organized in larger
    ungapped regions (blocks) of size between 3 and
    60 amino acids
  • The multiple alignment of these blocks is the
    basis for determining the probabilities of amino
    acid changes
  • To better balance the count, sequences with more
    than K similarity were grouped together in one
    sequence before counting the amino acid
    substitutions in the multiple alignment
  • The version of the matrix thus derived is called
    BLOSUMK

25
BLOSUM62
  • BLOSUM62 is widely used for protein sequence
    alignments
  • Differences with PAM
  • no assumptions on evolutionary model
  • based on scoring at conserved positions in
    blocks, rather than in whole sequences
  • based on analysis of larger number of families

26
Which scoring matrix to use?
  • For protein match scores, two main options
  • PAM based on global alignments of closely related
    sequences. Normalized to changes per 100 sites,
    then exponentiated for more distant relatives.
  • BLOSUM based on local alignments in much more
    diverse sequences
  • Picking the right distance is important, and may
    be hard to do. BLOSUM seems to work better for
    more evolutionarily distant sequences. BLOSUM62
    is a good default.

27
Picking gap penalties
  • Many different possible forms
  • Most common is affine (gap open gap continue
    penalities)
  • More complex penalties have been proposed.
  • Penalties must be commensurate with match scores.
    Therefore, the match scoring scheme influences
    the gap penalty
  • Most alignment programs suggest appropriate
    penalties for each match score option.

28
The significance of an alignment
  • Significance testing is the branch of statistics
    that is concerned with assesing the probability
    that a particular result could have occurred by
    chance.
  • How do we calculate the probability that an
    alignment occurred by chance?
  • Either with a model of evolution, or
  • Empirically, by scrambling our sequences and
    calculating scores on many randomized sequences.

29
Multiple Sequence Alignment
  • Often applied to proteins
  • Proteins that are similar in sequence are often
    similar in structure and function
  • Sequence changes more rapidly in evolution than
    does structure and function.

30
Overview of Methods
  • Dynamic programming too computationally
    expensive to do a complete search uses
    heuristics
  • Progressive starts with pair-wise alignment of
    most similar sequences adds to that
  • Iterative make an initial alignment of groups
    of sequences, adds to these (e.g. genetic
    algorithms)
  • Locally conserved patterns
  • Statistical and probabilistic methods

31
Dynamic Programming
  • Computational complexity even worse than for
    pair-wise alignment because were finding all the
    paths through an n-dimensional hyperspace (We can
    picture this in 2 or 3 dimensions.)
  • Can align about 7 relatively short (200-300)
    protein sequences in a reasonable amount of time
    not much beyond that

32
A Heuristic for Reducing the Search Space in
Dynamic Programming
  • Lets picture this in 3 dimensions. It
    generalizes to n.
  • Consider the pair-wise alignments of each pair of
    sequences.
  • Create a phylogenetic tree from these scores.
  • Consider a multiple sequence alignment built from
    the phylogenetic tree.
  • These alignments circumscribe a space in which to
    search for a good (but not necessarily optimal)
    alignment of all n sequences.

33
Phylogenetic Tree as a Guilding Tree
  • Dynamic programming uses a phylogenetic tree to
    build a first-cut MSA.
  • The tree shows how protein could have evolved
    from shared origins over evolutionary time.

34
Dynamic Programming -- MSA
  • Create a phylogenetic tree based on pair-wise
    alignments (Pairs of sequences that have the best
    scores are paired first in the tree.)
  • Do a first-cut msa by incrementally doing
    pair-wise alignments in the order of alikeness
    of sequences as indicated by the tree. Most alike
    sequences aligned first.
  • Use the pair-wise alignments and the first-cut
    msa to circumscribe a space within which to do a
    full msa that searches through this solution
    space.
  • The score for a given alignment of all the
    sequences is the sum of the scores for each pair,
    where each of the pair-wise scores is multiplied
    by a weight ? indicating how far the pair-wise
    score differs from the first-cut msa alignment
    score.

35
Heuristic Dynamic Programming Method for MSA
  • Does not guarantee an optimal alignment of all
    the sequences in the group.
  • Does get an optimal alignment within the space
    chosen.

36
Progressive Methods
  • Similar to dynamic programming method in that it
    uses the first step (i.e., it creates a
    phylogenetic tree, aligns the most-alike pair,
    and incrementally adds sequences to the alignment
    in order of alikeness as indicated by the
    tree.)
  • Differs from dynamic programming method for MSA
    in that it doesnt refine the first-cut MSA by
    doing a full search through the reduced search
    space. (This is the computationally expensive
    part of DP MSA in that, even though weve cut
    down the search space, its still big when we
    have many sequences to align.)

37
Progressive Method
  • Generally proceeds as follows
  • Choose a starting pair of sequences and align
    them
  • Align each next sequence to those already
    aligned, one at a time
  • Heuristic method doesnt guarantee an optimal
    alignment
  • Details vary in implementation
  • How to choose the first sequence to align?
  • Align all subsequence sequences cumulatively or
    in subfamilies?
  • How to score?

38
ClustalW
  • Based on phylogenetic analysis
  • A phylogenetic tree is created using a pairwise
    distance matrix and nearest-neighbor algorithm
  • The most closely-related pairs of sequences are
    aligned using dynamic programming
  • Each of the alignments is analyzed and a profile
    of it is created
  • Alignment profiles are aligned progressively for
    a total alignment
  • W in ClustalW refers to a weighting of scores
    depending on how far a sequence is from the root
    on the phylogenetic tree (See p. 154 of
    Bioinformatics by Mount.)

39
Problems with Progressive Method
  • Highly sensitive to the choice of initial pair to
    align. If they arent very similar, it throws
    everything off.
  • Its not trivial to come up with a suitable
    scoring matrix or gap penalties.

40
ClustalW
http//www.ebi.ac.uk/clustalw/ http//align.genome
.jp
41
ClustalW
gtquery MKNTLLKLGVCVSLLGITPFVSTISSVQAERTVEHKVIKNET
GTISISQLNKNVWVHTELGYFSGEAVPS NGLVLNTSKGLVLVDSSWDD
KLTKELIEMVEKKFKKRVTDVIITHAHADRIGGMKTLKERGIKAHSTAL
T AELAKKNGYEEPLGDLQSVTNLKFGNMKVETFYPGKGHTEDNIVVWL
PQYQILAGGCLVKSASSKDLGNV ADAYVNEWSTSIENVLKRYGNINLV
VPGHGEVGDRGLLLHTLDLLKgtgi2984094 MGGFLFFFLLVLFSF
SSEYPKHVKETLRKITDRIYGVFGVYEQVSYENRGFISNAYFYVADDGV
LVVDALSTYKLGKELIESIRSVTNKPIRFLVVTHYHTDHFYGAKAFREV
GAEVIAHEWAFDYISQPSSYNFFLARKKILKEHLEGTELTPPTITLTKNL
NVYLQVGKEYKRFEVLHLCRAHTNGDIVVWIPDEKVLFSGDIVFDGRLP
FLGSGNSRTWLVCLDEILKMKPRILLPGHGEALIGEKKIKEAVSWTRKY
IKDLRETIRKLYEEGCDVECVRERINEELIKIDPSYAQVPVFFNVNPVN
AYYVYFEIENEILMGEgtgi115023spP10425MKKNTLLKVGL
CVSLLGTTQFVSTISSVQASQKVEQIVIKNETGTISISQLNKNVWVHTE
LGYFNGEAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVT
DVIITHAHADRIGGITALKERGIKAHSTALTAELAKKSGYEEPLGDLQT
VTNLKFGNTKVETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSAEAKNL
GNVADAYVNEWSTSIENMLKRYRNINLVVPGHGKVGDKGLLLHTLDLLK
gtgi115030spP25910MKTVFILISMLFPVAVMAQKSVKISDD
ISITQLSDKVYTYVSLAEIEGWGMVPSNGMIVINNHQAALLDTPINDAQ
TEMLVNWVTDSLHAKVTTFIPNHWHGDCIGGLGYLQRKGVQSYANQMTI
DLAKEKGLPVPEHGFTDSLTVSLDGMPLQCYYLGGGHATDNIVVWLPTE
NILFGGCMLKDNQATSIGNISDADVTAWPKTLDKVKAKFPSARYVVPGH
GDYGGTELIEHTKQIVNQYIESTSKPgtgi282554pirS25844
MTVEVREVAEGVYAYEQAPGGWCVSNAGIVVGGDGALVVDTLSTIPRAR
RLAEWVDKLAAGPGRTVVNTH FHGDHAFGNQVFAPGTRIIAHEDMRSA
MVTTGLALTGLWPRVDWGEIELRPPNVTFRDRLTLHVGERQVE
LICVGPAHTDHDVVVWLPEERVLFAGDVVMSGVTPFALFGSVAGTLAAL
DRLAELEPEVVVGGHGPVAGP EVIDANRDYLRWVQRLAADAVDRRLTP
LQAARRADLGAFAGLLDAERLVANLHRAHEELLGGHVRDAMEI
FAELVAYNGGQLPTCLA
Write a Comment
User Comments (0)
About PowerShow.com