Alignment and Algorithms - PowerPoint PPT Presentation

About This Presentation
Title:

Alignment and Algorithms

Description:

Alignment and Algorithms Scoring Matrices: models of evolution Dynamic Program: speed Pairwise Sequence Alignment, variants Significance Pattern recognition, Hidden ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 29
Provided by: umichEdu7
Category:

less

Transcript and Presenter's Notes

Title: Alignment and Algorithms


1
Alignment and Algorithms
  • Scoring Matrices models of evolution
  • Dynamic Program speed
  • Pairwise Sequence Alignment, variants
  • Significance
  • Pattern recognition, Hidden Markov Models (HMM)
  • Multiple Sequence Alignment (MSA)
  • Protein families
  • Structure features

2
Scoring Models and Evolution
  • Several systems of scoring matrices.
  • Statistically based, curated data I.e., based on
    confident alignments of multiple sequences.
  • Two best known
  • PAM matrices Dayhoff et al., 1967
  • BLOSUM matrices Henikoff Henikoff, 1992

3
Scoring matrices, cont.
  • Recall using p(a,b) prob. that a (amino acid
    residue) mutated into b.
  • PAM (Point Accepted Mutation) uses sequence data
    for proteins which can be aligned with 99
    identical sequence. In such aligned sequences,
    one counts the frequency of each residue, and of
    each pair of residues in a column.

4
Scoring matrices, cont.
  • Basically, the score s(a,b) from this data is a
    LOD score
  • s(a,b) log(p(a,b)/p(a)p(b)).
  • Comparison of two models significant
    association vs. independent randomness.
  • Pruned scaled to make matrices integer-valued.

5
Scoring matrices, cont.
  • Interpretation Dayhoff, et al. define a short
    unit of evolutionary time, a PAM unit. Only very
    conservative changes will be seen in the
    composition of evolving proteins in this scale.
  • To extend to more distant similarities, used time
    independent Markov assumption.

6
Scoring matrices, cont.
  • p(a,b) p(a--gtbt1) gives the transition
    probability from a --gt b (or b --gt a we cannot
    keep track of directionality) in one unit of PAM
    time.
  • The time independent Markov model assumes
    p(a--gtbt2) in two units is simply the sum over
    all pairs of intermediate one time unit
    transitions
  • a --gt c followed by c --gt b

7
Scoring matrices, cont.
  • If P(1) is the original 20x20 matrix of amino
    acid transition probabilities, then the above
    says simply
  • P(n) P(1) x P(1)xx P(1) (n times)
  • (matrix multiplication)

8
Scoring matrices, cont.
  • So, PAM(n) matrix of LOD scores from transitions
    in PAM time n.
  • Available n1 to n500 smaller n for more
    closely related sequences.
  • Most commonly used 50 to 250.
  • Critique accurate for small time, goes off at
    larger times. Compare nucleic acid changes to
    amino acid changes.

9
PAM 250 MATRIX
10
Scoring matrices, cont.
  • BLOSUM Matrices
  • From the BLOCKS database.
  • BLOSUM BLOCKS Substitution Matrix
  • DB of aligned sequence. Conserved regions are
    marked. The BLOSUM(L)
  • transition probabilities are figured from
  • sequences with L identity of sequence. Score is
    a LOD score again. So, large BLOSUM L lt--gt small
    PAM n.
  • Warning corrections for oversampling of closely
    related sequence!

11
Scoring matrices, cont.
  • If the BLOSUM62 matrix is compared to PAM160
    (it's closest equivalent) then it is found that
    the BLOSUM matrix is less tolerant of
    substitutions to or from hydrophilic amino acids,
    while more tolerant of hydrophobic changes and of
    cysteine and tryptophan mismatches.
  • PAM is less sensitive to the effect of amino acid
    return substitutions, over long time periods.

12
BLOSUM62 Matrix
13
Dynamic Programming
  • Problem find (1) best score and (2) best
    alignment
  • Naïve evaluate each alignment and compare
  • Too many alignments!
  • Need more speed DP

14
Dynamic Programming, cont.
  • No. of alignments is exponentially large.
  • Use problem has a linear (left to right)
    structure to it, from the linearity of proteins,
    nucleic acids.
  • Analogous to the problem of distance originally
    arose in CS, edit distance.

15
Dynamic Programming, cont.
  • The basic algorithms here are
  • Needleman-Wunsch (global alignment)
  • (2) Smith-Waterman (local alignment)
  • BLAST is a heuristic version of SW.
  • (3) Many variants.

16
Dynamic Programming, cont.
  • NW algorithm can be understood pictorially.
  • It has three basic steps
  • Initialization
  • Iteration
  • Traceback
  • Basic insight you only have to optimize the
    score one step at a time, and can discard looking
    at most alignments.

17
Dynamic Programming, cont.
  • Missing scoring feature gap penalties.
  • Not as well understood as substitutions. Use
    geometric random variable models.
  • s(a,-) is independent of a, say s(a,-) -d
    gap penalty
  • For longer gaps, two methods
  • linear k gaps in a row gives penalty k x (-d).
  • affine k gaps in a row gives penalty
  • e (k-1) x (-d),
  • Where e is the gap initiation penalty.

18
Dynamic Programming, cont.
  • Notice affine gap does not just depend on the
    position, since you have to know whether you are
    continuing a gap from the left. First example of
    dependency in scoring.
  • Key to the iterative step is to realize there are
    only three ways to extend an alignment
  • R1 R1 -
  • r1 - r1
  • We have to update the score by taking the max of
    three possibilities max(s0 s(R1,r1), s0
    s(R1,-), s0 s(-,r1)).
  • Then to get the scores of all possible
    subalignments takes just
  • about 5 MN computations, instead of an
    exponential number.

19
Dynamic Programming, cont.
  • To derive the actual alignment requires storage
    one stores a traceback arrow from each position
    in the comparison array.
  • This information can be translated into the
    correct alignment.
  • Variants
  • Smith-Waterman local alignment. Adjustment is to
    0-out and start over again any time a NW score
    goes below zero, and procede. One chooses the
    largest scoreof a segment, and traces back as
    earlier.
  • Linear storage saves storage.
  • Repeats allows for duplication of segments.

20
Dynamic Programming, cont.
  • Variants (cont.)
  • Non-overlapping Repeats SW, avoiding previous
    best alignments.
  • Sampling of highest hits the best alignment
    score maybe an artifact of our scoring system, so
    many alignments may be high scoring.
  • Affine gap penalties actually have to have more
    states carried along, a precursor to Hidden
    Markov Models.

21
Significance
  • E-value in BLAST discussed last time. Equivalent
    to statistical concept of a p-value.
  • S score for best alignment is a random variable.
  • The p-value for an alignment is the probability
    that such a high score could have been found
    randomly.

22
Significance, cont.
  • This can be estimated two ways
  • Ungapped local alignment Karlin-Altschul
    statistics.
  • Generally, KA not available, rely instead on
    simulations take two sequences A and B, and
    randomly shuffle one and recompare. This
    eliminates composition bias in the teest samples,
    but randomizes the rest of the biological
    information in the sequence, namely the order of
    the residues (or nts). SW computes this for
    100s of shuffles.

23
HMMs
  • These are a special case of a general CS problem
    pattern recognition, or machine learning. Can one
    make the machine a master at recognizing or
    classifying certain things. There are a large
    family of such graphical models, this is one of
    the simplest.
  • Basic problem in MB can one recognize different
    functional or structural regions in a protein
    from its a.a. sequence? Its fold? Or less
    specifically, what are the common features of a
    group of proteins.

24
HMMs, cont.
  • For example, there exist very many G-protein
    coupled receptors, which fall into seven classes.
    Can one classify a new protein into (a) being a
    GPCR, and (b) assigning its class to it from
    sequence data alone?
  • Simplest example problem dishonest casino. You
    are given a sequence of observed random variables
    or states of the system (faces of a die), but you
    dont see hidden states (e.g., whether a casino
    is using a fair or biased die). Notice that this
    kind of example has a linear structure to it,
    like our a.a. sequences.

25
HMMs, cont.
  • The problem is given an observed path of
    (observable states) determine the underlying path
    through the hidden states. I.e., given the record
    of the coin flips, say when the biased coin was
    used and when the fair. Notice that in this model
    we know how many visible states there are and we
    are assuming we know how many hidden states there
    are.
  • Of course, we cannot know with cerrtainty when
    the fair/biased coin is used, we must seek a
    probable answer, and the algorithms show how to
    choose a most probable path through the model.
  • The harder problem (biological problem) is when
    we do not know what the hidden states are,
    except where we can define them, e.g.,
    structurally. For example, you could run through
    protein a.a. sequence and ask a HMM to decide
    which residues are situated in alpha helices of
    the protein, or when they lie in the cell
    membrane.

26
HMMs, cont.
  • In the first problem, if we know all the
    parameters, meaning mainly the transition
    problems for the Hidden Markov states, then it is
    easy to find a dynamic programming algorithm for
    finding the best path. This is the Viterbi
    algorithm.
  • Sometimes called parsing the hidden markov
    model, because the method originated in speech
    recognition work.
  • The harder problem is to figure out what the
    parameters are for the hidden states. This is
    called training the model, and is done by the
    Baum-Welch algorithm. This is a version of the
    so-called EM method, or algorithm (Expectation
    Maximization algorithm).
  • BW, or EM, is a heuristic optimization algorithm.
    That is, it starts from a given initial guess for
    the parameters involved and performs a
    calculation which is guaranteed to improve (make
    no worse, actually) the previous guess. One runs
    for a certain time and turns things off by a
    suitable threshold.

27
HMMs, cont.
  • Finally, a harder problem still is to learn the
    correct architecture of the model. There are
    packages for this in the case of protein
    families/ multiple sequence alignments.
  • In the case of structure, for example, one sees a
    classical problem in statistics one can let the
    data tell directly what the patterns should be
    (unsupervised learning) or you could use
    expertise to guide the data. This allows fainter
    structural signals to be heard. Problem is
    this fittingthe data too tightly, making the
    model generated harder to be generally
    applicable? Example of TMHMM and unusual, but
    necessary faint signals.

28
Further Notes are available at
  • http//www.math.lsa.umich.edu/dburns/547/547syll.
    html
  • You can link from there to lectures on sequence
    alignment and hidden Markov models.
Write a Comment
User Comments (0)
About PowerShow.com