Why do we care about a sequence alignment - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Why do we care about a sequence alignment

Description:

Simple and Affine Gap Scores ... Affine gap penalties have an open penalty and a ... a multi-tiered affine gap score is required to fit existing data sets well. ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 42
Provided by: jamesh78
Category:

less

Transcript and Presenter's Notes

Title: Why do we care about a sequence alignment


1
Why do we care about a sequence alignment?
  • For a new protein (or DNA) whose sequence is
    related to another one about which something is
    already known, a lot can be learned with little
    effort. This can include
  • Biochemical function (e.g. protein kinase)
  • Organismal function (e.g. cell-cycle regulator)
  • Structural information (e.g. modeling against a
    known X-ray crystal structure)
  • To determine gene structure (e.g. splicing
    pattern) or likely subcellular or tissue
    expression pattern.

(cont)
2
(why do we care, cont.)
  • To study evolutionary relationships among genes
    and organisms.
  • Sequence assembly in genome projects
  • Find sequence overlaps for assembly
  • Quality control using multiple sequence runs

3
Homology, orthology, paralogy, similarity
  • Two sequences are homologous if they derive from
    a common ancestral sequence (includes paralogs
    and orthologs).
  • Two homologous sequences in the same organism
    are called paralogous. They arose by duplication
    and divergence from a shared ancestral sequence.
  • Two homologous sequences in different organisms
    may be orthologous (a strict definition is
    thornier because of paralogy more in later
    class).
  • Two sequences that can be aligned in a
    convincing manner are said to have sequence
    similarity.
  • If two sequences have similarity over a
    substantial length, this nearly always reflects
    homology (convergence is unlikely).

4
Major alignment methods
  • Hand alignment. Surprisingly effective but
    LABORIOUS.
  • Dynamic programming methods. There are many
    closely-related algorithms with specific names,
    including Smith-Waterman and Needleman-Wunsch.
  • Hidden Markov Models and related strict
    probabilistic methods. (In algorithmic terms,
    these are dynamic programming as well but are
    rarely called that in the literature.)
  • Dialign a method that aligns ungapped blocks
    (diagonals) and then arranges them with gaps
    between them. GS554/Papers/SequenceAlignment/Diali
    gn1A.pdf

5
Dynamic Programming Alignment
  • Variants of this method are used by most
    pairwise and multiple alignment programs (e.g.
    Clustal).
  • Fundamental problem is how to produce an optimal
    alignment of related protein or DNA sequences,
    given some assumptions about sequence
    conservation.
  • What is "optimal"? What assumptions have to be
    made to make the alignment feasible?

6
Optimal Alignments
  • One definition of optimal an alignment in which
    each aligned sequence residue descended from the
    same ancestral residue.
  • Another definition each aligned sequence
    residue plays the same functional role for the
    two proteins.
  • Typically, these two definitions are closely
    related.

7
A self-evidently true set of related sequences
CLUSTAL W(1.4) multiple sequence alignment
identical in all
. conservative changes worm
CaMKII EARICRKLQHPNIVRLHDSIQEESFHYLVFDLVTGGELFEDI
VAREFYSEADASHCIQQI fly CaMKII
EARICRKLHHPNIVRLHDSIQEENYHYLVFDLVTGGELFEDIVAREFYSE
ADASHCIQQI rat CaMKII EARICRLLKHPNIVRLHDSISEEGHH
YLIFDLVTGGELFEDIVAREYYSEADASHCIQQI
. ..
worm CaMKII LESIAYCHSNGIVHRDLKPENLLLA
SKAKGAAVKLADFGLAIEVN-DSEAWHGFAGTPGY fly CaMKII
LESVNHCHQNGVVHRDLKPENLLLASKAKGAAVKLADFGLAIEVQGDHQA
WFGFAGTPGY rat CaMKII LEAVLHCHQMGVVHRDLKPENLLLAS
KLKGAAVKLADFGLAIEVEGEQQRWFGFAGTPGY
.. . . .

Among them, these proteins have a total of
approximately 2 billion years of evolution (!!)
8
A dubious case
CLUSTAL W(1.4) multiple sequence
alignment ATTSTDGLISNGAERLRLQGSRLQTSRFACFRCCGNIIT
YLVRLRSTPEELEQRYKSKEI EARICRKLHHPNIVRLHDSIQEENYHYL
VFDLVTGGELFEDIVAREFYSEADASHCIQQI . . .
. . .. . . .
.. DKFLE--KEKHTFRRQVK--LLLLGAGESGKSTFLKQMRIIHGVN
FDYELLLEYQS---V LESVNHCHQNGVVHRDLKPENLLLASKAKGAAVK
LADFGLAIEVQGDHQAWFGFAGTPGY . .. . .
. . . . . .
Are these sequences evolutionarily related?
No - they are a G-protein alpha subunit and a CaM
kinase
9
Assumptions Made for (most) Alignments
  • That there is a useful alignment at all. (Any
    two proteins with gt 20 amino acid identity over
    their full length are certainly derived from a
    common ancestral sequence.)
  • That each residue evolves independently of
    others.
  • That the primary sequence is the predominant
    determinant of function (biological assumption).
  • That residues remain in the same order (no
    changes in domain order etc.).

10
Graph theory and terminology
  • A sequence graph represents residues as edges
    between nodes (also called vertices).
  • Such a graph can represent a single sequence
    (above) or can be 2-dimensional (or more) and
    represent a comparison between two sequences.

thanks to Phil Green for some of these slide
figures
11
2-dimensional graph representing alignments
12
Graphs and sequence alignments
  • Paths through the graph correspond to
    alignments of the sequences, with each edge on
    the path corresponding to a column of the
    alignment.
  • Diagonal edges correspond to two aligned
    residues.
  • Horizontal and vertical edges correspond to a
    residue in one sequence and a gap in the other.

13
A
C
G
T
T
G
A
G
A
T
A
C
C
C
G
C
A
T
G
A
T
G
A
The above path corresponds to the following
alignment
Exercise convince yourself that there are 3mn
possible alignments.
14
Edge Weights on Graphs
  • Edge weights correspond to scores for an aligned
    residue or gap (a simple version would be 1 for a
    match, 0 otherwise).
  • The weight of a path is the sum of weights for
    each edge on the path.
  • The highest weight path corresponds to the
    highest scoring alignment for that scoring
    system.
  • Weights are assigned using a substitution score
    matrix, such as the BLOSUM62 matrix (the
    Needleman-Wunsch paper just uses 1 for each
    match).

15
How do we find the highest weight path?
16
Needleman-Wunsch / Smith-Waterman Algorithm
  • Start at the top left corner on the graph (or
    the bottom right corner).
  • Proceed across and down the graph, adding up
    match scores for every path and recording in a
    2-D array (note in fact you dont have to
    record them all).
  • At the end, choose the path that gives the best
    score at the bottom row or right edge.
  • For alignment, trace back through best score
    path.
  • It is possible to prove that this is the best
    alignment.

17
A
C
G
T
T
G
A
G
A
T
A
C
C
C
G
C
A
T
G
A
T
G
A
The above path corresponds to the following
alignment
18
Informal inductive proof of best alignment path
Consider the last step in the best alignment path
to node a below. This path must come from one of
the three nodes shown, where X, Y, and Z are the
cumulative scores of the best alignments up to
those nodes. We can reach node a by three
possible paths an A-B match, a gap in sequence A
or a gap in sequence B
The best-scoring path to a is the maximum of X
match Y gap Z gap
BUT the best paths to X, Y, and Z are analogously
the max of their three upstream possibilities,
etc. Inductively QED.
19
Traceback for best alignment path
For alignment, node a must have two associated
values the score of the best path leading to a
and which node that score came from (X, Y, or Z).
A similar pair of values can be kept for every
node in the graph.
At the end, search along the bottom and right
edges for the highest scoring node and start
traceback from that node. Follow the node
pointers backward from there. Et voila! Notes
there are minor variants on the algorithm that
permit global, local, and repeat alignments,
improved gap models etc. These include always
starting traceback from the lower right corner,
never letting the best score drop below zero, and
keeping track of gap open vs. gap extension
(requires keeping three best scores at each node).
20
Protein score matrices
  • DNA score matrices are much simpler (and we
    wont cover them).
  • Quantitatively represent the degree of
    conservation of typical amino acid residues over
    evolutionary time.
  • All possible amino acid changes are represented
    (matrix of size 20 x 20).
  • Most commonly used are three different BLOSUM
    matrices tuned for different degrees of
    evolutionary divergence. (Henikoff and Henikoff)

21
BLOSUM62 Score Matrix
22
Amino acid structures
Hydrophobic
Polar
Charged
phenylalanine F
23
BLOSUM62 Score Matrix example D
Bad scores very dissimilar
24
Amino acid structures
Hydrophobic
Polar
Charged
25
How are BLOSUM scores inferred?
  • Find sets of sequences whose alignment is
    thought to be correct (this is largely
    bootstrapped by alignment).
  • Measure how often various amino acid pairs occur
    in all columns of all the alignments.
  • Normalize this to the expected frequency of such
    pairs randomly in the same set.
  • Derive a log-odds score (usually in half bits).

26
Example of a short block of 29 sequences
Column 1 29/29 H (histidine)
Column 15 24/29 P (proline)
3 H (histidine) 2 S (serine)
Column 12 mostly L (leucine), I (isoleucine), or
V (valine).
Column 18 very diverse.
27
Amino acid structures
Hydrophobic
Polar
Charged
phenylalanine F
28
Pair frequency vs. expectation
Sample column count
29
Log-odds score calculation (so adding scores
multiplying probabilities)
For use in alignment programs computational speed
is important so these are often rounded to the
nearest integer and (to reduce round-off error)
they are often multiplied by 2 (or more) first,
giving a half-bit score
30
Sample frequencies from this BLOSUM matrix
(half-bit scores)
Frequency of C over all proteins 0.0162
Reverse calculation of aligned C-C pair frequency
in BLOSUM data set
C - C
you have to look this up
31
Constructing Blocks
  • Blocks are ungapped alignments of multiple
    sequences, usually 20 to 100 amino acids long.
  • Cluster the members of each block according to
    their percent identity.
  • Make pair counts and score matrix from similarly
    clustered blocks.
  • Each BLOSUM matrix is named for the percent
    identity cutoff in step 2 (e.g. BLOSUM70).

(Note that there is some circularity to the
process the blocks are aligned using score
matrices!)
32
Log-odds Scores - Reminder
Where is the alignment score, qij is the real
pair frequency, eij is the expected random pair
frequency.
33
Probabilistic Interpretation of Scores (ungapped)
  • By converting scores back to probabilities, we
    can give a probabilistic interpretation to an
    alignment score.

34
The need for gaps and gap penalties
  • one kind of mutation is an insertion or deletion
    of one or more nucleotides, often called an
    indel.
  • most of the time indels in protein sequences are
    eliminated by selection, but not always.
  • represented by gap symbols, which mean a
    deletion in the gap-bearing sequence OR an
    insertion in the other sequence (usually you
    cant tell which).

TACTTGGATCCGATCAGGAGGAACCGATTC TACTTGGATCCGA--AGGA
GGAACCGATTC
35
BUT we need to penalize gaps!
TACTTGGATCCGATCAGGAGGAACCGATTC T--TT---T--G----GG-
G-------TT-
  • convince yourself that if we allow gaps without
    affecting score, we can always generate a nearly
    perfect alignment (constrained only by residue
    frequency and sequence length).
  • thinking biologically, like an amino acid
    change, an indel indicates evolutionary
    divergence.

36
Simple and Affine Gap Scores
  • Simple gap score - using the same dynamic
    programming method, add a negative score for each
    gap added (often described as subtracting a
    penalty).
  • However in real sequences, positions where a
    single gap occurs are much more likely to have a
    longer gap (note biochemical basis).
  • Affine gap penalties have an open penalty and a
    much smaller extend penalty.
  • Choice of penalties is relatively arbitrary
    (heuristic) and not based on good evolutionary
    models (as match scores are).

37
Randomly Distributed Gaps
(probability of a gap at each position in the
sequence)
note - the slope of the line on a log-linear
plot will vary according to the frequency of
gaps, but it will always be linear
38
Distribution of alignment gap lengths in large
set of structurally-aligned proteins
log-log plot
  • a multi-tiered affine gap score is required to
    fit existing data sets well.
  • nearly all affine gap scores currently are one
    tiered (open and extend).

observed
39
Alignment gap lengths regraphed
log-linear plot
Reasonably fit by 4 affine steps
40
Assignment for Tues.
Download and install the Bonsai alignment
program. http//depts.washington.edu/jtlab/softwar
e/softwareIndex.html (also on course web links
page near bottom) Multiple align sample sequence
set 3 (Help menu). There should be 45
sequences. Save the tree and multiple alignment
as JPG images and send those to me. (note there
may be issues with viewing the tree on
Macs) Select a column that is moderately
conserved in the multiple alignment and look at
the column scores for the column. Pick one each
of the best and worst scoring amino acids and
write a sentence explaining (qualitatively) why
the score makes sense. Bonsai has documentation
files that will help you do all this and to
understand it (note they are badly out of date
and need rewriting).
41
Assignment part 2
F Y Y L F L F F Y
  • Find out the actual frequency of the 20 amino
    acids in proteins (overall).
  • Use this column of data to construct a 3 x 3
    amino acid score matrix, using rounded half-bit
    scores.
  • (dont use pseudocounts)
Write a Comment
User Comments (0)
About PowerShow.com