Why do we care about a sequence alignment - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Why do we care about a sequence alignment

Description:

worm CaMKII EARICRKLQHPNIVRLHDSIQEESFHYLVFDLVTGGELFEDIVAREFYSEADASHCIQQI ... worm CaMKII LESIAYCHSNGIVHRDLKPENLLLASKAKGAAVKLADFGLAIEVN-DSEAWHGFAGTPGY ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 33
Provided by: jamesh78
Category:

less

Transcript and Presenter's Notes

Title: Why do we care about a sequence alignment


1
Why do we care about a sequence alignment?
  • For a new protein (or DNA) whose sequence is
    related to another one about which something is
    already known, a lot can be learned with little
    effort. This can include
  • Biochemical function (e.g. protein kinase)
  • Organismal function (e.g. cell-cycle regulator)
  • Structural information (e.g. modeling against a
    known X-ray crystal structure)
  • To determine gene structure (e.g. splicing
    pattern) or likely subcellular or tissue
    expression pattern.

(cont)
2
(why do we care, cont.)
  • To study evolutionary relationships among genes
    and organisms.
  • Sequence assembly in genome projects
  • Find sequence overlaps for assembly
  • Quality control using multiple sequence runs

3
Homology, orthology, paralogy, similarity
  • Two sequences are homologous if they derive from
    a common ancestral sequence (includes paralogs
    and orthologs).
  • Two homologous sequences in the same organism
    are called paralogous. They arose by duplication
    and divergence from a shared ancestral sequence.
  • Two homologous sequences in different organisms
    may be orthologous (a strict definition is
    thornier because of paralogy more in later
    class).
  • Two sequences that can be aligned in a
    convincing manner are said to have sequence
    similarity.
  • If two sequences have similarity over a
    substantial length, this nearly always reflects
    homology (convergence is unlikely).

4
Major alignment methods
  • Hand alignment. Surprisingly effective but
    LABORIOUS.
  • Dynamic programming methods. There are many
    closely-related algorithms with specific names,
    including Smith-Waterman and Needleman-Wunsch.
  • Hidden Markov Models and related strict
    probabilistic methods. (In technical algorithmic
    terms, these are dynamic programming as well but
    are rarely called that in the literature.)
  • Dialign a method that aligns ungapped blocks
    (diagonals) and then arranges them with gaps
    between them. GS554/Papers/SequenceAlignment/Diali
    gn1A.pdf

5
Dynamic Programming Alignment
  • Variants of this method are used by most
    pairwise and multiple alignment programs (e.g.
    Clustal).
  • Fundamental problem is how to produce an optimal
    alignment of related protein or DNA sequences,
    given some assumptions about sequence
    conservation.
  • What is "optimal"? What assumptions have to be
    made to make the alignment feasible?

6
Optimal Alignments
  • One definition of optimal an alignment in which
    each aligned sequence residue descended from the
    same ancestral residue.
  • Another definition each aligned sequence
    residue plays the same functional role for the
    two proteins.
  • Typically, these two definitions are closely
    related.

7
A self-evidently true set of related sequences
CLUSTAL W(1.4) multiple sequence alignment
identical in all
. conservative changes worm
CaMKII EARICRKLQHPNIVRLHDSIQEESFHYLVFDLVTGGELFEDI
VAREFYSEADASHCIQQI fly CaMKII
EARICRKLHHPNIVRLHDSIQEENYHYLVFDLVTGGELFEDIVAREFYSE
ADASHCIQQI rat CaMKII EARICRLLKHPNIVRLHDSISEEGHH
YLIFDLVTGGELFEDIVAREYYSEADASHCIQQI
. ..
worm CaMKII LESIAYCHSNGIVHRDLKPENLLLA
SKAKGAAVKLADFGLAIEVN-DSEAWHGFAGTPGY fly CaMKII
LESVNHCHQNGVVHRDLKPENLLLASKAKGAAVKLADFGLAIEVQGDHQA
WFGFAGTPGY rat CaMKII LEAVLHCHQMGVVHRDLKPENLLLAS
KLKGAAVKLADFGLAIEVEGEQQRWFGFAGTPGY
.. . . .

Among them, these proteins have a total of
approximately 2 billion years of evolution (!!)
8
A dubious case
CLUSTAL W(1.4) multiple sequence
alignment ATTSTDGLISNGAERLRLQGSRLQTSRFACFRCCGNIIT
YLVRLRSTPEELEQRYKSKEI EARICRKLHHPNIVRLHDSIQEENYHYL
VFDLVTGGELFEDIVAREFYSEADASHCIQQI . . .
. . .. . . .
.. DKFLE--KEKHTFRRQVK--LLLLGAGESGKSTFLKQMRIIHGVN
FDYELLLEYQS---V LESVNHCHQNGVVHRDLKPENLLLASKAKGAAVK
LADFGLAIEVQGDHQAWFGFAGTPGY . .. . .
. . . . . .
Are these sequences evolutionarily related?
No - they are a G-protein alpha subunit and a CaM
kinase
9
Assumptions Made for Alignments
  • That there is a useful alignment at all. (Any
    two proteins with gt 20 amino acid identity over
    their full length are certainly derived from a
    common ancestral sequence.)
  • That each residue evolves independently of
    others.
  • That the primary sequence is the predominant
    determinant of function (biological assumption).

10
Graph theory and terminology
  • A sequence graph represents residues as edges
    between nodes (also called vertices).
  • Such a graph can represent a single sequence
    (above) or can be 2-dimensional (or more) and
    represent a comparison between two sequences.

thanks to Phil Green for some of these slide
figures
11
2-dimensional graph representing alignments
12
Graphs and sequence alignments
  • Paths through the graph correspond to
    alignments of the sequences, with each edge on
    the path corresponding to a column of the
    alignment.
  • Diagonal edges correspond to a column of two
    aligned residues.
  • Horizontal and vertical edges correspond to a
    column with a residue in one sequence and a gap
    in the other.

13
A
C
G
T
T
G
A
G
A
T
A
C
C
C
G
C
A
T
G
A
T
G
A
The above path corresponds to the following
alignment
Exercise convince yourself that there are 3mn
possible alignments.
14
Edge Weights on Graphs
  • Edge weights correspond to residue scores in an
    alignment column (a simple version would be 1 for
    a match, 0 otherwise).
  • The weight of a path is the sum of weights for
    each edge on the path.
  • The highest weight path corresponds to the
    highest scoring alignment for that scoring
    system.
  • Weights are assigned using a substitution score
    matrix, such as the BLOSUM62 matrix (the
    Needleman-Wunsch paper just uses 1 for each
    match).

15
How do we find the highest weight path?
16
Needleman-Wunsch / Smith-Waterman Algorithm
  • Start at the top left corner on the graph (or
    the bottom right corner).
  • Proceed across and down the graph, adding up
    match scores for every path and recording in a
    2-D array.
  • At the end, choose the path that gives the best
    score at the bottom row or right edge.
  • For alignment, trace back through best score
    path.
  • It is possible to prove that this is the best
    alignment.

17
A
C
G
T
T
G
A
G
A
T
A
C
C
C
G
C
A
T
G
A
T
G
A
The above path corresponds to the following
alignment
18
Protein score matrices
  • DNA score matrices are much simpler (and we
    wont cover them).
  • Quantitatively represent the degree of
    conservation of typical amino acid residues over
    evolutionary time.
  • All possible amino acid changes are represented
    (matrix of size 20 x 20).
  • Most commonly used are three different BLOSUM
    matrices tuned for different degrees of
    evolutionary divergence. (Henikoff and Henikoff)

19
BLOSUM62 Score Matrix
20
Amino acid structures
Hydrophobic
Polar
Charged
phenylalanine F
21
BLOSUM62 Score Matrix example D
Bad scores very dissimilar
22
Amino acid structures
Hydrophobic
Polar
Charged
23
How are BLOSUM scores inferred?
  • Find sets of sequences whose alignment is
    thought to be correct (this is largely
    bootstrapped by alignment).
  • Measure how often various amino acid pairs occur
    in all columns of all the alignments.
  • Normalize this to the expected frequency of such
    pairs randomly in the same set.
  • Derive a log-odds score (in half bits).

24
Example of a short block of 29 sequences
Column 1 29/29 H (histidine)
Column 15 24/29 P (proline)
3 H (histidine) 2 S (serine)
Column 12 mostly L (leucine), I (isoleucine), or
V (valine).
Column 18 very diverse.
25
Amino acid structures
Hydrophobic
Polar
Charged
phenylalanine F
26
Pair frequency vs. expectation
Sample column count
27
Log-odds score calculation (so adding scores
multiplying probabilities)
For use in alignment programs computational speed
is important so these are often rounded to the
nearest integer and (to reduce round-off error)
they are often multiplied by 2 (or more) first,
giving a half-bit score
28
Sample frequencies from this BLOSUM matrix
(half-bit scores)
In Swiss-Prot C 0.0162
Reverse calculation of aligned C-C pair frequency
in BLOSUM data set
C - C
you have to look this up
29
Constructing Blocks
  • Blocks are ungapped alignments of multiple
    sequences, usually 20 to 100 amino acids long.
  • Cluster the members of each block according to
    their percent identity.
  • Make pair counts and score matrix from similarly
    clustered blocks.
  • Each BLOSUM matrix is named for the percent
    identity cutoff in step 2 (e.g. BLOSUM70).

(Note that there is some circularity to the
process the blocks are aligned using score
matrices!)
30
Informal inductive proof of best alignment path
Consider the last step in the best alignment path
to node a below. This path must come from one of
the three nodes shown, where X, Y, and Z are the
cumulative scores of the best alignments up to
those nodes. We can reach node a by three
possible paths an A-B match, a gap in sequence A
or a gap in sequence B
The best-scoring path to a is the maximum of X
match Y gap Z gap
BUT the best paths to X, Y, and Z are analogously
the max of their three upstream possibilities,
etc. Inductively QED.
31
Traceback for best alignment path
For alignment, node a must have two associated
values the score of the best path leading to a
and which node that score came from (X, Y, or Z).
A similar pair of values is kept for every node
in the graph.
At the end, search along the bottom and right
edges for the highest scoring node and start
traceback from that node. Follow the node
pointers backward from there. Et voila! Notes
there are minor variants on the algorithm that
permit global, local, and repeat alignments,
improved gap models etc. These include always
starting traceback from the lower right corner,
never letting the best score drop below zero, and
keeping track of gap open vs. gap extension
(requires keeping three best scores at each node).
32
Assignment
F Y Y L F L F F Y
  • Find out the frequency of the 20 amino acids in
    proteins (overall).
  • Use this column of data to construct a 3 x 3
    amino acid score matrix, using rounded half-bit
    scores.
  • (dont use pseudocounts)
Write a Comment
User Comments (0)
About PowerShow.com