Sequence Alignment: A Brief Review - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Sequence Alignment: A Brief Review

Description:

Biological research or even human society often benefits from homology ... McClure M.A., Vasi, T.K., and Fitch W.M. (1994) Comparative analysis of multiple ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 31
Provided by: mei92
Category:

less

Transcript and Presenter's Notes

Title: Sequence Alignment: A Brief Review


1
Sequence Alignment A Brief Review
COMP650 10/17/01 Jie Song
2
Why align biological sequences (DNA, RNA and
protein)?
? Identify homology
Nature is conservative. Biological research or
even human society often benefits from homology
established through evolution. See the insulin
example.
? Predict functions
Primary sequence determines high level structure
and functions. Conservative regions in protein
sequence are good candidates for functional
domain.
? Infer evolutionary relations
Organisms sharing more homology are usually more
closely related in evolution.
3
Example compare human and rabbit insulin proteins
4
? Pairwise Sequence Alignment
Edit distance Dynamic programming Global sequence
alignment Local sequence alignment Other
heuristic methods in practice
? Multiple Sequence Alignment
Standard dynamic programming Progressive
methods Simultaneous methods Other heuristic
methods in practice
5
Pairwise alignment - Edit Distance
? Similarity vs. Distance
Two ways to measure similarity. Both functions
associate a numeric value with a specific pair of
sequences.
? Unit cost function w
Suppose a and b are nucleotide or amino acid
(allowing gap), then the simplest unit cost
function is -
w(a, a) 0 w(a, b) 1 (a ? b) w(a, ) w(,
b) 1
(This function would seem to be over-simplified
in practice.)
6
Pairwise alignment - Edit Distance
? Definitions
Given two sequences s and t
The cost of alignment of s and t is the sum of
all editing operations leading from s to t
The optimal alignment of s and t is the alignment
which has the minimal cost among all possible
alignments
The edit distance (dw) of s and t is the the cost
of the optimal alignment.
7
Pairwise alignment - Edit Distance
? A simple example
s AGCACTA t CAGCCTA
One possible alignment is (note there are a lot
more alignments)
s -AGCACTA t CAGC-CTA
Cost of alignment
w(-, C) w(A, A) w(G, G) w(C, C) w(A,
-) w(C, C) w(T, T) w(A, A) 2.
But we dont know yet if this alignment is the
optimal one. If it is, then the edit distance of
s and t will be 2.
8
Pairwise alignment - Dynamic Programming
? Goal
Aligning two sequences is to find the edit
distance or the optimal alignment of them. The
smaller the edit distance is, the more similar
the two sequences are.
? Problem
Calculating costs of all possible alignments is
straightforward, but how many possible alignments
are there for two given sequences?
? Solution
Dynamic programming to break the original,
complicated problem into smaller, manageable
sub-problems.
9
Pairwise alignment - Dynamic Programming
? How to break the problem?
Consider two prefixes s0..i and t0..j, assume we
have already known optimal alignments of shorter
prefixes (s0..i-1 and t0..j-1, s0..i-1 and t0..j,
s0..i and t0..j-1). then the optimal alignment of
s0..i and t0..j must be an extension of one of
them by adding
  • (si , tj ), or
  • (si , ), or
  • ( , tj )

dw(i, j) dw(s0..i, t0..j ) min dw(s0..i-1,
t0..j-1 ) w(si , tj ), dw(s0..i-1, t0..j )
w(si , ), dw(s0..i, t0..j-1 ) w( , tj ),
So,
10
Pairwise alignment - Global Sequence Alignment
? Needleman-Wunsch algorithm
Given two sequences s0..m and t0..n, find the
optimal alignment (or edit distance) by dynamic
programming.
What do we still need?
dw(s0..0, t0..0) 0, dw(s0..0, t0..j)
dw(s0..0, t0..j-1) w( , tj ) dw(s0..i, t0..0)
dw(s0..i-1, t0..0) w(si , )
Optimal alignment is the path of traceback.
Time and space complexity are O(n2).
11
Pairwise alignment - Global Sequence Alignment
? A simple example
S AAGT T AT
Copy the second graph on page 94 of textbook.
S AAGT T -A-T
12
Pairwise alignment - Local Sequence Alignment
? Motivation
  • Global alignment only considers the similarity
    of two
  • entire sequences.
  • Biologically significant similarities are often
    present in
  • only certain parts of DNA or protein
    sequences.

Consider two proteins carrying some functionally
related subunit, but most part of them do not
contribute to this function and they may be very
different in global alignments point of view.
How can we highlight the small regions of high
similarity?
  • Local alignment is much more commonly used than
  • global alignment in real world.

13
Pairwise alignment - Local Sequence Alignment
? Smith-Waterman algorithm
The basic idea is to evaluate similarity
instead of distance, other things may look
familiar
w(a, b) gt 0 if a and b are similar, w(a, b) lt 0
if a and b are not similar, w(a, ) lt 0 and w(,
b) lt 0.
sim(i, j) sim(s0..i, t0..j ) max
0, sim(s0..i-1, t0..j-1 ) w(si , tj
), sim(s0..i-1, t0..j ) w(si ,
), sim(s0..i, t0..j-1 ) w( , tj ),
sim(i, 0) 0 for 0 lt i lt m sim(0, j) 0 for 0
lt j lt n
14
Pairwise alignment - Local Sequence Alignment
? Example
s TTGACACCCTCCCAATTGTA t ACCCCAGGCTTTACACAT
Global alignment result
s TTGACACCCTCC-CAATTGTA
t ACCCCAGGCTTTACACAT---
Local alignment result
s ---------TTGACACCCTCCCAATTGTA
t ACCCCAGGCTTTACACAT-----------
Why is the difference? For local alignment,
? Final score is the largest score in any vertex
in matrix ? Optimal path may start and end
anywhere in the matrix ? No penalty for end gaps
15
Pairwise alignment - Local Sequence Alignment
? Example
w(a, a) 1 w(a, b) 0 (usually a negative
number) w(a, ) w(, b) -1
16
Pairwise alignment - Other Heuristic Methods
? Motivation
Algorithms based on dynamic programming have time
complexity of O(n2).
Today GenBank has approximately 109 documented
sequences.
Faster methods are necessary for database search.
Heuristic methods try to achieve O(n) run time on
the tradeoff of precision and optimality.
17
Pairwise alignment - Other Heuristic Methods
? FASTA
  • Consider exact matches between short substrings
    of s and t,

i.e., all substrings si..ik tj..jk for a
given parameter k.
  • If a significant number of matches are found,
    FASTA uses a

dynamic programming method to compute optimal
alignment.
? Precision depends on k, larger k results in
less exact matches.
? In practice, FASTA misses very few cases of
significant homology.
18
Pairwise alignment - Other Heuristic Methods
? BLAST
  • Consider non-gap alignments (rather than exact
    match) of

substrings with a fixed length k.
  • Uses a scoring function to measure similarity.

? With a threshold S, BLAST reports all database
entries which
have a segment pair with query that score higher
than S.
19
Multiple alignment
? Major applications
  • Motif discovery
  • Phylogenetic inference

? Challenges
  • Optimal alignment is known to be NP-complete
  • Standard dynamic programming has the time
    complexity O(nk), given k sequences of length n
  • Simple scores of similarity are no longer useful
  • Heuristic method is often necessary, and a large
    number of methods have been proposed

20
Multiple alignment - Standard Dynamic Programming
Recall the dynamic programming for pairwise
alignment, how if we have three sequences instead
of two? Consider a prefix s0..i/ t0..j/ u0..k,
they have seven possible shorter prefixes.
Now the edit distance becomes
dw(i, j, k) dw(s0..i, t0..j, u0..k) min
dw(s0..i-1, t0..j-1, u0..k-1) w(si , tj ,
uk), dw(s0..i-1, t0..j-1, u0..k) w(si , tj ,
), dw(s0..i-1, t0..j, u0..k) w(si , ,
), dw(s0..i-1, t0..j, u0..k-1) w(si , ,
uk), dw(s0..i, t0..j-1, u0..k-1) w( , tj ,
), dw(s0..i, t0..j-1, u0..k) w( , tj ,
), dw(s0..i, t0..j , u0..k-1) w( , , tj)
21
Multiple alignment - Standard Dynamic Programming
We may redefine the cost function
w(a, a, a) 0 w(a, a, b) 1 a ? b w(a, b, c)
2 a ? b, b ? c, a ? c
Traceback is done in three-dimensional space.
It is only suitable for small problems.
Time complexity O(nk) Space complexity
O(nk) Suppose each vertex needs 4 bytes, then
aligning 4 average size (300 amino acids)
proteins needs 3004 4 8.1 109 31640MB
memory
22
Multiple alignment - Standard Dynamic Programming
? Example Carrillo-Lipman Method (MSA2.0)
http//stateslab.bioinformatics.med.umich.edu/ibc/
msa.html
? Generates optimal multiple sequence alignment
? Basic idea is that optimal path is usually
near the diagonal,
if we can calculate some bounds for it, then we
can
ignore other parts of the multi-dimensional
matrix.
? Reduce more space complexity than time
complexity
? Still strictly restrict the number and length
of sequences
Usually only align less than 8 sequences
23
Multiple alignment - Progressive methods
? Also known as iterative methods
? Bains 1986, Feng and Doolittle 1987, Higgins
1996, etc
? Starts with two sequences, align to get a
consensus
sequence, more sequences are aligned to it in
some order
? Each alignment produces a new consensus
sequence
until reaching the global consensus sequence
? Final alignment depends on the order!
? Very fast, can easily align hundreds of
sequences
24
Multiple alignment - Progressive methods
A graphic demonstration using Clustal as an
example
INPUT SEQUENCES
EXHAUSTIVE PAIRWISE ALIGNMENT
May use a fast algorithm
Produce similarity tree (or phylogenetic tree)
UPGMA ANALYSIS
TAKE TWO MOST SIMILAR SEQUENCES FOR 2-WAY
ALIGNMENT
OUTPUT A CONSENSUS WITH GAP INSERTION
Once a gap, always a gap
MORE SEQUENCES?
YES
NO
FINAL ALIGNMENT
25
(No Transcript)
26
Multiple alignment - Simultaneous methods
? Motivation
Progressive methods suffer order dependency
problem.
Progressive methods easily stop at local optima
(like hill-climbing).
? DCA Divide and Conquer Alignment
Recursively divide original sequences until they
are suitable for MSA alignment, then align and
concatenate.
27
(No Transcript)
28
Multiple alignment - Other methods
? Use known motifs in protein database as
heuristic
? Use multiple dimensional dot plots
? Sampling alignment
? Hidden Markov Models
29
Multiple alignment - Other methods
? Finally, eyes and hands are still helpful
You might see this alignment generated by a
program
Improving it by hands is easy
30
Review of multiple sequence alignment methods
Chan S.C., Wong A.K. and Chiu D.K. (1992) A
survey of multiple sequence alignments methods.
Bull. Math. Biol. 54563-598.McClure M.A.,
Vasi, T.K., and Fitch W.M. (1994) Comparative
analysis of multiple protein-sequence alignment
methods. Mol. Biol. Evol. 11571-592.
http//www.csu.edu.au/ci/vol04/mulali/mulali.html
Write a Comment
User Comments (0)
About PowerShow.com