Title: Multiple Sequence Alignment A survey of the various programs available and application of MSA in add
1Multiple Sequence Alignment A survey of the
various programs available and application of MSA
in addressing certain biological problems
2Sequence Comparison
- Aligning two sequences is the cornerstone of
Bioinformatics. - Sequence alignment is the basic step upon which
everything else built. - Sequence alignments are employed in
- predicting de novo secondary structure of
proteins, - The initial functional assignment
- knowledge-based tertiary structure predictions,
- Interpretation of data from genome sequencing
projects - Inference of phylogenetic trees and resolution of
ancestral relationships between species.
3Sequence Comparisons
- Homology Searches
- Look for homologous sequences in databases using
FASTA or BLAST program - Pattern Searches
- Used for searching short sequence patterns
- Multiple Sequence Alignment
- For aligning and comparing 3 or more related
sequences - Profile Analysis
- A profile is created from a multiple sequence
alignment
4MSA vs Pairwise
- PSA
- based on seq. similarity we can identify
unknown biological relationship - MSA
- Similar to PSA
- But also possible to identify conserved
sub-patterns based on known biological
relationship - High seq similarity functional structural
similarity (PSA) - Sequences with functional and structural
similarity can differ in sequences - PSA cannot detect this case
- Example Haemoglobin
5MSA vs Pairwise
- Structurally and functionally conserved molecules
can differ in sequence PSA cannot reveal
conserved patterns - Comparing of 2 sequences with high similarity
patterns detection lost due to high similarity - MSA useful in revealing critical patterns from
multiple related sequences
6Multiple Sequence Alignment
- Homology
- Homologous sequences, derive from common ancestor
- Inferred by sequence similarity
- MSA useful to demonstrate homology
- Weak similarity non-significant in pairwise
comparison - could be highly significant if same residues are
conserved - in other distantly related sequences
7Multiple Sequence Alignment
- Global or Local Alignments
- Substitution Matrices and weighting gaps
- Best alignment is one that represents an
evolutionary scenario - Mutational events in evolution considered in MSA
- Substitutions
- Insertions
- Deletions
- BLOSUM and PAM based on evolutionary distances
- Affine gap penalty model
- p a bL
- p a b blog(L)
- Scoring MSA
- Sum of Pairs score (SP) for columns
8http//www.people.virginia.edu/wrp/papers/ismb200
0.pdf
9Which MSA method?
- Global Methods
- Optimal Algorithms (MSA, MWT, MUSEQAL)
- Progressive (MULTALIN, PILEUP, CLUSTAL, MULTAL,
AMULT, DFALIGN, T-Coffee, MAP, PRRP, AMPS) - Local methods
- PIMA, DIALIGN, PRALINE, POA, MACAW, BlockMaker,
Iteralign - Combined (GENALIGN, ASSEMBLE, DCA)
- Statistical (HMMT, SAGA, SAM, Match Box)
- Parsimony (MALIGN, TreeAlign)
10Progressive algorithm
- Alignment is only an approximate solution
(heuristic based) - Simplest and effective ways of MSA
- Less time and less memory
- Sequences are added one by one to the multiple
alignment based on a pre-computed dendogram - Sequence addition is by PSA algorithm
- Disadvantage
- Once a sequence is aligned, cant be changed even
if it conflicts with later sequence additions - Examples
- PileUp, ClustalW, MultAlign, T-Coffee, etc
11Exact Algorithms
- Useful in cases where sequences are extremely
divergent - Simultaneously aligns all sequences
- Disadvantage
- Need to generalize Needleman-Wunsch algorithm
- Only for a maximum of 3 sequences
- Causes exponential increase in time and memory as
number of sequences increase - Examples
- MSA (up to 10 closely related sequences)
12Iterative algorithms
- Iterate over a existing sub-optimal solution,
modifying it at each step, until a convergence
point is met. - Examples
- SAGA, AMPS, Praline, IterAlign
13Consistency-based Algorithms
- Given a set of sequences, an optimal MSA is one
that agrees most with all possible optimal
pairwise alignments - Do not depend on specific subs. Matrix
- Score associated with alignment of 2 residues
depends on their indexes (position within protein
sequence) - Most consistent are often the ones close to truth
- Examples
- T-Coffee, DiAlign
14Multiple Alignment by profile HMM training
- Given n sequences , consider the following
cases - If the profile HMM is known, the following
procedure can be applied - Align each sequence S(i) to the profile
separately. - Accumulate the obtained alignments to a multiple
alignment. -
- If the profile HMM is not known, one can use
the following technique in order to obtain an HMM
profile from the given sequences - Choose a length L for the profile HMM and
initialize the transition and emission
probabilities. - Train the model using the Baum-Welch algorithm,
on all the training sequences. - Obtain the multiple alignment from the resulting
profile HMM, as in the previous case. - http//www.math.tau.ac.il/rshamir/algmb/00/scribe
00/html/lec06/node11.html
15Schematic of T-Coffee
16Testing the methods
- BAliBASE benchmark
- Correct Alignments
- Core Blocks of Conserved Motifs
- Typical Hard Problem Sets
17BAliBASE reference sets, showing the number of
alignments in each set
18BAliBASE - Reference set 1
19Scores based on core blocks(V1)
Scores based on full-length alignment(V1)
20Median score for Reference set 2
21Median score for aligning subgroups of sequences
in Reference set 3
22Median score for N/C terminal extensions
Reference set 4
23Median score for internal insertions Reference
set 5
24Results
- Core blocks aligned well over long sequences
all programs - Due to different patterns of conservation
- Alignment unreliable in the twilight zone
- Iterative method did well under distinct
alignment conditions, but not in the presence of
an orphan sequence - Global algorithms accurate and reliable for
- equidistant sequences
- divergent families of sequences and
- alignment of orphan sequence with a family
- Local algorithm DiAlign
- Best for N/C terminal extensions
- Internal insertions
25ClustalW vs DiAlign vs T-Coffee vs POA
- ClustalW global progressive method
- Guide tree created
- Successive pairwise alignment
- Poa progressive using partially ordered graphs
- No tree to guide alignment of sequences
- 2 most similar seqs are aligned and others are
added to this one profile in a stepwise fashion - DiAlign local algorithm
- Aligns whole segments
- PSA performed and ungapped alignments used
- T-Coffee progressive global local
- Performs PSA twice, once global (ClustalW) and
local (LAlign) - Results combined into primary library, then
extension step - Progressive alignment using info from library
26(No Transcript)
27Results of BAliBASE testing
28CPU time consumed by each program to align sets
of increasingly long sequences
29Results
- Poor performance by all programs with increase in
evolutionary distance - Increase in seq length better alignment
- T-Coffee best for low moderate evolutionary
distances - DiAlign good for high evolutionary distances
30What Can We Do With MSAs?
- Motif / pattern identification
- Structural modeling
- Phylogenetic analysis
- Molecular evolutionary analyses
- Identification of conserved genomic regions
across species
31Phylogenetic Analysis
- Visual representation of a MSA as a tree of
relationships - Many methods
- Distance builds tree by clustering sequences
according to their similarity - Parsimony finds tree that minimizes the number
of changes required - Maximum Likelihood finds tree that maximizes
likelihood given parameters - Bayesian Markov chain Monte Carlo simulation to
calculate posterior probabilities of trees
32Determining Relationships
Phylogram
Cladogram
(Baldauf 2003)
33Rooted vs Unrooted Trees
Rooted
Unrooted
(Baldauf 2003)
34Many taxa (567), Few genes (3)
(Soltis et al. 1999)
35Few taxa (13), Many genes (61)
(Goremykin et al. 2003)
36Many taxa, Many genes
Coming Soon
37Detecting Gene Families
(Baldauf 2003)
38Human, Gorilla, Chimp Glycophorin gene family
(Wang et al. 2003)
39Bootstrapping
(Baldauf 2003)
40Molecular Evolution Background
- Types of changes in protein-coding DNA
- Silent (synonymous)
- Replacement (nonsynonymous)
- Based on degeneracy of the genetic code
- K frequency of change between two sequences
- Ka of replacement changes per replacement
site - Driven by natural selection
- Reflect level of protein conservation
- Ks of silent changes per silent site
- Neutral, not affected by selective forces
- Used to estimate the neutral mutation rate
41Estimation of Substitution Rates
- Absolute Rate
- R ½ K / T, where T time since last common
ancestor - Rates are directly comparable
- Relative Rate
- Two species of interest and one outgroup
- Compare K from species 1 and outgroup against K
from species 2 and outgroup
42Relative Rate
Rat
K Rat-Kan K Hum-Kan
Human
K Rat-Kan - K Hum-Kan
-
Kangaroo
43Evaluation of Selective Pressures, The Ka / Ks
Ratio
- Ka / Ks gt 1 indicates positive selection
- Ka / Ks 1 indicates no selection
- Ka / Ks lt 1 indicates purifying selection
44280 homologs, Macaque-Human
(Wang et al. 2003)
45Conserved Genomic Regions
- Need complete genomes or homologous genomic
regions - Identify exons from distantly related species
- Identify regulatory elements from more closely
related species
46(Thomas et al. 2003)
47References
- Biological Sequence Analysis R. Durbin, S.
Eddy, A. Krogh G. Mitchison - Bioinformatics Sequence, structure and databanks
D. Higgins W. Taylor - Recent progress in multiple sequence alignment a
survey. - http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
RetrievedbPubMedlist_uids11966409doptAbstra
ct - Quality assessment of multiple alignment
programs.http//www.ncbi.nlm.nih.gov/entrez/query
.fcgi?cmdRetrievedbPubMedlist_uids12354624do
ptAbstract - Multiple sequence alignment with the Clustal
series of programs.http//www.ncbi.nlm.nih.gov/en
trez/query.fcgi?cmdRetrievedbPubMedlist_uids1
2824352doptAbstract - MAVID multiple alignment serverhttp//www.ncbi.nl
m.nih.gov/entrez/query.fcgi?cmdRetrievedbPubMed
list_uids12824358doptAbstract - Multiple alignment of sequences and structures.
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
RetrievedbPubMedlist_uids12510583doptAbstrac
t - Fast algorithms for large-scale genome alignment
and comparison.http//www.ncbi.nlm.nih.gov/entrez
/query.fcgi?cmdRetrievedbPubMedlist_uids12034
836doptAbstract