Multiple Sequence Alignment A survey of the various programs available and application of MSA in add - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Multiple Sequence Alignment A survey of the various programs available and application of MSA in add

Description:

Optimal Algorithms (MSA, MWT, MUSEQAL) ... Iterate over a existing sub-optimal solution, modifying it at each step, until a ... – PowerPoint PPT presentation

Number of Views:86

Avg rating:3.0/5.0

Slides: 48

Provided by: kann2

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignment A survey of the various programs available and application of MSA in add

1
Multiple Sequence Alignment A survey of the
various programs available and application of MSA
in addressing certain biological problems

Jeff Mower
Kiran Annaiah

2
Sequence Comparison

Aligning two sequences is the cornerstone of
Bioinformatics.
Sequence alignment is the basic step upon which
everything else built.
Sequence alignments are employed in
predicting de novo secondary structure of
proteins,
The initial functional assignment
knowledge-based tertiary structure predictions,
Interpretation of data from genome sequencing
projects
Inference of phylogenetic trees and resolution of
ancestral relationships between species.

3
Sequence Comparisons

Homology Searches
Look for homologous sequences in databases using
FASTA or BLAST program
Pattern Searches
Used for searching short sequence patterns
Multiple Sequence Alignment
For aligning and comparing 3 or more related
sequences
Profile Analysis
A profile is created from a multiple sequence
alignment

4
MSA vs Pairwise

PSA
based on seq. similarity we can identify
unknown biological relationship
MSA
Similar to PSA
But also possible to identify conserved
sub-patterns based on known biological
relationship
High seq similarity functional structural
similarity (PSA)
Sequences with functional and structural
similarity can differ in sequences
PSA cannot detect this case
Example Haemoglobin

5
MSA vs Pairwise

Structurally and functionally conserved molecules
can differ in sequence PSA cannot reveal
conserved patterns
Comparing of 2 sequences with high similarity
patterns detection lost due to high similarity
MSA useful in revealing critical patterns from
multiple related sequences

6
Multiple Sequence Alignment

Homology
Homologous sequences, derive from common ancestor
Inferred by sequence similarity
MSA useful to demonstrate homology
Weak similarity non-significant in pairwise
comparison
could be highly significant if same residues are
conserved
in other distantly related sequences

7
Multiple Sequence Alignment

Global or Local Alignments
Substitution Matrices and weighting gaps
Best alignment is one that represents an
evolutionary scenario
Mutational events in evolution considered in MSA
Substitutions
Insertions
Deletions
BLOSUM and PAM based on evolutionary distances
Affine gap penalty model
p a bL
p a b blog(L)
Scoring MSA
Sum of Pairs score (SP) for columns

8
http//www.people.virginia.edu/wrp/papers/ismb200
0.pdf
9
Which MSA method?

Global Methods
Optimal Algorithms (MSA, MWT, MUSEQAL)
Progressive (MULTALIN, PILEUP, CLUSTAL, MULTAL,
AMULT, DFALIGN, T-Coffee, MAP, PRRP, AMPS)
Local methods
PIMA, DIALIGN, PRALINE, POA, MACAW, BlockMaker,
Iteralign
Combined (GENALIGN, ASSEMBLE, DCA)
Statistical (HMMT, SAGA, SAM, Match Box)
Parsimony (MALIGN, TreeAlign)

10
Progressive algorithm

Alignment is only an approximate solution
(heuristic based)
Simplest and effective ways of MSA
Less time and less memory
Sequences are added one by one to the multiple
alignment based on a pre-computed dendogram
Sequence addition is by PSA algorithm
Disadvantage
Once a sequence is aligned, cant be changed even
if it conflicts with later sequence additions
Examples
PileUp, ClustalW, MultAlign, T-Coffee, etc

11
Exact Algorithms

Useful in cases where sequences are extremely
divergent
Simultaneously aligns all sequences
Disadvantage
Need to generalize Needleman-Wunsch algorithm
Only for a maximum of 3 sequences
Causes exponential increase in time and memory as
number of sequences increase
Examples
MSA (up to 10 closely related sequences)

12
Iterative algorithms

Iterate over a existing sub-optimal solution,
modifying it at each step, until a convergence
point is met.
Examples
SAGA, AMPS, Praline, IterAlign

13
Consistency-based Algorithms

Given a set of sequences, an optimal MSA is one
that agrees most with all possible optimal
pairwise alignments
Do not depend on specific subs. Matrix
Score associated with alignment of 2 residues
depends on their indexes (position within protein
sequence)
Most consistent are often the ones close to truth
Examples
T-Coffee, DiAlign

14
Multiple Alignment by profile HMM training

Given n sequences , consider the following
cases
If the profile HMM is known, the following
procedure can be applied
Align each sequence S(i) to the profile
separately.
Accumulate the obtained alignments to a multiple
alignment.
If the profile HMM is not known, one can use
the following technique in order to obtain an HMM
profile from the given sequences
Choose a length L for the profile HMM and
initialize the transition and emission
probabilities.
Train the model using the Baum-Welch algorithm,
on all the training sequences.
Obtain the multiple alignment from the resulting
profile HMM, as in the previous case.
http//www.math.tau.ac.il/rshamir/algmb/00/scribe
00/html/lec06/node11.html

15
Schematic of T-Coffee
16
Testing the methods

BAliBASE benchmark
Correct Alignments
Core Blocks of Conserved Motifs
Typical Hard Problem Sets

17
BAliBASE reference sets, showing the number of
alignments in each set
18
BAliBASE - Reference set 1
19
Scores based on core blocks(V1)
Scores based on full-length alignment(V1)
20
Median score for Reference set 2
21
Median score for aligning subgroups of sequences
in Reference set 3
22
Median score for N/C terminal extensions
Reference set 4
23
Median score for internal insertions Reference
set 5
24
Results

Core blocks aligned well over long sequences
all programs
Due to different patterns of conservation
Alignment unreliable in the twilight zone
Iterative method did well under distinct
alignment conditions, but not in the presence of
an orphan sequence
Global algorithms accurate and reliable for
equidistant sequences
divergent families of sequences and
alignment of orphan sequence with a family
Local algorithm DiAlign
Best for N/C terminal extensions
Internal insertions

25
ClustalW vs DiAlign vs T-Coffee vs POA

ClustalW global progressive method
Guide tree created
Successive pairwise alignment
Poa progressive using partially ordered graphs
No tree to guide alignment of sequences
2 most similar seqs are aligned and others are
added to this one profile in a stepwise fashion
DiAlign local algorithm
Aligns whole segments
PSA performed and ungapped alignments used
T-Coffee progressive global local
Performs PSA twice, once global (ClustalW) and
local (LAlign)
Results combined into primary library, then
extension step
Progressive alignment using info from library

26
(No Transcript)
27
Results of BAliBASE testing
28
CPU time consumed by each program to align sets
of increasingly long sequences
29
Results

Poor performance by all programs with increase in
evolutionary distance
Increase in seq length better alignment
T-Coffee best for low moderate evolutionary
distances
DiAlign good for high evolutionary distances

30
What Can We Do With MSAs?

Motif / pattern identification
Structural modeling
Phylogenetic analysis
Molecular evolutionary analyses
Identification of conserved genomic regions
across species

31
Phylogenetic Analysis

Visual representation of a MSA as a tree of
relationships
Many methods
Distance builds tree by clustering sequences
according to their similarity
Parsimony finds tree that minimizes the number
of changes required
Maximum Likelihood finds tree that maximizes
likelihood given parameters
Bayesian Markov chain Monte Carlo simulation to
calculate posterior probabilities of trees

32
Determining Relationships
Phylogram
Cladogram
(Baldauf 2003)
33
Rooted vs Unrooted Trees
Rooted
Unrooted
(Baldauf 2003)
34
Many taxa (567), Few genes (3)
(Soltis et al. 1999)
35
Few taxa (13), Many genes (61)
(Goremykin et al. 2003)
36
Many taxa, Many genes
Coming Soon
37
Detecting Gene Families
(Baldauf 2003)
38
Human, Gorilla, Chimp Glycophorin gene family
(Wang et al. 2003)
39
Bootstrapping
(Baldauf 2003)
40
Molecular Evolution Background

Types of changes in protein-coding DNA
Silent (synonymous)
Replacement (nonsynonymous)
Based on degeneracy of the genetic code
K frequency of change between two sequences
Ka of replacement changes per replacement
site
Driven by natural selection
Reflect level of protein conservation
Ks of silent changes per silent site
Neutral, not affected by selective forces
Used to estimate the neutral mutation rate

41
Estimation of Substitution Rates

Absolute Rate
R ½ K / T, where T time since last common
ancestor
Rates are directly comparable
Relative Rate
Two species of interest and one outgroup
Compare K from species 1 and outgroup against K
from species 2 and outgroup

42
Relative Rate
Rat
K Rat-Kan K Hum-Kan

Human
K Rat-Kan - K Hum-Kan
-
Kangaroo
43
Evaluation of Selective Pressures, The Ka / Ks
Ratio

Ka / Ks gt 1 indicates positive selection
Ka / Ks 1 indicates no selection
Ka / Ks lt 1 indicates purifying selection

44
280 homologs, Macaque-Human
(Wang et al. 2003)
45
Conserved Genomic Regions

Need complete genomes or homologous genomic
regions
Identify exons from distantly related species
Identify regulatory elements from more closely
related species

46
(Thomas et al. 2003)
47
References

Biological Sequence Analysis R. Durbin, S.
Eddy, A. Krogh G. Mitchison
Bioinformatics Sequence, structure and databanks
D. Higgins W. Taylor
Recent progress in multiple sequence alignment a
survey.
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
RetrievedbPubMedlist_uids11966409doptAbstra
ct
Quality assessment of multiple alignment
programs.http//www.ncbi.nlm.nih.gov/entrez/query
.fcgi?cmdRetrievedbPubMedlist_uids12354624do
ptAbstract
Multiple sequence alignment with the Clustal
series of programs.http//www.ncbi.nlm.nih.gov/en
trez/query.fcgi?cmdRetrievedbPubMedlist_uids1
2824352doptAbstract
MAVID multiple alignment serverhttp//www.ncbi.nl
m.nih.gov/entrez/query.fcgi?cmdRetrievedbPubMed
list_uids12824358doptAbstract
Multiple alignment of sequences and structures.
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
RetrievedbPubMedlist_uids12510583doptAbstrac
t
Fast algorithms for large-scale genome alignment
and comparison.http//www.ncbi.nlm.nih.gov/entrez
/query.fcgi?cmdRetrievedbPubMedlist_uids12034
836doptAbstract