Multiple Sequence Alignment A survey of the various programs available and application of MSA in add - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Multiple Sequence Alignment A survey of the various programs available and application of MSA in add

Description:

Optimal Algorithms (MSA, MWT, MUSEQAL) ... Iterate over a existing sub-optimal solution, modifying it at each step, until a ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 48
Provided by: kann2
Category:

less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignment A survey of the various programs available and application of MSA in add


1
Multiple Sequence Alignment A survey of the
various programs available and application of MSA
in addressing certain biological problems
  • Jeff Mower
  • Kiran Annaiah

2
Sequence Comparison
  • Aligning two sequences is the cornerstone of
    Bioinformatics.
  • Sequence alignment is the basic step upon which
    everything else built.
  • Sequence alignments are employed in
  • predicting de novo secondary structure of
    proteins,
  • The initial functional assignment
  • knowledge-based tertiary structure predictions,
  • Interpretation of data from genome sequencing
    projects
  • Inference of phylogenetic trees and resolution of
    ancestral relationships between species.

3
Sequence Comparisons
  • Homology Searches
  • Look for homologous sequences in databases using
    FASTA or BLAST program
  • Pattern Searches
  • Used for searching short sequence patterns
  • Multiple Sequence Alignment
  • For aligning and comparing 3 or more related
    sequences
  • Profile Analysis
  • A profile is created from a multiple sequence
    alignment

4
MSA vs Pairwise
  • PSA
  • based on seq. similarity we can identify
    unknown biological relationship
  • MSA
  • Similar to PSA
  • But also possible to identify conserved
    sub-patterns based on known biological
    relationship
  • High seq similarity functional structural
    similarity (PSA)
  • Sequences with functional and structural
    similarity can differ in sequences
  • PSA cannot detect this case
  • Example Haemoglobin

5
MSA vs Pairwise
  • Structurally and functionally conserved molecules
    can differ in sequence PSA cannot reveal
    conserved patterns
  • Comparing of 2 sequences with high similarity
    patterns detection lost due to high similarity
  • MSA useful in revealing critical patterns from
    multiple related sequences

6
Multiple Sequence Alignment
  • Homology
  • Homologous sequences, derive from common ancestor
  • Inferred by sequence similarity
  • MSA useful to demonstrate homology
  • Weak similarity non-significant in pairwise
    comparison
  • could be highly significant if same residues are
    conserved
  • in other distantly related sequences

7
Multiple Sequence Alignment
  • Global or Local Alignments
  • Substitution Matrices and weighting gaps
  • Best alignment is one that represents an
    evolutionary scenario
  • Mutational events in evolution considered in MSA
  • Substitutions
  • Insertions
  • Deletions
  • BLOSUM and PAM based on evolutionary distances
  • Affine gap penalty model
  • p a bL
  • p a b blog(L)
  • Scoring MSA
  • Sum of Pairs score (SP) for columns

8
http//www.people.virginia.edu/wrp/papers/ismb200
0.pdf
9
Which MSA method?
  • Global Methods
  • Optimal Algorithms (MSA, MWT, MUSEQAL)
  • Progressive (MULTALIN, PILEUP, CLUSTAL, MULTAL,
    AMULT, DFALIGN, T-Coffee, MAP, PRRP, AMPS)
  • Local methods
  • PIMA, DIALIGN, PRALINE, POA, MACAW, BlockMaker,
    Iteralign
  • Combined (GENALIGN, ASSEMBLE, DCA)
  • Statistical (HMMT, SAGA, SAM, Match Box)
  • Parsimony (MALIGN, TreeAlign)

10
Progressive algorithm
  • Alignment is only an approximate solution
    (heuristic based)
  • Simplest and effective ways of MSA
  • Less time and less memory
  • Sequences are added one by one to the multiple
    alignment based on a pre-computed dendogram
  • Sequence addition is by PSA algorithm
  • Disadvantage
  • Once a sequence is aligned, cant be changed even
    if it conflicts with later sequence additions
  • Examples
  • PileUp, ClustalW, MultAlign, T-Coffee, etc

11
Exact Algorithms
  • Useful in cases where sequences are extremely
    divergent
  • Simultaneously aligns all sequences
  • Disadvantage
  • Need to generalize Needleman-Wunsch algorithm
  • Only for a maximum of 3 sequences
  • Causes exponential increase in time and memory as
    number of sequences increase
  • Examples
  • MSA (up to 10 closely related sequences)

12
Iterative algorithms
  • Iterate over a existing sub-optimal solution,
    modifying it at each step, until a convergence
    point is met.
  • Examples
  • SAGA, AMPS, Praline, IterAlign

13
Consistency-based Algorithms
  • Given a set of sequences, an optimal MSA is one
    that agrees most with all possible optimal
    pairwise alignments
  • Do not depend on specific subs. Matrix
  • Score associated with alignment of 2 residues
    depends on their indexes (position within protein
    sequence)
  • Most consistent are often the ones close to truth
  • Examples
  • T-Coffee, DiAlign

14
Multiple Alignment by profile HMM training
  • Given n sequences , consider the following
    cases
  • If the profile HMM is known, the following
    procedure can be applied
  • Align each sequence S(i) to the profile
    separately.
  • Accumulate the obtained alignments to a multiple
    alignment.
  • If the profile HMM is not known, one can use
    the following technique in order to obtain an HMM
    profile from the given sequences
  • Choose a length L for the profile HMM and
    initialize the transition and emission
    probabilities.
  • Train the model using the Baum-Welch algorithm,
    on all the training sequences.
  • Obtain the multiple alignment from the resulting
    profile HMM, as in the previous case.
  • http//www.math.tau.ac.il/rshamir/algmb/00/scribe
    00/html/lec06/node11.html

15
Schematic of T-Coffee
16
Testing the methods
  • BAliBASE benchmark
  • Correct Alignments
  • Core Blocks of Conserved Motifs
  • Typical Hard Problem Sets

17
BAliBASE reference sets, showing the number of
alignments in each set
18
BAliBASE - Reference set 1
19
Scores based on core blocks(V1)
Scores based on full-length alignment(V1)
20
Median score for Reference set 2
21
Median score for aligning subgroups of sequences
in Reference set 3
22
Median score for N/C terminal extensions
Reference set 4
23
Median score for internal insertions Reference
set 5
24
Results
  • Core blocks aligned well over long sequences
    all programs
  • Due to different patterns of conservation
  • Alignment unreliable in the twilight zone
  • Iterative method did well under distinct
    alignment conditions, but not in the presence of
    an orphan sequence
  • Global algorithms accurate and reliable for
  • equidistant sequences
  • divergent families of sequences and
  • alignment of orphan sequence with a family
  • Local algorithm DiAlign
  • Best for N/C terminal extensions
  • Internal insertions

25
ClustalW vs DiAlign vs T-Coffee vs POA
  • ClustalW global progressive method
  • Guide tree created
  • Successive pairwise alignment
  • Poa progressive using partially ordered graphs
  • No tree to guide alignment of sequences
  • 2 most similar seqs are aligned and others are
    added to this one profile in a stepwise fashion
  • DiAlign local algorithm
  • Aligns whole segments
  • PSA performed and ungapped alignments used
  • T-Coffee progressive global local
  • Performs PSA twice, once global (ClustalW) and
    local (LAlign)
  • Results combined into primary library, then
    extension step
  • Progressive alignment using info from library

26
(No Transcript)
27
Results of BAliBASE testing
28
CPU time consumed by each program to align sets
of increasingly long sequences
29
Results
  • Poor performance by all programs with increase in
    evolutionary distance
  • Increase in seq length better alignment
  • T-Coffee best for low moderate evolutionary
    distances
  • DiAlign good for high evolutionary distances

30
What Can We Do With MSAs?
  • Motif / pattern identification
  • Structural modeling
  • Phylogenetic analysis
  • Molecular evolutionary analyses
  • Identification of conserved genomic regions
    across species

31
Phylogenetic Analysis
  • Visual representation of a MSA as a tree of
    relationships
  • Many methods
  • Distance builds tree by clustering sequences
    according to their similarity
  • Parsimony finds tree that minimizes the number
    of changes required
  • Maximum Likelihood finds tree that maximizes
    likelihood given parameters
  • Bayesian Markov chain Monte Carlo simulation to
    calculate posterior probabilities of trees

32
Determining Relationships
Phylogram
Cladogram
(Baldauf 2003)
33
Rooted vs Unrooted Trees
Rooted
Unrooted
(Baldauf 2003)
34
Many taxa (567), Few genes (3)
(Soltis et al. 1999)
35
Few taxa (13), Many genes (61)
(Goremykin et al. 2003)
36
Many taxa, Many genes
Coming Soon
37
Detecting Gene Families
(Baldauf 2003)
38
Human, Gorilla, Chimp Glycophorin gene family
(Wang et al. 2003)
39
Bootstrapping
(Baldauf 2003)
40
Molecular Evolution Background
  • Types of changes in protein-coding DNA
  • Silent (synonymous)
  • Replacement (nonsynonymous)
  • Based on degeneracy of the genetic code
  • K frequency of change between two sequences
  • Ka of replacement changes per replacement
    site
  • Driven by natural selection
  • Reflect level of protein conservation
  • Ks of silent changes per silent site
  • Neutral, not affected by selective forces
  • Used to estimate the neutral mutation rate

41
Estimation of Substitution Rates
  • Absolute Rate
  • R ½ K / T, where T time since last common
    ancestor
  • Rates are directly comparable
  • Relative Rate
  • Two species of interest and one outgroup
  • Compare K from species 1 and outgroup against K
    from species 2 and outgroup

42
Relative Rate
Rat
K Rat-Kan K Hum-Kan

Human
K Rat-Kan - K Hum-Kan
-
Kangaroo
43
Evaluation of Selective Pressures, The Ka / Ks
Ratio
  • Ka / Ks gt 1 indicates positive selection
  • Ka / Ks 1 indicates no selection
  • Ka / Ks lt 1 indicates purifying selection

44
280 homologs, Macaque-Human
(Wang et al. 2003)
45
Conserved Genomic Regions
  • Need complete genomes or homologous genomic
    regions
  • Identify exons from distantly related species
  • Identify regulatory elements from more closely
    related species

46
(Thomas et al. 2003)
47
References
  • Biological Sequence Analysis R. Durbin, S.
    Eddy, A. Krogh G. Mitchison
  • Bioinformatics Sequence, structure and databanks
    D. Higgins W. Taylor
  • Recent progress in multiple sequence alignment a
    survey.
  • http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
    RetrievedbPubMedlist_uids11966409doptAbstra
    ct
  • Quality assessment of multiple alignment
    programs.http//www.ncbi.nlm.nih.gov/entrez/query
    .fcgi?cmdRetrievedbPubMedlist_uids12354624do
    ptAbstract
  • Multiple sequence alignment with the Clustal
    series of programs.http//www.ncbi.nlm.nih.gov/en
    trez/query.fcgi?cmdRetrievedbPubMedlist_uids1
    2824352doptAbstract
  • MAVID multiple alignment serverhttp//www.ncbi.nl
    m.nih.gov/entrez/query.fcgi?cmdRetrievedbPubMed
    list_uids12824358doptAbstract
  • Multiple alignment of sequences and structures.
    http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
    RetrievedbPubMedlist_uids12510583doptAbstrac
    t
  • Fast algorithms for large-scale genome alignment
    and comparison.http//www.ncbi.nlm.nih.gov/entrez
    /query.fcgi?cmdRetrievedbPubMedlist_uids12034
    836doptAbstract
Write a Comment
User Comments (0)
About PowerShow.com