Multiple Sequence Alignment - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Multiple Sequence Alignment

Description:

... interfaces Conserved residue analysis Comparison of sequence-based methods Central role of multiple alignments Bioinformatique Multiple Sequence Alignment ... – PowerPoint PPT presentation

Number of Views:328
Avg rating:3.0/5.0
Slides: 59
Provided by: LEC64
Category:

less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignment


1
Multiple Sequence Alignment
Julie Thompson Laboratory of Integrative
Bioinformatics and Genomics IGBMC, Strasbourg,
France julie_at_igbmc.fr
2
Multiple Sequence Alignment
  • Introduction what is a multiple alignment?
  • Multiple alignment construction
  • Traditional approaches optimal, progressive
  • Alignment parameters
  • Iterative and co-operative approaches
  • Multiple alignment analysis
  • Quality analysis/error detection
  • Conserved/homologous regions
  • Multiple alignment applications

3
What is a multiple alignment?
  • a representation of a set of sequences, where
    equivalent residues (e.g. functional, structural)
    are aligned in rows or more usually columns

Example part of an alignment of SH2 domains from
14 sequences
conserved identical residues conserved
similar residues
4
What is a multiple alignment?
conserved residues
secondary structure
conservation profile
5
Multiple Sequence Alignment
  • Introduction what is a multiple alignment?
  • Multiple alignment construction
  • Traditional approaches optimal, progressive
  • Alignment parameters
  • Iterative and co-operative approaches
  • Multiple alignment analysis
  • Quality analysis/error detection
  • Conserved/homologous regions
  • Multiple alignment applications

6
Multiple Alignment Construction
  • Optimal multiple alignment
  • example MSA (Lipman et al. 1989, Gupta et al.
    1995)

7
Optimal multiple alignment
Extension of dynamic programming for 2 sequences
gt N dimensions
Example alignment of 3 sequences
Problem calculation time and memory
requirements Time proportional to Nk for k
sequences of length N gt limited to less than 10
sequences
Alignment of 5 sulfate binding proteins, length
224-263 residues MSA OMA ClustalW gt12hours 6
2.9min 0.6sec
8
Multiple Alignment Construction
  • Optimal multiple alignment
  • MSA, OMA
  • Progressive multiple alignment
  • ClustalW (Thompson et al. NAR. 1994)
  • ClustalX (Thompson et al. NAR. 1997)

9
Progressive multiple alignment
Idea Progressively align pairs of sequences
(or groups of sequences)
10
Progressive multiple alignment
1) Pairwise alignments of all sequences
The alignment can be obtained by - local or
global method - dynamic programming or heuristic
method (eg. K-tuple count)
Hbb_human 3 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYP
WTQRFFESFGDLST ... . .
. . . Hba_human 2
LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLS.
...
Ex local pairwise alignments of globin sequences
Hbb_human 1 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYP
WTQRFFESFGDLST ... . .
Hbb_horse 1
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
...
Hba_human 2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHF.DLSH ...
. . . Hbb_horse 3
LSGEEKAAVLALWDKVNEE..EVGGEALGRLLVVYPWTQRFFDSFGDLSN
...
11
Progressive multiple alignment
Example in ClustalW/X distance between 2
sequences 1-
2) Construction of a distance matrix
No. identical residues
No. aligned residues
- .17 - .59 .60 - .59 .59 .13 - .77 .77 .75 .75 -
.81 .82 .73 .74 .80 - .87 .86 .86 .88 .93 .90 -
1
Ex 7 globin sequences
2
3
4
5
6
7
12
Progressive multiple alignment
  • Sequential branching
  • Construction of a guide tree
  • - Neigbor-Joining (NJ)
  • - UPGMA
  • - Maximum likelihood

3) Decide order of alignment
13
Progressive multiple alignment
4) Progressive multiple alignment
The sequences are aligned progressively (global
or local algorithm) - alignment of 2
sequences - alignment of 1 sequence and a
profile (group of sequences) - alignment of 2
profiles (groups of sequences)
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
14
Progressive multiple alignment
H1
H3
H2
H4
H6
H7
H5
15
Progressive multiple alignment
Global
Local
SB
SBpima
multal
NJ
clustalx
UPGMA
ML
multalign pileup
MLpima
SB - sequential branching
UPGMA- Unweighted Pair Grouping Method ML -
maximum likelihood NJ - neighbor-joining
16
Alignment parameters similarity matrices
Dynamic programming methods score an alignment
using residue similarity matrices, containing a
score for matching all pairs of residues
For nucleotide sequences
Transitions (A-G or C-T) are more frequent than
transversions (A-T or C-G)
More complex matrices exist where matches between
ambiguous nucleotides are given values whenever
there is any overlap in the sets of nucleotides
represented
17
Alignment parameters similarity matrices
For proteins, a wide variety of matrices
exist Identity, PAM, Blosum, Gonnet etc.
Matrices are generally constructed by observing
the mutations in large sets of alignments, either
sequence-based or structure-based
Matrices range from strict ones for comparing
closely related sequences to soft ones for very
divergent sequences.
e.g. PAM250 corresponds to an evolutionary
distance of 250, or approximately 80 residue
divergence PAM1 corresponds to less than
1 divergence
18
Alignment parameters similarity matrices
A single best matrix does not exist!
  • Altschul, 1991 suggests PAM250 for related
    sequences, PAM120 when the sequences are not
    known to be related and PAM40 to search for short
    segments of highly similar sequences.
  • Henikoff, Henikoff, 1993 suggest Blosum62 as a
    good all-round matrix, Blosum45 for more
    divergent sequences and Blosum100 for strongly
    related sequences
  • ClustalW automatically selects a suitable
    matrix depending on the observed pairwise
    identity
  • By default ID gt35 Gonnet 80
  • 35gtID gt25 Gonnet 250
  • lt25ID Gonnet 350

19
Alignment parameters gap penalties
  • A gap penalty is a cost for introducing gaps into
    the alignment, corresponding to insertions or
    deletions in the sequences

SFGDLSNPGAVMG HF-DLS-----HG
  • proportional gap costs charge a fixed penalty for
    each residue aligned with a gap - the cost of a
    gap is proportional to its length

GAP_COSTuk where k is the length of gap
  • linear or affine gap costs define a cost for
    introducing or opening a gap, plus a
    length-dependent extension cost

GAP_COSTvuk where v is the gap opening cost,
u is the
gap extension cost
20
Alignment parameters gap penalties
  • ClustalW uses position-specific gap penalties to
    make gaps more or less likely at different
    positions in the alignment
  • Gap penalties are lowered at existing gaps and
    increased near to existing gaps
  • Gap penalties are lowered in hydrophilic
    stretches
  • Otherwise, gap opening penalties are modified
    according to their observed relative frequencies
    adjacent to gaps (Pascarella Argos, 1992)

Goal is to introduce gaps in sequence segments
corresponding to flexible regions of the protein
structure
21
Multiple Alignment Construction
  • Optimal multiple alignment
  • MSA, OMA
  • Progressive multiple alignment
  • ClustalW, ClustalX
  • Iterative multiple alignment
  • PRRP (Gotoh, 1993)
  • SAGA (Notredame et al. NAR. 1996)
  • DIALIGN (Morgenstern et al. 1999)
  • HMMER (Eddy 1998), SAM (Karplus et al. 2001)

22
Iterative refinement
PRRP (Gotoh, 1993) refines an initial progressive
multiple alignment by iteratively dividing the
alignment into 2 profiles and realigning them.
divide sequences into 2 groups
pairwise profile alignment
profile 1
refined alignment
initial alignment
Global progressif
profile 2
no
23
Genetic Algorithms
SAGA (Notredame et al.1996) evolves a population
of alignments in a quasi evolutionary manner,
iteratively improving the fitness of the
population
24
Segment-to-segment alignment
Dialign (Morgenstern et al. 1996) compares
segments of sequences instead of single residues
1. construct dot-plots of all possible pairs of
sequences
Sequence i
Sequence j
2. find a maximal set of consistent diagonals in
all the sequences
.......aeyVRALFDFngndeedlpfkKGDILRIrdkpeeq........
.......WWNAedsegkr.GMIPVPYVek.......... ........nl
FVALYDFvasgdntlsitKGEKLRVlgynhnge..............WCE
Aqtkngq..GWVPSNYItpvns....... ieqvpqqptyVQALFDFdpq
edgelgfrRGDFIHVmdnsdpn...............WWKGachgqt..G
MFPRNYVtpvnrnv..... gsmstselkkVVALYDYmpmnandlqlrKG
DEYFIleesnlp...............WWRArdkngqe.GYIPSNYVtea
eds...... .....tagkiFRAMYDYmaadadevsfkDGDAIINvqaid
eg...............WMYGtvqrtgrtGMLPANYVeai.........
..gsptfkcaVKALFDYkaqredeltfiKSAIIQNvekqegg........
.......WWRGdyggkkq.LWFPSNYVeemvnpegihrd .......gyq
YRALYDYkkereedidlhLGDILTVnkgslvalgfsdgqearpeeigWLN
GynettgerGDFPGTYVeyigrkkisp..
3. Local alignment - residues between the
diagonals are not aligned
25
Multiple alignment methods
Progressive
Global
Local
SB
SBpima
multal
NJ
clustalx
UPGMA
ML
multalign pileup
MLpima
prrp
Genetic Algo.
HMM
dialign
saga
hmmt
Iterative
26
League Table based on BAliBASE benchmark database
Comparison of programs
Reference 1 lt 6 sequences
Reference 5 long insertions
Reference 3 several sub-families
Reference 4 long N/C terminal extensions
Reference 2 a family with an orphan


















lt 100 résidues
gt 400 résidues
Tous
All
N
/
A
N
/
A
N
/
A
N
/
A
GLOBAL
iterative
N
/
A
N
/
A
LOCAL
iterative
  • Iterative algorithms can improve alignment
    quality, but can be slow
  • Global algorithms work well when sequences are
    homologous over their full lengths, local
    algorithms are better for non-colinear sequences

Thompson et al. 1999
27
Multiple Alignment Construction
  • Optimal multiple alignment
  • MSA, OMA
  • Progressive multiple alignment
  • ClustalW, ClustalX
  • Iterative multiple alignment
  • PRRP, SAGA, DIALIGN, HMMER, SAM
  • Co-operative multiple alignment
  • T-COFFEE (Notredame et al. 2000)
    http//igs-server.cnrs-mrs.fr/Tcoffee/
  • DbClustal (Thompson et al. 2000)
    http//www-igbmc.u-strasbg.fr/BioInfo/
  • MAFFT (Katoh et al. 2002) http//www.biophys.kyoto
    -u.ac.jp/katoh/programs/align/mafft/
  • MUSCLE (Edgar, 2004) http//www.drive5.com/muscle
  • Probcons (Do et al. 2005)
  • Kalign (Lassmann et al. 2005)

28
DbClustal
http//bips.u-strasbg.fr/PipeAlign/
Blast Database Search
Query Sequence
Database Hits
Domain A
Domain B
Domain C
29
Comparaison ClustalW / DbClustal
ClustalW
DbClustal
30
MAFFT
  • Local homologous segments detected using a Fast
    Fourier Transform
  • Pairwise alignments are performed using
    restricted global dynamic programming
  • Multiple alignment is built up using a
    progressive algorithm, similar to ClustalW
  • Multiple alignment is then iteratively refined
    by dividing alignment into 2 parts and realigning

31
MAFFT
Pairwise alignments
c(k)
k
2
-1
1. Fast Fourier Transform to detect local
conserved segments
2. Segment Level Dynamic Programming to select
consistent segments
3. Fix residues at the centre of each segment
pair and realign between fixed points (white
regions only)
32
State-of-the-art
  • Co-operative algorithms have led to significant
    improvements

Ref 11 lt20 ID
BAliBASE 3
Ref 12 20-40 ID
Ref 5 insertions
Ref 2 orphan
Ref 4 extensions
Ref 3 sub-families
but none of the methods currently available
are capable of producing high-quality alignments
for all test cases
Thompson et al. 2005, 2006
33
RNA alignment methods
  • Comparison using BRAliBASE RNA structure
    alignments (Gardner et al, 2005)
  • Above 60 identity, sequence and structure based
    approaches have similar scores
  • Algorithms incorporating structural information
    outperform pure sequence methods. However, these
    algorithms are computationally demanding which
    severely limits their use in practice.
  • Some more recent methods
  • Sequence R-Coffee (Wilm, 2008), MAFFT (Katoh,
    2008)
  • Structure LARA (Bauer, 2007), FoldalignM
    (Torarinsson, 2007), SCARNA (Tabei, 2008)

34
DNA alignment methods
  • Complete genomes
  • Local alignments (BlastZ, MultiZ, MUMmer,)
  • Global alignments (MGA, Multi-LAGAN, MAVID,
    MAUVE, MAP2, Mulan,)

Reviewed in Dewey and Pachter, Human Molecular
Genetics, 2006
35
Multiple Sequence Alignment
  • Introduction what is a multiple alignment?
  • Multiple alignment construction
  • Traditional approaches optimal, progressive
  • Alignment parameters
  • Iterative and co-operative approaches
  • Multiple alignment analysis
  • Quality analysis/error detection
  • Conserved/homologous regions
  • Multiple alignment applications

36
Multiple alignment analysis
  • Are the sequences correctly aligned?
  • Quality analysis alignment objective functions
    (SP, NorMD)
  • error detection and correction (RASCAL, Refiner)
  • Are the sequences in the alignment homologous?
  • Conserved/homologous regions (MCOFFEE, LEON)
  • Conserved (functional) residues

37
Objective functions
Sum-of-pairs (Carrillo, Lipman, 1988) Sum of
scores for all pairs of sequences
Blosum62 N C N 6 -3 C -3 9
Seq1-2 3 pairs N-N 3x618
Sequence 1 N N N Sequence 2 N N
N Sequence 3 N N C Sequence 4 N C C
Seq1-3 2 pairs N-N, 1 pair N-C 2x6(-3)9
Seq1-4 1 pair N-N, 2 pairs N-C 62x(-3)0
Seq2-3 2 pairs N-N, 1 pair N-C 2x6(-3)9
Seq2-4 1 pair N-N, 2 pairs N-C 62x(-3)0
Seq3-4 1 pair N-N, 1 pair N-C, 1 pair
CC 6(-3)912
48
  • Information content (Hertz et al, 1999)
  • Entropy column scores (between 0 and 1), sum for
    all columns in the alignment
  • norMD (Thompson et al, 2001)
  • Column scores
  • normalisation for sequence set to be aligned
    (number, length, similarity)
  • lt0.3 bad alignment
  • 0.3-0.7 some local errors
  • gt0.7 good alignment

38
Objective functions NorMD
Window length 8
Window length 40
39
Error detection and correction
  • RASCAL (Thompson et al, 2003), Refiner
    (Chakrabati et al, 2006)

RASCAL
40
Error detection and correction
  • RASCAL, errors within core blocks

metalloprotease
41
Error detection and correction
  • RASCAL, errors between core blocks

methyltransferase
42
Homology detection methods
  • Sequence percent identity
  • gt30 identity ? sequences are homologous
  • 15-30 identity ? twilight zone
  • local analysis of positional conservation
  • AL2CO (Pi, Grishin, 2001), SEGID
    (Wang,Zu,2003), NorMD
  • Conserved regions
  • LEON (Thompson et al, 2004), MCOFFEE (Moretti et
    al, 2007)

43
Homology analysis with LEON
  • vertical analysis sequence clustering,
    intermediate sequences
  • horizontal analysis residue conservation,
    motif context information
  • composition analysis prediction of
    compositionally biased segments
  • Homologous regions are delineated
  • Removal of sequences non-homologous to query

44
Homology analysis with LEON
Query sequence DKK1_HUMAN
BlastP results



















DKK1_HUMAN Dickkopf related protein-1 precursor 1e-151
DKK3_MOUSE Dickkopf related protein-3 precursor 8e-07
TXCA_CAEEX Neurotoxic peptide caeron precursor. 0.007
PRK1_RAT Prokineticin 1 precursor 0.021
VPRA_DENPO Intestinal toxin 1 _MIT 0.10
Q8BKK7 MEGF11 protein. 0.10
COL_RABIT Colipase precursor. 0.13
PRK2_HUMAN Prokineticin 2 precursor 0.17
Q7XZ34 Growth factor _Fragment_. 0.17
1imt_ VENOM. MAMBA INTESTINAL TOXIN 1, 0.23
Q863H5 Bv8/prokineticin 2-like protein. 0.30
VE6_RHPV1 E6 protein. 1.1
COL_CANFA Colipase precursor. 3.3
Q9Y7V5 Conidiospore surface protein. 3.3
COLA_HORSE Procolipase A precursor _Fragment_. 4.3
O00508 Latent TGF-beta binding protein-4. 5.6
1pco_ LIPASE PROTEIN COFACTOR. 7.3
Q8SRF4 GTP binding protein. 7.3
NTC1_MOUSE Neurogenic locus notch homolog 9.6
45
Homology analysis with LEON
dkk1
dkk2
dkk3
Prokinecitin/ Intestinal toxin
Lipase protein cofactor
46
Structural proteomics target characterisation
Detection of structural homologs for targets in
the SPINE (Structural Proteomics in Europe)
project
47
Conserved residue analysis
  • Active site residues are under evolutionary
    pressure to maintain their functional integrity
    and undergo fewer mutations than less
    functionally important amino acids
  • Methods
  • Evolutionary trace (Lichtarge et al, 1996)
    sequence conservation patterns in homologous
    proteins are mapped onto the protein surface to
    generate clusters identifying functional
    interfaces

48
Conserved residue analysis
  • Comparison of sequence-based methods
  • FRcons combines information
  • conservation at each site
  • amino acid distribution
  • predicted secondary structure (ss)
  • predicted relative solvent accessibility (rsa)

FRcons Fischer et al. Bioinformatics 2008
49
OrdAli Ordered Alignment Analysis
color scheme
  • residues conserved in all sequences in family
  • structural or functional importance
    characteristic motifs
  • residues conserved within a sub-group of
    sequences
  • discriminant residues

50
Schematic alignment of aspartyl-tRNA synthetases
  • universal proteins, play a key role in traduction

320
180
280
300
200
260
240
220
Anticodon binding domain
340
360
380
400
420
440
460
480
500
520
540
560
P
L Q PQ KQ
R
Motif I
Flipping
Motif II
loop
Insertion domain
Catalytic core I
690
890
710
730
750
770
790
810
830
850
870
930
G
H
Euc
Family conserved ArchaeaBacteria
ArchaeaEukaryote
Arc
Bac
Motif III
Catalytic core II
51
PipeAlign automatic protein analysis
http//www-igbmc.u-strasbg.fr/PipeAlign/
52
(No Transcript)
53
Multiple sequence alignment editors
No automatic method is 100 reliable - manual
verification and refinement is essential!
SeqLab GCG Wisconsin Package SeaView (Gaultier et
al, 1996) http//pbil.univ-lyon1.fr/software/seavi
ew.html UNIX/Linux, Windows 95, MAC OS
8,9,X WEB servers GeneAlign (Kurukawa)
http//www.gen-info.osaka-u.ac.jp/geneweb2/geneali
gn/ Jalview (Clamp, 1998) http//www.ebi.ac.uk/mi
chele/jalview/ CINEMA (Lord et al, 2002)
http//www.bioinf.man.ac.uk/dbbrowser/cinema-mx
54
Multiple Sequence Alignment
  • Introduction what is a multiple alignment?
  • Multiple alignment construction
  • Traditional approaches optimal, progressive
  • Alignment parameters
  • Iterative and co-operative approaches
  • Multiple alignment analysis
  • Conserved/homologous regions
  • Quality analysis/error detection
  • Multiple alignment applications

55
Central role of multiple alignments
domain structure
conserved, functional sites
56
Central role of multiple alignments
Multiple alignment
57
Example protein, RNA complexes
ASP tRNA
ASP tRNA synthetase
aspRS, tRNA interactions
Ruff et al, 1991
58
Example Bardet Biedl Syndrome
Identification of new genes responsible for BBS
a rare recessive autosomic genetic
disease, probably caused by a defect at the basal
body of ciliated cells Phenotypes obesity,
retinopathy, polydactyly, mental retardation,
hypogonadism, renal failure 9 genes are known to
be involved BBS1 BBS9
In a comparative genomics study, Li et al, (2004)
identified 688 genes implicated in cilia and
flagella
BBS10 gene shows a high frequency of mutation
(20 of patients)
  • Clinical studies have identified a candidate
    chromosomic region of 8Mb with approx. 23 genes
  • including 4 genes from set of 688

J. Muller et al 2006
Write a Comment
User Comments (0)
About PowerShow.com