Title: Bioinformatics Methods Course Multiple Sequence Alignment Burkhard Morgenstern University of G
1Bioinformatics Methods CourseMultiple Sequence
AlignmentBurkhard Morgenstern University of
GöttingenInstitute of Microbiology and Genetics
Department of BioinformaticsGöttingen,
October/November 2006
2Tools for multiple sequence alignment
- T Y I M R E A Q Y E
- T C I V M R E A Y E
-
3Tools for multiple sequence alignment
- T Y I - M R E A Q Y E
- T C I V M R E A - Y E
-
4Tools for multiple sequence alignment
- T Y I M R E A Q Y E
- T C I V M R E A Y E
- Y I M Q E V Q Q E
- Y I A M R E Q Y E
-
5Tools for multiple sequence alignment
- T Y I - M R E A Q Y E
- T C I V M R E A - Y E
- Y - I - M Q E V Q Q E
- Y I A M R E - Q Y E
-
6Tools for multiple sequence alignment
- T Y I - M R E A Q Y E
- T C I V M R E A - Y E
- - Y I - M Q E V Q Q E
- Y I A M R E - Q Y E
- Astronomical Number of possible alignments!
7Tools for multiple sequence alignment
- T Y I - M R E A Q Y E
- T C I V - M R E A Y E
- - Y I - M Q E V Q Q E
- Y I A M R E - Q Y E
- Astronomical Number of possible alignments!
8Tools for multiple sequence alignment
- T Y I - M R E A Q Y E
- T C I V M R E A - Y E
- - Y I - M Q E V Q Q E
- Y I A M R E - Q Y E
- Which one is the best ???
9Tools for multiple sequence alignment
- Questions in development of alignment programs
- (1) What is a good alignment?
- ? objective function (score)
- (2) How to find a good alignment?
- ? optimization algorithm
- First question far more important !
10Tools for multiple sequence alignment
- Before defining an objective function (scoring
scheme) - What is a biologically good alignment ??
11Tools for multiple sequence alignment
- Criteria for alignment quality
- 3D-Structure align residues at corresponding
positions in 3D structure of protein! - Evolution align residues with common ancestors!
12Tools for multiple sequence alignment
- T Y I - M R E A Q Y E
- T C I V - M R E A Y E
- - Y I - M Q E V Q Q E
- - Y I A M R E - Q Y E
- Alignment hypothesis about sequence evolution
- Search for most plausible hypothesis!
13Tools for multiple sequence alignment
- Compute for amino acids a and b
-
- Probability pa,b of substitution
- a ? b (or b ? a),
- Frequency qa of a
- Define
- s(a,b) log (pa,b / qa qb)
-
14Tools for multiple sequence alignment
15(No Transcript)
16Tools for multiple sequence alignment
- Traditional objective functions
- Define Score of alignments as
- Sum of individual similarity scores s(a,b)
- Gap penalty g for each gap in alignment
- Needleman-Wunsch scoring system (1970) for
pairwise alignment ( alignment of two sequences)
17- T Y W I V
- T - - L V
- Example
- Score s(T,T) s(I,L) s (V,V) 2 g
18- T Y W I V
- T - - L V
- Idea alignment with optimal (maximal) score
probably biologically meaningful. - Dynamic programming algorithm finds optimal
alignment for two sequences efficiently
(Needleman and Wunsch, 1970).
19Tools for multiple sequence alignment
- Traditional Objective functions can be
generalized to multiple alignment (e.g.
sum-of-pair score, tree alignment) - Needleman-Wunsch algorithm can also be
generalized to find optimal multiple alignment,
but - Very time and memory consuming!
- -gt Heuristic algorithm needed, i.e. fast but
sub-optimal solution
20Tools for multiple sequence alignment
- Most commonly used heuristic for multiple
alignment - Progressive alignment
- (mid 1980s)
21Progressive Alignment
- WCEAQTKNGQGWVPSNYITPVN
- WWRLNDKEGYVPRNLLGLYP
- AVVIQDNSDIKVVPKAKIIRD
- YAVESEAHPGSFQPVAALERIN
- WLNYNETTGERGDFPGTYVEYIGRKKISP
22Progressive Alignment
- WCEAQTKNGQGWVPSNYITPVN
- WWRLNDKEGYVPRNLLGLYP
- AVVIQDNSDIKVVPKAKIIRD
- YAVESEAHPGSFQPVAALERIN
- WLNYNETTGERGDFPGTYVEYIGRKKISP
- Guide tree
23Progressive Alignment
- WCEAQTKNGQGWVPSNYITPVN
- WW--RLNDKEGYVPRNLLGLYP-
- AVVIQDNSDIKVVP--KAKIIRD
- YAVESEASFQPVAALERIN
- WLNYNEERGDFPGTYVEYIGRKKISP
- Profile alignment, once a gap - always a gap
24Progressive Alignment
- WCEAQTKNGQGWVPSNYITPVN
- WW--RLNDKEGYVPRNLLGLYP-
- AVVIQDNSDIKVVP--KAKIIRD
- YAVESEASVQ--PVAALERIN------
- WLN-YNEERGDFPGTYVEYIGRKKISP
- Profile alignment, once a gap - always a gap
25Progressive Alignment
- WCEAQTKNGQGWVPSNYITPVN-
- WW--RLNDKEGYVPRNLLGLYP-
- AVVIQDNSDIKVVP--KAKIIRD
- YAVESEASVQ--PVAALERIN------
- WLN-YNEERGDFPGTYVEYIGRKKISP
- Profile alignment, once a gap - always a gap
26Progressive Alignment
- WCEAQTKNGQGWVPSNYITPVN--------
- WW--RLNDKEGYVPRNLLGLYP--------
- AVVIQDNSDIKVVP--KAKIIRD-------
- YAVESEA---SVQ--PVAALERIN------
- WLN-YNE---ERGDFPGTYVEYIGRKKISP
- Profile alignment, once a gap - always a gap
27CLUSTAL W
- Most important software program
- CLUSTAL W
- J. Thompson, T. Gibson, D. Higgins (1994),
CLUSTAL W improving the sensitivity of
progressive multiple sequence alignment Nuc.
Acids. Res. 22, 4673 - 4680 - ( 20.000 citations in the literature)
28Tools for multiple sequence alignment
- Problems with traditional approach
- Results depend on gap penalty
- Heuristic guide tree determines alignment
- alignment used for phylogeny reconstruction
- Algorithm produces global alignments.
29Tools for multiple sequence alignment
- Problems with traditional approach
- But
- Many sequence families share only local
similarity - E.g. sequences share one conserved motif
30Local sequence alignment
EYENS
ERYENS
ERYAS
Find common motif in sequences ignore the rest
31Local sequence alignment
E-YENS
ERYENS
ERYA-S
Find common motif in sequences ignore the rest
32Local sequence alignment
E-YENS
ERYENS
ERYA-S
Find common motif in sequences ignore the rest
Local alignment
33Gibbs Motive Sampler
Local multiple alignment without gaps C.E.
Lawrence et al. (1993) Detecting subtle sequence
signals a Gibbs Sampling Strategy for Multiple
Alignment Science, 262, 208 - 214
34Traditional alignment approaches Either global
or local methods!
35New question sequence families with multiple
local similarities
Neither local nor global methods appliccable
36New question sequence families with multiple
local similarities
Alignment possible if order conserved
37The DIALIGN approach
-
- Morgenstern, Dress, Werner (1996),
- PNAS 93, 12098-12103
- Combination of global and local methods
- Assemble multiple alignment from
- gap-free local pair-wise alignments
-
- (,,fragments)
-
-
38The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
39The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
40The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
41The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
42The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
43The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
44The DIALIGN approach
-
- atc------taatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
45The DIALIGN approach
-
- atc------taatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaa--gagtatcacccctgaattgaataa
-
46The DIALIGN approach
-
- atc------taatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaa--gagtatcacc----------cctgaattgaataa
-
47The DIALIGN approach
-
- atc------taatagttaaactcccccgtgc-ttag
- cagtgcgtgtattactaac----------gg-ttcaatcgcg
- caaa--gagtatcacc----------cctgaattgaataa
-
48The DIALIGN approach
Consistency!
-
- atc------taatagttaaactcccccgtgc-ttag
- cagtgcgtgtattactaac----------gg-ttcaatcgcg
- caaa--gagtatcacc----------cctgaattgaataa
-
49The DIALIGN approach
-
- atc------TAATAGTTAaactccccCGTGC-TTag
- cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg
- caaa--GAGTATCAcc----------CCTGaaTTGAATaa
-
50The DIALIGN approach
-
- Multiple alignment
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
-
51The DIALIGN approach
-
- Multiple alignment
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaccctgaattgaagagtatcacataa
- (1) Calculate all optimal pair-wise alignments
-
52The DIALIGN approach
-
- Multiple alignment
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
- (1) Calculate all optimal pair-wise alignments
-
53The DIALIGN approach
-
- Multiple alignment
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
- (1) Calculate all optimal pair-wise alignments
-
54The DIALIGN approach
-
- Fragments from optimal pair-wise alignments
- might be inconsistent
-
-
55The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
56The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
57The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
58The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaa--gagtatcacccctgaattgaataa
-
59The DIALIGN approach
-
-
- atc------taatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaa--gagtatcacccctgaattgaataa
-
60The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
61The DIALIGN approach
- Score of alignment
- Define weight score for fragments based on
probability of random occurrence - Score of alignment sum of weight scores of
fragments - Goal find consistent set of fragments with
maximum total weight
62The DIALIGN approach
-
- Advantages of segment-based approach
- Program can produce global and local alignments!
- Sequence families alignable that cannot be
aligned with standard methods
63 T-COFFEE
- C. Notredame, D. Higgins, J. Heringa (2000),
T-Coffee A novel algorithm for multiple sequence
alignment, J. Mol. Biol. -
64 65 66 67 T-COFFEE
-
- T-COFFEE
- Less sensitive to spurious pairwise similarities
- Can handle local homologies better than CLUSTAL
68 T-COFFEE
-
- T-COFFEE
- Idea
- Build library of pairwise alignments
- Alignment from seq i, j and seq j, k supports
alignmetn from seq i, k.
69 Evaluation of multi-alignment methods
- Alignment evaluation by comparison to trusted
benchmark alignments. - True alignment known by information about
structure or evolution. -
70 Evaluation of multi-alignment methods
- For protein alignment
- M. McClure et al. (1994)
- 4 protein families, known functional sites
- J. Thompson et al. (1999)
- Benchmark data base, 130 known 3D structures
(BAliBASE) - T. Lassmann E. Sonnhammer (2002)
BAliBASE simulated evolution (ROSE)
71 Evaluation of multi-alignment methods
72 Evaluation of multi-alignment methods
1aboA 1 .NLFVALYDfvasgdntlsitkGEKLRVLgynhn
..............gE 1ycsB 1
kGVIYALWDyepqnddelpmkeGDCMTIIhrede............deiE
1pht 1 gYQYRALYDykkereedidlhlGDILTVNkgs
lvalgfsdgqearpeeiG 1ihvA 1
.NFRVYYRDsrd......pvwkGPAKLLWkg.................eG
1vie 1 .drvrkksga.........awqGQIVGWYctn
lt.............peG 1aboA 36
WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39
WWWARl..ndkeGYVPRNLLGLYP...... 1pht 51
WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27
AVVIQd..nsdiKVVPRRKAKIIRd..... 1vie 28
YAVESeahpgsvQIYPVAALERIN...... Key alpha
helix RED beta strand GREEN core blocks
UNDERSCORE
BAliBASE Reference alignments
73 74Result DIALIGN best method for distantly related
sequences, T-Coffee best for globally related
proteins
75 Evaluation of multi-alignment methods
- BAliBASE 5 categories of benchmark sequences
(globally related, internal gaps, end gaps) - CLUSTAL W, T-COFFEE, MAFFT, PROBCONS perform well
on globally related sequences, DIALIGN superior
for local similarities -
76 Evaluation of multi-alignment methods
-
- Conclusion no single best multi alignment
program! - Advice try different methods!
-
77Anchored sequence alignment
- Idea semi-automatic alignment
- use expert knowledge to define constraints
instead of fully automated alignment - Define parts of the sequences where biologically
correct alignment is known as anchor points,
align rest of the sequences automatically.
78Anchored sequence alignment
- NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN
- IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT
- GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS
-
-
79Anchored sequence alignment
- NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN
- IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT
- GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS
-
- Anchor points in multiple alignment
80Anchored sequence alignment
- NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN
- IIHREDKGVIYALWDYEPQND DELPMKEGDCMT
- GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS
-
- Anchor points in multiple alignment
81Anchored sequence alignment
- -------NLF V-ALYDFVAS GD-------- NTLSITKGEk
lrvLGYNhn - iihredkGVI Y-ALWDYEPQ ND-------- DELPMKEGDC
MT------- - -------GYQ YrALYDYKKE REedidlhlg DILTVNKGSL
VA-LGFS-- - Anchored multiple alignment
82Algorithmic questions
- Goal
- Find optimal alignment (consistent set of
fragments) under costraints given by
user-specified anchor points! -
83Algorithmic questions
- Additional input file with anchor points
- 1 3 215 231 5 4.5
- 2 3 34 78 23 1.23
- 1 4 317 402 8 8.5
-
84Algorithmic questions
- NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN
- IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT
- GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS
-
85Algorithmic questions
- Additional input file with anchor points
- 1 3 215 231 5 4.5
- 2 3 34 78 23 1.23
- 1 4 317 402 8 8.5
-
86Algorithmic questions
- Additional input file with anchor points
- 1 3 215 231 5 4.5
- 2 3 34 78 23 1.23
- 1 4 317 402 8 8.5
- Sequences
87Algorithmic questions
- Additional input file with anchor points
- 1 3 215 231 5 4.5
- 2 3 34 78 23 1.23
- 1 4 317 402 8 8.5
- Sequences start positions
88Algorithmic questions
- Additional input file with anchor points
- 1 3 215 231 5 4.5
- 2 3 34 78 23 1.23
- 1 4 317 402 8 8.5
- Sequences start positions length
89Algorithmic questions
- Additional input file with anchor points
- 1 3 215 231 5 4.5
- 2 3 34 78 23 1.23
- 1 4 317 402 8 8.5
- Sequences start positions length
score