Title: CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment
1CrossWA A new approach of combining pairwise and
three-sequence alignments to improve the accuracy
for highly divergent sequence alignment
- Che-Lun Hung, Chun-Yuan Lin, Yeh-Ching Chung, and
Chuan Yi Tang - National Tsing Hua University, Taiwan
Sixth International Conference on
Bioinformatics InCoB2007 HKUST, Hong Kong
2Outline
- Introduction
- Motivation
- Algorithm
- Experiments
- Conclusions
3Introduction
- Multiple sequence alignment (MSA)
- NP-hard problem
- The heuristic methods for MSA
- Progressive method
- ClustalW, T-Coffee, POA, and etc.
- Iterative method
- Muscle, DIALIGN, and etc.
- Probabilistic method
- Probcons, Hmmt, Muscle, and etc.
- Anchor-based method
- MAFFT, Align-m , and etc.
4Introduction (cont)
- Pairwise alignment
- Use Dynamic programming to find the optimal
alignment. Needleman, J. Mol. Biol 1970 Smith,
J. Mol. Biol 1981 - Three-sequence alignment
- More accurate than pairwise alignment. Murata,
PNAS 1985 - Introduce linear gap penalty. Gotoh, J. Theor.
Biol 1986 - Space has been reduced from O(N3) to O(N2) with
affine gap penalty. Huang, ACM 1994 - Useful for MSA. Makoto, Bioinformatics 1993 CY
Lin, CMCT 2006, ICPP 2007
5Introduction (cont)
- Progressive multiple sequence alignment
(Progressive pairwise MSA) - To align pair sequences following the branching
order of the guide tree until all sequences are
aligned. - The resulting alignment is affected by Initial
branching order. - Problems of Gap
- Gap will not be removed.
- Insertion gap may be calculated multiple times.
Loytynoja, PNAS 2005
6Introduction (cont)
- Progressive triple MSA - aln3nn
- Published on Matthias, BMC Bioinformatics July,
2007. - Any alignment step is three-sequence alignment.
- The three-sequence alignment uses the affine gap
penalty same as Huang, ACM 1994. - Use Huangs three-sequence alignment algorithm.
7Motivation
- CrossWA - combine three-sequence and pairwise
alignments - Minimize the problem of Progressive pairwise MSA
- Use three-sequence alignment to reduce the
affection of initial branching order. - Increase the accuracy of alignment
- Three-sequence alignment may obtain more accurate
alignments. - Keep pairwise alignment because three-sequence
alignment is not always better than pairwise
alignment. - For pairwise, using position-specific gap penalty
is more accurate than affine gap penalty.
Thompson, Bioinformatics 1995 - Introduce position-specific gap penalty into
three-sequence alignment which is different to
the algorithm aln3nn. - Avoid increasing the computing time
8Motivation (cont)
- Comparison of three protein sequences among
different methods
9Motivation (cont)
- Three-sequence alignment VS Progressive pairwise
MSA with three sequences (430 test sets, random
selected from BAliBase 2.0 Ref1 -5) - Three-sequence alignment with position-specific
gap penalty and sequence weighting
10Motivation (cont)
- Progressive pairwise MAS (ClustalW) VS
Progressive Triple MSA (aln3nn) reference set
1, BAliBase 2.0 Matthias, BMC Bioinformatics
2007, 7
11General Process of Progressive Multiple sequence
alignment
. .
. . .
Step 2. Constructing guide tree
Unaligned sequences
Step 1. Calculating distance matrix
Aligning pair sequence or group along the
branching order
. .
Aligned sequences
Step 3. Alignment
12Algorithm
- Process of CrossWA
- Step 1. construct distance matrix.
- Step 2. build guide tree Neighbour-Joining.
- Sequence weights will be calculated.
- Step 3. build a new guide tree modified from the
guide tree. - Branches will be changed for three-sequence and
pairwise alignments. - Sequence weights will be recalculated.
- Step 4. Alignment.
- Pairwise alignment
- Three-sequence alignment
- Compare with the alignment produced by
progressive pairwise alignment with same three
sequences and select better one.
13Algorithm (cont)
. .
. . .
Unaligned sequences
Step 1. Calculating distance matrix
Step 2. Constructing guide tree
Aligning pair or three sequences (or groups)
along the branching order of new tree
. .
. . .
Aligned sequences
VS
Step 3. Constructing new tree modified from the
guide tree in step 2
Progressive Pairwise MSA
Three-sequence alignment
Step 4. Alignment
14Algorithm (cont)
Type I
Type II
Type III
15Algorithm (cont)
- The evaluation of three-sequence alignment
- If SP(S) gt SP(T) then keep S
- IF SP(T) gt SP(S) then keep T
A
B
C
A
B
C
S Align(B, C)
S Align(A, S)
T Align(A, B, C)
16Algorithm (cont)
- Modification of sequence weights
- The calculation of sequence weight is same as
ClustalW.
D
D
B
A
C
A
C
Weight of Hba_Human 0.055 0.219/2 0.061/4
0.015/5 0.062/6 0.194
Length between node A and node C 0.219 0.061
0.280 Weight of Hba_Human 0.055 0.280/2
0.077/5 0.210
- The strategy of Gap penalty
- Introduce position-specific gap penalty into
three-sequence alignment (modified from ClustalW).
17Experiments
- System environment
- Linux (AMD opteron 250 2.4G with 512MB of memory)
- Data source
- BAliBASE 2.0
- Reference sets (1 5). T-Coffee, Muscle,
Probcons, aln3nn, and etc - Reference sets (6 8) contain repeats,
inversions and transmembrane helices, for which
none of the tested algorithms is designed.
Muscle
18Experiments (cont)
- Scoring functions
- Sum-of-pair (SP)
- Total Column Score (TC)
- Proportion probability ()
- No. of best alignment of the method/No. of total
test sets - Comparing algorithms
- CrossWAfast, CrossWAfull, ClustalW 1.83, T-Coffee
5.05, Muscle 3.6. - CrossWAfast only use the type I in the branch
changing rule. - CrossWAfull use all types in the branch
changing rule.
19Experiments (cont)
- The comparison of SP scores among different
alignment methods
Ref1 (81) Ref2 (19) Ref3 (12) Ref4 (7) Ref5 (12)
CrossWAfast 0.774 22 0.872 5 0.669 8 0.657 0 0.741 0
CrossWAfull 0.777 30 0.877 15 0.685 32 0.658 0 0.762 17
ClustalW 0.773 11 0.876 16 0.656 0 0.674 0 0.762 0
T-Coffee 0.787 34 0.884 37 0.692 8 0.718 57 0.825 50
Muscle 0.776 28 0.891 37 0.713 67 0.728 43 0.822 33
20Experiment (cont)
- The comparison of TC scores among different
alignment methods
Ref1 (81) Ref2 (19) Ref3 (12) Ref4 (7) Ref5 (11)
CrossWAfast 0.665 25 0.498 16 0.368 8 0.333 29 0.529 18
CrossWAfull 0.671 33 0.515 32 0.390 42 0.301 29 0.546 18
ClustalW 0.673 33 0.489 42 0.358 17 0.320 14 0.543 0
T-Coffee 0.676 30 0.434 21 0.323 17 0.409 29 0.625 45
Muscle 0.679 22 0.475 32 0.408 42 0.396 29 0.658 36
21Experiments (cont)
- The SP scores for each method of variant average
identities in Reference 1 data set
lt 25 20 - 40 gt 35
CrossWAfast 0.491 35 0.825 19 0.941 18
CrossWAfull 0.500 40 0.827 31 0.942 25
ClustalW 0.493 10 0.817 12 0.938 11
T-Coffee 0.477 32 0.840 35 0.954 57
Muscle 0.487 25 0.824 27 0.948 18
22Experiments (cont)
- The TC scores for each method of variant average
identities in Reference 1 data set
lt 25 20 - 40 gt 35
CrossWAfast 0.208 28 0.714 20 0.891 21
CrossWAfull 0.303 33 0.712 24 0.893 46
ClustalW 0.324 38 0.713 12 0.887 29
T-Coffee 0.274 33 0.744 36 0.905 33
Muscle 0.299 19 0.725 36 0.907 46
23Experiments (cont)
- The performance of CrossWA with 20 sequences
24Experiments (cont)
- The Performance of CrossWA with 40 sequences
25Experiments (cont)
- Comparison of performance among different methods
with 20 sequences
26Experiments (cont)
- Comparison of performance among different methods
with 40 sequences
27Conclusions
- Three-sequence alignment can obtain better
resulting alignment than pairwise alignment, but
not for all data sets. - Combining three-sequence alignment and pairwise
alignment can keep better alignment at any
alignment step in progressive MSA. - From the experimental results, CrossWA can be
another useful tool to align multiple sequence. - CrossWA can be used to align DNA sequences.
- For aligning Genome data, computing time is a
problem. It can be solved by parallel
programming. CY Lin, ICPP 2007
28Web service
Http//140.114.91.10/Genome
29Reference
- Needleman SB, Wunsch CD A general method
applicable to the search for similarities in the
amino acid sequence of two proteins. J Mol Biol
1970, 48443-453.27. Needleman, J Mol Biol 1970 - Smith TF, Waterman MS Identification of common
molecular subsequences. J. Mol. Biol. 1981,
147195-197. Smith, J Mol Biol 1981 - Murata M, Richardson JS, Sussman JL Simultaneous
comparison of three protein sequences. Proc Natl
Acad Sci U S A. 1985, 823073-3077. Murata,
PNAS 1985 - Gotoh O Alignment of three biological sequences
with an efficient traceback procedure, J Theor
Biol 1986, 327-337. Gotoh, J Theor Biol 1986 - Huang X Alignment of three sequences in
quadratic space. Applied Computing Review 1993,
17-11. Huang, ACM 1993 - Makoto H, Maski H, Masato I, Tomoyuki T MASCOT
multiple alignment system for protein sequences
based on three-way dynamic programming, J Mol
Biol 1993, 2161-167. Makoto, Bioinformatics
1993
30Reference (cont)
- CY Lin, CT Huang, YC Chung, Chuan YT Parallel
Three-sequence Alignment with Space-efficient,
Proceedings of the 23th Workshop on
Combinatorial Mathematics and Computation Theory,
Chang-Hua, Taiwan, April 2006, 160-165. CY Lin,
CMCT 2006 - CY Lin, CT Huang, YC Chung, Chuan YT Efficient
Parallel Algorithm for Optimal Three-Sequences
Alignment. International Conference on Parallel
Processing 2007. CY Lin, ICPP 2007 - Loytynoja A, Goldman N An algorithm for
progressive multiple alignment of sequences with
insertions. Proc Natl Acad Sci U S A. 2005,
102(30)10557-10562. Loytynoja, PNAS 2005 - Matthias K, Peter FS Progressive multiple
sequence alignments from triplets. BMC
Bioinformatics 2007. matthias, BMC
Bioinformatics July, 2007 - Thompson JD Introducing variable gap penalties
to sequence alignment in linear space.
Bioinformatics 1995, 11181-186. Thompson,
Bioinformatics 1995
31- Thank you for your attention
Che-Lun Hung allen_at_sslab.cs.nthu.edu.tw Chun-Yuan Lin cylin_at_sslab.cs.nthu.edu.tw
Yeh-Ching Chung ychung_at_cs.nthu.edu.tw Chuan Yi Tang cytang_at_cs.nthu.edu.tw
32- Position-specific gap penalty V.S. Affine gap
penalty in three-sequence alignment (300 test
sets) - Three Align use Position-specific gap penalty
- Three Align use Affine gap penalty
Ave Ave
SP TC
Three Align 0.81 0.75
Three Align 0.75 0.70
33- Algorithm of three-sequence alignment with
position-specific gap penalty