CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment

Description:

CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment Che-Lun Hung, Chun-Yuan ... – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 33
Provided by: All695
Category:

less

Transcript and Presenter's Notes

Title: CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment


1
CrossWA A new approach of combining pairwise and
three-sequence alignments to improve the accuracy
for highly divergent sequence alignment
  • Che-Lun Hung, Chun-Yuan Lin, Yeh-Ching Chung, and
    Chuan Yi Tang
  • National Tsing Hua University, Taiwan

Sixth International Conference on
Bioinformatics InCoB2007 HKUST, Hong Kong
2
Outline
  • Introduction
  • Motivation
  • Algorithm
  • Experiments
  • Conclusions

3
Introduction
  • Multiple sequence alignment (MSA)
  • NP-hard problem
  • The heuristic methods for MSA
  • Progressive method
  • ClustalW, T-Coffee, POA, and etc.
  • Iterative method
  • Muscle, DIALIGN, and etc.
  • Probabilistic method
  • Probcons, Hmmt, Muscle, and etc.
  • Anchor-based method
  • MAFFT, Align-m , and etc.

4
Introduction (cont)
  • Pairwise alignment
  • Use Dynamic programming to find the optimal
    alignment. Needleman, J. Mol. Biol 1970 Smith,
    J. Mol. Biol 1981
  • Three-sequence alignment
  • More accurate than pairwise alignment. Murata,
    PNAS 1985
  • Introduce linear gap penalty. Gotoh, J. Theor.
    Biol 1986
  • Space has been reduced from O(N3) to O(N2) with
    affine gap penalty. Huang, ACM 1994
  • Useful for MSA. Makoto, Bioinformatics 1993 CY
    Lin, CMCT 2006, ICPP 2007

5
Introduction (cont)
  • Progressive multiple sequence alignment
    (Progressive pairwise MSA)
  • To align pair sequences following the branching
    order of the guide tree until all sequences are
    aligned.
  • The resulting alignment is affected by Initial
    branching order.
  • Problems of Gap
  • Gap will not be removed.
  • Insertion gap may be calculated multiple times.
    Loytynoja, PNAS 2005

6
Introduction (cont)
  • Progressive triple MSA - aln3nn
  • Published on Matthias, BMC Bioinformatics July,
    2007.
  • Any alignment step is three-sequence alignment.
  • The three-sequence alignment uses the affine gap
    penalty same as Huang, ACM 1994.
  • Use Huangs three-sequence alignment algorithm.

7
Motivation
  • CrossWA - combine three-sequence and pairwise
    alignments
  • Minimize the problem of Progressive pairwise MSA
  • Use three-sequence alignment to reduce the
    affection of initial branching order.
  • Increase the accuracy of alignment
  • Three-sequence alignment may obtain more accurate
    alignments.
  • Keep pairwise alignment because three-sequence
    alignment is not always better than pairwise
    alignment.
  • For pairwise, using position-specific gap penalty
    is more accurate than affine gap penalty.
    Thompson, Bioinformatics 1995
  • Introduce position-specific gap penalty into
    three-sequence alignment which is different to
    the algorithm aln3nn.
  • Avoid increasing the computing time

8
Motivation (cont)
  • Comparison of three protein sequences among
    different methods

9
Motivation (cont)
  • Three-sequence alignment VS Progressive pairwise
    MSA with three sequences (430 test sets, random
    selected from BAliBase 2.0 Ref1 -5)
  • Three-sequence alignment with position-specific
    gap penalty and sequence weighting

10
Motivation (cont)
  • Progressive pairwise MAS (ClustalW) VS
    Progressive Triple MSA (aln3nn) reference set
    1, BAliBase 2.0 Matthias, BMC Bioinformatics
    2007, 7

11
General Process of Progressive Multiple sequence
alignment




. .
. . .
Step 2. Constructing guide tree
Unaligned sequences
Step 1. Calculating distance matrix
Aligning pair sequence or group along the
branching order
. .
Aligned sequences
Step 3. Alignment
12
Algorithm
  • Process of CrossWA
  • Step 1. construct distance matrix.
  • Step 2. build guide tree Neighbour-Joining.
  • Sequence weights will be calculated.
  • Step 3. build a new guide tree modified from the
    guide tree.
  • Branches will be changed for three-sequence and
    pairwise alignments.
  • Sequence weights will be recalculated.
  • Step 4. Alignment.
  • Pairwise alignment
  • Three-sequence alignment
  • Compare with the alignment produced by
    progressive pairwise alignment with same three
    sequences and select better one.

13
Algorithm (cont)




. .
. . .
Unaligned sequences
Step 1. Calculating distance matrix
Step 2. Constructing guide tree
Aligning pair or three sequences (or groups)
along the branching order of new tree
. .
. . .
Aligned sequences
VS
Step 3. Constructing new tree modified from the
guide tree in step 2
Progressive Pairwise MSA
Three-sequence alignment
Step 4. Alignment
14
Algorithm (cont)
  • The branch changing rule

Type I
Type II
Type III
15
Algorithm (cont)
  • The evaluation of three-sequence alignment
  • If SP(S) gt SP(T) then keep S
  • IF SP(T) gt SP(S) then keep T

A
B
C
A
B
C
S Align(B, C)
S Align(A, S)
T Align(A, B, C)
16
Algorithm (cont)
  • Modification of sequence weights
  • The calculation of sequence weight is same as
    ClustalW.

D
D
B
A
C
A
C
Weight of Hba_Human 0.055 0.219/2 0.061/4
0.015/5 0.062/6 0.194
Length between node A and node C 0.219 0.061
0.280 Weight of Hba_Human 0.055 0.280/2
0.077/5 0.210
  • The strategy of Gap penalty
  • Introduce position-specific gap penalty into
    three-sequence alignment (modified from ClustalW).

17
Experiments
  • System environment
  • Linux (AMD opteron 250 2.4G with 512MB of memory)
  • Data source
  • BAliBASE 2.0
  • Reference sets (1 5). T-Coffee, Muscle,
    Probcons, aln3nn, and etc
  • Reference sets (6 8) contain repeats,
    inversions and transmembrane helices, for which
    none of the tested algorithms is designed.
    Muscle

18
Experiments (cont)
  • Scoring functions
  • Sum-of-pair (SP)
  • Total Column Score (TC)
  • Proportion probability ()
  • No. of best alignment of the method/No. of total
    test sets
  • Comparing algorithms
  • CrossWAfast, CrossWAfull, ClustalW 1.83, T-Coffee
    5.05, Muscle 3.6.
  • CrossWAfast only use the type I in the branch
    changing rule.
  • CrossWAfull use all types in the branch
    changing rule.

19
Experiments (cont)
  • The comparison of SP scores among different
    alignment methods

Ref1 (81) Ref2 (19) Ref3 (12) Ref4 (7) Ref5 (12)
CrossWAfast 0.774 22 0.872 5 0.669 8 0.657 0 0.741 0
CrossWAfull 0.777 30 0.877 15 0.685 32 0.658 0 0.762 17
ClustalW 0.773 11 0.876 16 0.656 0 0.674 0 0.762 0
T-Coffee 0.787 34 0.884 37 0.692 8 0.718 57 0.825 50
Muscle 0.776 28 0.891 37 0.713 67 0.728 43 0.822 33
20
Experiment (cont)
  • The comparison of TC scores among different
    alignment methods

Ref1 (81) Ref2 (19) Ref3 (12) Ref4 (7) Ref5 (11)
CrossWAfast 0.665 25 0.498 16 0.368 8 0.333 29 0.529 18
CrossWAfull 0.671 33 0.515 32 0.390 42 0.301 29 0.546 18
ClustalW 0.673 33 0.489 42 0.358 17 0.320 14 0.543 0
T-Coffee 0.676 30 0.434 21 0.323 17 0.409 29 0.625 45
Muscle 0.679 22 0.475 32 0.408 42 0.396 29 0.658 36
21
Experiments (cont)
  • The SP scores for each method of variant average
    identities in Reference 1 data set

lt 25 20 - 40 gt 35
CrossWAfast 0.491 35 0.825 19 0.941 18
CrossWAfull 0.500 40 0.827 31 0.942 25
ClustalW 0.493 10 0.817 12 0.938 11
T-Coffee 0.477 32 0.840 35 0.954 57
Muscle 0.487 25 0.824 27 0.948 18
22
Experiments (cont)
  • The TC scores for each method of variant average
    identities in Reference 1 data set

lt 25 20 - 40 gt 35
CrossWAfast 0.208 28 0.714 20 0.891 21
CrossWAfull 0.303 33 0.712 24 0.893 46
ClustalW 0.324 38 0.713 12 0.887 29
T-Coffee 0.274 33 0.744 36 0.905 33
Muscle 0.299 19 0.725 36 0.907 46
23
Experiments (cont)
  • The performance of CrossWA with 20 sequences

24
Experiments (cont)
  • The Performance of CrossWA with 40 sequences

25
Experiments (cont)
  • Comparison of performance among different methods
    with 20 sequences

26
Experiments (cont)
  • Comparison of performance among different methods
    with 40 sequences

27
Conclusions
  • Three-sequence alignment can obtain better
    resulting alignment than pairwise alignment, but
    not for all data sets.
  • Combining three-sequence alignment and pairwise
    alignment can keep better alignment at any
    alignment step in progressive MSA.
  • From the experimental results, CrossWA can be
    another useful tool to align multiple sequence.
  • CrossWA can be used to align DNA sequences.
  • For aligning Genome data, computing time is a
    problem. It can be solved by parallel
    programming. CY Lin, ICPP 2007

28
Web service
Http//140.114.91.10/Genome
29
Reference
  • Needleman SB, Wunsch CD A general method
    applicable to the search for similarities in the
    amino acid sequence of two proteins. J Mol Biol
    1970, 48443-453.27. Needleman, J Mol Biol 1970
  • Smith TF, Waterman MS Identification of common
    molecular subsequences. J. Mol. Biol. 1981,
    147195-197. Smith, J Mol Biol 1981
  • Murata M, Richardson JS, Sussman JL Simultaneous
    comparison of three protein sequences. Proc Natl
    Acad Sci U S A. 1985, 823073-3077. Murata,
    PNAS 1985
  • Gotoh O Alignment of three biological sequences
    with an efficient traceback procedure, J Theor
    Biol 1986, 327-337. Gotoh, J Theor Biol 1986
  • Huang X Alignment of three sequences in
    quadratic space. Applied Computing Review 1993,
    17-11. Huang, ACM 1993
  • Makoto H, Maski H, Masato I, Tomoyuki T MASCOT
    multiple alignment system for protein sequences
    based on three-way dynamic programming, J Mol
    Biol 1993, 2161-167. Makoto, Bioinformatics
    1993

30
Reference (cont)
  • CY Lin, CT Huang, YC Chung, Chuan YT Parallel
    Three-sequence Alignment with Space-efficient,
    Proceedings of the 23th Workshop on
    Combinatorial Mathematics and Computation Theory,
    Chang-Hua, Taiwan, April 2006, 160-165. CY Lin,
    CMCT 2006
  • CY Lin, CT Huang, YC Chung, Chuan YT Efficient
    Parallel Algorithm for Optimal Three-Sequences
    Alignment. International Conference on Parallel
    Processing 2007. CY Lin, ICPP 2007
  • Loytynoja A, Goldman N An algorithm for
    progressive multiple alignment of sequences with
    insertions. Proc Natl Acad Sci U S A. 2005,
    102(30)10557-10562. Loytynoja, PNAS 2005
  • Matthias K, Peter FS Progressive multiple
    sequence alignments from triplets. BMC
    Bioinformatics 2007. matthias, BMC
    Bioinformatics July, 2007
  • Thompson JD Introducing variable gap penalties
    to sequence alignment in linear space.
    Bioinformatics 1995, 11181-186. Thompson,
    Bioinformatics 1995

31
  • Thank you for your attention

Che-Lun Hung allen_at_sslab.cs.nthu.edu.tw Chun-Yuan Lin cylin_at_sslab.cs.nthu.edu.tw
Yeh-Ching Chung ychung_at_cs.nthu.edu.tw Chuan Yi Tang cytang_at_cs.nthu.edu.tw
32
  • Position-specific gap penalty V.S. Affine gap
    penalty in three-sequence alignment (300 test
    sets)
  • Three Align use Position-specific gap penalty
  • Three Align use Affine gap penalty

Ave Ave
SP TC
Three Align 0.81 0.75
Three Align 0.75 0.70
33
  • Algorithm of three-sequence alignment with
    position-specific gap penalty
Write a Comment
User Comments (0)
About PowerShow.com