# Multiple Sequence Alignment Based on Compact Set - PowerPoint PPT Presentation

PPT – Multiple Sequence Alignment Based on Compact Set PowerPoint presentation | free to download - id: 129809-YzMyN

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## Multiple Sequence Alignment Based on Compact Set

Description:

### ... Alignment ... complexity of two popular problems in multiple sequence alignment: ... And L. Wang, Near optimal alignment within a band in polynomial ... – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 35
Provided by: jmch3
Category:
Tags:
Transcript and Presenter's Notes

Title: Multiple Sequence Alignment Based on Compact Set

1
Multiple Sequence Alignment Based on Compact Set
• Department of Computer Science
• National Tsing Hua University
• Chuan Yi Tang

2
Multiple Sequence Alignment
• Given s set of sequences,the MSA problem is to
find an alignment of the sequences such that some
object function is minimized
• ie.(Sum of Pair Score)

3
MSA with SP-ScoreExact Algorithm and Heuristics
• k of Sequences n Sequences of length
• Exactly (using Dynamic Programming)
• O((2n)k)D.Snakoff, Simultaneous solution of RNA
folding, alignment and Protosequence prolblems,
SIAM J. Appl. Math.,(1985)
• Heuristics
• D.F.Feng,R.F.Doolittle, Progressive sequence
alignment as a prerequisite to correct
phylogenetic trees. J. Mol. Evol. 25, 351-360.,
(1987)
• S.F.Altschul,D.J.Lipman, Trees,star and mutiple
biological sequence aligment,SIAM J. Appl.
Math.,(1989)
• D.J.lipman,S.F.Altschul, A tool for multiple
U.S.A.,(1989)
• S.C. Chan,A.K.C. Wang,D.K.Y. Chiu, A survey of
multiples sequences comparison methods,Bull.Math
Bio.,(1992)

4
MSA with SP-ScoreComplexity
• J Comput Biol 1994 Winter1(4)337-48
• On the complexity of multiple sequence
alignment.
• Wang L. Jiang T.
• McMaster University, Hamilton, Ontario, Canada.
• We study the computational complexity of two
popular problems in multiple sequence alignment
• 1. multiple alignment with SP-Score gt
NP-complete(non-metric)
• 2. multiple tree alignment gt MAX SNP-hard
• Theoretical Computer Science259 (2001) 63-79
• The complexity with Multiple sequence alignment
with SP-score that is a metric
• Paola Bonizzoni, Gianluca Della Vedoa
• 1. multiple alignment with SP-Score gt
NP-complete(metric)

5
MSA with SP-ScoreApproximation
• Approximation Algorithm
• Performance ratio of 2-2/kD.Gusfilde,Efficient
methods for multiple sequence alignment with
guaranteed error bounds,Bull. Math Bio.,(1993)
• Performance ratio of 2-3/kP.Pevzner,Multiple
alignment,communication cost,and graph
matching,SIAM J. Appl. Math.,(1992)
• Performance ratio of 2-l/k(assembling l-way
alignments,l k)V.Bafna,E.L.Lawler and
Pevzner,Approximation algorithms for multiple
sequences alignment,Theor. Comput. Sci.,(1997)
• Polynomial Time Approximation Scheme(PTAS)
• MSA within a constant band and allows only
constant number of insertion and deletion gaps of
arbitrary length per sequence on average M.
Li,B. Ma. And L. Wang, Near optimal alignment
within a band in polynomial time,STOC 2000.

6
Compact Set Definition
• Let S be the set of n objects S1,S2,S3Sn and
D(Si,Sj) denote the distance between Si and Sj in
the distance matrix D.
• Consider any C which is a subset of S,if the
distance between elements in C and not in C is
larger than the longest distance in C , then C is
called a compact set.
• Property
• The entire set S is a compact set.
• Each set consisting of a single object is also a
compact set.

7
Compact Set Example
11 Minimal border edge for compact set 3
S6
S5
10 Maximal inside edge for compact set 3
S1
S4
Compact Set 1
Distance Matrix
S2
S3
Compact Set 3
Compact Set 2
8
Compact Set Example(cont)
• Compact Set is hierarchical

9
MSA Compact Set
• Consider 12 Protein sequences example
• S1 MAPSAPAKTAKALDAKKKVVKGKRTTHRRQVRTSVHFRRPVTLKTA
RQARFPRKSAPKTSKMDHFRIIQHPLTTESAMKKIEEHNTLVFIVSNDAN
I
• S2 SSIIDYPLVTEKAMDEMDFQNKLQFIVDIDAAKPEIRDVVESEYDV
TVVDVNTQITPEAEKKATVKLSAEDDAQDVASRIGVF
VTVEQVNTQNTMDGEKKAVVRLSEDDDAQEVASRIGVF
• S4 MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTF
RRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTESAMKKIEDNNTL
VFIVDVKANKHQIKQAVKKLYDIDVAKVNTLIRPDGEKKAYVRLAPDYDA
LDVANKIGII
• S5 MAPSTKATAAKKAVVKGTNGKKALKVRTSASFRLPKTLKLARSPKY
ATKAVPHYNRLDSYKVIEQPITSETAMKKVEDGNTLVFKVSLKANKYQIK
• S6 MDAFDVIKTPIVSEKTMKLIEEENRLVFYVERKATKEDIKEAIKQ
LFNAEVAEVNTNITPKGQKKAYIKLKDEYNAGEVAASLGIY
PKTLKKDRNPKYPRISAPGRNKLDQYGILKYPLTTESAMKKIEDNNTLVF
VANKIGII
• S8 MAPSTKAASAKKAVVKGSNGSKALKVRTSTTFRLPKTLKLTRAPKY
ARKAVPHYQRLDNYKVIVAPIASETAMKKVEDGNTLVFQVDIKANKHQIK
• S9 MPPKSSTKAEPKASSAKTQVAKAKSAKKAVVKGTSSKTQRRIRTSV
TFRRPKTLRLSRKPKYPRTSVPHAPRMDAYRTLVRPLNTESAMKKIEDNN
DALDIANKIGFI
• S10 MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPT
FRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTESAMKKIEDNNT
LVFIVDVKANKHQIKQAVKKLYDIDVAKVNTLIRPDGEKKAYVRLAPDYD
ALDVANKIGII
• S11 APSAKATAAKKAVVKGTNGKKALKVRTSATFRLPKTLKLARAPKY
RKAYTRPQFRRPHTYRRPATVKPSSNVSAIKNKWDAFRIIRYPLTTDKAM
KKIEENNTLTFIVDSRANKTEIKKAIRKLYQVKTVKVNTLIRPDGLKKAY
IRLSASYDALDTANKMGLV

Original sequence
10
MSA Compact Set(cont)
Original distance matrix
Original Compact Set Tree
Good MSA should Preserve Compact Set as well
11
MSA Compact Set(cont)
• S1 -----------------MAPSAPAKTAKALDAKKKVVKGKRTTHR
RQVRTSVHFRRPVTLKTARQARFPRKSAPKTSKMDHFRIIQHPLTTESA
• S2 ---------------------------------------------
------------------------------------SSIIDYPLVTEKAM
DEMDFQNKLQFIVDIDAAKPEIRDV
• S3 ---------------------------------------------
-----------------------------------SWDVIKHPHVTEKAM
• S4 --------MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKK
KKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTES
• S5 ----------------------MAPSTKATAAKKAVVKGTNGKKA
LKVRTSASFRLPKTLKLARSPKYATKAVPHYNRLDSYKVIEQPITSETAM
KK
• S6 ---------------------------------------------
---------------------------------MDAFDVIKTPIVSEKTM
KLIEEENRLVFYVERKATKEDIKEA
SQKIRTKVTFHRPKTLKKDRNPKYPRISAPGRNKLDQYGILKYPLTTE
• S8 ----------------------MAPSTKAASAKKAVVKGSNGSKA
LKVRTSTTFRLPKTLKLTRAPKYARKAVPHYQRLDNYKVIVAPIASETAM
KK
• S9 ------MPPKSSTKAEPKASSAKTQVAKAKSAKKAVVKGTSSKTQ
RRIRTSVTFRRPKTLRLSRKPKYPRTSVPHAPRMDAYRTLVRPLN
• S10 --------MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHK
KKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTE
• S11 -----------------------APSAKATAAKKAVVKGTNGK
AMKK
KRKAYTRPQFRRPHTYRRPATVKPSSNVSAIKNKWDAFRIIRYP

MSA by MSA1
12
MSA Compact Set(cont)
• S1 ------------MAPSAPAKTA-KALDAKKKVVKGK-RTTHR--
R--QV--R---TSVHFRRPVTLKTARQARFPRKSAPK-TSKMDHFR-IIQ
HPL
• S2 --------------------------------------------
-------------------------------------------S--SIID
YPLVTEKAMDEMDFQNKLQFIVDID- AAK
• S3 --------------------------------------------
-------------------------------------------SW-DVIK
HPHVTEKAMNDMDFQNKLQFAVD-DRA
• S4 MAPKA--KKEAPAPPKAEAK-A-KALKAKKAVLKGV-HSHKK--
K--KI--R---TSPTFRRPKTLRLRRQPKYPRKSAPR-RNKLDHY-AIIK
FP
• S5 -----------------MAPST-KATAAKKAVVKGT-NG--K--
KALKV--R---TSASFRLPKTLKLARSPKYATKAVPH-YNRLDSYK-VIE
QPITSET
• S6 --------------------------------------------
-----------------------------------------MDAF-DVIK
TPIVSEKTMKLIEEENRLVFYVER-KATK
KSQKI--R---TKVTFHRPKTLKKDRNPKYPRISAPG-RNKLDQY-GILK
YP
• S8 -----------------MAPST-KAASAKKAVVKGS-NG--S--
KALKV--R---TSTTFRLPKTLKLTRAPKYARKAVPH-YQRLDNYK-VIV
APIASET
• S9 MPPKSSTKAE-PKASSAKTQVA-KAKSAKKAVVKGT-SS--K--
TQRRI--R---TSVTFRRPKTLRLSRKPKYPRTSVPH-APRMDAYRTLVR
• S10 MAPKA--KKEAPAPPKAEAK-A-KALKAKKAVLKGV-HSHKK-
-K--KI--R---TSPTFRRPKTLRLRRQPKYPRKSAPR-RNKLDHY-AII
KF
• S11 ------------------APSA-KATAAKKAVVKGT-NG--K-
EQPITSET
TKVVKVTKRKAYTRPQFRRPHTYRRPATVK-PSSNVSAIKNKWDAFR

MSA by MSA2
13
MSA Compact Set(cont)
Compact Set Tree by MSA1
Distance Matrix by MSA1
14
MSA Compact Set(cont)
Compact Set Tree by MSA2
Distance Matrix by MSA2
15
Measure of Compact Set Preservation
• How can we measure the Compact Set Preservation
in quantity?
• N1 of the original Compact Set relations
• N2 of the relations preserved after MSA
• Estimate by Compact Set Preservation

16
Measure of Compact Set Preservation(cont)
Original Compact Set relations
1 2 4 1 2 5 1 3 4 1 3 5 2 3 4 2 3 5 1 2 3
4 5 1 4 5 2 4 5 3
Distance Matrix
N1 10
17
Measure of Compact Set Preservation(cont)
The relations preserved after MSA
1 2 4 1 2 5 1 4 3 3 5 1 2 4 3 3 5 2 1 2 3 1 4
5 2 4 5 3 5 4

1 2 4 1 2 5 1 3 4 1 3 5 2 3 4 2 3 5 1 2 3
4 5 1 4 5 2 4 5 3
After MSA gt
Distance Matrix After MSA
N210-73 gt
Compact Set Tree after MSA
Estimate by Compact Set Preservation 3/10
18
Why Pair Wise Compact Set?
• Evolutionary tree is the real judge
• Evolutionary tree has property to minimize the
total evolutionary edges (say tree size) from
pair wise distance which seems to be compact
• It is true in experiments

19
Compact Set Relation Preserved Rate for
Evolutionary Tree
of relations preserved in Evolutionary Tree /
of Compact Set relations of Pair Wise Distance
More larger more better
20
Compact Set Evaluation Algorithm
• Step1 Construct the original Compact Set Tree T
and the Compact Set Tree after MSA T 1.
• Step2 Preorder Traversal T to generate the
Compact Set relations after MSA R ,and mark the
entry in the hash table H according to R.
• Step3 Preorder Traversal T to generate the
Original Compact Set Relations R ,and check
whether the marked entry in the hash table by R
is a subset of the hash table H.
• Total Time Complexity O( ),where n is the
number of sequences
• Reference
• 1. E. Dekel,J. Hu and W. Ouyang, An optimal
algorithm for finding compact sets, Inform.
Process. Lett. 44(1992) 285289

21
Our Strategy for MSA
• Progressive alignment (Fei Feng and Doolittle
1987 )
• with neighbor first( by using Minimal Spanning
Tree(MST) Kruskal Merging Order)
• Set-to-Set align. Once a gap, always a gap.

Kruskal merging order tree
3
S3----ACAGACTCCA S4TTTAAAAGTC----
1
2
set1
S1
S2
S3
S4
S1---AACAGACTT-A- S2----ACAGACTT-AA S3----ACAGA
CTCCA- S4TTTAAAAGTC-----
S1AACAGACTTA- S2-ACAGACTTAA
set2
22
Q Why do we use MST Kruskal Order?
A1It has similar structure with compact set
MST Order Merge Tree
Compact Tree
A2MST Kruskal order is obtained easily
23
Score function
Match
Begin- gap
Gap-extended
---AACAGACTT-A- ----ACAGAC---AA ----ACAGACTCCA- TT
TAAAAGTC-C---
End-gap
Mismatch
Gap-open
24
Strategy of set-to-set alignment
• Score(8, 8) Max

Score(7, 7) (a8ß8) Score(7, 8)
(a8G3) Score(8, 7) (G2ß8)
(a8ß8) (G,C)(G,-)(G,G)(-,C)(-,-)(-,G)
(-10)(-15)(10)(-15)(0)(-15) -45
Time Complexity of seta to setß alignment
(sasßlalß )(2388), Where sa,sß are the
number of sequences in seta and
setßrespectively, and la,lß are the length of
resulted sequences in seta and setß respectively.
25
Time Complexity of our strategy
• The worst case happens in that the binary tree is
balanced.
• Total set-to-set time complexity is bounded by
• where l is the length of the resulted sequences
and n is the number of sequences.
• The worst case time complexity O(n2l2 )

26
MSA Useful tools
• GCG (Genetics Computer Group) PileUp
• http//gcg.nhri.org.tw8003/gcg-bin/seqweb.cgi
• Clustalw

27
Clustal W
• Pairwise alignment
• Calculate distance matrix
• Construct the unrooted Neighbor-Joining (NJ) tree
• Construct the rooted NJ tree
• rooted at mid-point
• Progressive alignment
• Align following the rooted NJ tree
• set-to-set alignment

28
Experiment
29
SP Score Result
Clustalw and our result are better than GCGs
More larger more better
30
Compact Set Relation Failure rate Result
of relation not preserved / of source compact
set relation
More smaller more better
31
Three-point Relative Scale Preserved Rate
For all three species A, B,C, we evaluate their
relative distance relation between original
distance matrix and the MSA distance are
identical or not.
32
I Believe Tree Only
• One might still not believe original pair wise
distance is not a good judge
• One believes the true evolutionary tree only

33
Compact Set Relation Failure Rate
Take Protein 12 for example
of relations not preserved / of source
Compact Set relations
Distance
MSA_Method
More smaller more better
34
Future Work
• Is our measurement and algorithms really good?
• Simulations and Web service
• Does Our MSA by set-to-set alignment satisfy some
approximation property?
• Theoretical Proving
• How can we reduce the time?
• Hardwired Dynamic Programming
• exPARACEL http//www.paracel.com/