Loading...

PPT – Multiple Sequence Alignment Based on Compact Set PowerPoint presentation | free to download - id: 129809-YzMyN

The Adobe Flash plugin is needed to view this content

Multiple Sequence Alignment Based on Compact Set

- Department of Computer Science
- National Tsing Hua University
- Chuan Yi Tang

Multiple Sequence Alignment

- Given s set of sequences,the MSA problem is to

find an alignment of the sequences such that some

object function is minimized - ie.(Sum of Pair Score)

MSA with SP-ScoreExact Algorithm and Heuristics

- k of Sequences n Sequences of length
- Exactly (using Dynamic Programming)
- O((2n)k)D.Snakoff, Simultaneous solution of RNA

folding, alignment and Protosequence prolblems,

SIAM J. Appl. Math.,(1985) - Heuristics
- D.F.Feng,R.F.Doolittle, Progressive sequence

alignment as a prerequisite to correct

phylogenetic trees. J. Mol. Evol. 25, 351-360.,

(1987) - S.F.Altschul,D.J.Lipman, Trees,star and mutiple

biological sequence aligment,SIAM J. Appl.

Math.,(1989) - D.J.lipman,S.F.Altschul, A tool for multiple

sequences alignment,Proc.Nat.Acad. Sci.

U.S.A.,(1989) - S.C. Chan,A.K.C. Wang,D.K.Y. Chiu, A survey of

multiples sequences comparison methods,Bull.Math

Bio.,(1992)

MSA with SP-ScoreComplexity

- J Comput Biol 1994 Winter1(4)337-48
- On the complexity of multiple sequence

alignment. - Wang L. Jiang T.
- McMaster University, Hamilton, Ontario, Canada.
- We study the computational complexity of two

popular problems in multiple sequence alignment - 1. multiple alignment with SP-Score gt

NP-complete(non-metric) - 2. multiple tree alignment gt MAX SNP-hard
- Theoretical Computer Science259 (2001) 63-79
- The complexity with Multiple sequence alignment

with SP-score that is a metric - Paola Bonizzoni, Gianluca Della Vedoa
- 1. multiple alignment with SP-Score gt

NP-complete(metric)

MSA with SP-ScoreApproximation

- Approximation Algorithm
- Performance ratio of 2-2/kD.Gusfilde,Efficient

methods for multiple sequence alignment with

guaranteed error bounds,Bull. Math Bio.,(1993) - Performance ratio of 2-3/kP.Pevzner,Multiple

alignment,communication cost,and graph

matching,SIAM J. Appl. Math.,(1992) - Performance ratio of 2-l/k(assembling l-way

alignments,l k)V.Bafna,E.L.Lawler and

Pevzner,Approximation algorithms for multiple

sequences alignment,Theor. Comput. Sci.,(1997) - Polynomial Time Approximation Scheme(PTAS)
- MSA within a constant band and allows only

constant number of insertion and deletion gaps of

arbitrary length per sequence on average M.

Li,B. Ma. And L. Wang, Near optimal alignment

within a band in polynomial time,STOC 2000.

Compact Set Definition

- Let S be the set of n objects S1,S2,S3Sn and

D(Si,Sj) denote the distance between Si and Sj in

the distance matrix D. - Consider any C which is a subset of S,if the

distance between elements in C and not in C is

larger than the longest distance in C , then C is

called a compact set. - Property
- The entire set S is a compact set.
- Each set consisting of a single object is also a

compact set.

Compact Set Example

11 Minimal border edge for compact set 3

S6

S5

10 Maximal inside edge for compact set 3

S1

S4

Compact Set 1

Distance Matrix

S2

S3

Compact Set 3

Compact Set 2

Compact Set Example(cont)

- Compact Set is hierarchical

MSA Compact Set

- Consider 12 Protein sequences example
- S1 MAPSAPAKTAKALDAKKKVVKGKRTTHRRQVRTSVHFRRPVTLKTA

RQARFPRKSAPKTSKMDHFRIIQHPLTTESAMKKIEEHNTLVFIVSNDAN

KYQIKDAVHKLYNVQALKVNTLITPLQQKKAYVRLTADYDALDVANKIGV

I - S2 SSIIDYPLVTEKAMDEMDFQNKLQFIVDIDAAKPEIRDVVESEYDV

TVVDVNTQITPEAEKKATVKLSAEDDAQDVASRIGVF - S3 SWDVIKHPHVTEKAMNDMDFQNKLQFAVDDRASKGEVADAVEEQYD

VTVEQVNTQNTMDGEKKAVVRLSEDDDAQEVASRIGVF - S4 MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTF

RRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTESAMKKIEDNNTL

VFIVDVKANKHQIKQAVKKLYDIDVAKVNTLIRPDGEKKAYVRLAPDYDA

LDVANKIGII - S5 MAPSTKATAAKKAVVKGTNGKKALKVRTSASFRLPKTLKLARSPKY

ATKAVPHYNRLDSYKVIEQPITSETAMKKVEDGNTLVFKVSLKANKYQIK

KAVKELYEVDVLSVNTLVRPNGTKKAYVRLTADFDALDIANRIGYI - S6 MDAFDVIKTPIVSEKTMKLIEEENRLVFYVERKATKEDIKEAIKQ

LFNAEVAEVNTNITPKGQKKAYIKLKDEYNAGEVAASLGIY - S7 MAPAKADPSKKSDPKAQAAKVAKAVKSGSTLKKKSQKIRTKVTFHR

PKTLKKDRNPKYPRISAPGRNKLDQYGILKYPLTTESAMKKIEDNNTLVF

IVDIKADKKKIKDAVKKMYDIQTKKVNTLIRPDGTKKAYVRLTPDYDALD

VANKIGII - S8 MAPSTKAASAKKAVVKGSNGSKALKVRTSTTFRLPKTLKLTRAPKY

ARKAVPHYQRLDNYKVIVAPIASETAMKKVEDGNTLVFQVDIKANKHQIK

QAVKDLYEVDVLAVNTLIRPNGTKKAYVRLTADHDALDIANKIGYI - S9 MPPKSSTKAEPKASSAKTQVAKAKSAKKAVVKGTSSKTQRRIRTSV

TFRRPKTLRLSRKPKYPRTSVPHAPRMDAYRTLVRPLNTESAMKKIEDNN

TLLFIVDLKANKRQIADAVKKLYDVTPLRVNTLIRPDGKKKAFVRLTPEV

DALDIANKIGFI - S10 MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPT

FRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTESAMKKIEDNNT

LVFIVDVKANKHQIKQAVKKLYDIDVAKVNTLIRPDGEKKAYVRLAPDYD

ALDVANKIGII - S11 APSAKATAAKKAVVKGTNGKKALKVRTSATFRLPKTLKLARAPKY

ASKAVPHYNRLDSYKVIEQPITSETAMKKVEDGNILVFQVSMKANKYQIK

KAVKELYEVDVLKVNTLVRPNGTKKAYVRLTADYDALDIANRIGYI - S12 MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVTK

RKAYTRPQFRRPHTYRRPATVKPSSNVSAIKNKWDAFRIIRYPLTTDKAM

KKIEENNTLTFIVDSRANKTEIKKAIRKLYQVKTVKVNTLIRPDGLKKAY

IRLSASYDALDTANKMGLV

Original sequence

MSA Compact Set(cont)

Original distance matrix

Original Compact Set Tree

Good MSA should Preserve Compact Set as well

MSA Compact Set(cont)

- S1 -----------------MAPSAPAKTAKALDAKKKVVKGKRTTHR

RQVRTSVHFRRPVTLKTARQARFPRKSAPKTSKMDHFRIIQHPLTTESA

- S2 ---------------------------------------------

------------------------------------SSIIDYPLVTEKAM

DEMDFQNKLQFIVDIDAAKPEIRDV - S3 ---------------------------------------------

-----------------------------------SWDVIKHPHVTEKAM

NDMDFQNKLQFAVDDRASKGEV - S4 --------MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKK

KKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTES - S5 ----------------------MAPSTKATAAKKAVVKGTNGKKA

LKVRTSASFRLPKTLKLARSPKYATKAVPHYNRLDSYKVIEQPITSETAM

KK - S6 ---------------------------------------------

---------------------------------MDAFDVIKTPIVSEKTM

KLIEEENRLVFYVERKATKEDIKEA - S7 ----------MAPAKADPSKKSDPKAQAAKVAKAVKSGSTLKKK

SQKIRTKVTFHRPKTLKKDRNPKYPRISAPGRNKLDQYGILKYPLTTE - S8 ----------------------MAPSTKAASAKKAVVKGSNGSKA

LKVRTSTTFRLPKTLKLTRAPKYARKAVPHYQRLDNYKVIVAPIASETAM

KK - S9 ------MPPKSSTKAEPKASSAKTQVAKAKSAKKAVVKGTSSKTQ

RRIRTSVTFRRPKTLRLSRKPKYPRTSVPHAPRMDAYRTLVRPLN - S10 --------MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHK

KKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTE - S11 -----------------------APSAKATAAKKAVVKGTNGK

KALKVRTSATFRLPKTLKLARAPKYASKAVPHYNRLDSYKVIEQPITSET

AMKK - S12 MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVT

KRKAYTRPQFRRPHTYRRPATVKPSSNVSAIKNKWDAFRIIRYP

MSA by MSA1

MSA Compact Set(cont)

- S1 ------------MAPSAPAKTA-KALDAKKKVVKGK-RTTHR--

R--QV--R---TSVHFRRPVTLKTARQARFPRKSAPK-TSKMDHFR-IIQ

HPL - S2 --------------------------------------------

-------------------------------------------S--SIID

YPLVTEKAMDEMDFQNKLQFIVDID- AAK - S3 --------------------------------------------

-------------------------------------------SW-DVIK

HPHVTEKAMNDMDFQNKLQFAVD-DRA - S4 MAPKA--KKEAPAPPKAEAK-A-KALKAKKAVLKGV-HSHKK--

K--KI--R---TSPTFRRPKTLRLRRQPKYPRKSAPR-RNKLDHY-AIIK

FP - S5 -----------------MAPST-KATAAKKAVVKGT-NG--K--

KALKV--R---TSASFRLPKTLKLARSPKYATKAVPH-YNRLDSYK-VIE

QPITSET - S6 --------------------------------------------

-----------------------------------------MDAF-DVIK

TPIVSEKTMKLIEEENRLVFYVER-KATK - S7 MAP-A--KAD-PS-KKSDPK-A-QAAKVAKAVKSG--STLKK--

KSQKI--R---TKVTFHRPKTLKKDRNPKYPRISAPG-RNKLDQY-GILK

YP - S8 -----------------MAPST-KAASAKKAVVKGS-NG--S--

KALKV--R---TSTTFRLPKTLKLTRAPKYARKAVPH-YQRLDNYK-VIV

APIASET - S9 MPPKSSTKAE-PKASSAKTQVA-KAKSAKKAVVKGT-SS--K--

TQRRI--R---TSVTFRRPKTLRLSRKPKYPRTSVPH-APRMDAYRTLVR

- S10 MAPKA--KKEAPAPPKAEAK-A-KALKAKKAVLKGV-HSHKK-

-K--KI--R---TSPTFRRPKTLRLRRQPKYPRKSAPR-RNKLDHY-AII

KF - S11 ------------------APSA-KATAAKKAVVKGT-NG--K-

-KALKV--R---TSATFRLPKTLKLARAPKYASKAVPH-YNRLDSYK-VI

EQPITSET - S12 ------MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKP

TKVVKVTKRKAYTRPQFRRPHTYRRPATVK-PSSNVSAIKNKWDAFR

MSA by MSA2

MSA Compact Set(cont)

Compact Set Tree by MSA1

Distance Matrix by MSA1

MSA Compact Set(cont)

Compact Set Tree by MSA2

Distance Matrix by MSA2

Measure of Compact Set Preservation

- How can we measure the Compact Set Preservation

in quantity? - N1 of the original Compact Set relations
- N2 of the relations preserved after MSA
- Estimate by Compact Set Preservation

Measure of Compact Set Preservation(cont)

Original Compact Set relations

1 2 4 1 2 5 1 3 4 1 3 5 2 3 4 2 3 5 1 2 3

4 5 1 4 5 2 4 5 3

Distance Matrix

N1 10

Measure of Compact Set Preservation(cont)

The relations preserved after MSA

1 2 4 1 2 5 1 4 3 3 5 1 2 4 3 3 5 2 1 2 3 1 4

5 2 4 5 3 5 4

1 2 4 1 2 5 1 3 4 1 3 5 2 3 4 2 3 5 1 2 3

4 5 1 4 5 2 4 5 3

After MSA gt

Distance Matrix After MSA

N210-73 gt

Compact Set Tree after MSA

Estimate by Compact Set Preservation 3/10

Why Pair Wise Compact Set?

- Evolutionary tree is the real judge
- Evolutionary tree has property to minimize the

total evolutionary edges (say tree size) from

pair wise distance which seems to be compact - It is true in experiments

Compact Set Relation Preserved Rate for

Evolutionary Tree

of relations preserved in Evolutionary Tree /

of Compact Set relations of Pair Wise Distance

More larger more better

Compact Set Evaluation Algorithm

- Step1 Construct the original Compact Set Tree T

and the Compact Set Tree after MSA T 1. - Step2 Preorder Traversal T to generate the

Compact Set relations after MSA R ,and mark the

entry in the hash table H according to R. - Step3 Preorder Traversal T to generate the

Original Compact Set Relations R ,and check

whether the marked entry in the hash table by R

is a subset of the hash table H. - Total Time Complexity O( ),where n is the

number of sequences - Reference
- 1. E. Dekel,J. Hu and W. Ouyang, An optimal

algorithm for finding compact sets, Inform.

Process. Lett. 44(1992) 285289

Our Strategy for MSA

- Progressive alignment (Fei Feng and Doolittle

1987 ) - with neighbor first( by using Minimal Spanning

Tree(MST) Kruskal Merging Order) - Set-to-Set align. Once a gap, always a gap.

Kruskal merging order tree

3

S3----ACAGACTCCA S4TTTAAAAGTC----

1

2

set1

S1

S2

S3

S4

S1---AACAGACTT-A- S2----ACAGACTT-AA S3----ACAGA

CTCCA- S4TTTAAAAGTC-----

S1AACAGACTTA- S2-ACAGACTTAA

set2

Q Why do we use MST Kruskal Order?

A1It has similar structure with compact set

MST Order Merge Tree

Compact Tree

A2MST Kruskal order is obtained easily

Score function

Match

Begin- gap

Gap-extended

---AACAGACTT-A- ----ACAGAC---AA ----ACAGACTCCA- TT

TAAAAGTC-C---

End-gap

Mismatch

Gap-open

Strategy of set-to-set alignment

- Score(8, 8) Max

Score(7, 7) (a8ß8) Score(7, 8)

(a8G3) Score(8, 7) (G2ß8)

(a8ß8) (G,C)(G,-)(G,G)(-,C)(-,-)(-,G)

(-10)(-15)(10)(-15)(0)(-15) -45

Time Complexity of seta to setß alignment

(sasßlalß )(2388), Where sa,sß are the

number of sequences in seta and

setßrespectively, and la,lß are the length of

resulted sequences in seta and setß respectively.

Time Complexity of our strategy

- The worst case happens in that the binary tree is

balanced. - Total set-to-set time complexity is bounded by
- where l is the length of the resulted sequences

and n is the number of sequences. - The worst case time complexity O(n2l2 )

MSA Useful tools

- GCG (Genetics Computer Group) PileUp
- http//gcg.nhri.org.tw8003/gcg-bin/seqweb.cgi
- Clustalw
- http//clustalw.genome.ad.jp/

Clustal W

- Pairwise alignment
- Calculate distance matrix
- Construct the unrooted Neighbor-Joining (NJ) tree
- Construct the rooted NJ tree
- rooted at mid-point
- Progressive alignment
- Align following the rooted NJ tree
- set-to-set alignment

Experiment

SP Score Result

Clustalw and our result are better than GCGs

More larger more better

Compact Set Relation Failure rate Result

of relation not preserved / of source compact

set relation

More smaller more better

Three-point Relative Scale Preserved Rate

For all three species A, B,C, we evaluate their

relative distance relation between original

distance matrix and the MSA distance are

identical or not.

I Believe Tree Only

- One might still not believe original pair wise

distance is not a good judge - One believes the true evolutionary tree only

Compact Set Relation Failure Rate

Take Protein 12 for example

of relations not preserved / of source

Compact Set relations

Distance

MSA_Method

More smaller more better

Future Work

- Is our measurement and algorithms really good?
- Simulations and Web service
- Does Our MSA by set-to-set alignment satisfy some

approximation property? - Theoretical Proving
- How can we reduce the time?
- Hardwired Dynamic Programming
- exPARACEL http//www.paracel.com/