"An%20Eulerian%20path%20approach%20to%20global%20multiple%20alignment%20for%20DNA%20sequences - PowerPoint PPT Presentation

About This Presentation
Title:

"An%20Eulerian%20path%20approach%20to%20global%20multiple%20alignment%20for%20DNA%20sequences

Description:

Transformation the de Bruijn graph to DAG. Claim ... Extract a consensus path from DAG. Greedy Algorithm. To find a heaviest path within linear time ... – PowerPoint PPT presentation

Number of Views:187
Avg rating:3.0/5.0
Slides: 49
Provided by: jaehe
Category:

less

Transcript and Presenter's Notes

Title: "An%20Eulerian%20path%20approach%20to%20global%20multiple%20alignment%20for%20DNA%20sequences


1
CPSC 689-604
Journal of Computational Biology 10-6, pp.
803-819 (2003). Proc. National Academy of
Science of USA 102-5, pp. 1285-1290 (2005).
"An Eulerian path approach to global multiple
alignment for DNA sequences by Y. Zhang and M.
Waterman An Eulerian path approach to local
multiple alignment for DNA sequences by Y.
Zhang and M. Waterman
Presented by Jaehee Jung Mar 4 2005
2
Outline
  • Motivation
  • Hamiltonian Eulerian path
  • Superpath problem
  • Global Alignment
  • Global Alignment Algorithm
  • Probability Analysis
  • Complexity
  • Discussion
  • Local Alignment
  • Local Alignment Algorithm
  • Significance Estimation
  • Complexity
  • Discussion

3
Motivation - Hamiltonian path
SATG, TGG, TGC, GTG, GGC ,GCA, GCG, CGT
ATG
TGG
CTG
GGC
GCA
GCG
CGT
TGC
ATGCGTGGCA
ATGGCGTGCA
Hamiltonian path problem is NP- complete
4
Motivation - Eulerian path
SATG, TGG, TGC, GTG, GGC ,GCA, GCG, CGT
Vertices correspond to (l-1) tuples Edges
correspond to l-tuples from the spectrum
ATGGCGTGCA
ATGCGTGGCA
Eulerian path visiting all edges correspond to
sequence reconstruction
5
Global multiple alignment
  • Global multiple alignment
  • Entire sequence are align into one configuration
  • Time and memory cost
  • L sequence length
  • N number of sequences
  • Multiple sequence alignment
  • Many heuristic algorithm
  • Progressive alignment strategies
  • Aligning the closet pair of sequences
  • Aligning the next close pair of sequences
  • Ex MULTAL, CLUSTALW, T-COFFEE

6
Global multiple alignment
  • Many heuristic algorithm (contd)
  • Iterative refinement strategies
  • Local alignment to construct multiple alignment
    based on segment segment comparison
  • Refine the initial alignment iteratively by local
    alignment
  • Ex DIALIGN
  • Iteratively dividing the sequence into two groups
    and the realignment
  • Ex PRRP
  • Stochastic iterative strategies
  • Ex HMMT, SAM
  • ISSUE
  • Robust under certain condition
  • Local optimal problem (iterative problem)
  • gt Efficient time and memory space

7
Motivation
EULER12 EulerAlign3
Fragment assembly in DNA sequencing using Eulerian superpath approach Global multiple DNA sequence alignment problem using Eulerian Paths
Easy to solve Eulerian path problem in Bruijn graph Similar to Star method
Contribution discard the traditional overlap-layout-consensus
error-free data by an error-correction procedure Assume all input sequences are derived from a common ancestral sequence
8
Star Alignment Example
MPE MKE
MSKE - MKE
x1 MPE x2 MKE x3 MSKE x4 SKE
s3
s1
s2
SKE MKE
-MPE -MKE MSKE -SKE
-MPE -MKE MSKE
MPE MKE
s4
  • Compute the alignments of all sequence pairs
  • Picks one sequence among N sequences as the
    consensus

9
Motivation - Eulerian Superpath
  • Superpath Problem EULER 2
  • Given an Eulerian graph and a collection of paths
    in this graph, find an Eulerian Path in this
    graph that contains all these paths as subpath
  • Solve
  • Transform graph G, system of path P -gt G1 and P1
  • Make a series of equivalent transformation
  • (G , P) -gt (G1 , P1) -gt (G2 , P2) . -gt(Gk , Pk)

10
Motivation - Eulerian Superpath
  • Equivalent transformation
  • X,Y detachment

P y-gt
11
Motivation - Eulerian Superpath
  • Equivalent transformation
  • X,Y detachment
  • P consistent with Px,y1 but inconsistent with
    Px,y2
  • P is resolvable

12
Motivation - Eulerian Superpath
  • Equivalent transformation
  • X,Y detachment
  • P inconsistent with both Px,y1 and Px,y2
  • Has no solution (did not encounter in NM
    project)

NM project difficult-to assemble and
repeat-rich bacterial genomes
13
Motivation - Eulerian Superpath
  • Equivalent transformation
  • X,Y detachment
  • P consistent with both Px,y1 and Px,y2
  • Difficult situation
  • Analyze until all resolvable edges are analyzed

14
Motivation - Eulerian Superpath
  • Equivalent transformation
  • X-cut
  • P-gtx and Px-gt without affecting the graph G

15
Eulerian global alignment -the algorithm
  1. Construct a directed de Bruijn graph
  2. Transform the de Bruijn graph to DAG
  3. Extract a consensus path form the DAG according
    to the edges
  4. Do fast pairwise alignment between the consensus
    path and each input sequence
  5. Construct the final multiple alignment according
    to the pairwise alignment

16
(1) (2) (3) (4) (5) Construct a directed
de Bruijn graph
CCTTAG
CCTTA
CTTAG
CCTT
CTTA
CTTA
TTAG


CCTT
CTTA
CTTA
TTAG
Merge Vertices CTTA
CCTT
CTTA
TTAG
Construction of the de Bruijn graph for CCTTAG
and k5
17
de Bruijn Graph Construction
  • Assume that there are no sequencing errors.
  • Construct the de Bruijn graph, taking all (k
    1)-mers appearing in the set of fragments as
    vertices.
  • TCACA ACAA GTCA
  • These errors have to be corrected before
    construction of the de Bruijn graph
  • read ACGGCTAT other reads
    CTAACTGC CTGCTA
    AACTGCT correction
    T

18
(1) (2) (3) (4) (5) Construct a directed
de Bruijn graph
0
1
9
1
2
multiplicity
2
8
4
3
8
8
3
4
9
5
5
9
10
9
0
9
6
6
9
7
8
7
9
8
9
9
An example of the initial de Bruijn graph
19
(1) (2) (3) (4) (5)Transformation the de
Bruijn graph to DAG
  • Transformation the de Bruijn graph to DAG
  • Tangle
  • a vertex that has more than one incomings or
    outgoings edges
  • Created by random matches, repeats, mutation DNA
    sequences
  • Result cycle
  • Goal delete tangle, because of many cycles

vi
20
(1) (2) (3) (4) (5)Transformation the de
Bruijn graph to DAG
  • Claim
  • E-gtVi left edge for vertex vi to be an edge
    that points to vi
  • If a vertex vi has two or more left edgeEn-gtVi
    n1,2,3.. that are contained in the same
    sequence path, there must exist a cycle in a
    graph
  • Proof
  • vi will visited when visiting E1-gtVi and vi wil
    visited will when visiting E2-gtVi

21
(2) (3) (4) (5)Transformation the de
Bruijn graph to DAG
  • Rule of transformation
  • Sequence information in Evi-gt partitioned two
    superedges E1-gtvi-gt, E2-gtvi-gt
  • Multiplicity for superedge E1-gtvi-gt, E2-gtvi-gt
    compute

E1-gtVi-gt
E1-gtVi
EVi-gt
vi
vj
E2-gtVi
E2-gtVi-gt
A tangle at vi is eliminated by making a copy
vi of vertex vi and separating
22
(2) (3) (4) (5) Transformation the de
Bruijn graph to DAG
  • Rule of transformation

E1-gtVi-gt
E1-gtVi
E1Vi-gt
E2-gtVi-gt
E2-gtVi
E2Vi-gt
A tangle at vi is eliminated by making a copy vi
of vertex vi
23
(2) (3) (4) (5) Transformation the de
Bruijn graph to DAG
  • Safe transformation
  • Does not introduce the loss of similarity

2
1
24
(2) (3) (4) (5) Transformation the de
Bruijn graph to DAG
  • Unsafe transformation
  • Introduce the loss of similarity

25
(2) (3) (4) (5) Transformation the de
Bruijn graph to DAG
  • Remove all cycles by performing safe
    transformation
  • Leave all unsafe stansformations for later

0
1
1
9
2
2
multiplicity
8
4
3
3
8
8
4
5
9
5
9
6
10
9
0
9
6
7
9
7
8
8
9
Make DAG heaviest consensus path
9
26
(1) (2) (3) (4) (5) Extract a consensus
path from DAG
  • Greedy Algorithm
  • To find a heaviest path within linear time
  • Not optimal but satisfactory
  • Weight for each edge
  • Proportional to its multiplicity and length

27
(2) (3) (4) (5)Fast pairwise alignment
  • Banded pairwise alignment algorithm
  • The positional shifts between two candidate
    letters in two sequences are bonded by a constant
  • Align the consensus sequence with each input
    sequence

28
(1) (2) (3) (4) (5) Construct the final
multiple alignment
  • Combine the alignment to construct the final
    multiple alignment

29
Probability Analysis
  • Assume all input sequence are derived from a
    common ancestral sequence S0
  • N -gt identical S0
  • N number of sequence
  • L average sequence length
  • k size k-tuple
  • mutation rate
  • No mutation N sequence exactly same S0
  • multiplicity for each edge N
  • With mutation weight edge in S0

30
Probability Analysis
  • Large Deviation Theorem (L.D.T) for binomial
    estimate
  • If ,then consensus path exist and be
    accurate

31
Computational complexity
  • Construction and transformation of the graph
  • Find the heaviest path
  • Banded pairwise alignment

32
Discussion
  • Choice of k-tuple size
  • The larger k, the fewer multiplicity for edge
  • For Larger N
  • The smaller k, the k is not unique in the
    sequence
  • For small N get high multiplicity
  • Estimate k using L.D.T
  • Graph transformation may lose information
  • unsafe transformation, lose of similarity
    information
  • Arbitrary scoring function

33
Local multiple alignment
  • Difficulty
  • Locations, sizes, structures ,number of conserved
    regions
  • Local multiple alignment
  • PIMA, MACW,DIALIGN
  • Subproblem of local alignment
  • Motif finding
  • Gibbs motif sampler
  • Ex MEME
  • Limitation
  • size of data , the length of motif

34
Local multiple alignment
  • Another Specific Problem of local alignment
  • Entire Genome Sequence
  • Large size sequence comparsion
  • Local Alignment
  • Using pairwise sequence comparison
  • Not accurate, error accumulate , ruin final
    result
  • Comparing each sequence with a DB
  • Find only conserved regions

35
Local Alignment Algorithm
  1. Construct de Bruijn graph by overlapping k-tuple
  2. Cut thin edge by estimating the statistical
    significance of each edge with a Poisson
    heuristic
  3. Resolve cycles in graph
  4. Extract a heaviest path as the consensus
  5. Construct and output a multiple alignment from
    pairwise alignment
  6. Declump de Bruijn graph and return to step 5 to
    find other patterns

36
(1) (2) (3) (4) (5) (6)Construct de
Bruijn graph
  • ATGT
  • ATG
  • TGT
  • ATGC
  • ATG
  • TGC
  • CTGT
  • CTG
  • TGT

AT
GC
TGC
ATG
TG
TCT
CTG
CT
GT
3 tuple de Bruijn graph by gluing identical
edge and vertices
37
(1) (2) (3) (4) (5) (6)Cut thin edge
  • Uninteresting edge
  • Huge number of thin edge gt small multiplicity
  • Remove an edge by estimating the probability

a before removing thin edges b after removing
thin edges
38
(1) (2) (3) (4) (5) (6)Resolve cycles
in graph
  • Tandem repeat
  • Repeat present as a cycle in the graph
  • Ambiguous to determine how many time a cycle
  • Solve the superpath solution

39
(1) (2) (3) (4) (5) (6) Extract a
heaviest path as the consensus
  • Heaviest path
  • Shortest path algorithm with negative edge
  • Using topological sort
  • Cost linear time (acyclic graph)

40
(1) (2) (3) (4) (5) (6) Construct and
output a multiple alignment
  • Find the consensus
  • Banded version of local pairwise alignment
  • Declumping algorithm to find segments similar to
    the consensus
  • Optimal alignment has p gt p0
  • P0 assume the Poisson distribution

41
(1) (2) (3) (4) (5) (6) Construct and
output a multiple alignment
  • Declumping algorithm

AT AT
42
(1) (2) (3) (4) (5) (6) Declumping
graph
  • Remove information of previously output local
    alignments
  • Allows additional patterns
  • Ex XYZ PYQ
  • Do not remove the edge of Y
  • Reduce its multiplicity
  • Repeat
  • Finding consensus consensus alignment
    decumpling graph
  • Until no significant local alignment are left

43
Significance Estimation
  • Estimate the P value of local multiple alignment
  • Remove thin edge formed by random matches
  • Rank multiple outputs by statistical significance
  • Estimate minimum multiplicity of mutations free
    edge
  • Local alignment is complicated than in the global
    case
  • Position and the orders of conserved regions in
    each sequences

44
Poisson clumping heuristic
  • Pairwise alignment
  • H is the optimal clump score
  • p(2) is the probability that two letters are
    identical
  • L1,L2 are the adjusted lengths of two sequences
  • L1,L2 p(2)x is an approximation to the
    expected number of clumps with score
  • Multiple alignment

,
45
Computation Efficiency
  • k tuple size
  • l pattern length found in each iterations
  • N number of sequences
  • L average sequence length
  • Time
  • Graph construction and transformation
  • Pairwise alignment with declumping
  • Space

The size of alignment matrix
46
Discussion
  • Tuple size(1020)
  • How to detect true pattern other than
    concatenation different pattern
  • Current version focus on DNA not protein sequence

47
Assignment 5
  • When we using the de Bruijn graph in Eulerain
    graph, we just adopt in DNA because its
    characters are consist of four nucleotide like
    A,C,G,T. Give me an efficient algorithm to get
    the multiple sequence alignment for adopting
    protein (it is 20 characters) using the graph.
  • Hint Not use de Bruijn graph and Eulerian graph,
    Graph structure is embedded in the dynamic
    programming algorithm)

If you have question, Contact me
jhjung_at_cs.tamu.edu
48
Reference
  • 1 A new algorithm for DNA sequence assembly
  • by Idury, R., and Waterman,. Journal of
    Computational Biology. 2, 291306. (1993)
  • 2 An Eulerian path approach to DNA fragment
    assembly.
  • by Pevzner, P.A., Tang, H., and Waterman,Proc.
    National Academy of Science of USA, PP97489753
    (1998)
  • 3 "An Eulerian path approach to global multiple
    alignment for DNA sequences" by Y. Zhang and M.
    Waterman, Journal of Computational Biology 10-6,
    pp. 803-819 (2003).
  • 4 "An Eulerian path approach to local multiple
    alignment for DNA sequences" by Y. Zhang and M.
    Waterman, Proc. National Academy of Science of
    USA 102-5, pp. 1285-1290 (2005).
Write a Comment
User Comments (0)
About PowerShow.com