Title: Analysis of Tree Edit Distance Algorithms Serge Dulucq and Hlne
1Analysis of Tree Edit Distance AlgorithmsSerge
Dulucq and Hélène
- B89902009 ???
- B89902011 ???
- B89902045 ???
2Outline
- Introduction
- Edit Distance for Trees and Forests
- Cover Strategies
3- Introduction
- Edit Distance for Trees and Forests
- Cover Strategies
4Motivation
- One way of comparing two ordered trees is by
measuring their edit distance - Application areas
- Comparison of hierarchically structured data
- Alignment of RNA secondary structures in
computational biology - Two algorithms using dynamic programming
- Zhang-Shasha
- Klein
5Purpose
- A general analysis of dynamic programming for
edit distance algorithm - Study the complexity of those decompositions by
counting the exact number of distinct recursive
calls - Define a new edit distance algorithm for trees
which improves original algorithms with respect
to the number of recursive calls
6- Introduction
- Edit Distance for Trees and Forests
- Cover Strategies
7Trees and forests
- A tree is a node (called the root) connected to
an ordered sequence of disjoint trees - Such a sequence is called a forest
- We write l(A1??An) for the tree composed of the
node l connected to the sequence of trees A1, ,
An
2
2
?
3
4
4
3
5
1
5
1
l
???
A2
A1
An
???
8F
1
10
- F denotes the number of nodes of the forest F
- SF(F) is the set of all subforests of F
- F(i), i is a node of F, denotes the subtree of F
rooted at i - deg(i) is the degree of i, that is the number of
children of i
2
4
7
8
9
3
5
6
F 10
4
? SF(F)
9
5
6
2
F(2)
3
deg(4) 2
9Edit distance
- Let F and G be two forests. The edit distance
between F and G, denoted d(F, G), is the minimal
cost of edit operations needed to transform F
into G - Operations
- Substitution
- Insertion
- Deletion
- Let Cs, Ci, Cd denote the costs of substitution,
insertion, deletion
10Recursive relationship(1/3)
- Strings
- u, v are strings x, y are alphabet symbols
- d(xu, yv) min Cd(x) d(u, yv),
- Ci(y) d(xu, v),
- Cs(x, y) d(u, v)
- d(ux, vy) min Cd(x) d(u, vy),
- Ci(y) d(ux, v),
- Cs(x, y) d(u, v)
u
x
y
y
v
y
11Recursive relationship(2/3)
- Trees
- l, l are roots F, F are forests
- d(l(F), l(F)) min Cd(l) d(F, l(F)),
- Ci(l) d(l(F), F),
- Cs(l, l) d(F, F)
l
l
l
l
12Recursive relationship(3/3)
- Forests
- T, T are forests
- Left decomposition
- d(l(F)?T, l(F)?T) min Cd(l) d(F?T,
l(F)?T), - Ci(l) d(l(F)?T, F?T),
- d(l(F), l(F)) d(T, T)
- Right decomposition
- d(T?l(F), T?l(F)) min Cd(l) d(T?F,
T?l(F)), - Ci(l) d(T?l(F), T?F),
- d(l(F), l(F)) d(T, T)
- direction to indicate left or right
13Example
Left decomposition
4
3
1
3
4
5
2
3
2
4
5
5
4
5
4
5
2
Right decomposition
3
4
5
1
3
2
4
5
4
3
5
2
4
5
2
4
5
4
5
4
2
2
14Strategy Relevant forests
- Let F and G be two forests. A strategy is a
mapping from SF(F)SF(G) to left, right - Let (F, F) be a pair of forests provided with a
strategyf.The set RFf(F, F) of relevant forests
is defined as the least subset of SF(F)SF(F)
such that if the decomposition of (F, F) meets
the pair (G, G), then (G, G) belongs to RFf(F,
F) - RFf(F) and RFf(F) denote the projection of
RFf(F, F) on SF(F) and SF(F) - relevant denote the number of relevant forests
15Proposition(1/2)
- FFØ ? RFf(F, F)Ø
- f(F, F)left, Fl(G)?T, FØ
- ? RFf(F, F) (F, F)?RFf(G?T, F)
- f(F, F)right, FT?l(G), FØ
- ? RFf(F, F) (F, F)?RFf(T?G, F)
- f(F, F)left, FØ, Fl(G)?T
- ? RFf(F, F) (F, F)?RFf(F, G?T)
d(l(G)?T, l(G)?T) min Cd(l) d(G?T,
l(G)?T), Ci(l) d(l(G)?T,
G?T), Cs(l(G), l(G)) d(G?T,
G?T) d(T?l(G), T?l(G)) min Cd(l)
d(T?G, T?l(G)), Ci(l)
d(T?l(G), T?G), Cs(l(G), l(G))
d(T?G, T?G)
16Proposition(2/2)
- f(F, F)right, FØ, FT?l(G)
- ? RFf(F, F) (F, F)?RFf(F, T?G)
- f(F, F)left, Fl(G)?T, Fl(G)?T
- ? RFf(F, F) (F, F)? RFf(G?T, F)?
- RFf(F, G?T)?RFf(l(G), l(G))?RFf(T, T)
- f(F, F)right, FT?l(G), FT?l(G)
- ? RFf(F, F) (F, F)? RFf(T?G, F)?
- RFf(F, T?G)?RFf(l(G), l(G))?RFf(T, T)
d(l(G)?T, l(G)?T) min Cd(l) d(G?T,
l(G)?T), Ci(l) d(l(G)?T,
G?T), Cs(l(G), l(G)) d(G?T,
G?T) d(T?l(G), T?l(G)) min Cd(l)
d(T?G, T?l(G)), Ci(l)
d(T?l(G), T?G), Cs(l(G), l(G))
d(T?G, T?G)
17Lemma 1
- Given a tree Al(A1??An), for any strategy we
have - relevant(A)
- A - Ai relevant(A1) relevant(An)
- where i?1n is such that the size of Ai is
maximal
18Proof(1/2)
- Let F A1??An ? RF(A) A?RF(F)
- ? relevant(A) 1 relevant(F)
- When n1
- F A1, Al(A1)
- ? relevant(A) 1 relevant(A1)
- A - A1 relevant(A1)
- When ngt1
- Suppose left, Let A1 l(F1), T A2??An
- RF(F) F?RF(A1)?RF(T)?RF(F1?T)
- RF(F1?T) (RF(F1)?RF(T)) minF1, T
- ? relevant(F) 1 relevant(A1)
- relevant(T) minF1, T
- Let j?2n st Aj is maximal among A2, ,
An - ? relevant(F) 1 relevant(A1)
- relevant(An) T - Aj
minF1, T
19Take a look
- relevant(A) A - Ai
- relevant(A1) relevant(An)
- ? relevant(F) F Ai
- relevant(A1) relevant(An)
- relevant(F) 1 T - Aj minF1, T
- relevant(A1) relevant(An)
20Proof(2/2)
- 1 T - Aj minF1, T F - Ai
- 1) If F1 T
- ? 1 T minF1, T F
- Since Aj Ai
- ?1 T - Aj minF1, T F - Aj
- F - Ai
- 2) If F1 gt T
- ? F - Ai T (?i1)
- ?1 T - Aj minF1, T 1 T
T - Aj - 1 T
- gt F - Ai
- ? relevant(F) F - Ai relevant(A1)
relevant(An) - ? relevant(A) A - Ai relevant(A1)
relevant(An)
21Lemma 2
- For every nature number n, there exists a tree A
of size n such that for any strategy,
relevant(A) has a lower bound in O(n logn) - For complete balanced binary tree Tn of size n,
- prove by induction on n that
- relevant(Tn) (n1)log2(n1)/2
22- Introduction
- Edit Distance for Trees and Forests
- Cover Strategies
23Idea
- Suppose the direction is left
- RF(l(F)?T) l(F)?T?RF(l(F))?RF(F?T)?RF(T)
Since T?F?T, We want to eliminate in priority
nodes of F in F?T, such that RF(F?T) and RF(T)
share relevant forests as most as possible!
24Cover
- Let F be a forest. A cover r of F is a mapping
from F to F?left, right satisfying for each
node i in F - if deg(i) 0 or 1, then r(i)?left, right
- if deg(i) gt 1, then r(i) is a child of i
2
2
4
3
4
3
1
1
left, right
25Cover strategy
- Given a pair of trees (A, B) and a cover r for A,
we associate a unique strategyf as follows. - if deg(i) 0 or 1, then f(A(i), G) r(i), for
each forest G in B - If A(i) is of the form l(A1??An) with n gt 1,
then let p?1, , n such that the favorite child
r(i) is the root of Ap. For each forest G of B,
we define - f(A(i), G) right whenever p 1, left otherwise
- f(T?Ap??An, G) left, for each forest T of
A1??Ap-1 - f(Ap?T, G) right, for each forest T of
Ap1??An - The tree A is called the cover tree. A strategy
is a cover strategy if there exists a cover tree
associated to it
26f(A(i), G) right whenever p 1, left
otherwise f(T?Ap??An, G) left, for each forest
T of A1??Ap-1 f(Ap?T, G) right, for each
forest T of Ap1??An
i
A(i)
G
A2
A1
A4
A3
27Some Tasks
- The order of our Tasks
- ?? Tree A
- ?? Tree B
- ? Tree A Tree B ????????
- ?? distinct pairs (recursively)
28?? Tree A
29Tree A
- Focus on relevant(A) (detail)
- Cover strategies in A
- A ???? B ?
30Lemma 3
j
1
F
i
1
G
This is trivial
31Lemma 4
- RF(l(F)?T)
- l(F) ?T, F1 ?T, .. ,Fk?T?RF(l(F))?RF(T)
- ????????
- Term k F F??node???
- Fk1 ? Fk ?left decomposition ???
- ?forest , so F1 , F2 , , Fk ???
- ???left decomposition ???? forests.
- ?? ??cover strategy ? f(l(F) ? T) left
- ????????recursive????
32RF(l(F)?T)
T
F
Since cover strategy, the direction is left
T
T
F
F
RF(l(F))
RF(T)
RF(F?T)
RF(l(F)?T) l(F) ?T ? RF(l(F)) ? RF(T) ?RF(F?T)
33RF(F?T)
Continue..
T
F
Since cover strategy, the direction is left
T
T
F1
F1
?RF(l(F))
RF(T)
34T
So .
F
T
T
F
F
F1 ?T , .. , Fk?T
35Conclusion
- RF(l(F)?T)
- l(F) ?T, F1 ?T, .. ,Fk?T?RF(l(F))?RF(T)
36Lemma 5
- relevant(A)
- A - Aj relevant(A1) relevant(A2)
relevant(An) - Term A l(A1 ?A2 ? ? An).
- Aj ? A?favorite child.
- ?? ????cover tree?relevant forests???
37A
l
A1
An
Aj
Aj ?A? favorite child j?1n
38Part 1 A - Aj Note F(A(i), G) right
whenever p 1, left otherwise
F(T?Ap??An, G) left, for each forest T of
A1??Ap-1 F(Ap?T, G) right, for each
forest T of Ap1??An
?? ??Aj ? A? favorite child , ??A - Aj
?????A ? ????Aj? forests ? ??
Aj
39Part 2 relevant(A1) relevant(A2)
relevant(An) Note RF(A1?A2?A3?A4?... ?An)
A1?A2?A3?A4?... ?An
?RF(F1?A2?A3?A4?... ?An)?RF(A1)?RF(A2?A3?A4?...
?An )
A1
A2
A3
A4
An
..
40Conclusion
- relevant(A)
- A - Aj relevant(A1) relevant(A2)
- relevant(An)
41free node
- ???free node?
- ?????
- ?????????
- Definition
- the root of A
- the node whose parent is of degree grater than 1
and is not the favorite child
favorite child
free node
42?? Tree B
43Tree B
- B ?? A ????
- So no any cover strategy
- Focus on following three things
- Rightmost forests
- Leftmost forests
- Special forests
44Three Things (1)
Rightmost ? leftmost special ?
NO!
- Definition
- Rightmost forests
- ? B ??,????? left decomposition ???,?????
subforests - Leftmost forests
- ? B ??,????? right decomposition ???,?????
subforests - special forests
- ? B ??,????? left or right decomposition
???,????? subforests
45example
Left decomposition
all rightmost forests of B
46Three Things (2)
- Three categories
- relevant forests of A fall within three
categories - (a) those are compared with all rightmost forests
of B - (ß) those are compared with all leftmost forests
of B - (?) those are compared with all special forests
of B
why?
47Three Things (3)
- The of rightmost , leftmost , special
forests ( ) - right(B) ?(B(i),i?B) - ?(B(i),i is a
rightmost child) - left(B) ?(B(i),i?B) - ?(B(i),i is a
leftmost child) - special(B) B(B3) / 2 - ?(B(i),i?B)
number
right left special
48?? right(B) , left(B)
- Rightmost forests all cover strategies are that
favorite child is rightmost child because of
all left decomposition - Leftmost forests all cover strategies are that
favorite child is leftmost child because of all
right decomposition
right(B) ?(B(i),i?B) - ?(B(i),i is a
rightmost child)
right(B) B - B? right(B1)
right(Bn)
recursively
left(B) B - B? left(B1) left(Bn)
left(B) ?(B(i),i?B) - ?(B(i),i is a
leftmost child)
recursively
relevant(B) B - Bj relevant(B1)
relevant(Bn)
Review
49?
?
50comparison
- two types (??A)
- Trees comparison
- free node
- favorite child
- Forests comparison
51Lemma 6
- let F be a relevant forest of A
- if the direction is left , then F is at least
compared with all rightmost forests of B - if the direction is right , then F is at least
compared with all lef tmost forests of B
Why?
????
52free nodes comparison
- Lemma 7
- let i be a free node of A
- if the direction of i is left , then A(i) is (a)
- if the direction of i is right , then A(i) is (ß)
(a) those are compared with all rightmost
forests of B (ß) those are compared with all
leftmost forests of B (?) those are compared
with all special forests of B
53 lemma7 ??
if the direction of i is left , then A(i) is (a)
- consider G , the largest forest of B such that
(A(i),G) belongs to RF(A,B) and G is not a
rightmost forest - ?? G ???? B , so..
- ??????? (A(i),G) ?
- ??????? case
G is a tree !
not free node !
Case1 ???? , ???? since the
direction of A(i) is left ?? a node
l , two forests H and P such that G H ? P
? (A(i) , l(H) ? P) is in RF(A,B)
(A(i) , l(H) ? P) -gt (A(i),G) by ????!!
G is the largest and not rightmost gt
l(H) ? P is a rightmost forest of B
gt G H ? P is also a a rightmost forest of B
??
Case2 ???? , ???? ?? a node l , (l
? A(i) , G) -gt by ????!!
(A(i) ? l , G) -gt by ????!!
(l(A(i)) , G) -gt by ????!!
Case3 tree ?????? (A(i) ? F1 , G ?
F2) -gt (A(i) , G) by tree ??????
(F1 ? A(i) , F2 ? G) -gt (A(i) , G) by tree
??????
Case4 forest ?????? (T1 ? A(i) , T2
? G) -gt (A(i) , G) by forest ??????
(A(i)? T1 , G ? T2) -gt (A(i) , G) by forest
??????
54forests comparison
- Lemma9
- let F be a relevant forest of A but not a tree.
Let i be the lower common ancestor of the set or
nodes of F and j be the favorite child of i - if F is a rightmost forest whose left most tree
is not A(j) , then F has the same category
as A(i) - if F is a leftmost forest , then F has the same
category as A(i) - else F is (?)
55lemma8 ?? (1) (2)
- The fact
- (1) (2) is very trivial !!
?????? forest , ?? ??(1) -gt decomposition ????
favorite child (???) ??(2) -gt decomposition
???? favorite child (???)
category
????(LCA)???forests
????
56Lemma8 ?? (3)
- ?????? forest , ?????(1) (2) ,
- ?????? tree ????????? (favorite child) ,
- ?????? direction ? right
- now consider a forest G
?? G is a rightmost forest of B ?? F is
not a leftmost forest ?? F ?????????????
gt A(i) ???? left by lemma gt ?? (A(i) ,
G) (A(i) , G) -gt (F , G) by ???
????,???? !!
?? G is not a rightmost forest of B B
??? G ??? right decomposition ??? F ??????
right ?? (F , G) ??
57favorite childs comparison
- Lemma9
- let i be the node of A is not free , and j be the
parent of i - if the direction of i is left , if i is the
rightmost child of j and A(j) is left , then A(i)
is (a) - if the direction of i is right , if i is the
leftmost child of j and A(j) is right , then A(i)
is (ß) - else A(i) is (?)
58Lemma9 ??
??? trivial ???
- The fact
- all are very trivial !!
(1) left ??? (2) right ??? (3) ??
59Final Task
60Notation
- let i be a node of A , let j be the parent of i
(if i is not root) - Free(A(i)) relevent(A(i),B) if i is free
- Right(A(i)) relevent(A(i),B) if A(j) is (a)
- Left(A(i)) relevent(A(i),B) if A(j) is (ß)
All(A(i)) relevent(A(i),B) if A(j) is (?) - So , relevant(A,B) Free(A)
61Theorem
- let (A,B) be a pair of trees , A be a cover tree
- 7 case
62Case(1)
- If A is reduced to a single node whose direction
is right
Free(A) left(B) Right(A)
special(B) Left(A) left(B) All(A)
special(B)
63Case2
- If A is reduced to a single node whose direction
is left
Free(A) right(B) Right(A)
left(B) Left(A) special(B) All(A)
special(B)
64Case3
- if A l(A) and the direction of l is right
- ( A is a tree )
Free(A) left(B) Left(A) Right(A)
special(B) All(A) Left(A) left(B)
Left(A) All(A) special(B) All(A)
65Case4
- if A l(A) and the direction of l is left
- ( A is a tree )
Free(A) right(B) Right(A)
Right(A) right(B) Right(A) Left(A)
special(B) All(A) All(A) special(B)
All(A)
66Case5
- if A l(A1??An) and the favorite child is the
leftmost child
Free(A) left(B)(A-A1) Left(A1)
Free(A2) Free(An) Right(A)
special(B)(A-A1) All(A1) Free(A2)
Free(An) Left(A) left(B)(A-A1) Left(A1)
Free(A2) Free(An) All(A)
special(B)(A-A1) All(A1) Free(A2)
Free(An)
67Case6
- if A l(A1??An) and the favorite child is the
rightmost child
Free(A) right(B)(A-An) Right(An)
Free(A1) Free(An-1) Right(A)
right(B)(A-An) Right(An) Free(A1)
Free(An-1) Left(A) special(B)(A-An)
All(An) Free(A1) Free(An-1) All(A)
special(B)(A-An) All(An) Free(A1)
Free(An-1)
68Case7
- if A l(A1??An) and the favorite child is Aj ,
with 1ltjltn
Free(A) right(B)(1A1??Aj-1)
special(B)(Aj??An) All(Aj) Free(A1)
Free(Aj-1) Free(Aj1) Free(An) Right(A)
right(B)(1A1??Aj-1) special(B)(Aj??An)
All(Aj) Free(A1) Free(Aj-1) Free(Aj1)
Free(An) Left(A) special(B)(A-Aj)
All(Aj) Free(A1)
Free(Aj-1) Free(Aj1) Free(An) All(A)
special(B)(A-Aj) All(Aj)
Free(A1) Free(Aj-1) Free(Aj1) Free(An)
69conclusion
- Steps
- ??two tree A B
- ?? right(B) left(B) special(B)
Free(A)
relevant(A,B)
Free(A)
by theorem
recursive
70example
- For Zhang-Shasha algorithm
relevant(A,B) right(A) right(B)
Why ?
71????
72Choose the favorite child (1)
- Choose the good favorite child to minimize Free(A)
Case 5 (favorite child ????) Case 6 (favorite
child ????) Case 7 (favorite child ????)
Free(A) min
73Choose the favorite child (2)
Not necessarily !!
Why?
Need preprocessing time !!
74The end
?? ??? ?? ?? ??? ??? ??? ?? ??? ??? ??? ?? ???
??? ??? ?? ??? ??? ??? ?? ??? ??? ??? ?? ??? ???
???
Happy New Year !