http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/fulltext.ppt - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/fulltext.ppt

Description:

Title: Author: Last modified by: Created Date – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 43
Provided by: 21079
Category:

less

Transcript and Presenter's Notes

Title: http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/fulltext.ppt


1
???????????????????????
  • ?? ??
  • ??????????
  • ??????

http//naomi.is.s.u-tokyo.ac.jp/sada/papers/fullt
ext.ppt
2
??
  • ????????????????
  • ????
  • ??????
  • ????
  • ????

3
??
  • ???????????
  • WWW, ???
  • ??, ??, ??
  • ?????????
  • ?????????????????
  • ???????????
  • ??????????

4
???????????
  • sequential search
  • signature file Moders 49
  • ???????????????
  • inverted file Bleir 67
  • ?????????????
  • digital tree (trie)
  • ????????

5
Inverted file??????
????????????,?????
  • sorted array
  • ??????????????
  • prefix B-tree
  • ?????
  • trie
  • prefix?????????

6
Word indexes vs. Full-text indexes
full-text indexes
word indexes
  • ???????????
  • ???????
  • ?????
  • ?????
  • sorted array
  • prefix B-tree
  • trie
  • ????????
  • ???????
  • ?????
  • ?????
  • suffix array
  • String B-tree
  • suffix tree

7
Full-text index??????
  • suffix tree Weiner 73
  • suffix array Manber, Myers 93
  • String B-treeFerragina, Grossi 95

8
Suffix tree
  • ???????suffix(???)???compacted trie
  • ???????????????
  • ???????
  • unbalanced

9
Suffix array
  • ???????suffix?????????????????
  • ?????(5N)
  • ?????

10
String B-tree
  • suffix??????B-tree??????
  • ????disk????????(blind tree)
  • ?????????
  • ????????
  • ??? 13N
  • 1????????

abab
11
I/O complexity
  • ???I/O complexity
  • ???I/O complexity
  • ???I/O complexity

12
???I/O complexity
  • Suffix tree
  • N??????
  • String B-tree
  • Suffix array

p ?????? occ ???? N ????
13
???I/O complexity
  • Suffix tree
  • N??????
  • String B-tree
  • Suffix array
  • ??????????String B-tree?????

p ?????? N ???? B ??????????
14
???I/O complexity
  • suffix tree
    (optimal)
  • suffix array
  • String B-tree

N ???? M ?????? B ??????????
15
????????
  • Suffix tree???
  • ????
  • ?????
  • Suffix array???
  • ????
  • ?????

16
Suffix tree???
  • ???? (????)
  • Weiner 73
  • McCreight 76
  • Ukkonen 95
  • Farach 97
  • divide and conquer, batch??
  • ?????
  • Farach, Ferragina, Muthukrishnan 98

17
Disk???suffix tree??
  • ???????sorting?scan???
  • ??sorting???I/O complexity (optimal)


18
Sorting I/O complexity
  • ?????sorting???I/O complexity???
  • tree?????lca?K? (K??range minima)
  • tree T ?Euler Tour ET(T)???????
  • ???????????K??
  • tree?????????mark???????
  • uncompacted trie?merge
  • suffix tree????suffix link
  • suffix tree???

19
Block vs. Random I/O
  • 2-way merge
  • M/B-way merge

20
Disk??suffix tree????
  • tree????????????? ??????

21
Algorithm outline
  • Odd tree???
  • Even tree???
  • merge??


22
Building the odd tree
  • ????2???1???????? ??N/2???????
  • ???????suffix tree???????
  • ???????

abab
AA

23
Building the even tree
  • ?????suffix?????radix sort??
  • (?????, ?????suffix????)
  • ????suffix??lcp????
  • compacted trie???

abab
2
4
24
Merging the odd and even trees
  • anchor pair?????
  • side tree pair?????
  • pull node?????
  • merge node?????
  • Te?To?merge??

25
Suffix array?????????
  • quick sort
  • ??????????????
  • ternary partitioningBentley, Sedgewick 97
  • ????????????
  • ????????????
  • doubling algorithm
  • Manber, Myers 93
  • Sadakane, Imai 98
  • ???????

26
Doubling algorithm
  • Karp, Miller, Rosenberg 72
  • ????????????Arge et al. 97
  • ?? 1, 2, 4, ????????????
  • log n ??????????????

27
Suffix sorting by doubling (1/5)
  • ?suffix????1???????????
  • ???????????
  • ????????suffix?2???????
  • ????? (?????? ???2??????)
  • ????????suffix?3,4???????
  • ???????????
  • ???suffix?????????????
  • ??????????????skip??

28
Suffix sorting by doubling (2/5)
1
3
3
6
0
10
11
1
6
11
6
VIi1
VIi
13
2
11
3
12
6
1
4
7
10
5
0
8
9
Ii

beor
be
eorn
e
nott
obeo
orno
otto
obe
rnot
tobe
ttob
tobe
tobeornottobe
29
Suffix sorting by doubling (3/5)
VIi2
8
0
4
3
1
1
VIi
0
5
10
13
2
11
12
3
6
1
10
4
7
5
0
9
8
Ii

beor
be
eorn
e
nott
obeo
orno
otto
obe
rnot
tobe
ttob
tobe
tobeornottobe
30
Suffix sorting by doubling (4/5)
VIi4
8
0
VIi
0
5
10
13
11
2
12
3
6
10
1
4
7
5
0
9
8
Ii

beor
be
eorn
e
nott
obeo
orno
otto
obe
rnot
tobeor
ttob
tobe
tobeornottobe
31
Suffix sorting by doubling (5/5)
VIi
0
5
10
13
11
2
12
3
6
10
1
4
7
5
9
0
8
Ii

beor
be
eorn
e
nott
obeo
orno
otto
obe
rnot
tobeor
ttob
tobe
tobeornottobe
?????
32
Suffix array??????????
  • Gonnet, Baeza-Yates, Snider 92
  • disk?sequential access??
  • Crauser, Ferragina 98
  • doubling algorithm discarding

33
Doubling algorithm discarding
  • doubling algorithm?????????
  • ????
  • M/B-way ??????????
  • ?????????
  • ??????????????????

34
Word indexes vs. Full-text indexes
35
???
word indexes
full-text indexes
  • ???????
  • ??????1/7 (???, ????)
  • ????????
  • ????????
  • DNA????????
  • ????????
  • ?????????????
  • ??????????
  • ??????????
  • ??????????????
  • ??????????????
  • AND??????

36
Full-text index?????
  • ??????????????
  • ?????????????????
  • ???grep????????
  • ???????
  • Full-text index??word index?????
  • ?????????
  • ?????index??????
  • index?????????????????

37
??
  • ???????????
  • AND???
  • ????????
  • OR??
  • ?????????????
  • ???????????
  • ????????
  • ???????????
  • word index????

38
????????
  • Block sorting???Burrows, Wheeler 94
  • suffix array????????????????

39
??
  • ?????????????NTT???????
  • ??????????????

40
????(1/3)
  • 1 L.Arge, P.Ferragina, R.Grossi, and J.S.
    Vitter. On sorting strings in external memory. In
    ACM Symposium on Theory of Computing, pp.
    540--548, 1997.
  • 2 J.L. Bentley and R.Sedgewick. Fast algorithms
    for sorting and searching strings. In
    Proceedings of the 8th Annual ACM-SIAM Symposium
    on Discrete Algorithms, pp. 360--369, 1997.
  • 3 M. Burrows and D. J. Wheeler. A Block-sorting
    Lossless Data Compression Algorithms. Technical
    Report 124, Digital SRC Research Report, 1994.
  • 4 A.Crauser and P.Ferragina. External memory
    construction of full-text indexes. In DIMACS
    Workshop on External Memory Algorithms and/or
    Visualization, 1998.
  • 5 M.Farach. Optimal Suffix Tree Construction
    with Large Alphabets. In 38th Symp. on
    Foundations of Computer Science, pp. 137--143,
    1997.

URL
URL
URL
URL
URL
41
????(2/3)
  • 6 P.Ferragina and R.Grossi. The String B-Tree
    a new data structure for string search in
    external memory and its applications. Journal of
    the ACM, 1998.
  • 7 G.H. Gonnet, R.Baeza-Yates, and T.Snider. New
    Indices for Text PAT trees and PAT arrays. In
    W.Frakes and R.Baeza-Yates, editors, Information
    Retrieval Algorithms and Data Structures,
    chapter5, pp. 66--82. Prentice-Hall, 1992.
  • 8 R.M. Karp, R.E. Miller, and A.L. Rosenberg.
    Rapid identification of repeated patterns in
    strings, arrays and trees. In 4th ACM Symposium
    on Theory of Computing, pp. 125--136, 1972.
  • 9 U.Manber and G.Myers. Suffix arrays A New
    Method for On-Line String Searches. SIAM Journal
    on Computing, Vol.22, No.5, pp. 935--948, October
    1993.

URL
URL
42
????(3/3)
  • 10 E.M. McCreight. A space-economical suffix
    tree construction algorithm. Journal of the ACM,
    Vol.23, No.12, pp. 262--272, 1976.
  • 11 K.Sadakane and H.Imai. A Cooperative
    Distributed Text Database Management Method
    Unifying Search and Compression Based on the
    Burrows-Wheeler Transformation. In Proceedings of
    NewDB98, 1998.
  • 12 K.Sadakane and H.Imai. Constructing Suffix
    Arrays of Large Texts. In Proceedings of DEWS'98,
    1998.
  • 13 E.Ukkonen. On-line construction of suffix
    trees. Algorithmica, Vol.14, No.3, pp. 249--260,
    September 1995.
  • 14 P.Weiner. Linear Pattern Matching Algorihms.
    In Proceedings of the 14th IEEE Symposium on
    Switching and Automata Theory, pp. 1--11, 1973.

URL
URL
Write a Comment
User Comments (0)
About PowerShow.com