Compressed Permuterm Index PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Compressed Permuterm Index


1
Compressed Permuterm Index
  • Paolo Ferragina
  • Dipartimento di Informatica, Università di Pisa

2
A basic problem
  • Given a dictionary D of strings, having variable
    length, design a compressed data structure that
    supports
  • string ? id
  • Prefix(a) find all s in D that are prefixed by a
  • Suffix(b) find all s in D that are suffixed by b
  • Substring(g) find all s in D that contain g
  • PrefixSuffix(a,b) Prefix(a) ? Suffix(b)
  • (Compacted) Trie
  • Two versions for D and for DR
  • Intersect answers
  • Need to store D for resolving edge-labels

3
A basic problem
  • Given a dictionary D of strings, having variable
    length, compress them in a way that we can
    efficiently support
  • string ? id
  • Prefix(a) find all s in D that are prefixed by a
  • Suffix(b) find all s in D that are suffixed by b
  • Substring(g) find all s in D that contain by g
  • PrefixSuffix(a,b) Prefix(a) ? Suffix(b)

Permuterm Index (Garfield, 76) ? Reduce
any query to a prefix query over a larger
dictionary
4
Permuterm Index Garfield, 1976
  • Take a dictionary Dyahoo,google
  • Append a special char to the end of each
    string
  • Generate all rotations of these strings
  • yahoo
  • ahooy
  • hooya
  • ooyah
  • oyaho
  • yahoo
  • google
  • oogleg
  • oglego
  • glegoo
  • legoog
  • egoogl
  • google

Prefix(ya) Prefix(ya) Suffix(oo)
Prefix(oo) Substring(oo) Prefix(oo) PrefixSuffi
x(y,o) Prefix(oy)
Permuterm Dictionary
Space problems
Any query on D reduces to a prefix-query on PD
5
The FM-index
Ferragina-Manzini, JACM 05
  • The result
  • Count(P) O(p) time
  • Locate(P) O(occ polylog(T)) time
  • Display( Ti,iL ) O( L polylog(T) ) time
  • Space occupancy T Hk(T) o(T log S) bits

?
New concept The FM-index is an opportunistic
data structure
?
Compressed Permuterm index builds upon the best
two features of the FM-index
6
Third ingredient FM-index substring search
L
unknown
mississipp imississip ippimissis issippimis is
sissippi mississippi pimississi ppimississ sipp
imissi sissippimi ssippimiss ssissippim
i p s s m p i s s i i
7
Compressed Permuterm Index
Z hathiphophot
Some queries are trivial... ? Prefix(a)
Substring search(a) within Z ? Suffix(b)
Substring search(b) within Z ? Substr(g)
Substring search(g) within Z
8
PrefixSuffix search
unknown
9
PrefixSuffix(ho,p)
unknown
ho
LF
CLF
No change in time/space bounds of compressed
indexes
10
Rank and Select of strings
unknown
Z hathiphophot
Other queries... ? Rank(s) row of s ?
Select(i) backw from Li1
11
A test on URLs
Choose your trade-off
dict-size
Trade-off
  • Time of 20?60 msec/char, and space close to bzip
  • Time close to Front-Coding (4 msec/char), but
    lt50 of its space
Write a Comment
User Comments (0)
About PowerShow.com