Title: Compressed Permuterm Index
1Compressed Permuterm Index
- Paolo Ferragina
- Dipartimento di Informatica, Università di Pisa
2A basic problem
- Given a dictionary D of strings, having variable
length, design a compressed data structure that
supports - string ? id
- Prefix(a) find all s in D that are prefixed by a
- Suffix(b) find all s in D that are suffixed by b
- Substring(g) find all s in D that contain g
- PrefixSuffix(a,b) Prefix(a) ? Suffix(b)
- (Compacted) Trie
- Two versions for D and for DR
- Intersect answers
- Need to store D for resolving edge-labels
3A basic problem
- Given a dictionary D of strings, having variable
length, compress them in a way that we can
efficiently support - string ? id
- Prefix(a) find all s in D that are prefixed by a
- Suffix(b) find all s in D that are suffixed by b
- Substring(g) find all s in D that contain by g
- PrefixSuffix(a,b) Prefix(a) ? Suffix(b)
Permuterm Index (Garfield, 76) ? Reduce
any query to a prefix query over a larger
dictionary
4Permuterm Index Garfield, 1976
- Take a dictionary Dyahoo,google
- Append a special char to the end of each
string - Generate all rotations of these strings
- yahoo
- ahooy
- hooya
- ooyah
- oyaho
- yahoo
- google
- oogleg
- oglego
- glegoo
- legoog
- egoogl
- google
Prefix(ya) Prefix(ya) Suffix(oo)
Prefix(oo) Substring(oo) Prefix(oo) PrefixSuffi
x(y,o) Prefix(oy)
Permuterm Dictionary
Space problems
Any query on D reduces to a prefix-query on PD
5The FM-index
Ferragina-Manzini, JACM 05
- The result
- Count(P) O(p) time
- Locate(P) O(occ polylog(T)) time
- Display( Ti,iL ) O( L polylog(T) ) time
- Space occupancy T Hk(T) o(T log S) bits
?
New concept The FM-index is an opportunistic
data structure
?
Compressed Permuterm index builds upon the best
two features of the FM-index
6Third ingredient FM-index substring search
L
unknown
mississipp imississip ippimissis issippimis is
sissippi mississippi pimississi ppimississ sipp
imissi sissippimi ssippimiss ssissippim
i p s s m p i s s i i
7Compressed Permuterm Index
Z hathiphophot
Some queries are trivial... ? Prefix(a)
Substring search(a) within Z ? Suffix(b)
Substring search(b) within Z ? Substr(g)
Substring search(g) within Z
8PrefixSuffix search
unknown
9PrefixSuffix(ho,p)
unknown
ho
LF
CLF
No change in time/space bounds of compressed
indexes
10Rank and Select of strings
unknown
Z hathiphophot
Other queries... ? Rank(s) row of s ?
Select(i) backw from Li1
11A test on URLs
Choose your trade-off
dict-size
Trade-off
- Time of 20?60 msec/char, and space close to bzip
- Time close to Front-Coding (4 msec/char), but
lt50 of its space