XML Compression and Indexing - PowerPoint PPT Presentation

About This Presentation
Title:

XML Compression and Indexing

Description:

The Future of Web Search Barcelona, May 2006 XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Universit di Pisa [Joint with F. Luccio, G ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 26
Provided by: PaoloFe1
Category:

less

Transcript and Presenter's Notes

Title: XML Compression and Indexing


1
XML Compression and Indexing
The Future of Web Search Barcelona, May 2006
  • Paolo Ferragina
  • Dipartimento di Informatica, Università di Pisa
  • Joint with F. Luccio, G. Manzini, S.
    Muthukrishnan

Under patenting by Pisa-Rutgers Univ.
2
Compressed Permuterm Index
  • Paolo Ferragina, Rossano Venturini
  • Dipartimento di Informatica, Università di Pisa

Under Y!-patenting
3
A basic problem
  • Given a dictionary D of strings, having variable
    length, design a compressed data structure that
    supports
  • string ? id
  • Prefix(a) find all strings in D that are
    prefixed by a
  • Suffix(b) find all strings in D that are
    suffixed by b
  • Substring(g) find all strings in D that contain
    g
  • PrefixSuffix(a,b) Prefix(a) ? Suffix(b)

IR book of Manning-Raghavan-Schutze ?
Tolerant Retrieval Problem (wildcards)
Prefix(a) a Suffix(b) b Substring(g)
g PrefixSuffix(a,b) ab
4
A basic problem
  • Given a dictionary D of strings, having variable
    length, design a compressed data structure that
    supports
  • string ? id
  • Prefix(a) find all s in D that are prefixed by a
  • Suffix(b) find all s in D that are suffixed by b
  • Substring(g) find all s in D that contain g
  • PrefixSuffix(a,b) Prefix(a) ? Suffix(b)

Hashing ? Not exact searches
5
A basic problem
  • Given a dictionary D of strings, having variable
    length, design a compressed data structure that
    supports
  • string ? id
  • Prefix(a) find all s in D that are prefixed by a
  • Suffix(b) find all s in D that are suffixed by b
  • Substring(g) find all s in D that contain g
  • PrefixSuffix(a,b) Prefix(a) ? Suffix(b)

(Compacted) Trie ? Two versions for D and for
DR Intersect answers ? No substring search
(unless using Suffix Trie) ? Need to store D for
resolving edge-labels
6
A basic problem
  • Given a dictionary D of strings, having variable
    length, design a compressed data structure that
    supports
  • string ? id
  • Prefix(a) find all s in D that are prefixed by a
  • Suffix(b) find all s in D that are suffixed by b
  • Substring(g) find all s in D that contain g
  • PrefixSuffix(a,b) Prefix(a) ? Suffix(b)

Front coding...
7
Front-coding
uk-2002 crawl 250Mb
bzip 10 Be back on this, later on!
  • ? Two versions for D and for DR Intersect
    answers
  • Need some extra data structures for bucket
    identification
  • No substring search

8
A basic problem
  • Given a dictionary D of strings, having variable
    length, compress them in a way that we can
    efficiently support
  • string ? id
  • Prefix(a) find all s in D that are prefixed by a
  • Suffix(b) find all s in D that are suffixed by b
  • Substring(g) find all s in D that contain by g
  • PrefixSuffix(a,b) Prefix(a) ? Suffix(b)

Permuterm Index (Garfield, 76) ? Reduce
any query to a prefix query over a larger
dictionary
9
Premuterm Index Garfield, 1976
  • Take a dictionary Dyahoo,google
  • Append a special char to the end of each
    string
  • Generate all rotations of these strings
  • yahoo
  • ahooy
  • hooya
  • ooyah
  • oyaho
  • yahoo
  • google
  • oogleg
  • oglego
  • glegoo
  • legoog
  • egoogl
  • google

Prefix(ya) Prefix(ya) Suffix(oo)
Prefix(oo) Substring(oo) Prefix(oo) PrefixSuffi
x(y,o) Prefix(oy)
Permuterm Dictionary
Space problems
Any query on D reduces to a prefix-query on PD
10
Compressed Permuterm Index
SIGIR 07
  • It deploys two ingredients
  • Permuterm index
  • Compressed full-text index
  • Theoretically
  • Query ops take optimal time proportional to
    pattern length
  • Space occupancy is D Hk(D) o(D log S)
    bits
  • Technically
  • A simple reduction step Permuterm ? Compressed
    index
  • Re-use known machinery on compressed indexes
  • Achieve bzip-compression at Front-coding speed

11
The Burrows-Wheeler Transform (1994)
Take the text T mississippi
L
F
mississippi
ississippim
ssissippimi
sissippimis
issippimiss
ssippimissi
sippimissis ippimississ ppimississi pimississi
p imississipp mississippi
12
Compressing L is effective
  • Key observation
  • L is locally homogeneous
  • Bzip vs. Gzip 20 vs. 33, but it is slower in
    (de)compression !

13
The FM-index
Ferragina-Manzini, JACM 05
Survey of Navarro-Makinen contains many other
indexes
  • The result
  • Count(P) O(p) time
  • Locate(P) O(occ polylog(T)) time
  • Display( Ti,iL ) O( L polylog(T) ) time
  • Space occupancy T Hk(T) o(T log S) bits

?
New concept The FM-index is an opportunistic
data structure
?
Compressed Permuterm index builds upon the best
two features of the FM-index
14
First ingredient L ? F mapping
F
L
unknown
mississipp i
i mississip p
i ppimissis s
15
First ingredient L ? F mapping
F
L
unknown
mississipp i
i mississip p
i ppimissis s
FM-index is actually Rank ds over BWT
O(1) time and Hk-space
16
Second ingredient Backward step
F
L
unknown
mississipp i
i mississip p
i ppimissis s
T scanned backward by using LF-mapping
LF
...s
s
i...
LF
17
Third ingredient substring search
L
unknown
mississipp imississip ippimissis issippimis is
sissippi mississippi pimississi ppimississ sipp
imissi sissippimi ssippimiss ssissippim
i p s s m p i s s i i
18
The Comprressed Permuterm
Z hathiphophot
Some queries are trivial... ? Prefix(a)
Substring search(a) within Z ? Suffix(b)
Substring search(b) within Z ? Substr(g)
Substring search(g) within Z
19
PrefixSuffix search
unknown
20
PrefixSuffix(ho,p)
unknown
ho
LF
CLF
No change in time/space bounds of compressed
indexes
21
Rank and Select of strings
unknown
Z hathiphophot
Other queries... ? Rank(s) row of s ?
Select(i) backw from Li1
22
Experiments
  • Three dictionaries
  • Term dictionary Trec WT10G
  • Host dictionary (reversed) UK-2005
  • Url dictionary (host reversed) first 190Mb of
    UK-2005

Term Host Url
size 118 Mb 34 Mb 190 Mb
strings 10 Mil 2 Mil 3 Mil
FC 40 45 30
bzip 33 25 10
PrefixSuffix search needs 2
23
(No Transcript)
24
A test on URLs
Choose your trade-off
MRS book says one disadvantage of the PI is
that its dictionary becomes quite large,
including as it does all rotations of each term.
dict-size
Now, they mention CPI ?
Trade-off
  • Time of 20?60 msec/char, and space close to bzip
  • Time close to Front-Coding (4 msec/char), but
    lt50 of its space

25
We proposed an approach for dictionary storage
Theory optimal time and entropy-bounds for space
Practice trades time vs space, thus fitting
user needs
Write a Comment
User Comments (0)
About PowerShow.com