Compressing and Indexing Strings and labeled Trees - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Compressing and Indexing Strings and labeled Trees

Description:

Query complexity: O(p occ loge N) time. Space occupancy: O( N Hk(T)) o(N) bits ... Search: O(p log N occ loge N ) High-order entropy CSA (GV and Gupta, Soda 03) ... – PowerPoint PPT presentation

Number of Views:198
Avg rating:3.0/5.0
Slides: 62
Provided by: paol94
Category:

less

Transcript and Presenter's Notes

Title: Compressing and Indexing Strings and labeled Trees


1
Compressing and Indexing Strings and (labeled)
Trees
  • Paolo Ferragina
  • Dipartimento di Informatica, Università di Pisa

2
Two types of data
  • String raw sequence of symbols from an
    alphabet ?
  • Texts
  • DNA sequences
  • Executables
  • Audio files
  • ...
  • Labeled tree tree of arbitrary shape and depth
    whose nodes are labeled with strings drawn from
    an alphabet ?
  • XML files
  • Parse trees
  • Tries and Suffix Trees
  • Compiler intermediate representations
  • Execution traces
  • ...

3
What do we mean by Indexing ?
  • Word-based indexes, here a notion of word must
    be devised !
  • Inverted files, Signature files, Bitmaps.
  • Full-text indexes, no constraint on text and
    queries !
  • Suffix Array, Suffix tree, ...
  • Path indexes that also support navigational
    operations !
  • see next...

Subset of XPath W3C
4
What do we mean by Compression ?
  • Data compression has two positive effects
  • Space saving (or, enlarge memory at the same
    cost)
  • Performance improvement
  • Better use of memory levels closer to CPU
  • Increased network, disk and memory bandwidth
  • Reduced (mechanical) seek time

5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
Study the interplay of Compression and Indexing
  • Do we witness a paradoxical situation ?
  • An index injects redundant data, in order to
    speed up the pattern searches
  • Compression removes redundancy, in order to
    squeeze the space occupancy
  • NO, new results proved a mutual reinforcement
    behaviour !
  • Better indexes can be designed by exploiting
    compression techniques
  • Better compressors can be designed by exploiting
    indexing techniques
  • More surprisingly, strings and labeled trees are
    closer than expected !
  • Labeled-tree compression can be reduced to string
    compression
  • Labeled-tree indexing can be reduced to special
    string indexing problems

9
Our journey over string data
Index design (Weiner 73)
Compressor design (Shannon 48)
Burrows-Wheeler Transform (1994)
Suffix Array 87 and 90
Wavelet Tree Grossi-Gupta-Vitter, Soda 03
Improved indexes and compressors for
strings Ferragina-Manzini-Makinen-Navarro,
04 And many other papers of many other
authors...
10
The Suffix Array BaezaYates-Gonnet, 87 and
Manber-Myers, 90
T mississippi
Psi
  • Suffix permutation cannot be any of 1,...,N
  • binary texts 2N N! permutations on
    1, 2, ..., N
  • ?(N) bits is the worst-case lower bound ?
  • ?(N H(T)) bits for compressible texts ?

Several papers on characterizing the SAs
permutation Duval et al, 02 Bannai et al, 03
Munro et al, 05 Stoye et al, 05
11
Can we compress the Suffix Array ?
Ferragina-Manzini, Focs 00
Ferragina-Manzini, JACM 05
  • The FM-index is a data structure that mixes the
    best of
  • Suffix array data structure
  • Burrows-Wheeler Transform
  • The theoretical result
  • Query complexity O(p occ loge N) time
  • Space occupancy O( N Hk(T)) o(N) bits

? o(N) if T compressible
  • The corollary is that
  • The Suffix Array is compressible
  • It is a self-index

Index does not depend on k Bound holds for all
k, simultaneously
New concept The FM-index is an opportunistic
data structure that takes advantage of
repetitiveness in the input data to achieve
compressed space occupancy, and still efficient
query performance.
12
The Burrows-Wheeler Transform (1994)
Let us given a text T mississippi
F
L
mississippi
ississippim
ssissippimi
sissippimis
issippimiss
ssippimissi
sippimissis ippimississ ppimississi pimississi
p imississipp mississippi
13
Why L is so interesting for compression ?
F
L
unknown
mississipp i
  • A key observation
  • L is locally homogeneous

i mississip p
i ppimissis s
  • Bzip vs. Gzip 20 vs. 33 compression ratio !
    Some theory behind Manzini, JACM 01

Building the BWT ? SA construction Inverting the
BWT ? array visit ...overall ?(N) time, but
slower than gzip...
14
L is helpful for full-text searching ?
mississipp imississip ippimissis issippimis is
sissippi mississippi pimississi ppimississ sipp
imissi sissippimi ssippimiss ssissippim
mississippi
15
A useful tool L ? F mapping
F
L
unknown
mississipp i
i mississip p
i ppimissis s
To implement the LF-mapping we need an
oracle occ( c , j ) Rank of char c in L1,j
16
Substring search in T (Count the pattern
occurrences)
unknown
s
s
  • Find the first c in Lfr, lr
  • Find the last c in Lfr, lr
  • L-to-F mapping of these chars

Occ() oracle is enough (ie. Rank/Select
primitives over L)
17
Many details are missing...
  • What about a large ?
  • Wavelet Tree and variations Grossi et al, Soda
    03 F.M.-Makinen-Navarro, Spire 04
  • New approaches to Rank/Select primitives Munro
    et al. Soda 06
  • Efficient and succinct index construction Hon
    et al., Focs 03
  • In practice, Lightweight Algorithms (5?)N bytes
    of space
  • see Manzini-Ferragina, Algorithmica 04

18
Five years of history...
FM-index (Ferragina-Manzini, Focs 00)
Compact Suffix Array (Grossi-Vitter, Stoc 00)
Space 5 N Hk(T) o(N) bits, for any k Search
O( p occ loge N )
Space ?(N) bits text Search O(p
polylog(N) occ loge N ) o(p) time with
Patricia Tree, O(occ) for short P
Look at the survey by Gonzalo Navarro and Veli
Makinen
Wavelet Tree
WT variant
q-gram index Kärkkäinen-Ukkonen,
96 Succinct Suffix Tree N log N ?(N) bits
Munro et al., 97ss LZ-index ?(N) bits and fast
occ retrieval Navarro, 03 Variations
over CSA and FM-index Navarro, Makinen
19
Whats next ?
20
What about their practicality ?
December 2003
January 2005
21
(No Transcript)
22
Is this a technological breakthrough ?
23
(No Transcript)
24
Where we are...
Labeled Trees ?
Data type
Indexing
Compressed Indexing
25
Why we care about labeled trees ?
26
An XML excerpt
ltdblpgt ltbookgt ltauthorgt Donald E. Knuth
lt/authorgt lttitlegt The TeXbook lt/titlegt ltpublishe
rgt Addison-Wesley lt/publishergt ltyeargt 1986
lt/yeargt lt/bookgt ltarticlegt ltauthorgt
Donald E. Knuth lt/authorgt ltauthorgt Ronald W.
Moore lt/authorgt lttitlegt An Analysis of
Alpha-Beta Pruning lt/titlegt ltpagesgt 293-326
lt/pagesgt ltyeargt 1975 lt/yeargt ltvolumegt 6
lt/volumegt ltjournalgt Artificial Intelligence
lt/journalgt lt/articlegt ... lt/dblpgt
27
A tree interpretation...
  • XML document exploration ? Tree navigation
  • XML document search ? Labeled subpath
    searches

Subset of XPath W3C
28
Our problem
  • Consider a rooted, ordered, static tree T of
    arbitrary shape, whose t nodes are labeled with
    symbols from an alphabet S.
  • We wish to devise a succinct representation for T
    that efficiently
  • supports some operations over Ts structure
  • Navigational operations parent(u), child(u, i),
    child(u, i, c)
  • Subpath searches over a sequence of k labels
  • Seminal work by Jacobson Focs 90 dealt with
    binary unlabeled trees, achieving O(1) time per
    navigational operation and 2t o(t) bits.
  • Munro-Raman Focs 97, then many others,
    extended to unlabeled trees of arbitrary degree
    and a richer set of navigational ops subtree
    size, ancestor,...
  • Geary et al Soda 04 were the first to deal
    with labeled trees and navigational operations,
    but the space is Q(t S) bits.

Yet, subpath searches are unexplored
29
Our journey over labeled trees Ferragina et
al, Focs 05
  • We propose the XBW-transform that mimics on trees
    the nice structural properties of the BW-trasform
    on strings.
  • The XBW-transform linearizes the tree T in such a
    way that
  • the indexing of T reduces to implement simple
    rank/select operations over a string of symbols
    from S.
  • the compression of T reduces to use any k-th
    order entropy compressor (gzip, bzip,...) over a
    string of symbols from S.

30
The XBW-Transform
Sa
Sp
C B D c a c A b a D c B D b a
e C B C D B C D B C B C C A C A C A C D A C C B
C D B C B C
Step 1. Visit the tree in pre-order. For each
node, write down its label and the labels on its
upward path
31
The XBW-Transform
Sa
Sp
C b a D D c D a B A B c c a b
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
Step 2. Stably sort according to Sp
32
The XBW-Transform
Sp
Slast
Sa
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
C b a D D c D a B A B c c a b
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
XBW can be built and inverted in optimal O(t)
time
Key facts Nodes correspond to items in
ltSlast,Sagt Node numbering has useful properties
for compression and indexing
Step 3. Add a binary array Slast marking the rows
corresponding to last children
XBW takes optimal t log S 2t bits
33
The XBW-Transform is highly compressible
Sp
Slast
Sa
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
C b a D D c D a B A B c c a b
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
  • XBW is highly compressible
  • Sa is locally homogeneous (like BWT for strings)
  • Slast has some structure (because of Ts
    structure)

34
XML Compression XBW PPMdi !
String compressors are not so bad !?!
35
Structural properties of XBW
Sp
Slast
Sa
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
C b a D D c D a B A B c c a b
  • Properties
  • Relative order among nodes having same leading
  • path reflects the pre-order visit of T
  • Children are contiguous in XBW (delimited by 1s)
  • Children reflect the order of their parents

36
The XBW is searchable
Sp
Slast
Sa
SS
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
C b a D D c D a B A B c c a b
0 1 0 0 1 0 0 0 1 0 0 1 0 0 0
A
B
C
D
  • XBW indexing reduction to string indexing
  • Store succinct and efficient Rank and Select
  • data structures over these three arrays

37
Subpath search in XBW
Sp
Slast
Sa
SS
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
C b a D D c D a B A B c c a b
0 1 0 0 1 0 0 0 1 0 0 1 0 0 0
P B D
Their children have upward path D B
  • Inductive step
  • Pick the next char in Pi1, i.e. D
  • Search for the first and last D in Safr,lr
  • ? Jump to their children

38
Subpath search in XBW
Sp
Slast
Sa
SS
e A C A C A C B C B C B C B C C C C D A C D B
C D B C D B C
1 0 0 1 0 1 0 1 0 0 1 1 0 1 1
C b a D D c D a B A B c c a b
0 1 0 0 1 0 0 0 1 0 0 1 0 0 0
P B D
Look at Slast to find the 2 and 3 group of
children
Their children have upward path D B
  • Inductive step
  • Pick the next char in Pi1, i.e. D
  • Search for the first and last D in Safr,lr
  • ? Jump to their children

Two occurrences because of two 1s
39
XML Compressed Indexing
What about XPress and XGrind ? XPress ? 30 (dblp
50), XGrind ? 50 ? no software running
40
In summary Ferragina et al, Focs 05
  • The XBW-transform takes optimal space 2t t log
    S, and can be computed in optimal linear time.
  • We can compress and index the XBW-transform so
    that
  • its space occupancy is the optimal t H0(T) 2t
    o(t) bits
  • navigational operations take O(log S) time
  • subpath searches take O(p log S) time

If Spolylog(t), no logS-factor (loglog S
for general S Munro et al, Soda 06)
New bread for Rank/Select people !!
  • It is possible to extend these ideas to other
    XPath queries, like
  • //pathtext()substring
  • //path1//path2
  • ...

41
The overall picture on Compressed Indexing...
Data type
Indexing
Kosaraju, Focs 89
Strong connection
Compressed Indexing
42
Mutual reinforcement relationship...
We investigated the reinforcement
relation Compression ideas ? Index
design Lets now turn to the other
direction Indexing ideas ? Compressor design
Booster
43
Compression Boosting for strings Ferragina et
al., J.ACM 2005
  • Qualitatively, the booster offers various
    properties
  • The more compressible is s, the shorter is c
    wrt c
  • It deploys compressor A as a black-box, hence no
    change to As structure is needed
  • No loss in time efficiency, actually it is
    optimal
  • Its performance holds for any string s, it
    results better than Gzip and Bzip
  • It is fully combinatorial, hence it does not
    require any parameter estimations

44
An interesting compression paradigm
PPC paradigm (Permutation, Partition, Compression)
  • Problem 1. Fix a permutation P. Find a
    partitioning strategy and a
  • compressor that minimize the number of compressed
    bits.
  • If PId, this is classic data compression !
  • Problem 2. Fix a compressor C. Find a permutation
    P and partitioning strategy that minimize the
    number of compressed bits.
  • Taking PId, PPC cannot be worse than compressor
    C alone.
  • Our booster showed that a good P can make PPC
    far better.
  • Other contexts Tables ATT people, Graphs
    Bondi-Vigna, WWW 04

Theory is missing, here!
45
Compression of labeled trees Ferragina et al.,
Focs 05
Extend the definition of Hk to labeled trees by
taking as k-context of a node its leading path of
k-length (related to Markov random fields
over trees)
A new paradigm for compressing the tree T
XBW(T)
46
Thanks !!
47
Where we are ...
We investigated the reinforcement
relation Compression ideas ? Index
design Lets now turn to the other
direction Indexing ideas ? Compressor design
Booster
48
What do we mean by boosting ?
A memoryless compressor is poor in that it
assigns codewords to symbols according only to
their frequencies (e.g. Huffman) It incurs in
some obvious limitations T anbn (highly
compressible) T random string of n as and n
bs (uncompressible)
49
The empirical entropy Hk
(1/T) ?wk Tw H0(Tw)
Hk(T)
  • Tw string of symbols that precede w in T

Example Given T mississippi, we have
  • Problems with this approach
  • How to go from all Tw back to the string T ?
  • How do we choose efficiently the best k ?


50
Use BWT to approximate Hk
Bwt(T)
unknown
? compress pieces of bwt(T) up to H0
Remember that...
51
Finding the best pieces to compress...
Leaf cover ?
unknown
12 11 9 5 2 1 10 9 7 4 6 3
L1
L2
H1(T)
H2(T)
Goal find the best BWT-partition induced by a
Leaf Cover !!
Some leaf covers are related to Hk !!!
52
A compression booster Ferragina et al.,
JACM 05
  • Let Compr be the compressor we wish to boost
  • Let LC1, , LCr be the partition of BWT(T)
    induced by a leaf cover LC, and let us define
    cost of LC as cost(LC, Compr)?j Compr(LCj)
  • Goal Find the leaf cover LC of minimum cost
  • It suffices a post-order visit of the suffix
    tree (suffix array), optimal time
  • We have Cost(LC, Compr) Cost(Hk, Compr) ?
    Hk(T), ?k

?k
0
k
This is purely combinatorial. We do not need any
knowledge of the statistical properties of the
source, no parameter estimation, no training,...
53
(No Transcript)
54
2001
55
Locate the pattern occurrences in T
T mississippi
4
From ss position we get 4 3 7, ok !!
56
What about their practicality ?
  • We have a library that currently offers
  • The FM-index build, search, display,...
  • The Suffix Array construction in space (5?) n
    bytes
  • The LCP Array construction in space (6?) n
    bytes

57
What about word-based searches ?
Pbzip
T bzipbzip2unbzip2unbzip
...the post-processing phase can be time
consuming !
  • The FM-index can be adapted to support word-based
    searches
  • Preprocess T and transform it into a digested
    text DT

Word-search in T ? Substring-search in DT
  • Use the FM-index over the digested DT

58
The WFM-index
  • Digested-T derived from a Huffman variant
    Moura et al, 98
  • Symbols of the huffman tree are the words of T
  • The Huffman tree has fan-out 128
  • Codewords are byte-aligned and tagged

Any word
P bzip
1. Dictionary of words
3. FM-index built on DT
59
A historical perspective
  • Shannon showed a narrower result for a
    stationary ergodic S
  • Idea Compress groups of k chars in the string T
  • Result Compress ratio ? the entropy of S, for k
    ? ?
  • Various limitations
  • It works for a source S
  • It must modify As structure, because of the
    alphabet change
  • For a given string T, the best k is found by
    trying k0,1,,T
  • W(T2) time slowdown
  • k is eventually fixed and this is not an optimal
    choice !

Any string s
Black-box
O(s) time
Variable length contexts
Two Key Components Burrows-Wheeler Transform and
Suffix Tree
60
How do we find the best partition (i.e. k)
  • Approximate via MTF Burrows-Wheeler,
    94
  • MTF is efficient in practice bzip2
  • Theory and practice showed that we can aim for
    more !
  • Use Dynamic Programming Giancarlo-Sciortino
    , CPM 03
  • It finds the optimal partition
  • Very slow, the time complexity is cubic in T

Surprisingly, full-text indexes help in
finding the optimal partition in optimal linear
time !!
61
Example not one k
xs ynzn gt yxs yn-1 , zxs zn-1
Write a Comment
User Comments (0)
About PowerShow.com