Adam Silberstein, Hao He, Ke Yi, Jun Yang - PowerPoint PPT Presentation

About This Presentation

Title:

Adam Silberstein, Hao He, Ke Yi, Jun Yang

Description:

Assign labels to XML elements to capture the document hierarchy. Facilitates query processing by providing ... 'One more level of indirection solves everything' ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 26

Provided by: adams158

Category:

more less

Transcript and Presenter's Notes

Title: Adam Silberstein, Hao He, Ke Yi, Jun Yang

1
BOXes Efficient Maintenance of Order-Based
Labeling for Dynamic XML Data

Adam Silberstein, Hao He, Ke Yi, Jun Yang
Duke University
Durham, North Carolina, USA

2
XML labeling

Assign labels to XML elements to capture the
document hierarchy
Facilitates query processing by providing
efficient checking of relationships between
elements
Having a labeling scheme for dynamic documents is
important
As more and larger data is maintained as XML,
need to be able to make updates
Problem has been addressed by many academic and
industry groups (Niagara, Timber, Microsoft
ORDPATH, etc.)

3
Order-based labeling

Popular method is to assign each element an
interval (start_label, end_label) based on
document order of its start and end tags
If tag t1 precedes tag t2 in the document, then
t1s label is less than t2s
Widely used by many systems (e.g., Niagara,
Timber) in processing XPath location steps
E1 is an ancestor of E2 iff E1s interval
contains that of E2
Labeling a static document is easy, but what if
document is updated?

bib
book
book
title
author
author
section
section
bookref
section
4
Immutable labeling scheme

Cohen et al., PODS 2002
Any immutable labeling scheme (i.e., label values
dont change once assigned) will necessarily
require W(N) bits per label, where N is the size
of the document
Can do better if we know something about the
document structure in advance, but still hopeless
in adversarial cases

5
Dynamic labeling scheme

Allow labels to be mutable
When we run out of labels to assign, change some
existing labels to make space
Updating various copies (e.g., in inverted
keyword indexes) is problematic
One more level of indirection solves
everything
Map immutable label IDs to mutable label values
using, say, a heap file
Challenges addressed by our BOXes
How to reduce relabeling cost?
How to do it in an I/O-efficient manner?
How to avoid the extra indirection when accessing
labels?

Mutable label value
6
Naïve relabeling

To insert a new label between two existing labels
(e.g., 20 and 30)
Assign the average to the new label (e.g.,
avg(20, 30)25)
If there is no space between existing labels
(e.g., 2 and 3), relabel everything to leave
equally sized gaps between adjacent labels
Easily broken by an adversary that repeatedly
inserts into the smallest gap
For a gap of k bits, it takes only k1 insertions
to trigger relabeling
Using floating-point numbers instead of integers
wont help, because the number of bit patterns
still pose the same limit
Must cut down the cost of relabeling!

7
Approach 1 Tree-based relabeling

36
37
32
33
34

A complete tree recursively partitions the label
value space into a hierarchy of ranges
Invariant all labels found beneath a node fall
into the nodes associated range
An insertion that does not cause any node splits
in the worst case requires relabeling within the
same leaf

8
Tree-based relabeling split

An overflowing node is a good indication that its
associated range is getting crowded
Splitting a node causes ranges to be reassigned,
and any label that moves to a new range must be
reassigned

9
B-tree is not good enough

Regular B-tree reorganizes too frequently
A node at level i (assuming leaves are at level
0) can split every (B/2)i1 insertions, where B
is the block size or the maximum fanout
But this split involves relabeling up to Bi2
labels
A factor of 2i1B difference!
Alternative weight-balanced B-tree Arge
Vitter, FOCS 1996

10
W-BOXWeight-balanced B-tree for Ordering XML

Weight of a node number of leaf entries below
it
Basic idea balance tree by weight rather than
fanout
A weight-balanced B-tree has two parameters
Branching parameter a (2 less than ½ of max
fanout)
Leaf parameter k (roughly ½ of max leaf capacity)
And following constraints (tuned specifically for
W-BOX)
All leaves are at the same depth, and root has
more than one child
A node at level i (assuming leaves are at level
0) has weight lt 2aik
A node at level i (except root) has weight gt aik
2ai1k
Implies that internal fanout is in max/4 1,
max, so
Emptier than a regular B-tree
Still O(logB N) height and O(N/B) space, where B
is the block size
Implies that weight(parent(u)) O(B weight(u))

11
Complexity of W-BOX

Space is O(N/B)
Bits per label is at most log N 1d1.3
loga(N/k)log be
Amortized update cost is O(logB N) I/Os, because
W-BOX splits much less frequently than regular
B-tree a node u will not be split again until
W(weight(u)) leaf entries are inserted below u
Splitting u in the worst case involves relabeling
all entries below us parent, with
O(weight(parent(u))/B) O(weight(u)) I/Os
Worst-case lookup cost is one I/O, given the heap
file record associated with the label (which
points to the W-BOX leaf containing the label
value)

12
Approach 2Virtual labels

Since updating labels is so messy, why physically
store them? Why not just provide a way to
reconstruct them efficiently?
Given the path from root to the leaf entry, we
can construct a multi-component label consisting
of the ordinal positions of the child links
traversed
But without storing any labelswhich are the
B-tree search key valueshow do we obtain this
path in the first place?

13
B-BOXBack-linked Keyless B-tree for Ordering XML

Given the heap file record associated with the
label, begin search at the leaf containing the
B-BOX entry
Scan through leaf to find record pointer record
ordinal position

Follow back-link from the child to the parent
Scan through parent to find this child record
ordinal position
Repeat

14
Complexity of B-BOX

Space is O(N/B)
Bits per label is at most log N 1d
(logN1)/(logB1) e
Worst-case lookup cost is O(logB N) I/Os
Amortized update cost is O(1), because
Worst-case update cost is O(B logB N) I/Os
Every node split relocates B/2 children to a
different parent, requiring B/2 I/Os to update
their back-links
Splits can happen at every level
But no need to reorganize siblings of splitting
node
Splits are not too often leaf splits only every
B/2 insertions level-1 node splits only every
(B/2)2 insertions level-2 node splits only every
(B/2)3 insertions and so on

15
Ordinal support

BOXes can be extended to support exact ordinal
labels
Augment with size fields, noting number of
records below an entry
W-BOX
After retrieving the label as normal, traverse
top-down searching for it and sum all size fields
to left of traversed pointers in all nodes
Lookup becomes O(logB N)
B-BOX
Initialize counter to number of entries on
starting leaf to left of query record
During bottom-up traversal, at each node, add to
counter all size fields to left of record
Update becomes O(logB N)

16
Ordinal support
size fields
12
9
3

W-BOX top-down ordinal for is (912)3226
B-BOX bottom-up ordinal for is 23(912)26

17
Bulk operations

Bulk construction
Bulk loading done by filling leaves with no
splitting
Inserting an XML subtree (see paper for deletion)
Find the insertion point in leaf
W-BOX traverse upward to find lowest node that
can accommodate subtrees number of nodes
B-BOX
Bulk construct a new B-BOX, T, with h levels
Traverse existing B-BOX upward, ripping nodes
at the insertion point, h levels up
Place T into resulting gap
Result all root-to-leaf paths have same length

18
ExperimentConcentrated insertions

Designed to stress-test the data structures
2-level XML document with 2 million elements
Insert 0.5 million elements one by one, always
right in the middle of the document
Naïve performs poorly even with 256 more bits
BOXes handle this near-worst case gracefully
B-BOX is most efficient
Bear in mind that W-BOX lookup has constant
cost but B-BOX is logarithmic

Avg. I/Os Per Insert
Avg. I/Os Per Insert
naïve-256
naïve-64
naïve-16
naïve-4
B-BOX
W-BOX
19
Experiment XMark

Designed to test normal operations
XMark document with 336K elements
Insert elements one by one in document order
Start accounting after 200K elements
Naïve still struggles, unless it has 32 more bits
But the overhead of manipulating long labels
would be high for query processing, which is
not measured in this figure
BOXes still very efficient
Labels fit in machine word

Avg. I/Os Per Insert
naïve-32
naïve-16
naïve-8
naïve-4
naïve-2
B-BOX
W-BOX
20
Removing indirection

Basic caching
Each reference to a label is augmented with a
cached value and a last-cached timestamp
Each document maintains a last-updated timestamp
If (last-cached gt last-updated), cached value is
valid otherwise, pay the full cost of lookup
Good enough for rarely updated documents, less
effective when there is a steady update workload

21
Caching logging

Observation effect of an update on existing
labels can often be described succinctly for
W-BOX and B-BOX
Example insert a new label before 109 on a leaf
whose largest label is 123 assuming no split,
the effect can be described as 109, 123 1
Keep a log of last k updates in memory
Consult the log to see if a cached label value
can be brought up to date by applying the effects
of subsequent updates in order
If (last-cached lt earliest logged update), pay
full cost of lookup

22
Conclusion

XML labeling difficult for dynamic documents
BOXes facilitate mutable labels of size O(log N)
BOXes trade off update/lookup cost
W-BOX logarithmic update (amortized), constant
lookup
B-BOX constant update (amortized), logarithmic
lookup
Both handle arbitrary insertion/deletion patterns
and XML tree shapes
Indirection/lookup overhead mitigated by caching
and logging

Questions?

24
Related Work

Dewey encoding Tatarinov, et al., SIGMOD 2002
Combine local ordering of each element on
incoming path
Microsoft ORDPATH ONeil, et al., SIGMOD 2004
Extends Dewey to support inserts using
carating-in
W(N) bits/label for some insertion sequences or
tree shapes
Relabeling for equally-sized gaps Jagadish, et
al., VLDBJ 2002 Halverson, et al., VLDB 2003
etc., and use of floating-point labels Amagasa,
et al., ICDE 2003
High relabeling cost for some insertion sequences
Maintaining order in a linked list Dietz 1982,
1987 Bender et al., ESA 2002 and application to
XML labeling Fisher, et al., CIKM 2003 Chen et
al., EDBT Workshop 2004
Internal-memory data structures