Title: Range CUBE: Efficient Cube Computation by Exploiting Data Correlation
1Range CUBE Efficient Cube Computation by
Exploiting Data Correlation
2Outline
- Problems Faced by Data Cube
- Range Cubing to the Rescue
- How Range Cube achieves it
- Example of Range cube
- Range Cube Algorithm
- Range cube Experimental Results
3Problems faced by Data cube
- Prohibitively Expensive
- Loss of format semantics
- Loss of precision
- efficient data structure supporting this
definition - High disk i/o time
4Solution
- H-Cubing
- Star Cubing
- Condensed Cube
- Quotient Cube
- Range cube
5Range Cubing Features
- Efficient ways to compute data cube
- Efficient way to compress data cube
- No loss of precision
- Preserve the roll-up/drill-down semantics
- Efficient w.r.to H-cubing (1/13th)
- Less than one ninth of the space of full cube
6Range cube Features (cont..)
- Semantic preserving and format preserving of
native data cube - Work easily with wide range of current database
and data mining apps - Less sensitive to dimension order
- Other indexing or compression techniques such as
dwarf can be applied - Can incorporate performance approaches
7How does Range cube achieve this ?
- Using a new data structure called RANGE TRIE
which is used to compress and identify
correlation - A range trie captures data correlation by finding
dimension values that imply dimension values.
8Whats Trie
- Trie of size (h, b) is a tree of height h and
branching factor b - All keys can be regarded as integers in range 0,
bh - Each key K can be represented as h-digit number
in base b K1K2K3Kh - Keys are stored in the leaf level path from the
root resembles decomposition of the keys to digits
9Efficiency of cube computation comesfrom the
three aspects
- It compresses the base table into a range trie,
so that it will calculate cells with identical
aggregation values only once. - The traversal takes advantage of simultaneous
aggregation, that is, the m - dimensional cell
will be computed from a bunch of (m 1) -
dimensional cells after the initial range trie is
built. At the same time, it facilitates Apriori
pruning. - The reduced cube size requires less output I/O
time.
10Base Table
11The lattice of cuboids derived from the base table
12Construction of Range Trie
13Construction of Range Trie
14Information stored in Range Trie
- Dimension Values of tuples determine the
structure of a range trie, and are stored in the
nodes along paths from the root to leaves. - Measure Values determine the aggregation values
of the nodes.
15Properties of a Range Trie
- The maximum depth of the range trie is the number
of dimensions n. - The number of leaf nodes in a range trie is the
number of tuples with distinct dimension values,
which is bounded by the total number of tuples. - Because siblings have distinct values on their
start dimension, the fan-out of a parent node is
bounded by the cardinality of the start dimension
of its child nodes. - Each interior node has at least two child nodes,
since its parent node contains all dimension
values common to all child nodes. - Each node represents a set of tuples represented
by the leaf nodes below it.
16Range Trie Construction
17Size of Range Trie
- Suppose we have a range trie on a D-dimensional
dataset with T tuples. - The depth of the range trie is D in the worst
case, when the dataset is very dense and the trie
structure is like a full tree. - It is logNT in the average case, for some N, the
average fan-out of the nodes.
18Range cubing Vs. H- cubingEvaluating the
effectiveness
19Evaluating the impact of skewnessDimension6,card
iniality100, no. of tuples 200K
20Evaluating the impact of sparsitydimension6,
tuples200k ,cardiniality10.10000
21Evaluating performance on real Data set weather
22Future Work
- Incorporating constraints with range cube
computation - Dealing with holistic funcitons
- Applying different compression technique to
compress cube - Supporting incremental and batch updates
23Questions ?