Range CUBE: Efficient Cube Computation by Exploiting Data Correlation - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Range CUBE: Efficient Cube Computation by Exploiting Data Correlation

Description:

... interior node has at least two child nodes, since its parent node contains all ... Supporting incremental and batch updates. Questions ? ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 24

Provided by: Rush5

Category:

more less

Transcript and Presenter's Notes

Title: Range CUBE: Efficient Cube Computation by Exploiting Data Correlation

1
Range CUBE Efficient Cube Computation by
Exploiting Data Correlation

By
Rushabh Ajmera

2
Outline

Problems Faced by Data Cube
Range Cubing to the Rescue
How Range Cube achieves it
Example of Range cube
Range Cube Algorithm
Range cube Experimental Results

3
Problems faced by Data cube

Prohibitively Expensive
Loss of format semantics
Loss of precision
efficient data structure supporting this
definition
High disk i/o time

4
Solution

H-Cubing
Star Cubing
Condensed Cube
Quotient Cube
Range cube

5
Range Cubing Features

Efficient ways to compute data cube
Efficient way to compress data cube
No loss of precision
Preserve the roll-up/drill-down semantics
Efficient w.r.to H-cubing (1/13th)
Less than one ninth of the space of full cube

6
Range cube Features (cont..)

Semantic preserving and format preserving of
native data cube
Work easily with wide range of current database
and data mining apps
Less sensitive to dimension order
Other indexing or compression techniques such as
dwarf can be applied
Can incorporate performance approaches

7
How does Range cube achieve this ?

Using a new data structure called RANGE TRIE
which is used to compress and identify
correlation
A range trie captures data correlation by finding
dimension values that imply dimension values.

8
Whats Trie

Trie of size (h, b) is a tree of height h and
branching factor b
All keys can be regarded as integers in range 0,
bh
Each key K can be represented as h-digit number
in base b K1K2K3Kh
Keys are stored in the leaf level path from the
root resembles decomposition of the keys to digits

9
Efficiency of cube computation comesfrom the
three aspects

It compresses the base table into a range trie,
so that it will calculate cells with identical
aggregation values only once.
The traversal takes advantage of simultaneous
aggregation, that is, the m - dimensional cell
will be computed from a bunch of (m 1) -
dimensional cells after the initial range trie is
built. At the same time, it facilitates Apriori
pruning.
The reduced cube size requires less output I/O
time.

10
Base Table
11
The lattice of cuboids derived from the base table
12
Construction of Range Trie
13
Construction of Range Trie
14
Information stored in Range Trie

Dimension Values of tuples determine the
structure of a range trie, and are stored in the
nodes along paths from the root to leaves.
Measure Values determine the aggregation values
of the nodes.

15
Properties of a Range Trie

The maximum depth of the range trie is the number
of dimensions n.
The number of leaf nodes in a range trie is the
number of tuples with distinct dimension values,
which is bounded by the total number of tuples.
Because siblings have distinct values on their
start dimension, the fan-out of a parent node is
bounded by the cardinality of the start dimension
of its child nodes.
Each interior node has at least two child nodes,
since its parent node contains all dimension
values common to all child nodes.
Each node represents a set of tuples represented
by the leaf nodes below it.

16
Range Trie Construction
17
Size of Range Trie

Suppose we have a range trie on a D-dimensional
dataset with T tuples.
The depth of the range trie is D in the worst
case, when the dataset is very dense and the trie
structure is like a full tree.
It is logNT in the average case, for some N, the
average fan-out of the nodes.

18
Range cubing Vs. H- cubingEvaluating the
effectiveness
19
Evaluating the impact of skewnessDimension6,card
iniality100, no. of tuples 200K
20
Evaluating the impact of sparsitydimension6,
tuples200k ,cardiniality10.10000
21
Evaluating performance on real Data set weather
22
Future Work