Range CUBE: Efficient Cube Computation by Exploiting Data Correlation - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Range CUBE: Efficient Cube Computation by Exploiting Data Correlation

Description:

... interior node has at least two child nodes, since its parent node contains all ... Supporting incremental and batch updates. Questions ? ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 24
Provided by: Rush5
Category:

less

Transcript and Presenter's Notes

Title: Range CUBE: Efficient Cube Computation by Exploiting Data Correlation


1
Range CUBE Efficient Cube Computation by
Exploiting Data Correlation
  • By
  • Rushabh Ajmera

2
Outline
  • Problems Faced by Data Cube
  • Range Cubing to the Rescue
  • How Range Cube achieves it
  • Example of Range cube
  • Range Cube Algorithm
  • Range cube Experimental Results

3
Problems faced by Data cube
  • Prohibitively Expensive
  • Loss of format semantics
  • Loss of precision
  • efficient data structure supporting this
    definition
  • High disk i/o time

4
Solution
  • H-Cubing
  • Star Cubing
  • Condensed Cube
  • Quotient Cube
  • Range cube

5
Range Cubing Features
  • Efficient ways to compute data cube
  • Efficient way to compress data cube
  • No loss of precision
  • Preserve the roll-up/drill-down semantics
  • Efficient w.r.to H-cubing (1/13th)
  • Less than one ninth of the space of full cube

6
Range cube Features (cont..)
  • Semantic preserving and format preserving of
    native data cube
  • Work easily with wide range of current database
    and data mining apps
  • Less sensitive to dimension order
  • Other indexing or compression techniques such as
    dwarf can be applied
  • Can incorporate performance approaches

7
How does Range cube achieve this ?
  • Using a new data structure called RANGE TRIE
    which is used to compress and identify
    correlation
  • A range trie captures data correlation by finding
    dimension values that imply dimension values.

8
Whats Trie
  • Trie of size (h, b) is a tree of height h and
    branching factor b
  • All keys can be regarded as integers in range 0,
    bh
  • Each key K can be represented as h-digit number
    in base b K1K2K3Kh
  • Keys are stored in the leaf level path from the
    root resembles decomposition of the keys to digits

9
Efficiency of cube computation comesfrom the
three aspects
  • It compresses the base table into a range trie,
    so that it will calculate cells with identical
    aggregation values only once.
  • The traversal takes advantage of simultaneous
    aggregation, that is, the m - dimensional cell
    will be computed from a bunch of (m 1) -
    dimensional cells after the initial range trie is
    built. At the same time, it facilitates Apriori
    pruning.
  • The reduced cube size requires less output I/O
    time.

10
Base Table
11
The lattice of cuboids derived from the base table
12
Construction of Range Trie
13
Construction of Range Trie
14
Information stored in Range Trie
  • Dimension Values of tuples determine the
    structure of a range trie, and are stored in the
    nodes along paths from the root to leaves.
  • Measure Values determine the aggregation values
    of the nodes.

15
Properties of a Range Trie
  • The maximum depth of the range trie is the number
    of dimensions n.
  • The number of leaf nodes in a range trie is the
    number of tuples with distinct dimension values,
    which is bounded by the total number of tuples.
  • Because siblings have distinct values on their
    start dimension, the fan-out of a parent node is
    bounded by the cardinality of the start dimension
    of its child nodes.
  • Each interior node has at least two child nodes,
    since its parent node contains all dimension
    values common to all child nodes.
  • Each node represents a set of tuples represented
    by the leaf nodes below it.

16
Range Trie Construction
17
Size of Range Trie
  • Suppose we have a range trie on a D-dimensional
    dataset with T tuples.
  • The depth of the range trie is D in the worst
    case, when the dataset is very dense and the trie
    structure is like a full tree.
  • It is logNT in the average case, for some N, the
    average fan-out of the nodes.

18
Range cubing Vs. H- cubingEvaluating the
effectiveness
19
Evaluating the impact of skewnessDimension6,card
iniality100, no. of tuples 200K
20
Evaluating the impact of sparsitydimension6,
tuples200k ,cardiniality10.10000
21
Evaluating performance on real Data set weather
22
Future Work
  • Incorporating constraints with range cube
    computation
  • Dealing with holistic funcitons
  • Applying different compression technique to
    compress cube
  • Supporting incremental and batch updates

23
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com