Ripple Joins for Online Aggregation - PowerPoint PPT Presentation

About This Presentation
Title:

Ripple Joins for Online Aggregation

Description:

Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 22
Provided by: Rayt190
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Ripple Joins for Online Aggregation


1
Ripple Joins for Online Aggregation
  • by
  • Peter J. Haas and Joseph M. Hellerstein
  • published in June 1999
  • presented by
  • Ronda Hilton

2
Overview
  • This paper tells how to join a bunch of tables
  • and get the SUM, COUNT, or AVG
  • in GROUP BY clauses
  • showing approximate results immediately
  • and the confidence interval of the results
  • from the first few tuples retrieved
  • updating a GUI display with closer approximation
    information as the join adds more tuples.

3
Ripple joins compared to our previous topics
  • General research area algorithms
  • another approximation algorithm
  • online processing
  • not maintaining a sample set
  • aggregate queries joins, and group-by
  • requires random retrieval
  • uses probabilistic calculations to determine the
    quality of the approximate result
  • not optimizing
  • implemented as middleware on the DBMS

4
Traditional Hash Join stores the smaller relation
in memory
  • Two relations R and S
  • with a common attribute
  • on each distinct value of that attribute,
  • match up the tuples which have the same value.
  • Example
  • select R.roomnumber,
  • COUNT(S.homeroom)
  • from Rooms R
  • join Student S on
  • R.roomnumberS.homeroom
  • For each tuple r in R
  • add hash(roomnumber) to the hashtable in memory
  • if hashtable has filled up memory
  • for every tuple s in S
  • if hash(homeroom) is found in the hashtable
  • add tuple r and tuple s to the output
  • reset the hashtable
  • Finally, scan S and add the resulting join tuples
    to the output.

5
What's different about ripple join?
  • Traditional hash join blocks until the entire
    query output is finished.
  • Ripple join reports approximate results after
    each sampling step, and allows user intervention.
  • In the inner loop, an entire table is scanned.
  • Ripple join expands the sample set incrementally.

6
The most important difference
  • The tuples are processed in random order.

7
Pipelining
  • In pipelining join algorithms, as the join
    progresses, more and more information gets added
    to the result.
  • In ripple joins, each new tuple gets joined with
    all previously-seen tuples of the other
    operand(s).
  • The relative rates of the two (or more) operands
    are dynamically adjusted.

8
Worst-case scenario
  • Ripple join reduces to a nested loop join.

9
The relations do not have to be relatively equal
size.
Aspect ratio how many tuples are retrieved from
each base relation per sampling step. e.g. ß1
1, ß2 3, Ripple join adjusts the aspect
ratio according to the sizes of the base
relations.
10
Rectangular version
11
What can the end user control?
  • how many groups continue to process
  • Any one group can be stopped.
  • All other groups will continue to process
    (faster).
  • the speed of the query selection process
  • What happens to make the process faster?
  • More tuples are skipped in the aggregation, so
    the approximation will be less accurate, and the
    confidence interval will be wider.
  • The end user controls the trade-off between speed
    and accuracy.

12
GUI, 1999
13
Confidence interval
  • A running confidence interval displays how close
    this answer is to the final result.
  • This could be calculated in many ways.
  • The authors present an example calculation built
    on extending the Central Limit Theorem.

14
Central Limit Theorem
  • ˆµn is estimator for true µ
  • average of the n values in the sample
  • a random quantity
  • CLT for large n
  • (e.g. after joining 30 tuples),
  • ˆµn has a normal distribution with mean µ and
    variance s2 /n

15
Random variable Z
  • Shift and scale ˆµn to get a "standardized"
    random variable Z
  • (ˆµn µ) / (s /vn)
  • Z also has a standard normal distribution.
  • There are a lot of ways to compute
  • the zp values.

16
"Interval" column on the GUI
  • The authors use ˆsn as an estimator for true
    variance
  • en ( zp ˆsn ) / vn
  • This is displayed quantity as the final
    half-width of the confidence interval.

17
Why call this "Ripple Join"?
  1. The algorithm seems to ripple out from a corner
    of the join.
  2. Acronym "Rectangles of Increasing Perimeter
    Length"

18
Variants of ripple join
  • Block ripple join
  • Index ripple join
  • Hash ripple join

19
Performance
20
Further publications
  • Eddies Continuously Adaptive Query Processing,
    by Ron Avnur and Joseph M. Hellerstein, MOD 2000,
    Dallas
  • Confidence Bounds for Sampling-Based GROUP BY
    Estimates, by Fei Xu, Christopher Jermaine, and
    Alin Dobra, ACMTrans. Datab. Syst. 33, 3 (Aug.
    2008)
  • Wavelet synopsis for hierarchical range queries
    with workloads, by Sudipto Guha, Hyoungmin Park,
    and Kyuseok Shim, VLDB Journal (2008)
    1710791099

21
Questions?
Write a Comment
User Comments (0)
About PowerShow.com