RPJ: Producing Fast Join Results on Streams through Ratebased Optimization - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

RPJ: Producing Fast Join Results on Streams through Ratebased Optimization

Description:

Join the largest hash partition with the corresponding disk partition of the other relation. ... design of a stream-join algorithm includes deciding. the ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 20
Provided by: marily192
Category:

less

Transcript and Presenter's Notes

Title: RPJ: Producing Fast Join Results on Streams through Ratebased Optimization


1
RPJ Producing Fast Join Results on Streams
through Rate-based Optimization
  • Yufei Tao City University of Hong Kong
  • Man Lung Yiu University of Hong Kong
  • Dimitris Papadias HKUST
  • Marios Hadjieleftheriou UC Riverside
  • Nikos Mamoulis University of Hong Kong

2
Problem
  • R1 and R2 are two finite relations with a common
    attribute A.
  • Their data arrive in the form of continuous
    streams.
  • The goal is to return all the results of
  • in a progressive manner.

3
Progressiveness
  • Return as many results as possible in the same
    amount of time.
  • Continue to return results even when data
    transmission is blocked.

4
Processing Flow
5
XJoin(Urhan and FranklinData Eng. Bulletin 00)
  • A tuple t of R1 arrives gt
  • Probe the memory part of R2 gt
  • Add t to the memory part of R1 gt
  • Memory overflow?
  • gt Flush the largest hash partition (of either R1
    or R2)
  • Transmission blocked gt
  • Join the largest hash partition with the
    corresponding disk partition of the other
    relation.
  • Still blocked?
  • gt Repeat with the next largest.

6
Hash Merge Join (Mokbel et al. ICDE04)
  • Differs from XJoin in
  • Flush policy
  • Flush all, largest, smallest, adaptive
  • Action when transmission is blocked
  • Joins two disk partitions using progressive
    sort-merge join

7
Implications
  • The design of a stream-join algorithm includes
    deciding
  • the flushing policy
  • the actions when the transmission is blocked.

8
RPJ (Rate-based Progressive Join)
  • We assume that the received tuples reflect the
    subsequent arrival distribuition.
  • Usually true when
  • tuples arrive in a random order, and
  • the join attribute is not the key of any
    relation.

9
RPJ Flushing
  • The flushing policy aims at maximizing the
    probability that an arriving tuple produces a
    join result with an in-memory tuple.
  • of join results produced by the arriving tuples
    before the next flushing

10
RPJ Flushing (cont.)
  • An example when the memory overflows
  • If 1 tuple needs to be flushed
  • we should flush a tuple of R2 with value 1.
  • If only 2 tuples can be retained (i.e., flush 75
    tuples)
  • we should only keep the 2 tuples of R1 with value
    1.

11
RPJ Flushing (cont.)
  • To flush 60 tuples, first remove all tuples of R2
    with value 1 gt n2(1) 0.
  • Then flush 10 tuples of R1 with value 0 gt n1(0)
    10.

12
Transmission blocked
  • Motivation Joining two disk partitions may
    actually produce faster results than joining a
    memory and a disk partition.

13
Transmission blocked (cont.)
  • Assume that each relation is hashed into m
    partitions.
  • When the transmission is blocked, RPJ calculates
    the expected output rate for each of the
    following 3m choices
  • Joining the i-th memory partition of a relation
    with the corresponding disk partition of the
    other relation
  • Joining the i-th disk partitions of the two
    relations.
  • Expected output rate is computed by maintaining
    necessary statistics.

14
Experiments
  • The domain of the join attribute the integers in
    1, 10000.
  • Data distribution skewed.
  • Memory large enough to hold 100k tuples.
  • Flush 10k tuples each time.
  • We test the impacts of the following factors
  • Network reliability
  • Arrival distribution
  • Harmony
  • Reverse
  • Relative speed of the two streams.
  • 11 both relations have 1 million tuples
  • 15 one relation 1 million, the other 5 million

15
Experiment 1 Reliable Network
16
Experiment 1 Reliable Network (cont.)
17
Experiment 2 Unreliable Network
18
Experiment 2 Unreliable Network (cont.)
19
Future Work
  • Join with a range predicate
  • Multi-dimensional data
  • Load shedding for dealing with limited memory
Write a Comment
User Comments (0)
About PowerShow.com