RPJ: Producing Fast Join Results on Streams through Ratebased Optimization

About This Presentation

Title:

RPJ: Producing Fast Join Results on Streams through Ratebased Optimization

Description:

Join the largest hash partition with the corresponding disk partition of the other relation. ... design of a stream-join algorithm includes deciding. the ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 20

Provided by: marily192

Category:

more less

Transcript and Presenter's Notes

Title: RPJ: Producing Fast Join Results on Streams through Ratebased Optimization

1
RPJ Producing Fast Join Results on Streams
through Rate-based Optimization

Yufei Tao City University of Hong Kong
Man Lung Yiu University of Hong Kong
Dimitris Papadias HKUST
Marios Hadjieleftheriou UC Riverside
Nikos Mamoulis University of Hong Kong

2
Problem

R1 and R2 are two finite relations with a common
attribute A.
Their data arrive in the form of continuous
streams.
The goal is to return all the results of
in a progressive manner.

3
Progressiveness

Return as many results as possible in the same
amount of time.
Continue to return results even when data
transmission is blocked.

4
Processing Flow
5
XJoin(Urhan and FranklinData Eng. Bulletin 00)

A tuple t of R1 arrives gt
Probe the memory part of R2 gt
Add t to the memory part of R1 gt
Memory overflow?
gt Flush the largest hash partition (of either R1
or R2)
Transmission blocked gt
Join the largest hash partition with the
corresponding disk partition of the other
relation.
Still blocked?
gt Repeat with the next largest.

6
Hash Merge Join (Mokbel et al. ICDE04)

Differs from XJoin in
Flush policy
Flush all, largest, smallest, adaptive
Action when transmission is blocked
Joins two disk partitions using progressive
sort-merge join

7
Implications

The design of a stream-join algorithm includes
deciding
the flushing policy
the actions when the transmission is blocked.

8
RPJ (Rate-based Progressive Join)

We assume that the received tuples reflect the
subsequent arrival distribuition.
Usually true when
tuples arrive in a random order, and
the join attribute is not the key of any
relation.

9
RPJ Flushing

The flushing policy aims at maximizing the
probability that an arriving tuple produces a
join result with an in-memory tuple.
of join results produced by the arriving tuples
before the next flushing

10
RPJ Flushing (cont.)

An example when the memory overflows
If 1 tuple needs to be flushed
we should flush a tuple of R2 with value 1.
If only 2 tuples can be retained (i.e., flush 75
tuples)
we should only keep the 2 tuples of R1 with value
1.

11
RPJ Flushing (cont.)

To flush 60 tuples, first remove all tuples of R2
with value 1 gt n2(1) 0.
Then flush 10 tuples of R1 with value 0 gt n1(0)
10.

12
Transmission blocked

Motivation Joining two disk partitions may
actually produce faster results than joining a
memory and a disk partition.

13
Transmission blocked (cont.)

Assume that each relation is hashed into m
partitions.
When the transmission is blocked, RPJ calculates
the expected output rate for each of the
following 3m choices
Joining the i-th memory partition of a relation
with the corresponding disk partition of the
other relation
Joining the i-th disk partitions of the two
relations.
Expected output rate is computed by maintaining
necessary statistics.

14
Experiments

The domain of the join attribute the integers in
1, 10000.
Data distribution skewed.
Memory large enough to hold 100k tuples.
Flush 10k tuples each time.
We test the impacts of the following factors
Network reliability
Arrival distribution
Harmony
Reverse
Relative speed of the two streams.
11 both relations have 1 million tuples
15 one relation 1 million, the other 5 million

15
Experiment 1 Reliable Network
16
Experiment 1 Reliable Network (cont.)
17
Experiment 2 Unreliable Network
18
Experiment 2 Unreliable Network (cont.)
19
Future Work