A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA

About This Presentation

Title:

A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA

Description:

a paper on random sampling over joins by surajit chaudhari rajeev motwani vivek narasayya presented by, jeevan kumar gogineni saranya gottipati semantics of sample 1. – PowerPoint PPT presentation

Number of Views:134

Avg rating:3.0/5.0

Slides: 25

Provided by: jee9

Learn more at: https://crystal.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA

1
A Paper on RANDOM SAMPLING OVER
JOINSbySURAJIT CHAUDHARIRAJEEV MOTWANIVIVEK
NARASAYYA

PRESENTED BY,
JEEVAN KUMAR GOGINENI
SARANYA GOTTIPATI

2
Outlines

Introduction
Semantics of Sample
Algorithms of Sampling
Join Sampling Problem
New Strategies for Join Sampling
Extensions and Negative Results
Experimental Evaluations
Conclusions

3
Terms Used

SAMPLE(R, f) is an SQL operation.
f is a fraction of a relation R.
Relation R is produced when a query Q is
evaluated.

4
Introduction

Sampling the output of query is inefficient.
OLAP and Data Mining use sample of the result of
the query posed.
Sampling must be supported on the result of an
arbitrary SQL query.

5
Continued

Supports Random Sampling as a primitive
relational operation in relational databases.
SAMPLE(R, f) operation.
Partially evaluate Q to generate a sample of R.
Sample operation appears arbitrarily in query
tree T.
Commute the sample operation down the tree using
a single join operation.

6
Semantics of Sample

1. Sampling with Replacement (WR)
2. Sampling without Replacement (WoR)
3. Independent Coin Flips (CF)
Sample with probability f independent of other
tuples.
f- Fraction of Tuples in R
n- Number of Tuples in R

7
Algorithms for Unweighted Sequential WR Sampling
Black-Box U1 Given relation R with n tuples, generate an UNWEIGHTED WR sample of size r. Black-Box U2 Given relation R with n tuples, generate an UNWEIGHTED WR sample of size r.

The size of relation being sampled.
How it scans the relation?
Need any significant auxiliary memory?

8
Algorithms for Weighted Sequential WR Sampling
Black-Box U1 Given relation R with n tuples, generate an WEIGHTED WR sample of size r. Black-Box U2 Given relation R with n tuples, generate an WEIGHTED WR sample of size r.

The size of relation being sampled.
How it scans the relation?
Need any significant auxiliary memory?

9
The Difficulty of Join Sampling
?
10
Classification of the Problem

Case A No information is available for either
R1 or R2
Case B No information is available for R1
but indexes and /or statistics are available
for R2.
Case C Indexes/statistics are available for
R1 and R2

11
Previous Sampling Strategies

Strategy Naive-Sample
Strategy Olken-Sample

12
New Strategies for Join Sampling

Three new strategies of Sampling are
Strategy Stream Sample.
Strategy Group Sample.
Strategy Frequency-Partition-Sample.

13
Table showing the information about R1 and R2
14
Strategy Stream Sample

Performs only a sequential sample from R1
Does not generate excess tuples

15
Strategy Group Sample
16
Strategy Frequency-Partition-Sample

Assumption that we have full statistics for R2
Uses strategy Group Sample for high frequency
values.
Strategy Naive Sample for low frequency values.
Join attribute values need not be of high
frequency in both operand relations.
Determine the distribution of the sample between
high and low frequency sub domain.
Advantage It needs summary statistics in the
form of histograms for R2.

17
Continued
18
Extensions and Negative Results

The Inherent difficulty of Join Sampling
Even if we have large samples from R1 and R2
and the detailed statistics, it is not possible
to generate any non-empty random sample of R1
join R2.
Dealing with Join Trees
Pushing down the Sample operation to the
operands.

19
Experimental Evaluations

Naïve Sample Add U1 operator as the root of tree
Olken Sample Create uniform random sample T from
key values of R1
Stream Sample Insert WR1 operator as a child of
the join operator
Frequency-Partition-Sample Implement a modified
version of WR1 operator for producing random
sample from R1

20
Experimental results
21
Continued
22
Continued
23
Conclusions

Study of issues involved in implementing sampling
as primitive operation.
Series of Sampling Strategies
Provided new schemes for sequential random
sampling for uniform and weighted sampling
distributions
Even more efficient strategies can be developed