A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA - PowerPoint PPT Presentation

About This Presentation
Title:

A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA

Description:

a paper on random sampling over joins by surajit chaudhari rajeev motwani vivek narasayya presented by, jeevan kumar gogineni saranya gottipati semantics of sample 1. – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 25
Provided by: jee9
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA


1
A Paper on RANDOM SAMPLING OVER
JOINSbySURAJIT CHAUDHARIRAJEEV MOTWANIVIVEK
NARASAYYA
  • PRESENTED BY,
  • JEEVAN KUMAR GOGINENI
  • SARANYA GOTTIPATI

2
Outlines
  • Introduction
  • Semantics of Sample
  • Algorithms of Sampling
  • Join Sampling Problem
  • New Strategies for Join Sampling
  • Extensions and Negative Results
  • Experimental Evaluations
  • Conclusions

3
Terms Used
  • SAMPLE(R, f) is an SQL operation.
  • f is a fraction of a relation R.
  • Relation R is produced when a query Q is
    evaluated.

4
Introduction
  • Sampling the output of query is inefficient.
  • OLAP and Data Mining use sample of the result of
    the query posed.
  • Sampling must be supported on the result of an
    arbitrary SQL query.

5
Continued
  • Supports Random Sampling as a primitive
    relational operation in relational databases.
  • SAMPLE(R, f) operation.
  • Partially evaluate Q to generate a sample of R.
  • Sample operation appears arbitrarily in query
    tree T.
  • Commute the sample operation down the tree using
    a single join operation.

6
Semantics of Sample
  • 1. Sampling with Replacement (WR)
  • 2. Sampling without Replacement (WoR)
  • 3. Independent Coin Flips (CF)
  • Sample with probability f independent of other
    tuples.
  • f- Fraction of Tuples in R
  • n- Number of Tuples in R

7
Algorithms for Unweighted Sequential WR Sampling
Black-Box U1 Given relation R with n tuples, generate an UNWEIGHTED WR sample of size r. Black-Box U2 Given relation R with n tuples, generate an UNWEIGHTED WR sample of size r.
  • The size of relation being sampled.
  • How it scans the relation?
  • Need any significant auxiliary memory?

8
Algorithms for Weighted Sequential WR Sampling
Black-Box U1 Given relation R with n tuples, generate an WEIGHTED WR sample of size r. Black-Box U2 Given relation R with n tuples, generate an WEIGHTED WR sample of size r.
  • The size of relation being sampled.
  • How it scans the relation?
  • Need any significant auxiliary memory?

9
The Difficulty of Join Sampling
?
10
Classification of the Problem
  • Case A No information is available for either
    R1 or R2
  • Case B No information is available for R1
    but indexes and /or statistics are available
    for R2.
  • Case C Indexes/statistics are available for
    R1 and R2

11
Previous Sampling Strategies
  • Strategy Naive-Sample
  • Strategy Olken-Sample

12
New Strategies for Join Sampling
  • Three new strategies of Sampling are
  • Strategy Stream Sample.
  • Strategy Group Sample.
  • Strategy Frequency-Partition-Sample.

13
Table showing the information about R1 and R2
14
Strategy Stream Sample
  • Performs only a sequential sample from R1
  • Does not generate excess tuples

15
Strategy Group Sample
16
Strategy Frequency-Partition-Sample
  • Assumption that we have full statistics for R2
  • Uses strategy Group Sample for high frequency
    values.
  • Strategy Naive Sample for low frequency values.
  • Join attribute values need not be of high
    frequency in both operand relations.
  • Determine the distribution of the sample between
    high and low frequency sub domain.
  • Advantage It needs summary statistics in the
    form of histograms for R2.

17
Continued
18
Extensions and Negative Results
  • The Inherent difficulty of Join Sampling
  • Even if we have large samples from R1 and R2
    and the detailed statistics, it is not possible
    to generate any non-empty random sample of R1
    join R2.
  • Dealing with Join Trees
  • Pushing down the Sample operation to the
    operands.

19
Experimental Evaluations
  • Naïve Sample Add U1 operator as the root of tree
  • Olken Sample Create uniform random sample T from
    key values of R1
  • Stream Sample Insert WR1 operator as a child of
    the join operator
  • Frequency-Partition-Sample Implement a modified
    version of WR1 operator for producing random
    sample from R1

20
Experimental results
21
Continued
22
Continued
23
Conclusions
  • Study of issues involved in implementing sampling
    as primitive operation.
  • Series of Sampling Strategies
  • Provided new schemes for sequential random
    sampling for uniform and weighted sampling
    distributions
  • Even more efficient strategies can be developed

24
  • QUESTIONS??
  • Thank you
Write a Comment
User Comments (0)
About PowerShow.com