Title: A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA
1A Paper on RANDOM SAMPLING OVER
JOINSbySURAJIT CHAUDHARIRAJEEV MOTWANIVIVEK
NARASAYYA
- PRESENTED BY,
- JEEVAN KUMAR GOGINENI
- SARANYA GOTTIPATI
2Outlines
- Introduction
- Semantics of Sample
- Algorithms of Sampling
- Join Sampling Problem
- New Strategies for Join Sampling
- Extensions and Negative Results
- Experimental Evaluations
- Conclusions
3Terms Used
- SAMPLE(R, f) is an SQL operation.
- f is a fraction of a relation R.
- Relation R is produced when a query Q is
evaluated.
4Introduction
- Sampling the output of query is inefficient.
- OLAP and Data Mining use sample of the result of
the query posed. - Sampling must be supported on the result of an
arbitrary SQL query.
5Continued
- Supports Random Sampling as a primitive
relational operation in relational databases. - SAMPLE(R, f) operation.
- Partially evaluate Q to generate a sample of R.
- Sample operation appears arbitrarily in query
tree T. - Commute the sample operation down the tree using
a single join operation.
6Semantics of Sample
- 1. Sampling with Replacement (WR)
- 2. Sampling without Replacement (WoR)
- 3. Independent Coin Flips (CF)
- Sample with probability f independent of other
tuples. - f- Fraction of Tuples in R
- n- Number of Tuples in R
7Algorithms for Unweighted Sequential WR Sampling
Black-Box U1 Given relation R with n tuples, generate an UNWEIGHTED WR sample of size r. Black-Box U2 Given relation R with n tuples, generate an UNWEIGHTED WR sample of size r.
- The size of relation being sampled.
- How it scans the relation?
- Need any significant auxiliary memory?
8Algorithms for Weighted Sequential WR Sampling
Black-Box U1 Given relation R with n tuples, generate an WEIGHTED WR sample of size r. Black-Box U2 Given relation R with n tuples, generate an WEIGHTED WR sample of size r.
- The size of relation being sampled.
- How it scans the relation?
- Need any significant auxiliary memory?
9The Difficulty of Join Sampling
?
10 Classification of the Problem
- Case A No information is available for either
R1 or R2 - Case B No information is available for R1
but indexes and /or statistics are available
for R2. - Case C Indexes/statistics are available for
R1 and R2
11Previous Sampling Strategies
- Strategy Naive-Sample
- Strategy Olken-Sample
12New Strategies for Join Sampling
- Three new strategies of Sampling are
- Strategy Stream Sample.
- Strategy Group Sample.
- Strategy Frequency-Partition-Sample.
13Table showing the information about R1 and R2
14Strategy Stream Sample
- Performs only a sequential sample from R1
- Does not generate excess tuples
15Strategy Group Sample
16Strategy Frequency-Partition-Sample
- Assumption that we have full statistics for R2
- Uses strategy Group Sample for high frequency
values. - Strategy Naive Sample for low frequency values.
- Join attribute values need not be of high
frequency in both operand relations. - Determine the distribution of the sample between
high and low frequency sub domain. - Advantage It needs summary statistics in the
form of histograms for R2.
17Continued
18Extensions and Negative Results
- The Inherent difficulty of Join Sampling
- Even if we have large samples from R1 and R2
and the detailed statistics, it is not possible
to generate any non-empty random sample of R1
join R2. - Dealing with Join Trees
- Pushing down the Sample operation to the
operands.
19Experimental Evaluations
- Naïve Sample Add U1 operator as the root of tree
- Olken Sample Create uniform random sample T from
key values of R1 - Stream Sample Insert WR1 operator as a child of
the join operator - Frequency-Partition-Sample Implement a modified
version of WR1 operator for producing random
sample from R1
20Experimental results
21Continued
22Continued
23Conclusions
- Study of issues involved in implementing sampling
as primitive operation. - Series of Sampling Strategies
- Provided new schemes for sequential random
sampling for uniform and weighted sampling
distributions - Even more efficient strategies can be developed
24