Endbiased Samples for Join Cardinality Estimation - PowerPoint PPT Presentation

About This Presentation
Title:

Endbiased Samples for Join Cardinality Estimation

Description:

Proposing a new type of summaries based on a special type of sampling ... the amount of correlation between tables. Uncorrelated tables generated ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 22
Provided by: lilian58
Category:

less

Transcript and Presenter's Notes

Title: Endbiased Samples for Join Cardinality Estimation


1
End-biased Samples for Join Cardinality Estimation
  • Cristian Estan, Jeffrey F. Naughton
  • Computer Sciences Department
  • University of Wisconsin-Madison

2
Problem description
  • Estimating join size
  • Not restricted to key-foreign key joins
  • Based on summaries of the two tables computed
    separately
  • Two main contributions of this paper
  • Proposing a new type of summaries based on a
    special type of sampling
  • Extensive experimental comparison of many types
    of summaries

3
We can get more accurate estimates!
  • AGMS99 showed that on certain data sets
  • All summaries give inaccurate estimates
  • Estimates based on random sampling are within
    constant factor of bound
  • We show that
  • On other data sets, our estimates significantly
    more accurate than those with random sampling
  • No known summaries give estimates more accurate
    than all others for every data set

4
Overview
  • End-biased samples
  • Theoretical comparison against other
    sampling-based methods
  • Experimental comparison against sketches and
    histograms

5
Building the end-biased samples
  • If frequency of every value known for both tables
    ? exact join size
  • We keep a sample of this data
  • Sampling probability proportional to frequency
    DLT01
  • Sampling decisions correlated by using a shared
    hash function F90,DG00,EKMV04

Frequency of values of join attribute in table A
Frequency of values of join attribute in table B
(c,10)
(d,1)
(g,1)
(g,1)
(m,2)
(m,1)
(s,5)
(r,7)
(t,1)
(z,1)
6
Estimating join size
  • Let av be the frequency of value v in table A, bv
    in B and pv the probability that v is selected
    into both samples
  • Sum contribution of values in both samples (av
    bv/pv) to estimate join size
  • If av Ta and bvTb , pv 1
  • If av Ta and bvltTb , pv bv/Tb
  • If av ltTa and bvTb , pv av/Ta
  • If av ltTa and bvltTb , pv min(av/Ta,bv/Tb)

7
Why correlate the samples?
  • Example tables with 1000 values appearing once,
    50 values common to both tables
  • We sample with probability 1/10
  • Sample size 100 for each table
  • Comparison
  • pv
  • Common values sampled
  • Join size estimate

8
Comparison of sampling methods
9
Overview
  • End-biased samples
  • Theoretical comparison against other
    sampling-based methods
  • Experimental comparison against sketches and
    histograms

10
Experimental methodology
  • Randomly generated tables with 1,000,000 tuples
  • Explored multiple configurations
  • Varied the peakedness of the distribution
  • Varied memory budget from 204 to 659,456 words
  • Varied the amount of correlation between tables
  • Uncorrelated tables generated independently
  • Positively correlated frequent values likely
    same in both tables
  • Negatively correlated unlikely frequent values
    same in the two tables
  • 1,000 runs for each configuration

11
Summaries compared
  • End-biased samples
  • End-biased equi-depth histograms PC84
  • Sketches AGMS99,DGGR02,GGR04
  • Concise samples GM98
  • Counting samples GM98

12
Comparison with histograms
13
Comparison with sketches
14
Memory comparison
15
Qualitative comparison
16
Conclusions
  • End-biased samples and sketches are the best
    summaries for the join size estimation problem
    addressed in this paper
  • End-biased samples are compelling if
  • Selections on the join attribute are required
  • Summaries must be very concise
  • The frequencies of join attributes in the two
    tables are strongly correlated

17
Questions?
Thank you!
Scripts and results for experiments available at
http//www.cs.wisc.edu/estan/ebs.tar.gz
18
Estimating the join size
19
Related work sampling methods
  • GM98 concise samples, counting samples
  • DLT01 smart sampling
  • F90,EKMV04 using a hash function to select
    values used as summary of data

20
Related work join size estimation
  • Histograms
  • Multidimensional histograms
  • GG02,GK04 Wavelets
  • AGMS99,DGGR02,GGR04 Sketches

21
Variance of join size estimate
  • No slide, point to the paper.
Write a Comment
User Comments (0)
About PowerShow.com