Histograms for Selectivity Estimation - PowerPoint PPT Presentation

About This Presentation
Title:

Histograms for Selectivity Estimation

Description:

We need to store some statistics of the database to estimate the selectivity ... Histograms are useful in estimating the selectivity of a query ... – PowerPoint PPT presentation

Number of Views:203
Avg rating:3.0/5.0
Slides: 32
Provided by: CSI115
Category:

less

Transcript and Presenter's Notes

Title: Histograms for Selectivity Estimation


1
Histograms for Selectivity Estimation
  • Speaker Ho Wai Shing

2
Contents
  • Introduction What is a histogram? How to use a
    histogram?
  • A taxonomy of single-dimensional histograms
  • Some experimental results
  • Some approaches for multi-dimensional histograms
  • Conclusions
  • Future Work

3
Introduction
  • Many modules of a DB require selectivity
    estimation (estimating the query result size)
  • e.g.,
  • query optimizer -- determine the nesting in
    indexed nested loop join
  • user interface -- return a rough answer to the
    users

4
Introduction
  • We need to store some statistics of the database
    to estimate the selectivity
  • Histogram is one of the most common statistics to
    be stored in practice.
  • Quite accurate, needs reasonably small space.

5
What is a Histogram?
  • Histograms approximate the frequency distribution
    of an attribute (or a set of attributes)
  • group attribute values into "buckets"
  • approximate the actual frequencies by the
    statistical information stored in each bucket.

6
Histogram Example
  • Consider the following distribution
  • This is an "equi-width" histogram
  • ?(1) ? 3

7
Histogram the problem
  • a pair of dual problem 1
  • Given a data distribution, a limit B on the
    length of H, and an error metric E(), find the
    histogram H that minimizes E(H).
  • Given the data distribution, a limit ? on the
    error, and an error metric E(), find the
    histogram H of smallest length for which E(H) is
    at most ?.

8
Two Goals for Histograms
  • for selections (exact or range queries)
  • for joins
  • focus on histograms for selections in this talk

9
Taxonomy of Histogram
  • based on the paper by Poosala et al. 2 in
    SIGMOD'96
  • on single-dimensional histograms
  • proposed a generalized histogram-generating
    algorithm
  • different decisions in each step results in
    different histograms

10
Generalized Histogram Generating Algorithm
  • consider the data distribution as a two column
    table T(value, frequency)
  • create a third attribute a3 (sort parameter)
    based on the first two attributes, sort the table
    according to a3.
  • specify a subclass of histogram
  • create a 4th attribute a4 (source parameter)
  • partition T into B buckets s.t. it satisfies some
    constraints on a4.

11
Example Equi-Width Histograms
  • a3 value
  • all histograms are possible
  • a4 value
  • constraint every bucket should contain the same
    number of data values

12
Example End-Biased Equi-Depth Histograms
  • a3 value
  • all but one buckets must be singletons
  • a4 frequency
  • constraint all buckets should have the same
    total frequency counts

13
Taxonomy
  • Dimensions
  • partition classes -- serial, end-biased
  • a3 -- value (V), frequency (F), area (A)
  • a4 -- spread (S), frequency (F), cum. freq (C),
    area (A)
  • constraints -- equi-sum, v-optimal, max-diff,
    compressed, spline-based

14
Constraints
  • Equi-sum each bucket should have the same sum of
    a4
  • V-Optimal divide the buckets so that the
    variance of the overall frequency approximation
    is minimized
  • Spline-based the cumulative freq. satisfies a
    piece-wise linear approximation.

15
Constraints (cont.)
  • Max-diff bucket boundaries are at top-(B-1)
    adjacent a4 differences.
  • Compressed (comp.) top-n entries with the
    highest a3 values are stored exactly, others are
    stored using equi-sum.

16
Taxonomy
17
Equi-Width Histograms
  • discussed in Kooi's thesis (1980) 3
  • denoted by "Equi-Sum(V, S)" in the taxonomy
  • mergeable buckets must have contiguous values
  • merge criteria is about the spread
  • based on equi-sum

18
Equi-Depth Histograms
  • proposed by Piatetsky-Shapiro and Connell in
    SIGMOD'84 4
  • denoted by "Equi-Sum(V, F)" in the taxonomy
  • mergeable buckets must have contiguous values
  • merge criteria is about the frequencies
  • based on equi-sum

19
V-Optimal(F, F) Histograms
  • proposed by Ioannidis and Christodoulakis in 1993
    5
  • mergeable buckets must have contiguous
    frequencies
  • merge criteria is to minimize sum-squared error
    on frequencies within a bucket

20
V-Optimal(V, F) Histograms
  • proposed Poosala et al. in 1996 2
  • mergeable buckets must have contiguous values
  • merge criteria is to minimize sum-squared error
    on frequencies

21
Max-Diff(V, F) Histograms
  • proposed Poosala et al. in 1996 2
  • mergeable buckets must have contiguous values
  • merge criteria is to minimize sum-squared error
    on frequencies

22
Compressed(V, F) Histograms
  • proposed Poosala et al. in 1996 2
  • mergeable buckets must have contiguous values
  • merge criteria is equi-depth except the more
    frequent n values.

23
Summary
Data Distribution
Max-Diff(V, F)
V-optimal(F, F)
equi-width
Compressed(V, F)
V-optimal(V, F)
equi-depth
24
Estimation Example
  • ?(4) 2 (actual value)
  • equi-width ?(4) ? 4
  • equi-depth ?(4) ? 2.8
  • V-optimal(F,F) ?(4) ? 9(0)4(1)1(0) 4
  • V-optimal(V,F) ?(4) ? 2.4(1)9(0)2.7(0) 2.4
  • Max-Diff(V,F) ?(4) ? 2.4
  • Compressed(V,F) ?(4) ? 2.2

25
Experimental Results
  • 100000 tuples
  • 200 attribute values
  • 2000 samples for construction

26
Experimental Results
  • cusp_max value distribution
  • random value freq. relation
  • frequencies fit to Zipf distribution

27
Other Experiment Parameters
  • Skew of frequency
  • Skew of data value distribution
  • Sample size in construction
  • Accuracy vs. Storage
  • Data Distributions (freq., values, correlations)
  • Queries

28
Conclusions
  • Histograms are useful in estimating the
    selectivity of a query
  • Different techniques to use histogram for
    approximating the data exist
  • v-optimal or MaxDiff histograms can have good
    accuracy for 1-D case

29
Future Work
  • The methods presented can't solve the n-D
    histogram problem completely
  • Try to apply SF-Tree to store and retrieve the
    buckets in multi-dimensional histogram
    efficiently.

30
References
  • 1 H.V. Jagadish, Nick Koudas, S. Muthukrishnan,
    Viswanath Poosala, Ken Sevcik, Torsten Suel,
    Optimal Histograms with Quality Guarantees,
    VLDB98
  • 2 Viswanath Poosala, Yannis Ioannidis, Peter
    Haas, Eugene Shekita, Improved Histograms for
    Selectivity Estimation of Range Predicates,
    SIGMOD96
  • 3 R. P. Kooi, The Optimization of Queries in
    Relational Databases, PhD Thesis, Case Western
    Reserver University, 1980

31
References
  • 4 M. Muralikrishna and D. DeWitt, Equi-Depth
    Histograms for Estimating Selectivity Factors for
    Multi-Dimensional Queries, SIGMOD88
Write a Comment
User Comments (0)
About PowerShow.com