Histograms for Selectivity Estimation presentation

About This Presentation

Transcript and Presenter's Notes

Title: Histograms for Selectivity Estimation

1
Histograms for Selectivity Estimation

Speaker Ho Wai Shing

2
Contents

Introduction What is a histogram? How to use a
histogram?
A taxonomy of single-dimensional histograms
Some experimental results
Some approaches for multi-dimensional histograms
Conclusions
Future Work

3
Introduction

Many modules of a DB require selectivity
estimation (estimating the query result size)
e.g.,
query optimizer -- determine the nesting in
indexed nested loop join
user interface -- return a rough answer to the
users

4
Introduction

We need to store some statistics of the database
to estimate the selectivity
Histogram is one of the most common statistics to
be stored in practice.
Quite accurate, needs reasonably small space.

5
What is a Histogram?

Histograms approximate the frequency distribution
of an attribute (or a set of attributes)
group attribute values into "buckets"
approximate the actual frequencies by the
statistical information stored in each bucket.

6
Histogram Example

Consider the following distribution
This is an "equi-width" histogram
?(1) ? 3

7
Histogram the problem

a pair of dual problem 1
Given a data distribution, a limit B on the
length of H, and an error metric E(), find the
histogram H that minimizes E(H).
Given the data distribution, a limit ? on the
error, and an error metric E(), find the
histogram H of smallest length for which E(H) is
at most ?.

8
Two Goals for Histograms

for selections (exact or range queries)
for joins
focus on histograms for selections in this talk

9
Taxonomy of Histogram

based on the paper by Poosala et al. 2 in
SIGMOD'96
on single-dimensional histograms
proposed a generalized histogram-generating
algorithm
different decisions in each step results in
different histograms

10
Generalized Histogram Generating Algorithm

consider the data distribution as a two column
table T(value, frequency)
create a third attribute a3 (sort parameter)
based on the first two attributes, sort the table
according to a3.
specify a subclass of histogram
create a 4th attribute a4 (source parameter)
partition T into B buckets s.t. it satisfies some
constraints on a4.

11
Example Equi-Width Histograms

a3 value
all histograms are possible
a4 value
constraint every bucket should contain the same
number of data values

12
Example End-Biased Equi-Depth Histograms

a3 value
all but one buckets must be singletons
a4 frequency
constraint all buckets should have the same
total frequency counts

13
Taxonomy

Dimensions
partition classes -- serial, end-biased
a3 -- value (V), frequency (F), area (A)
a4 -- spread (S), frequency (F), cum. freq (C),
area (A)
constraints -- equi-sum, v-optimal, max-diff,
compressed, spline-based

14
Constraints

Equi-sum each bucket should have the same sum of
a4
V-Optimal divide the buckets so that the
variance of the overall frequency approximation
is minimized
Spline-based the cumulative freq. satisfies a
piece-wise linear approximation.

15
Constraints (cont.)

Max-diff bucket boundaries are at top-(B-1)
adjacent a4 differences.
Compressed (comp.) top-n entries with the
highest a3 values are stored exactly, others are
stored using equi-sum.

16
Taxonomy
17
Equi-Width Histograms

discussed in Kooi's thesis (1980) 3
denoted by "Equi-Sum(V, S)" in the taxonomy
mergeable buckets must have contiguous values
merge criteria is about the spread
based on equi-sum

18
Equi-Depth Histograms

proposed by Piatetsky-Shapiro and Connell in
SIGMOD'84 4
denoted by "Equi-Sum(V, F)" in the taxonomy
mergeable buckets must have contiguous values
merge criteria is about the frequencies
based on equi-sum

19
V-Optimal(F, F) Histograms

proposed by Ioannidis and Christodoulakis in 1993
5
mergeable buckets must have contiguous
frequencies
merge criteria is to minimize sum-squared error
on frequencies within a bucket

20
V-Optimal(V, F) Histograms

proposed Poosala et al. in 1996 2
mergeable buckets must have contiguous values
merge criteria is to minimize sum-squared error
on frequencies

21
Max-Diff(V, F) Histograms

proposed Poosala et al. in 1996 2
mergeable buckets must have contiguous values
merge criteria is to minimize sum-squared error
on frequencies

22
Compressed(V, F) Histograms

proposed Poosala et al. in 1996 2
mergeable buckets must have contiguous values
merge criteria is equi-depth except the more
frequent n values.

23
Summary
Data Distribution
Max-Diff(V, F)
V-optimal(F, F)
equi-width
Compressed(V, F)
V-optimal(V, F)
equi-depth
24
Estimation Example

?(4) 2 (actual value)
equi-width ?(4) ? 4
equi-depth ?(4) ? 2.8
V-optimal(F,F) ?(4) ? 9(0)4(1)1(0) 4
V-optimal(V,F) ?(4) ? 2.4(1)9(0)2.7(0) 2.4
Max-Diff(V,F) ?(4) ? 2.4
Compressed(V,F) ?(4) ? 2.2

25
Experimental Results

100000 tuples
200 attribute values
2000 samples for construction

26
Experimental Results

cusp_max value distribution
random value freq. relation
frequencies fit to Zipf distribution

27
Other Experiment Parameters

Skew of frequency
Skew of data value distribution
Sample size in construction
Accuracy vs. Storage
Data Distributions (freq., values, correlations)
Queries

28
Conclusions

Histograms are useful in estimating the
selectivity of a query
Different techniques to use histogram for
approximating the data exist
v-optimal or MaxDiff histograms can have good
accuracy for 1-D case

29
Future Work

The methods presented can't solve the n-D
histogram problem completely
Try to apply SF-Tree to store and retrieve the
buckets in multi-dimensional histogram
efficiently.

30
References

1 H.V. Jagadish, Nick Koudas, S. Muthukrishnan,
Viswanath Poosala, Ken Sevcik, Torsten Suel,
Optimal Histograms with Quality Guarantees,
VLDB98
2 Viswanath Poosala, Yannis Ioannidis, Peter
Haas, Eugene Shekita, Improved Histograms for
Selectivity Estimation of Range Predicates,
SIGMOD96
3 R. P. Kooi, The Optimization of Queries in
Relational Databases, PhD Thesis, Case Western
Reserver University, 1980

31
References

4 M. Muralikrishna and D. DeWitt, Equi-Depth
Histograms for Estimating Selectivity Factors for
Multi-Dimensional Queries, SIGMOD88

Write a Comment

User Comments (0)

About PowerShow.com

Histograms for Selectivity Estimation PowerPoint PPT Presentation