CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF - PowerPoint PPT Presentation

About This Presentation
Title:

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF

Description:

... like California is 70 times more populated than Wyoming. ... Map this theorem to various classes of group-by queries with arbitrary mixes of groupings. ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 30
Provided by: cryst
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF


1
CONGRESSIONAL SAMPLES FOR APPROXIMATE
ANSWERING OF GROUP-BY QUERIES Swarup Acharya
Phillip Gibbons Viswanath Poosala (Information
Sciences Research Center, Bell Labs, New
Jersey) Divya Rao
2
Outline
  • Introduction
  • Background
  • Aqua System
  • Problem Formulation
  • Solutions
  • Query Rewriting strategies
  • Experiment
  • Conclusion

3
Introduction
  • Group-by queries- most important class of
    queries in decision support systems.
  • Congressional Samples- A hybrid union of uniform
    and biased samples
  • Seek to propose techniques for obtaining fast,
    highly-accurate answers for Group-by queries

4
Background
  • Uniform random sampling is not effective for
    group-by queries.
  • Ex A group by query on the US Census database to
    determine the per-capita income of every state.
  • Huge discrepancies in the sizes of different
    groups like California is 70 times more populated
    than Wyoming.
  • This leads to poor accuracy of answers of those
    groups which have fewer number of tuples than the
    larger ones as accuracy is highly dependent on
    the number of sample tuples that belong to that
    group.

5
Background
  • Uniform Random Sampling are more appropriate only
    when the utility of data to the user mirrors the
    data distribution
  • Multi-table query When different data have equal
    representation but their utility to the user is
    skewed
  • Ex Data warehouses where the usefulness of data
    degrades with time
  • This means the approximate sample has to collect
    more samples from the recent data which cannot be
    achieved through uniform random sample over the
    entire warehouse.

6
Biased Sampling
  • Use precomputed samples to address the problem of
    unbiased query
  • Advantages of using precomputed biased samples
  • Queries can be answered without accessing the
    original data at query run time
  • Storing queries in disk blocks avoids the
    overhead of random scanning
  • Disadvantage Biased samples must commit to the
    sample before seeing the query. Hence not
    suitable for user controlled progressive
    refinement.

7
Aqua System
  • Aqua is an efficient decision support system
    providing approximate answers to queries

8
Aqua System(Contd.)
  • Aqua is a Middleware tool that can sit atop any
    DBMS managing a data warehouse
  • Aqua maintains statistical summaries of data in
    Synopses and uses them to answer queries
  • The aqua system provides probabilistic error/
    confidence bounds on the answer

9
Aqua System(Contd.)
10
Aqua System(Contd.)
11
Problem Formulation
  • Main aim is to provide accurate answers to
    group-by queries in an approx. query answering
    system
  • If ci and ci' be the exact and apprx. aggregate
    values in the group gi. Then error is the
    percentage relative error e in the estimation of
    ci is
  • e ( ci ci' )/ci 100

12
Solutions
  • Theorem Divide the sample space X equally among
    the groups and take uniform random sample within
    each group.
  • Map this theorem to various classes of group-by
    queries with arbitrary mixes of groupings.
  • Ex US Congress

HOUSE
SENATE
13
House
  • The House has representatives from each state
    proportional to the state's population
  • Applying theorem T to the House we have,
  • For the aggregate operation, the quality of
    approx. answers increases with the query
    selectivity
  • Answers to the queries with the same aggregate
    and equal selectivities will typically have
    similar quality guarantees.

14
Senate
  • Senate has equal number of representatives from
    each state
  • Applying the theorem to the Senate we have,
  • Each group in the sample will have atleast as
    many sample points as any other group in the
    entire sample

15
Problems with House and Senate
  • Using Samples from House would result in very few
    sample points for smaller groups
  • Senate allocates fewer tuples to the larger
    groups compared to the House.
  • Hence we have another technique called the
    Congress-collect both the House and the Senate
    samples

16
Basic Congress
  • Apply the theorem to the aggregate queries
    containing group-by queries on a set of
    attributes and queries with no group-bys at all.
  • Collect both the House and Senate samples
  • Reduce this factor by 2

17
Congress
  • For the sample space X, the final sample size
    allocated to each group is given by,
  • Where the expected sample space allocated to g is

18
Query rewriting
  • Scaling up the aggregate expressions
  • Deriving error bounds on the estimate
  • Generating unbiased answers using tuples in the
    biased sample
  • Scale factor is the inverse of sampling rate

19
(No Transcript)
20
Rewriting Strategies
  • The key step in scaling is to efficiently
    associate each tuple with its corresponding scale
    factor
  • a) Store the scale factor with each tuple
  • i) Integrated Rewriting
  • ii) Nested-integrated Rewriting
  • b) Use a separate table to store the scale factor
  • iii) Normalized Rewriting
  • iv) Key-normalized Rewriting

21
(No Transcript)
22
Experiments
  • Experimental Testbed Aqua system with Oracle v7
    as the back-end DBMS

Parameter Range of values Default value
Table size(T) 100k-6M tuples 1M
Sample Percentage(SP) 1-75 7
Num.groups 10-200k 1000
Group-size skew(z) 0-1.5 0.86
Experimental Parameters
23
Experiment(Contd.)
  • Study to identify a scheme that can provide
    consistently good performance

24
Performance of various allocation strategies
  • Performance of Different query sets
  • Queries with no group-bys House performs well
    Congress technique performs consistently the best
    for queries of all types
  • Queries with three group-bys Senate has low
    errors
  • Queries with two group-bysBoth senate and House
    perform poorly in this case
  • Congress performs close to best for queries of
    all types. Other techniques perform well only in
    a limited part of the spectrum

25
(No Transcript)
26
  • Performance of different sample sizes
  • The errors in Congress drop as the sample space
    increases

27
  • Performance of group count
  • Integrated and Nested-integrated perform better
    than Normalized and Key-normalized due to the
    absence of a join operation
  • Nested-integrated performs better than Integrated
    due to significantly fewer multiplications.

28
Conclusions
  • Demonstrated that uniform samples are not enough
    to accurately answer all group-by queries
  • Proposed new techniques based on biased sampling
  • Congressional sampling concept was introduced and
    the sampling strategies were validated
    experimentally to produce accurate estimates to
    group-by queries and in their execution
    efficiency
  • All the techniques have been incorporated into
    the Aqua System.

29
Questions??
Write a Comment
User Comments (0)
About PowerShow.com