A Robust, OptimizationBased Approach for Approximate Answering of Aggregate Queries PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: A Robust, OptimizationBased Approach for Approximate Answering of Aggregate Queries


1
A Robust, Optimization-Based Approach for
Approximate Answering of Aggregate Queries
  • Surajit Chaudhuri
  • Gautam Das
  • Vivek Narasayya
  • Presented by
  • Sushanth Sivaram Vallath

2
Why Approximate query Answering?
  • Most applications are OLAP and data mining for
    analyzing large databases.
  • Most of the queries are aggregation queries on
    these large databases.
  • Hence expensive and resource intensive.
  • Approximate answers given accurately and
    efficiently benefits the scalability of the
    application.

3
Approaches to Approximation/Data Reduction
  • Use pre-computed samples of data instead of
    complete data
  • Sampling
  • Weighted Sampling
  • Congressional sampling
  • On the fly sampling (error prone when selections,
    GROUP BYs and joins are used)
  • Workload (Deterministic solution for identical
    workloads)

4
Approaches to Approximation/Data Reduction
(contd.)
  • similar workloads, are considered as
    optimization problem (minimizing the error)
  • Histograms
  • Wavelets

5
Attacking similar workloads
  • Stratified sampling
  • Minimize error in estimation of aggregates

6
Drawbacks of previous studies
  • Lack of rigorous problem formulations leds to
    solutions that are difficult to evaluate
    theoretically.
  • Does not deal with uncertainty in incoming
    queries that are similar but identical
  • Ignore the variance in data distribution of
    aggregated columns

7
Architecture for AQP
  • Queries with selections, foreign-key joins and
    GROUP BY, containing aggregation functions such
    as COUNT, SUM, AVG.

8
Architecture for AQP
9
Architecture for AQP
  • Offline Component building of sample
  • Online Component
  • Rewrites an incoming query to use the sample to
    answer the query approximately
  • Reports the answer with error estimates

10
Pre-computing Samples for Fixed Workload
  • Fundamental Regions For a given relation R and
    workload W, consider partitioning the records in
    R into a minimum number of regions R1, R2, , Rr
    such that for any region Rj, each query in W
    selects either all records in Rj or none.

11
Fixed Workload
  • Identify all fundamental regions ? r
  • After step 1, Case A (r k) and Case B (r gtk)
    (k ? sample size)
  • Case A (r k) (Pick Sample records) Selects the
    samples by picking exactly one record from each
    important fundamental region
  • Assigns appropriate values to additional columns
    in the sample records
  • Case B (r gt k) (Pick Sample records) select k
    regions and then pick one record from each of the
    selected regions. The heuristic is to select top
    k.
  • Assign Values to Additional Columns. This is an
    optimization problem, which is solved by
    partially differentiating and the resulting
    linear equations using Gauss-Seidel method.

12
Disadvantage
  • Per-query (probabilistic) error guarantee is not
    possible.
  • If incoming query is not identical to a query in
    the given workload, FIXED can result in
    unpredictable errors.

13
Lifting Workload to Query Distributions
  • Should be resilient to situations where similar
    but not identical queries
  • Similarity is not based on syntax. If the
    records returned by the two queries have
    significant overlap, it is similar.
  • Each record will have a probability associated
    with it such that the incoming query will select
    this record

14
Rationale for Stratified Sampling
Minimize the MSE of the lifted workload
  • An effective scheme for stratification should be
    such that the expected variance, over all queries
    in each stratum, is small, and allocate more
    samples to strata with larger expected variances.

15
Example
16
Solution for single-table selection queries with
Aggregation
  • Stratification
  • How many strata required during partition?
  • How many records should each strata have?
  • Allocation
  • Determine the number of samples required across
    each strata
  • Sampling

17
Pragmatic Issues
  • Identifying Fundamental Regions
  • Handling Large Number of Fundamental Regions
  • Obtaining Integer Solutions
  • Obtaining an Unbiased Estimator

18
Extensions for more General Queries
  • GROUP BY Queries
  • JOIN Queries
  • Other Extensions
  • Mix of COUNT and SUM queries

19
Conclusion
  • The solutions FIXED and STRAT handles the
    problems of data variance, heterogeneous mixes of
    queries, GROUP BY and foreign-key joins.
Write a Comment
User Comments (0)
About PowerShow.com