A Robust, OptimizationBased Approach for Approximate Answering of Aggregate Queries presentation

About This Presentation

Transcript and Presenter's Notes

Title: A Robust, OptimizationBased Approach for Approximate Answering of Aggregate Queries

1
A Robust, Optimization-Based Approach for
Approximate Answering of Aggregate Queries

Surajit Chaudhuri
Gautam Das
Vivek Narasayya

Presented by
Sushanth Sivaram Vallath

2
Why Approximate query Answering?

Most applications are OLAP and data mining for
analyzing large databases.
Most of the queries are aggregation queries on
these large databases.
Hence expensive and resource intensive.
Approximate answers given accurately and
efficiently benefits the scalability of the
application.

3
Approaches to Approximation/Data Reduction

Use pre-computed samples of data instead of
complete data
Sampling
Weighted Sampling
Congressional sampling
On the fly sampling (error prone when selections,
GROUP BYs and joins are used)
Workload (Deterministic solution for identical
workloads)

4
Approaches to Approximation/Data Reduction
(contd.)

similar workloads, are considered as
optimization problem (minimizing the error)
Histograms
Wavelets

5
Attacking similar workloads

Stratified sampling
Minimize error in estimation of aggregates

6
Drawbacks of previous studies

Lack of rigorous problem formulations leds to
solutions that are difficult to evaluate
theoretically.
Does not deal with uncertainty in incoming
queries that are similar but identical
Ignore the variance in data distribution of
aggregated columns

7
Architecture for AQP

Queries with selections, foreign-key joins and
GROUP BY, containing aggregation functions such
as COUNT, SUM, AVG.

8
Architecture for AQP
9
Architecture for AQP

Offline Component building of sample
Online Component
Rewrites an incoming query to use the sample to
answer the query approximately
Reports the answer with error estimates

10
Pre-computing Samples for Fixed Workload

Fundamental Regions For a given relation R and
workload W, consider partitioning the records in
R into a minimum number of regions R1, R2, , Rr
such that for any region Rj, each query in W
selects either all records in Rj or none.

11
Fixed Workload

Identify all fundamental regions ? r
After step 1, Case A (r k) and Case B (r gtk)
(k ? sample size)
Case A (r k) (Pick Sample records) Selects the
samples by picking exactly one record from each
important fundamental region
Assigns appropriate values to additional columns
in the sample records
Case B (r gt k) (Pick Sample records) select k
regions and then pick one record from each of the
selected regions. The heuristic is to select top
k.
Assign Values to Additional Columns. This is an
optimization problem, which is solved by
partially differentiating and the resulting
linear equations using Gauss-Seidel method.

12
Disadvantage

Per-query (probabilistic) error guarantee is not
possible.
If incoming query is not identical to a query in
the given workload, FIXED can result in
unpredictable errors.

13
Lifting Workload to Query Distributions

Should be resilient to situations where similar
but not identical queries
Similarity is not based on syntax. If the
records returned by the two queries have
significant overlap, it is similar.
Each record will have a probability associated
with it such that the incoming query will select
this record

14
Rationale for Stratified Sampling
Minimize the MSE of the lifted workload

An effective scheme for stratification should be
such that the expected variance, over all queries
in each stratum, is small, and allocate more
samples to strata with larger expected variances.

15
Example
16
Solution for single-table selection queries with
Aggregation

Stratification
How many strata required during partition?
How many records should each strata have?
Allocation
Determine the number of samples required across
each strata
Sampling

17
Pragmatic Issues

Identifying Fundamental Regions
Handling Large Number of Fundamental Regions
Obtaining Integer Solutions
Obtaining an Unbiased Estimator

18
Extensions for more General Queries

GROUP BY Queries
JOIN Queries
Other Extensions
Mix of COUNT and SUM queries

19
Conclusion

The solutions FIXED and STRAT handles the
problems of data variance, heterogeneous mixes of
queries, GROUP BY and foreign-key joins.

Write a Comment

User Comments (0)

About PowerShow.com

A Robust, OptimizationBased Approach for Approximate Answering of Aggregate Queries PowerPoint PPT Presentation