Title: A Robust, OptimizationBased Approach for Approximate Answering of Aggregate Queries
1A Robust, Optimization-Based Approach for
Approximate Answering of Aggregate Queries
- Surajit Chaudhuri
- Gautam Das
- Vivek Narasayya
- Presented by
- Sushanth Sivaram Vallath
2Why Approximate query Answering?
- Most applications are OLAP and data mining for
analyzing large databases. - Most of the queries are aggregation queries on
these large databases. - Hence expensive and resource intensive.
- Approximate answers given accurately and
efficiently benefits the scalability of the
application.
3Approaches to Approximation/Data Reduction
- Use pre-computed samples of data instead of
complete data - Sampling
- Weighted Sampling
- Congressional sampling
- On the fly sampling (error prone when selections,
GROUP BYs and joins are used) - Workload (Deterministic solution for identical
workloads)
4Approaches to Approximation/Data Reduction
(contd.)
- similar workloads, are considered as
optimization problem (minimizing the error) - Histograms
- Wavelets
5Attacking similar workloads
- Stratified sampling
- Minimize error in estimation of aggregates
6Drawbacks of previous studies
- Lack of rigorous problem formulations leds to
solutions that are difficult to evaluate
theoretically. - Does not deal with uncertainty in incoming
queries that are similar but identical - Ignore the variance in data distribution of
aggregated columns
7Architecture for AQP
- Queries with selections, foreign-key joins and
GROUP BY, containing aggregation functions such
as COUNT, SUM, AVG.
8Architecture for AQP
9Architecture for AQP
- Offline Component building of sample
- Online Component
- Rewrites an incoming query to use the sample to
answer the query approximately - Reports the answer with error estimates
10Pre-computing Samples for Fixed Workload
- Fundamental Regions For a given relation R and
workload W, consider partitioning the records in
R into a minimum number of regions R1, R2, , Rr
such that for any region Rj, each query in W
selects either all records in Rj or none.
11Fixed Workload
- Identify all fundamental regions ? r
- After step 1, Case A (r k) and Case B (r gtk)
(k ? sample size) - Case A (r k) (Pick Sample records) Selects the
samples by picking exactly one record from each
important fundamental region - Assigns appropriate values to additional columns
in the sample records - Case B (r gt k) (Pick Sample records) select k
regions and then pick one record from each of the
selected regions. The heuristic is to select top
k. - Assign Values to Additional Columns. This is an
optimization problem, which is solved by
partially differentiating and the resulting
linear equations using Gauss-Seidel method.
12Disadvantage
- Per-query (probabilistic) error guarantee is not
possible. - If incoming query is not identical to a query in
the given workload, FIXED can result in
unpredictable errors.
13Lifting Workload to Query Distributions
- Should be resilient to situations where similar
but not identical queries - Similarity is not based on syntax. If the
records returned by the two queries have
significant overlap, it is similar. - Each record will have a probability associated
with it such that the incoming query will select
this record
14Rationale for Stratified Sampling
Minimize the MSE of the lifted workload
- An effective scheme for stratification should be
such that the expected variance, over all queries
in each stratum, is small, and allocate more
samples to strata with larger expected variances.
15Example
16Solution for single-table selection queries with
Aggregation
- Stratification
- How many strata required during partition?
- How many records should each strata have?
- Allocation
- Determine the number of samples required across
each strata - Sampling
17Pragmatic Issues
- Identifying Fundamental Regions
- Handling Large Number of Fundamental Regions
- Obtaining Integer Solutions
- Obtaining an Unbiased Estimator
18Extensions for more General Queries
- GROUP BY Queries
- JOIN Queries
- Other Extensions
- Mix of COUNT and SUM queries
19Conclusion
- The solutions FIXED and STRAT handles the
problems of data variance, heterogeneous mixes of
queries, GROUP BY and foreign-key joins.