A Robust, OptimizationBased Approach for Approximate Answering of Aggregate Queries PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: A Robust, OptimizationBased Approach for Approximate Answering of Aggregate Queries


1
A Robust, Optimization-Based Approach for
Approximate Answering of Aggregate Queries
  • Surajit Chaudhuri
  • Gautam Das
  • Vivek Narasayya
  • Proceedings of the 2001 ACM SIGMOD International
    Conference on Management of Data p. 295 - 306
  • Presented by
  • Rebecca M. Atchley

2
Motivation
  • Decision Support applications - OLAP and data
    mining for analyzing large databases have become
    very popular
  • Most queries are aggregation queries on these
    large databases
  • Expensive and resource intensive
  • Approximate answers given accurately and
    efficiently benefit the scalability of the
    application and are usually Good Enough

3
Approaches to Approximation/Data Reduction
  • This papers approach uses pre-computed samples
    of data instead of complete data
  • Sampling approaches
  • Weighted Sampling
  • Congressional sampling
  • On-the-fly sampling (error-prone when selections,
    GROUP BYs and joins are used)
  • Workload (Deterministic solution for identical
    workloads)
  • similar workloads, are considered as an
    optimization problem (minimizing the error)

4
Approaches to Approximation/Data Reduction
(contd.)
  • Histograms
  • Wavelets

5
Drawbacks of previous work
  • Lack of rigorous problem formulations lead to
    solutions that are difficult to evaluate
    theoretically
  • Does not deal with uncertainty in incoming
    queries that are similar but not identical
    Assumes fixed workload
  • Ignores the variance in data distribution of
    aggregated columns

6
Attacking similar workloads
  • Stratified sampling
  • Minimize error in estimation of aggregates

7
Architecture for AQP
  • Queries with selections, foreign-key joins and
    GROUP BY, containing aggregation functions such
    as COUNT, SUM, AVG.

8
Architecture for AQP
9
Architecture for AQP
  • Inputs a database and a workload W
  • Offline component for selecting a sample
  • Online component that
  • Rewrites an incoming query to use the sample to
    answer the query approximately.
  • Reports the answer with an estimate of the error
    in the answer.
  • ScaleFactor As in previous works, each record
    in the sample contains an additional column,
    ScaleFactor. The value of the aggregate column
    of each record in the sample is first scaled up
    by multiplying with the ScaleFactor, and then
    aggregated.

10
Architecture for AQP
  • A workload W is specified as a set of pairs of
    queries and their corresponding weights i.e., W
    ltQ1,w1gt, ltQq,wqgt
  • Weight wi indicates the importance of query Qi in
    the workload.
  • Without loss of generality, assume the weights
    are normalized, i.e., Siwi 1

11
Architecture for AQP
  • If correct answer for query Q is y while
    approximate answer is y
  • Relative error E(Q) y - y / y
  • Squared error SE(Q) (y - y / y)²
  • If correct answer for the ith group is yi while
    approximate answer is yi
  • Squared error in answering a GROUP BY query Q
  • SE(Q) (1/g) Si ((yi yi)/ yi)²
  • Given a probability distribution of queries pw
  • Mean squared error for the distribution
  • MSE(pw) SQ pw(Q)SE(Q), (where pw(Q) is
    probability of query Q)
  • Root mean squared error (L2)
  • Other error metrics
  • L1 metric the mean error over all queries in
    workload
  • L8 metric the max error over all queries

12
Special Case Fixed Workload
  • Fundamental Regions For a given relation R and
    workload W, consider partitioning the records in
    R into a minimum number of regions R1, R2, , Rr
    such that for any region Rj, each query in W
    selects either all records in Rj or none.

13
FIXEDSAMP Solution a Deterministic Algorithm
called FIXED
  • Step 1 Identify all fundamental regions ? r
  • After step 1, Case A (r k) and Case B (r gtk)
  • (k ? sample size)
  • Case A (r k) (Pick Sample records) Selects the
    samples by picking exactly one record from each
    important fundamental region
  • Assigns appropriate values to additional columns
    in the sample records.
  • Case B (r gt k) (Pick Sample records) select k
    regions and then pick one record from each of the
    selected regions. The heuristic is to select top
    k.
  • Assign Values to Additional Columns. This is an
    optimization problem, which is solved by
    partially differentiating and the resulting
    linear equations using Gauss-Seidel method.

14
Disadvantage
  • Per-query (probabilistic) error guarantee is not
    possible
  • If incoming query is not identical to a query in
    the given workload, FIXED can result in
    unpredictable errors

15
Rationale for Stratified Sampling
Minimize the MSE of the lifted workload
  • Stratified sampling is a well-known
    generalization of uniform sampling where a
    population is partitioned into multiple strata
    and samples are selected uniformly from each
    stratum with important strata contributing
    relatively more samples
  • An effective scheme for stratification should be
    such that the expected variance, over all queries
    in each stratum, is small, and allocate more
    samples to strata with larger expected variances.

16
Lifting Workload to Query Distributions
  • Should be resilient to situations where similar
    but not identical queries
  • Similar-ity not based on syntax
  • If records returned by the two queries have
    significant overlap, theyre similar
  • Each record will have a probability associated
    with it such that the incoming query will select
    this record

17
The Non-Special Case with a Lifted Workload
  • Problem SAMP
  • Focus on Single-Relation queries with aggregation
    containing SUM or COUNT, W consisting of 1 query
    Q on Relation R
  • Input R, Pw (a probability distribution
    function specified by W), and k
  • Output A sample of k records (with the
    appropriate additional column(s)) such that the
    MSE(Pw) is minimized.
  • lifted workload
  • For a given W, define a lifted workload pw, i.e.,
    a probability distribution of incoming queries.
  • high if Q is similar to queries in W
  • low if dissimilar
  • Instead of mapping queries to probabilities, PQ
    maps subsets of R to probabilities.
  • Objective is to define the distribution PQ

18
The Non-Special Case with a Lifted Workload
  • lifted workload (Cont.)
  • Two parameters d (½ d 1) and ? (0 ? ½)
    define the degree to which the workload
    influences the query distribution. For any
    given record inside (resp. outside) RQ, the
    parameter d (resp. ?) represents the probability
    that an incoming query will select this record.
  • PQ(R) is the probability of occurrence of any
    query that selects exactly the set of records R.
  • n1, n2, n3, and n4 are the counts of records in
    the regions.
  • n2 or n4 large (large overlap), PQ(R) is
    high
  • n1 or n3 large (small overlap), PQ(R) is low

19
Example
Query Q1 SELECT COUNT() FROM R WHERE PRODUCTID
IN (3,4) Population POPQ1 0,0,1,1
20
Solution STRAT for single-table selection queries
with Aggregation
  • Stratification
  • How many strata required during partition?
  • How many records should each stratum have?
  • Allocation
  • Determine the number of samples required across
    each strata
  • Sampling

21
Allocation
  • After stratification into
  • the r fundamental regions for COUNT query
  • h finer subdivisions of the fundamental regions
    for SUM query (since each r may have large
    internal variance in the aggregate column, we
    cannot use the same stratification as in the
    COUNT case)
  • How do we do distribute the k records for the
    sample across the r (for COUNT) or hr (for SUM)
    strata?

22
The Allocation Step for COUNT Query
  • We want to minimize the error over queries in pw
    .
  • k1, kr are unknown variables such that Skj
    k.
  • From Equation (2) on an earlier slide,
  • MSE(pW) can be expressed as a weighted sum of the
    MSE of each query in the workload
  • Lemma 2 MSE(pW) Si wi MSE(pQ)

23
The Allocation Step for COUNT Query (Cont.)
  • For any Q e W, we express MSE(pQ) as a function
    of the kjs.
  • Lemma 3 For a COUNT query Q in W,
  • let ApproxMSE(pQ)
  • Then

24
The Allocation Step for COUNT Query (Cont.)
  • Since we have an (approximate) formula for
    MSE(pQ), we can express MSE(pw) as a function
    of the kjs variables.
  • Corollary 1 MSE(pw) Sj(aj / kj), where each
    aj is a function of n1,,nr, d, and ?.
  • aj captures the importance of a region it is
    positively correlated with nj as well as the
    frequency of queries in the workload that access
    Rj.
  • Now we can minimize MSE(pw).

25
The Allocation Step for COUNT Query (Cont.)
  • Lemma 4 Sj (aj / kj) is minimized subject to Sj
    kj k
  • if kj k ( sqrt(aj) / Si sqrt(ai) )
  • This provides a closed-form and computationally
    inexpensive solution to the allocation problem
    since aj depends only on d, ? and the number of
    tuples in each fundamental region.

26
Allocation Step for SUM Query
  • Like COUNT, we express an optimization problem
    with hr unknowns k1,, khr.
  • Unlike COUNT, the specific values of the
    aggregate column in each region (as well as the
    variance of values in each region) influence
    MSE(pQ).
  • Let yj(Yj) be the average (sum) of the aggregate
    column values of all records in region Rj. Since
    the variance within each region is small, each
    value within the region can be approximated as
    simply yj. Thus to express MSE(pQ) as a
    function of the kjs for a SUM query Q in W

27
Allocation Step for SUM Query
  • As with COUNT, MSE(pW) for SUM is functionally of
    the form Sj(aj / kj), and aj depends on the same
    parameters n1, nhr , d, and ? (see Corollary
    1).
  • The same minimization procedure can be used as in
    Lemma 4.

28
(No Transcript)
29
Experimental Results
  • FIXED solution for FIXEDSAMP, fixed workload,
    identical queries
  • STRAT solution for SAMP, workloads with
    single-table selection queries with aggregation
  • PREVIOUS WORK
  • USAMP uniform random sampling
  • WSAMP weighted sampling
  • OTLIDX outlier indexing combined with weighted
    sampling
  • CONG Congressional sampling

30
Experimental Results
31
Experimental Results
32
Experimental Results
33
Experimental Results
34
Experimental Results
35
Experimental Results
36
Experimental Results
37
Experimental Results
38
Pragmatic Issues
  • Identifying Fundamental Regions
  • Handling Large Number of Fundamental Regions
  • Obtaining Integer Solutions
  • Obtaining an Unbiased Estimator

39
Extensions for more General Queries
  • GROUP BY Queries
  • JOIN Queries
  • Other Extensions
  • Mix of COUNT and SUM queries

40
Conclusion
  • The solutions FIXED and STRAT handle the problems
    of data variance, heterogeneous mixes of queries,
    GROUP BY and foreign-key joins.

Questions?
41
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com