APPROXIMATE QUERY PROCESSING IN DATABASES - PowerPoint PPT Presentation

About This Presentation
Title:

APPROXIMATE QUERY PROCESSING IN DATABASES

Description:

APPROXIMATE QUERY PROCESSING IN DATABASES By: Jatinder Paul Introduction Decision Support Systems (DSS) What is Approximate Query Processing ? AQP Keeping query ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 38
Provided by: rangerUt4
Learn more at: https://ranger.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: APPROXIMATE QUERY PROCESSING IN DATABASES


1
APPROXIMATE QUERY PROCESSING IN DATABASES
  • By
  • Jatinder Paul

2
Introduction
User
  • DecisionSupport Systems(DSS)

SQL Query
Exact Answer
Long Response Times!
  • In recent years, advances in data collection and
    management technologies
  • have led to proliferation of very large
    databases.
  • Decision support applications such as Online
    Analytical Processing (OLAP)
  • and data mining tools are gaining popularity.
    Executing such applications on
  • large volumes of data can be resource
    intensive.
  • For very large databases (multi-gigabyte )
    processing even a single query
  • involves accessing enormous amounts of data,
    leading to very high running
  • time.

3
What is Approximate Query Processing ? AQP
  • Keeping query processing time small is very
    important in data mining and decision support
    system (DSS).
  • In aggregate queries (involving SUM,COUNT,AVG
    etc.) precision to last decimal is not needed
  • e.g., What is averages sale of a product in all
    the states of US ?
  • Exactness of result is not very important .
    Result can be given as approximate answer with
    some error guarantee.
  • e.g. Average salary
  • 59,000 /- 500 (with 95
    confidence) in 10 seconds
  • vs. 59,152.25
    in 10 minutes
  • Approximate Query Processing (AQP) techniques
    sacrifice accuracy to improve the running time

4
Different AQP techniques
  • All AQP techniques involves building some kind of
    synopses (summary) of the database.
  • AQP techniques differ in the way synopses is
    build.
  • Sampling Most preferred and practically used
    technique.
  • Randomized sampling Sampling
    uniformly at random.
  • Deterministic sampling Selecting the
    best sample that minimizes the total
  • error in the approximate answers.
  • Histogram Involves creating histograms which
    represent the database.
  • Wavelets Still a research area !
  • Concentrating on various sampling techniques.

5
Basic Sampling Based AQP Architecture.
  • Pre-processing phase (offline) Sample is build .
  • Run time phase (online) Queries are rewritten to
    run against the sample. The result is then scaled
    to given the approximate answer.

6
Limitations of Uniform Sampling for Aggregation
Queries
  • We discuss two problems that adversely affect the
    accuracy of sampling based estimations
  • presence of skewed data in aggregate values ,
  • the effect of low selectivity in selection
    queries and ,
  • the presence of small groups in group-by
    queries.
  • Discussions of these problems in details.

7
Effect of data skew
  • A skewed database is characterized by the
    existence of certain tuples that are deviant from
    the rest in terms of their contribution the
    aggregate value.
  • These tuples are called outliers. These are the
    values that contribute heavily to error in
    uniform sampling.
  • Outliers in data results large variance .
  • Large variance gt Large standard
    deviation and ,
  • Large standard deviation gtLarge error in
    results.

8
Effect of data skew
  • Example Consider a relation R containing
  • Two columns ltProductId,
    Revenuegt and
  • four records lt1, 10gt, lt2,
    10gt, lt3,10gt, lt4, 1000gt.
  • and aggregate query Q SELECT SUM(Revenue)
    FROM R.
  • Using a sample S of two records from R to answer
    the query.
  • Query is answered by
  • Running it against S and
  • Scaling the result by a factor of two (since we
    are using a 50 sample).
  • Consider a sample S1 lt1, 10gt, lt3, 10gt.
  • The estimated answer for Q using S1 is 40,
    which is a severe underestimate of the actual
    answer (1030).
  • Now consider another sample S2 lt1,10gt,
    lt4,1000gt. The estimated answer for Q
  • using S2 is 2020, which is a significant
    overestimate.
  • Thus, large variance in the aggregate column can
    lead to large relative errors

9
Handling data skew Outlier -indexes
  • Main idea is deal with outliers separately ,and
    sample from rest of the relation.
  • Partition the relation R into two sub relation-
  • Ro (outliers) and
  • RNO (non-outliers).
  • The query Q can now be considered as the UNION of
    two sub queries -
  • Q applied to Ro and
  • Q applied to RNO (non-outliers).

10
Handling data skew Outlier -indexes
  • Preprocessing steps
  • 1. Determine outliers specify the
    sub-relation RO of the relation R
  • deemed to be the set of outliers.
  • 2. Sample non-outliers select a
    uniform random sample T of the
  • relation RNO.
  • Query processing steps.
  • 1. Aggregate outliers apply the query to
    the outliers in RO accessed via the
  • outlier-index.
  • 2. Aggregate non-outliers Apply the query
    to the sample T and extrapolate to
  • obtain an estimate of the query
    result for RNO.
  • 3. Combine aggregates combine the
    approximate result for RNO with the
  • exact result for RO to obtain an
    approximate result for R.

11
Outlier indexes Other Issues.
  • Since the database content changes over time,
    this requires selection of outlier indexes and
    samples to be refreshed appropriately. The
    samples need to be refreshed periodically as
    pre-computed samples can become stale with use.
  • An alternative is to do the sampling completely
    online i.e., make it a part of query processing.
  • For the case of the COUNT aggregate
    ,outlier-indexing is not beneficial since there
    is no variance among the data values.
  • Also , outlier indexing is not useful for
    aggregate that depend on the rank order of tuples
    rather than their actual values.
  • (e.g. Aggregates such as min ,max and median
    )

12
Effect of low selectivity
  • Most queries involve selection conditions and/or
    group-bys .
  • If the selectivity of a query is low (i.e. it
    results in few records selection) then it
    adversely impacts the accuracy of sampling based
    estimation.
  • A selection query partitions the relation into
    two sub-relations
  • tuples that satisfy the condition (relevant
    sub-relation) and
  • tuples that do not.
  • If we sample uniformly from the relation, the
    number of tuples that are sampled from the
    relevant sub-relations will be proportional to
    its size. If this relevant sample size is small
    due to low selectivity of the query, it may lead
    to large error.

13
Effect of small groups.
  • The effect of small groups is same as effect of
    low selectivity .
  • The group-by queries also partition the relation
    into numerous sub-relations (tuples that belong
    to specific groups).
  • Thus for uniform sampling to perform well, the
    relevant sub-relation should be large in size,
    which is not the case in general

14
Using Workload Information Solution to low
selectivity and small groups
  • Approach to this problem is to use weighted
    sampling (instead of uniform sampling) of data by
    exploiting workload information while drawing the
    sample.
  • The essential idea behind weighted sampling
    scheme is to sample more from subsets of data
    that are small in size but are important, i.e.
    have high usage. This results in low error.
  • This approach is based on the fact that the usage
    of a database is typically characterized by
    considerable locality in the access pattern,
    i.e., queries against the database access certain
    parts of the data more than others.

15
Using Workload Information Solution to low
selectivity and small groups
  • The use of workload information involves the
    following steps
  • 1. Workload Collection Obtain a workload
    consisting of representative queries
  • against the database. Modern database
    systems provide tools to log queries
  • posed against the server (e.g. the
    Profiler component of Microsoft SQL Sever).
  • 2. Trace Query Patterns The workload can be
    analyzed to obtain parsed
  • information, e.g., the set of selection
    conditions that are posed.
  • 3. Trace Tuple Usage The execution of the
    workload reveals additional information
  • on usage of specific tuples, e.g.,
    frequency of access to each tuple , the number of
  • queries in the workload for which it
    passes the selection condition of the query.
  • 4. Weighted Sampling Perform sampling by
    taking into account weights of tuples
  • (from Step 3).

16
Weighted Sampling
  • Assuming that a tuple ti has weight wi, if the
    tuple ti is required to answer wi of the queries
    in the workload.
  • The normalized weight be wi defined as .
  • Tuple is accepted in the sample with the
    probability
  • pi n.wi , where n is the sample size.
  • Thus the probability with which each tuple is
    accepted in the sample varies from tuple to
    tuple.
  • The inverse of this probability is the
    multiplication factor associated with tuple used
    while answering the query. Each aggregate
    computed over this tuple gets multiplied by this
    multiplication factor.

17
Problem with Group-By Queries
  • Decision support queries routinely segment the
    data into groups.
  • For example, a group-by query on the U.S. census
    database could be used to determine the per
    capita income per state. However ,there can be a
    huge discrepancy in the sizes of different
    groups, e.g., the state of California has nearly
    70 times the population of North Carolina.
  • As a result, a uniform random sample of the
    relation will contain disproportionately fewer
    tuples from the smaller groups, which leads to
    poor accuracy for answers on those groups because
    accuracy is highly dependent on the number of
    sample tuples that belong to that group.
  • Standard error is inversely proportional to vn
    for uniform sample.
  • n is the uniform sample random size.

18
Solution Congressional Samples
  • Congressional samples are hybrid union of uniform
    and biased samples.
  • The strategy adopted is to divide the available
    sample space X equally among the g groups , and
    take a uniform random sample within each group.
  • Consider US Congress which is hybrid of House
    and Senate.
  • House has representative from each from each
    state in proportion to its population.
  • Senate has equal number of representative from
    each state.
  • We now apply House and Senate scenario for
    representing different groups.
  • House sample Uniform random sampling
    from each group .
  • Senate sample Sample an equal number of
    tuples from each group.

19
Congressional Samples
  • We define a strategy S1 as following
  • Divide the available sample space X equally among
    the g groups , and take a uniform random sample
    within each group
  • Congressional approach In this approach we
    consider the entire set of possible group by
    queries over a relation R.
  • Let be the set of non-empty groups under the
    grouping G. The grouping G partitions the
    relation R according to the cross-product of all
    the grouping attributes this is the finest
    possible partitioning for group-bys on R. Any
    group h on any other grouping T G is the union
    of one or more groups g from .
  • Constructing Congress,
  • 1. Apply S1 on each T G.
  • 2. Let be the set of non-empty groups
    under the grouping T, and let
  • the number of such groups.
  • 3. By S1, each of the non-empty groups in T
    should get a uniform random sample
  • of X/mT tuples from the group.

20
Congressional Samples
  • Constructing Congress,
  • 4. Thus for each subgroup g in of a group h
    in T, the expected space
  • allocated to g is simply
  • 5. Then, for each group g , take the
    maximum over all T of Sg,T, as the sample size
    for g, and scale it down to limit the space used
    to X. The final formula is Sample Size (g)
  • 6. For each group g in , select a uniform
    random sample of size Sample Size(g).
  • Thus we have a stratified, biased sample in
    which each group at the finest
  • partitioning is its own strata. Thus Congress
    essentially guarantees that
  • both large and small groups in all groupings will
    have a reasonable number
  • of samples.

where ng and nh are the number of tuples in g and
h respectively.
21
Rewriting for biased samples
  • Query rewriting involves two key steps
  • a) scaling up the aggregate expressions and
  • b) deriving error bounds on the estimate.
  • For each tuple, let its scale factor ScaleFactor
    be the inverse sampling rate for its strata.
  • All the sample tuples belonging to a group will
    have the same ScaleFactor. Thus key step in
    scaling is efficiently associate each tuple with
    its corresponding ScaleFactor.
  • There are two approaches to doing this
  • a) store the ScaleFactor(SF) with each tuple
    in sample relation and
  • b) use a separate table to store the
    ScaleFactors for the groups.
  • Each approach has its pros and cons.

22
Dynamic Sample Selection for AQP
  • All previous strategies for AQP uses single
    sample with fixed bias and do not take advantage
    of extra disk space.
  • Increasing the size of a sample stored on disk
    increases the running time of a query executing
    against that sample.
  • Dynamic Sampling gets around with this problem by
  • 1.Creating a large sample containing a
    family of differently biased
  • sub samples during pre-processing
    phase.
  • 2. Using only portion of the sample to
    answer each query at runtime.
  • Because there are many different sub samples with
    different biases available to choose from a
    runtime, the chances increase that one of them
    will be a good fit for any particular query
    that is issued.
  • And since only a small portion of the overall
    sample is used to answer query response time is
    low.

23
Generic Dynamic Sample Selection Architecture
  • Pre-Processing Phase
  • Consists of two steps
  • The data distribution and query distribution (
    e.g. workload information) is examined to
    identify a set of biased samples to be created.
  • Samples are created and stored in database along
    with the metadata.

Figure 2. Pre-processing phase
Division of database
Identifies the characteristics of each sample
24
Generic Dynamic Sample Selection Architecture
  • Runtime Phase
  • Two steps
  • Rewrites the query to run against sample tables.
  • Appropriate sample table(s) to use for a given
    query Q is determined by comparing Q with the
    metadata annotations for the sample.

Figure 3. Runtime phase
  • We need algorithm to decide
  • Which samples are to be built during
    pre-processing .
  • Which samples are to be used for query answering
    during runtime phase.
  • It should be possible to quickly determine which
    of various samples to use to
  • answer that query.

25
Small Group Sampling A dynamic sample selection
technique
  • Small group sampling targets most common type of
    analysis queries , aggregation queries with
    groups-bys.
  • Basic idea Uniform sampling does performs well
    for larger group-by query
  • but for small groups uniform sampling performs
    poorly since do not have enough representative
    from small groups.
  • The small group sampling approach uses a
    combination of
  • a uniform random sample (overall sample) over
    large groups.
  • One or more small group tables that contain only
    rows from small groups. The small group tables
    are not downscaled.

26
Small Group Sampling Algorithm Description
  • Pre-processing phase
  • The pre-processing algorithm takes two input
    parameter
  • Base sampling rate, r, which determines the size
    of the uniform random sample that is created
    (i.e. overall sample).
  • Small group fraction t, which determines the
    maximum size of each small group sample table.
  • Let, N number of rows in the database.
  • C the set of columns in the database.
  • The pre-processing algorithm can be implemented
    efficiently by making just
  • two scans
  • 1) First scan identifies the frequently
    occurring values for each column an
  • their approximate frequencies.
  • 2) Second scan, the small group tables for each
    column in S is constructed
  • along with the overall sample.

27
Small Group Sampling Algorithm Description
  • Pre-processing phase Algorithm
  • Final pass of algorithm
  • 1. Initially , the set S is initialized to C.
  • 2. Count the number of occurrences of each
    distinct value in each column of
  • the database and determine the common values
    for each column.
  • ( can be done using separate hash table).
  • 3. If the number of distinct values for a column
    exceeds a threshold u, then
  • remove the that column from S and stop
    maintaining its count.
  • L(C) is defined as the minimum set of values from
    C whose frequencies sum to at least
  • N(1 - t).
  • Second pass of algorithm
  • 1. Determine the set of common values L(C) for
    each column C.
  • 2. Rows with values from the set L(C) will not
    be included in the small group table for C,
  • but rows with all other values will be
    there are at most Nt such rows.
  • 3. If column C has no small groups, remove it
    from S.
  • 4. After computing L(C) for every C S, the
    algorithm creates a metadata table which
  • contains a mapping from each column name to
    an unique index between 0 and S - 1.

28
Pre-Processing Phase Algorithm
  • The final step in pre-processing is to make a
    second scan of the database to construct the
    sample tables.
  • Each row containing an uncommon value for one or
    more columns (i.e. a value not in the set L(C))
    is added to the small group sample table for the
    appropriate columns.
  • At the same time as the small group tables are
    being constructed, the preprocessing algorithm
    also creates the overall sample, using reservoir
    sampling to maintain a uniform random sample of
    rN tuples.
  • Each row that is added to either a small group
    table or the overall sample is tagged with an
    extra bit-mask field (of length S) indicating
    the set of small group tables to which that row
    was added. This field is used during runtime
    query processing to avoid double-counting rows
    that are assigned to multiple sample tables.

29
Small Group Sampling Algorithm Description
  • Runtime phase
  • When a query arrives at runtime, it is re-written
    to run against the sample tables instead of the
    base fact table.
  • Each query is executed against the overall
    sample, scaling the aggregate values by the
    inverse of the sampling rate r.
  • In addition, for each column C ? S in the querys
    group-by list, the query is executed against
    that columns small group table. The aggregate
    values are unscaled when executing against the
    small group sample tables.
  • Finally, the results from the various sample
    queries are aggregated together into a single
    approximate query answer. Since a row can be
    included in multiple sample tables, the
    re-written queries include filters that avoid
    double-counting rows

30
Small group sampling v/s Congressional sampling
Congressional Sampling Small group sampling
1. It creates only single sample, hence sample must necessarily be very general purpose in nature and only loosely for any appropriate query. Uses dynamic sample selection architecture and thus can have benefits of more specialized samples that are each tuned for a narrower , more specific class of queries.
2.The preprocessing time required by congressional sampling is proportional to the number of different combinations of grouping columns, which is exponential in the number of columns. This renders it impractical for typical data warehouses that have dozens or hundreds of potential grouping columns The pre-processing time for small group sampling is linear in the number of columns in the database
31
Problem with Joins
  • Approximate query strategies discussed so far
    uses some sort of sample
  • ( called base sample) using uniform random
    sampling.
  • The use of base samples to estimate the output of
    a join of two or more relations, however, can
    produce a poor quality approximation. This is for
    the following two reasons
  • 1. Non-Uniform Result Sample In general, the
    join of two uniform random base samples is not a
    uniform random sample of the output of the join.
    In most cases, this non-uniformity significantly
    degrades the accuracy of the answer and the
    confidence bounds.
  • 2. Small Join Result Sizes The join of two
    random samples typically has
  • very few tuples, even when the actual
    join selectivity is fairly high. This
  • can lead to both inaccurate answers
    and very poor confidence bounds
  • since they critically depend on the query
    result size.
  • Thus, it is in general impossible to produce good
    quality approximate answers using samples on the
    base relations alone.

32
Solution Join Synopses
  • We need to pre-compute samples of join results
    for making quality answer possible.
  • First , a naïve solution is to pre-compute such
    samples is to execute all possible join queries
    of interest and collect samples of their results.
  • But it is not feasible since it is too expensive
    to compute and maintain.
  • One strategy is to compute samples of the results
    of a small set of distinguished joins , which can
    be used to obtain random samples of all possible
    joins in the schema. The samples of these
    distinguished joins as join synopses.
  • Join synopses is a solution for queries with only
    foreign key joins.

33
Join Synopses
  • Lemma J.1 The sub-graph of G on the k nodes in
    any k-way foreign key join must be a connected
    sub-graph with a single root node.
  • We denote the relation corresponding to the
    root node as the source relation for the k-way
    foreign key join.
  • Lemma J.2 There is a 1-1 correspondence between
    tuples in a relation r1 and tuples in any k-way
    foreign key join with source relation r1.
  • For each relation r, there is some
  • maximum foreign key join
  • (i.e., having the largest number of
    relations)
  • with r as the source relation.
  • For example., in Figure,
    is
  • the maximum foreign key join with
  • source relation C.

34
Join Synopses
  • Definition of Join synopses For each node u in
    G, corresponding to a relation r1.
  • Let ( u) to be the output of the maximum
    foreign key join with
    source r1.
  • Let be a uniform random sample of r1.
  • A join synopsis, , to be the
    output of
  • The join synopses of a schema consists of
    for all u in G.
  • Join synopses are also referred as join samples.

35
Join Synopses
  • A single join synopses can be used for a large
    number of distinct joins.
  • Lemma J.3 From a single join synopsis for a
    node whose maximum foreign key join has
    relations, we can extract a uniform random sample
    of the output of between and
    distinct foreign key joins.
  • In other words, a join synopsis of relation r can
    be used to provide a random samples of any join
    involving r and one or more of its descendants.
  • Since Lemma J.2 fails to apply in general for any
    relation other than the source relation, the
    joining tuples in any relation r other than the
    source relation will not in general be a uniform
    random sample of r. Thus distinct join synopses
    are needed for each node/relation.
  • Since tuples in join synopses are the results of
    multi-way joins, a possible concern is that they
    will be too large because they have many columns.
    To reduce the columns stored for tuples in join
    synopses, we can eliminate redundant columns (for
    example, join columns) and only store columns of
    interest. Small relations can be stored in their
    entirety, rather than as part of join synopses.

36
References
  1. Surajit Chaudhuri, Gautam Das, Mayur Datar,
    Rajeev Motwani, Vivek Narasayya Overcoming
    Limitations of Sampling for Aggregation Queries.
    ICDE 2001.
  2. Surajit Chaudhuri, Gautam Das, Vivek Narasayya A
    Robust, Optimization-Based Approach for
    Approximate Answering of Aggregate Queries.
    SIGMOD Conference 2001.
  3. Brian Babcock, Surajit Chaudhuri, Gautam Das
    Dynamic Sample Selection for Approximate Query
    Processing. SIGMOD Conference 2003.
  4. S. Acharya, P. B. Gibbons, V. Poosala, and S.
    Ramaswamy.Join synopses for approximate query
    answering. In Proc. ACM SIGMOD International
    Conf. on Management of Data, pages 275-286, June
    1999.
  5. Gautam Das Survey of Approximate Query
    Processing Techniques. (Invited Tutorial) SSDBM
    2003
  6. P. B. Gibbons, S. Acharya, and V. PoosaIa. Aqua
    Approximate Query Answering System
  7. Yong Yao and Johannes Gehrke, Query Processing
    for sensor network

37
  • THANK YOU
Write a Comment
User Comments (0)
About PowerShow.com