Maintenance of Data Cubes and Summary Tables in a Data Warehouse - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Maintenance of Data Cubes and Summary Tables in a Data Warehouse

Description:

Maintenance of Data Cubes. and. Summary Tables. in a. Data Warehouse. Carmen Gillette. John Schloman ... contains a large amount of information collected from ... – PowerPoint PPT presentation

Number of Views:235
Avg rating:3.0/5.0
Slides: 35
Provided by: schl4
Category:

less

Transcript and Presenter's Notes

Title: Maintenance of Data Cubes and Summary Tables in a Data Warehouse


1
Maintenance of Data Cubes and Summary Tables
in a Data Warehouse
Carmen Gillette John Schloman
2
Data Warehouse
  • contains a large amount of information collected
    from a variety of independent sources
  • OLAP (On-line Analytical Processing) hundreds of
    complex aggregate queries over large volumes of
    data
  • Views must be updated to reflect changes that are
    made to the data sources (incrementally,
    calculations, intial)

3
How are Changes with the Views Conducted?
  • Immediately as soon as a change form a source
    is received
  • Deferred a time when a large batch of updates
    is applied to the warehouse at once.

View Immediate Changes Why could this cause a
problem?
4
Why could immediated changes cause problems
within the view?
  • Each update into the warehouse requires updating
    the views
  • The overhead increases with the number of views
    and its complexity
  • Warehouse applications require that the state of
    the views not change while the warehouse is being
    accessed, therefore the user will see a
    consistent snapshot of the warehouse across
    multiple queries during analysis

5
Components of Data Warehouses
  • Large amounts of Data
  • Maintaining Multiple Views
  • Maintaining Multiple Summary Tables
  • OLAP query performance

6
Goals
Efficiently maintaining a set of summary tables
in a warehouse
  • Incremental Maintenance techniques for summary
    tables- using Summary-Delta Tables
  • General Strategy to minimize the batch time
    needed for maintenance by splitting the work
    into propagate and refresh functions for
    aggregate views
  • Maintaining multiple summary tables

7
Example Dimension and Fact Tables
Fact Table
Pos(storeID,itemID,date,qty,price)
Dimension Table
stores(storeID,city,region)
Dimension Table
items(itemID,name,category,cost)
Dimension Hierarchies is a set of functional
dependencies among the attributes of the
dimension table
storeID
city
region
8
Summary TablesMaterialized Views
9
Maintaining Summary Tables Using Propagation and
Refresh
What is the summary-delta table?represents the
net changes to the summary table due to changes
to the fact table Is this the same as the summary
table?
  • Propagate creates a summary delta table from
    the deferred of changes
  • Refresh applies the net changes represented in
    the summary-delta table to the summary table
    updates the summary table to reflect changes in
    the summary-delta table

10
Example Updating Summary Table
11
Reuse of the Summary-Delta Tables
The Summary-Delta table of one are used to
compute other Summary-Delta tables
Fewer tuple access
12
Using Aggregate Functions for Maintaining Summary
Tables
  • Distributive computed by the partitioning input
    into disjoint sets, aggregating each set
    individually, ect.
  • Algebraic scalar function of distributive
    aggregate functions
  • Holistic computed by dividing into parts

Self- Maintainable Distributive COUNT()
(Insertions and Deletions) determines when all
the tuples in a group have been deleted MIN,MAX
value for the group is recomputed from base data
13
Data Cubes
  • Dimension hierarchy information is obtained
    implicitly
  • Explicit joins with the dimension tables

K dimension yields 2K cube views
  • SQL SELECT-FROM-WHERE-GROUP BY
  • 2K of the dimension attributes

14
Generalized Cube Views
  • Many systems and warehouses do not fit into the
    structure of cube views. There is then a need for
    a more generalized form. Generalized cube views
    are traditional cube-style views that are
    extended in the following
  • a. Different views may compute different
    aggregate functions.
  • b. Some views may compute aggregate functions on
    columns used as dimension attributes in other
    views (e.g. the group-by attributes of other
    views).
  • c. Views may join with different combinations of
    dimension tables.

15
Dimension hierarchies and lattices
  • As said before, dimensional hierarchy information
    is stored in separate dimensional tables (e.g.
    stores(storeID, city, region)) and to group facts
    higher up on the hierarchy, the fact table must
    be joined with the dimensional table.

16
Dimension hierarchies and lattices (cont.)
  • But if we look at the dimensional hierarchy as a
    lattice, we can construct a lattice representing
    the possible set of views by grouping on the
    dimensional hierarchies. Here Figure 4 is grouped
    with stores(storeID, city, region) and
    items(itemID,category) to result in Figure 5.

17
Partially-materialized lattices
  • A lattice obtained by removing some nodes
    (usually information that is not materialized in
    the warehouse).
  • If we are removing node n, all incoming edges
    (n1,n) and outgoing edges (n,n2) must be combined
    into a new edge (n1,n2).
  • In SQL the query defining view n2 along edge
    (n1,n2) is obtained by replacing view n in the
    FROM clause with view n1.

18
Basic Summary-Delta Maintenance Algorithm
  • To maintain the generalized cube view there are
    some requirements
  • a. All aggregate functions must be
    self-maintainable or made self-maintainable by
    adding the necessary COUNT functions.
  • b. The holistic functions MIN and MAX may be used
    (in which case they are handled during the
    refresh function).

19
Propagate function
  • As the propagate function only affects the
    summary-delta table, leaving the summary table
    open, it is important to do as much work in the
    Propagation step to minimize the amount of time
    the system is unavailable during refresh.
  • To help make the calculation of the summary-delta
    table easier to understand, some of the work is
    split into three virtual views prepare-changes,
    prepare-insertions, and prepare-deletions.

20
Preparing Changes
  • These tables represent the changes to the
    aggregate functions caused by individual
    insertions and deletions.
  • An aggregate-source attribute corresponds to each
    of the aggregate functions computed by the
    summary table. If the summary table included the
    aggregate function SUM(AB), a aggregate-source
    attribute would be included in prepare-insertions
    (AB) and prepare-deletions (-AB).

21
Preparing Changes
  • prepare-changes is derived from a join of
    prepare-insertions and -deletions.

22
Compute the summary-delta table
  • The summary-delta table is computed by
    aggregating the prepare-changes virtual view.
  • The resulting summary-delta table has the same
    schema as the summary table (except that the
    aggregate functions represent changes to the
    aggregate functions in the summary table).

23
Refresh Function
  • The changes in the summary-delta table are
    applied to the summary table. Each tuple in the
    summary-delta table causes a change to a single
    corresponding tuple in the summary table
    (inserted, deleted, or updated).
  • COUNT and SUM based aggregates are calculated
    directly.
  • If MIN and MAX aggregates are used the new values
    are compared to see if the summary table values
    need to be changed.

24
Efficiently maintaining multiple summary tables
  • Just as multiple cube views can be arranged into
    a lattice, summary tables too can be arranged
    into a lattice a V-lattice.
  • Furthermore our summary-delta tables can be
    arranged into a lattice a D-lattice.

25
Placing generalized cube views into a lattice
  • Cube view v2 should be derivable from cube view
    v1 (placed above v2 in the cube lattice) and
    therefore v1 is the ancestor of v2. This is true
    if
  • 1. each group-by attribute of v2 is either a
    group-by attribute of v1 or is an attribute of a
    dimension table whose foreign key is a group-by
    attribute of v1.
  • 2. each aggregate function a(E) of v2 either
    appears in v1 or E is an expression over group-by
    attributes of v1, or E is an expression over
    attribute of dimension tables whose foreign keys
    are group-by attributes of v1.

26
Optimizing the lattice
  • This means pushing down joins, aggregate
    functions, and dimension columns as low down into
    the lattice as possible. This is done for the
    following reasons
  • a. as one traverses down the data cube, the
    number of tuples is likely to decrease, so fewer
    tuples are needed to be involved in the join.
  • b. joining with all dimension tables at the
    top-most view results in very wide tuples, which
    require more room in memory and disk.

27
Summary-delta lattice
  • In the same way we formed the summary tables into
    the V-lattice, we would like to do the same for
    the summary delta tables (the D-lattice). Again
    the hope is that we would be able to exploit the
    lattice structure for a boost in efficiency.
  • Theorem 5.1 The D-lattice is identical to the
    V-lattice, including the queries along each edge,
    modulo a change in the names of tables at each
    node (e.g. from view v to summary-delta table
    sd_v). Also some of the tables in the FROM clause
    are uniformly replaced by the prepare-changes
    table.

28
Summary-delta lattice (cont.)
  • In the example, the summary-delta D-lattice is
    the same as the partially-materialized V-lattice
    of Figure 8.

29
Computing the summary-delta lattice
  • The problem of computing the summary-delta
    lattice turns out to map directly to the problem
    of computing multiple summary tables from
    scratch.
  • Their solutions can be used to derive an
    efficient propagation strategy on how to
    sort/hash inputs, what order to evaluate
    summary-delta tables, and which of the incoming
    lattice edges to use to evaluate a summary-delta
    table.

30
Performance
  • The performance of the refresh operation depended
    heavily on the number of updates/deletes vs.
    inserts on the summary tables. Therefore two
    types of changes to the pos table was considered
  • Update-Generation Changes insertions and
    deletions of equal number of tuples over existing
    date, store, and item values.
  • Insertion-Generation Changes insertions over new
    dates, but existing store and item values.

31
Performance (cont.)
  • Performance Graph Legend
  • Rematerialize- time to rematerialize using the
    lattice structure but no summary delta tables.
  • Summary Delta Maint.- time to rematerialize
    using the lattice and summary delta tables (both
    propagate and refresh steps).
  • Propagate- time to propagate using lattice.
  • Propagate (w/o lattice)- time to propagate
    without using lattice.

32
Performance (cont.)
33
Conclusions
  • Other maintenance papers have used the delta
    paradigm. But the summary-delta paradigm is to
    compute a summary-delta table that represents a
    summary of the changes applied to the
    materialized view. The refresh is more complex
    than the union/difference in the delta paradigm.
  • The summary-delta paradigm is one of few to
    consider maintenance of aggregate functions and
    the only to consider the problem of maintaining
    multiple aggregate views.
  • The formal split of the propagate and refresh
    functions allows for access to the summary tables
    during the propagation step.
  • The usage of a lattice structure in the V-lattice
    and D-lattice provides a logical, space, and time
    increase in efficiency.

34
Questions
  • 1. How would triggers for insertions, deletions,
    and updates to the data warehouse be appropriate
    because of the lattice?
  • 2. For the data cube in Figure 4 in the article,
    how would the (storied, itemID) view be affected
    by changes to the (storeID) view? Changes to the
    (storeID, itemID, date) view?
  • 3. How does propagation take place with the
    lattice? Without the lattice?
  • 4. The paper only talked briefly of using ECA
    triggers with the Summary Delta Table. Would the
    Summary Delta Table benefit from the use of
    Continual Queries from the first seminar? Why?
Write a Comment
User Comments (0)
About PowerShow.com