Title: Maintenance of Data Cubes and Summary Tables in a Data Warehouse
1Maintenance of Data Cubes and Summary Tables
in a Data Warehouse
Carmen Gillette John Schloman
2Data Warehouse
- contains a large amount of information collected
from a variety of independent sources - OLAP (On-line Analytical Processing) hundreds of
complex aggregate queries over large volumes of
data - Views must be updated to reflect changes that are
made to the data sources (incrementally,
calculations, intial)
3How are Changes with the Views Conducted?
- Immediately as soon as a change form a source
is received - Deferred a time when a large batch of updates
is applied to the warehouse at once.
View Immediate Changes Why could this cause a
problem?
4Why could immediated changes cause problems
within the view?
- Each update into the warehouse requires updating
the views - The overhead increases with the number of views
and its complexity - Warehouse applications require that the state of
the views not change while the warehouse is being
accessed, therefore the user will see a
consistent snapshot of the warehouse across
multiple queries during analysis
5Components of Data Warehouses
- Large amounts of Data
- Maintaining Multiple Views
- Maintaining Multiple Summary Tables
- OLAP query performance
6Goals
Efficiently maintaining a set of summary tables
in a warehouse
- Incremental Maintenance techniques for summary
tables- using Summary-Delta Tables - General Strategy to minimize the batch time
needed for maintenance by splitting the work
into propagate and refresh functions for
aggregate views - Maintaining multiple summary tables
7Example Dimension and Fact Tables
Fact Table
Pos(storeID,itemID,date,qty,price)
Dimension Table
stores(storeID,city,region)
Dimension Table
items(itemID,name,category,cost)
Dimension Hierarchies is a set of functional
dependencies among the attributes of the
dimension table
storeID
city
region
8Summary TablesMaterialized Views
9Maintaining Summary Tables Using Propagation and
Refresh
What is the summary-delta table?represents the
net changes to the summary table due to changes
to the fact table Is this the same as the summary
table?
- Propagate creates a summary delta table from
the deferred of changes - Refresh applies the net changes represented in
the summary-delta table to the summary table
updates the summary table to reflect changes in
the summary-delta table
10Example Updating Summary Table
11Reuse of the Summary-Delta Tables
The Summary-Delta table of one are used to
compute other Summary-Delta tables
Fewer tuple access
12Using Aggregate Functions for Maintaining Summary
Tables
- Distributive computed by the partitioning input
into disjoint sets, aggregating each set
individually, ect. - Algebraic scalar function of distributive
aggregate functions - Holistic computed by dividing into parts
Self- Maintainable Distributive COUNT()
(Insertions and Deletions) determines when all
the tuples in a group have been deleted MIN,MAX
value for the group is recomputed from base data
13Data Cubes
- Dimension hierarchy information is obtained
implicitly - Explicit joins with the dimension tables
K dimension yields 2K cube views
- SQL SELECT-FROM-WHERE-GROUP BY
- 2K of the dimension attributes
14Generalized Cube Views
- Many systems and warehouses do not fit into the
structure of cube views. There is then a need for
a more generalized form. Generalized cube views
are traditional cube-style views that are
extended in the following - a. Different views may compute different
aggregate functions. - b. Some views may compute aggregate functions on
columns used as dimension attributes in other
views (e.g. the group-by attributes of other
views). - c. Views may join with different combinations of
dimension tables.
15Dimension hierarchies and lattices
- As said before, dimensional hierarchy information
is stored in separate dimensional tables (e.g.
stores(storeID, city, region)) and to group facts
higher up on the hierarchy, the fact table must
be joined with the dimensional table.
16Dimension hierarchies and lattices (cont.)
- But if we look at the dimensional hierarchy as a
lattice, we can construct a lattice representing
the possible set of views by grouping on the
dimensional hierarchies. Here Figure 4 is grouped
with stores(storeID, city, region) and
items(itemID,category) to result in Figure 5.
17Partially-materialized lattices
- A lattice obtained by removing some nodes
(usually information that is not materialized in
the warehouse). - If we are removing node n, all incoming edges
(n1,n) and outgoing edges (n,n2) must be combined
into a new edge (n1,n2). - In SQL the query defining view n2 along edge
(n1,n2) is obtained by replacing view n in the
FROM clause with view n1.
18Basic Summary-Delta Maintenance Algorithm
- To maintain the generalized cube view there are
some requirements - a. All aggregate functions must be
self-maintainable or made self-maintainable by
adding the necessary COUNT functions. - b. The holistic functions MIN and MAX may be used
(in which case they are handled during the
refresh function).
19Propagate function
- As the propagate function only affects the
summary-delta table, leaving the summary table
open, it is important to do as much work in the
Propagation step to minimize the amount of time
the system is unavailable during refresh. - To help make the calculation of the summary-delta
table easier to understand, some of the work is
split into three virtual views prepare-changes,
prepare-insertions, and prepare-deletions.
20Preparing Changes
- These tables represent the changes to the
aggregate functions caused by individual
insertions and deletions. - An aggregate-source attribute corresponds to each
of the aggregate functions computed by the
summary table. If the summary table included the
aggregate function SUM(AB), a aggregate-source
attribute would be included in prepare-insertions
(AB) and prepare-deletions (-AB).
21Preparing Changes
- prepare-changes is derived from a join of
prepare-insertions and -deletions.
22Compute the summary-delta table
- The summary-delta table is computed by
aggregating the prepare-changes virtual view. - The resulting summary-delta table has the same
schema as the summary table (except that the
aggregate functions represent changes to the
aggregate functions in the summary table).
23Refresh Function
- The changes in the summary-delta table are
applied to the summary table. Each tuple in the
summary-delta table causes a change to a single
corresponding tuple in the summary table
(inserted, deleted, or updated). - COUNT and SUM based aggregates are calculated
directly. - If MIN and MAX aggregates are used the new values
are compared to see if the summary table values
need to be changed.
24Efficiently maintaining multiple summary tables
- Just as multiple cube views can be arranged into
a lattice, summary tables too can be arranged
into a lattice a V-lattice. - Furthermore our summary-delta tables can be
arranged into a lattice a D-lattice.
25Placing generalized cube views into a lattice
- Cube view v2 should be derivable from cube view
v1 (placed above v2 in the cube lattice) and
therefore v1 is the ancestor of v2. This is true
if - 1. each group-by attribute of v2 is either a
group-by attribute of v1 or is an attribute of a
dimension table whose foreign key is a group-by
attribute of v1. - 2. each aggregate function a(E) of v2 either
appears in v1 or E is an expression over group-by
attributes of v1, or E is an expression over
attribute of dimension tables whose foreign keys
are group-by attributes of v1.
26Optimizing the lattice
- This means pushing down joins, aggregate
functions, and dimension columns as low down into
the lattice as possible. This is done for the
following reasons - a. as one traverses down the data cube, the
number of tuples is likely to decrease, so fewer
tuples are needed to be involved in the join. - b. joining with all dimension tables at the
top-most view results in very wide tuples, which
require more room in memory and disk.
27Summary-delta lattice
- In the same way we formed the summary tables into
the V-lattice, we would like to do the same for
the summary delta tables (the D-lattice). Again
the hope is that we would be able to exploit the
lattice structure for a boost in efficiency. - Theorem 5.1 The D-lattice is identical to the
V-lattice, including the queries along each edge,
modulo a change in the names of tables at each
node (e.g. from view v to summary-delta table
sd_v). Also some of the tables in the FROM clause
are uniformly replaced by the prepare-changes
table.
28Summary-delta lattice (cont.)
- In the example, the summary-delta D-lattice is
the same as the partially-materialized V-lattice
of Figure 8.
29Computing the summary-delta lattice
- The problem of computing the summary-delta
lattice turns out to map directly to the problem
of computing multiple summary tables from
scratch. - Their solutions can be used to derive an
efficient propagation strategy on how to
sort/hash inputs, what order to evaluate
summary-delta tables, and which of the incoming
lattice edges to use to evaluate a summary-delta
table.
30Performance
- The performance of the refresh operation depended
heavily on the number of updates/deletes vs.
inserts on the summary tables. Therefore two
types of changes to the pos table was considered - Update-Generation Changes insertions and
deletions of equal number of tuples over existing
date, store, and item values. - Insertion-Generation Changes insertions over new
dates, but existing store and item values.
31Performance (cont.)
- Performance Graph Legend
- Rematerialize- time to rematerialize using the
lattice structure but no summary delta tables. - Summary Delta Maint.- time to rematerialize
using the lattice and summary delta tables (both
propagate and refresh steps). - Propagate- time to propagate using lattice.
- Propagate (w/o lattice)- time to propagate
without using lattice. -
32Performance (cont.)
33Conclusions
- Other maintenance papers have used the delta
paradigm. But the summary-delta paradigm is to
compute a summary-delta table that represents a
summary of the changes applied to the
materialized view. The refresh is more complex
than the union/difference in the delta paradigm. - The summary-delta paradigm is one of few to
consider maintenance of aggregate functions and
the only to consider the problem of maintaining
multiple aggregate views. - The formal split of the propagate and refresh
functions allows for access to the summary tables
during the propagation step. - The usage of a lattice structure in the V-lattice
and D-lattice provides a logical, space, and time
increase in efficiency.
34Questions
- 1. How would triggers for insertions, deletions,
and updates to the data warehouse be appropriate
because of the lattice? - 2. For the data cube in Figure 4 in the article,
how would the (storied, itemID) view be affected
by changes to the (storeID) view? Changes to the
(storeID, itemID, date) view? - 3. How does propagation take place with the
lattice? Without the lattice? - 4. The paper only talked briefly of using ECA
triggers with the Summary Delta Table. Would the
Summary Delta Table benefit from the use of
Continual Queries from the first seminar? Why?