Online%20Analytical%20Processing%20(OLAP) - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

Online%20Analytical%20Processing%20(OLAP)

Description:

Heineken. Coke. Product Name. Feb. Mar. Feb. Month. non relational representation ... 157. 83. 74. Doritos. 154. 80. 74. Heineken. 153. 85. 68. Pepsi. 970 ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 51
Provided by: dbU7
Learn more at: http://db.ucsd.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Online%20Analytical%20Processing%20(OLAP)


1
Online Analytical Processing (OLAP)
  • An Overview

Kian Win Ong, Nicola Onose
Mar 3rd 2006
2
Overview
  • Motivation
  • Multi-Dimensional Data Model
  • Research Areas
  • Optimizations
  • Materializing multiple aggregates simultaneously
  • Materialization strategy

3
Motivation
  • Aggregation, summarization and exploration
  • Of historical data
  • To help management make informed decisions

4
Different Goal
  • Aggregation, summarization and exploration
  • Of historical data
  • To help management make informed decisions

Product Branch Time Price
Coke (0.5 gallon) Convoy Street 2006-03-01 090001 1.00
Pepsi (0.5 gallon) UTC 2006-03-01 090001 1.03
Coke (1 gallon) UTC 2006-03-01 090002 1.50
Altoids Costa Verde 2006-03-01 090133 0.30
... ... ... ...
  • Find the total sales for each product and month
  • Find the percentage change in the total monthly
    sales for each product

5
Different Requirements
  • OLTP On-Line Transaction Processing
  • OLAP On-Line Analytical Processing

OLTP OLAP
Tasks Day to day operation High level decision support
Size of database Gigabytes Terabytes
Time span Recent, up-to-date Spanning over months / years
Size of working set Tens of records, accessed through primary keys Consolidated data from multiple databases
Workload Structured / repetitive Ad-hoc, exploratory queries
Performance Transaction throughput Query latency
6
Overview
  • Motivation
  • Multi-Dimensional Data Model
  • Research Areas
  • Optimizations
  • Materializing multiple aggregates simultaneously
  • Materialization strategy

7
Query Language Extensions
  • In the real world, data is stored in RDBs.

8
Query Language Extensions
  • In the real world, data is stored in RDBs.
  • How to express N-dimensional problems using 2D
    tables?

9
Query Language Extensions
  • In the real world, data is stored in RDBs.
  • How to express N-dimensional problems using 2D
    tables?
  • Can we combine OLAP and SQL queries?
  • Jim Gray et al Data Cube A Relational
    Aggregation Operator 1997

10
Query Language Extensions
Problems with GROUP BY
  1. histograms

SELECT sales, prod_name, population FROM
sales_history GROUP BY Population(City,
State) as
population
11
Query Language Extensions
Problems with GROUP BY
  1. histograms
  2. rollup/drilldown

non relational representation
Product Category Product Name Month Sales Sales by Cat., by Name Sales by Cat.
Drinks Coke Feb 30.3
Mar 93.9 124.2
Heineken Feb 34.8
Mar 123.8 158.6 282.8
12
Query Language Extensions
Problems with GROUP BY
  1. histograms
  2. rollup/drilldown

relational, but the rollup is huge
Product Category Product Name Month Sales Sales by Cat., by Name Sales by Cat.
Drinks Coke Feb 30.3 124.2 282.8
Drinks Coke Mar 93.9 124.2 282.8
Drinks Heineken Feb 34.8 158.6 282.8
Drinks Heineken Mar 123.8 158.6 282.8
13
Query Language Extensions
Problems with GROUP BY
  1. histograms
  2. rollup/drilldown
  3. cross tabulations

Product Category Product Name Month Sales
Drinks Coke Feb 30.3
Drinks Coke Mar 93.9
Drinks Coke Total 124.2
Drinks Heineken Feb 34.8
Drinks Heineken Mar 123.8
Drinks Heineken Total 158.6
Drinks Total Total 282.8
Could be represented as
14
Query Language Extensions
Problems with GROUP BY
  1. histograms
  2. rollup/drilldown
  3. cross tabulations

Drinks Feb Mar Total
Coke 30.3 93.9 124.2
Heineken 34.8 123.8 158.6
Total 65.1 217.7 282.8
2-D aggregation is more compact and more natural
15
Query Language Extensions
Problems with GROUP BY
  1. histograms
  2. rollup/drilldown
  3. cross tabulations
  4. complex expressions, hard to optimize

when reducing to 1-D aggregation (GROUP BY) need
2number of dim. GROUP BYs
16
Query Language Extensions
Reducing the number of attributes
Product Category Product Name Month Sales
Drinks Coke Feb 30.3
Drinks Coke Mar 93.9
Drinks Coke ALL 124.2
Drinks Heineken Feb 34.8
Drinks Heineken Mar 123.8
Drinks Heineken ALL 158.6
Drinks ALL ALL 282.8
Drinks ALL Feb 65.1
Drinks ALL Mar 217.7
17
Query Language Extensions
Reducing the number of attributes
  • introduce a new value ALL

Drinks Feb Mar Total (ALL)
Coke 30.3 93.9 124.2
Heineken 34.8 123.8 158.6
Total (ALL) 65.1 217.7 282.8
ALL the set over which we aggregate
18
Query Language Extensions
General approach
  • GROUP BY (1D)

Sales by Product Name Feb Mar
Coke 30.3 93.9
Heineken 34.8 123.8
SUM 65.1 217.7
19
Query Language Extensions
General approach
the corresponding relation
Product Category Product Name Month Sales
Drinks Coke Feb 30.3
Drinks Coke Mar 93.9
Drinks Coke ALL 124.2
Drinks Heineken Feb 34.8
Drinks Heineken Mar 123.8
Drinks Heineken ALL 158.6
Drinks ALL Feb 65.1
Drinks ALL Mar 217.7
Drinks ALL ALL 282.8
  • GROUP BY (1D)
  • Cross Tab (2D)

Drinks Feb Mar ALL
Coke 30.3 93.9 124.2
Heineken 34.8 123.8 158.6
ALL 65.1 217.7 282.8
20
Query Language Extensions
General approach
Product Category Product Name Month Sales
Drinks Coke Feb 30.3
Drinks Coke Mar 93.9
Drinks Coke ALL 124.2

Snacks Doritos Feb 123.8
Snacks Doritos Mar 158.6
Snacks Doritos ALL 65.1

ALL ALL ALL 964.0
  • GROUP BY (1D)
  • Cross Tab (2D)
  • Cube (3D)

By cat. and name (does it make sense?)
By cat. and month
By month and name
21
Query Language Extensions
General approach
  • GROUP BY (1D)
  • Cross Tab (2D)
  • Cube (3D)
  • Any hypercube can be represented as a relation!

22
Query Language Extensions
General approach
  • a CUBE relation, with aggregation function f(.)
  • (x1, x2, , xn-1, xn, f() )
  • (x1, xn-1, , xn, ALL, f() )
  • (x1, x2, , ALL, xn, f() )
  • after ROLLUP , reduce to a linear of tuples
  • (x1, x2, , xn-1, xn, f() )
  • (x1, xn-1, , xn, ALL, f() )
  • (x1, x2, , ALL, ALL, f() )
  • (ALL, ALL, , ALL, ALL, f() )

23
Query Language Extensions
The new operators CUBE, ROLLUP
SELECT prod_category, prod_name, month,
SUM(sales) AS sales FROM sales_history
GROUP BY CUBE prod_category,
prod_name, month
Product Category Product Name Month Sales
Drinks Coke Feb 30.3
Drinks Coke Mar 93.9
Drinks Coke ALL 124.2

Drinks ALL Feb 99.8

ALL ALL ALL 964.0
Idea Group by the CUBE list. Union the
aggregates. Introduce the ALL values.
24
Query Language Extensions
The new operators CUBE, ROLLUP
SELECT prod_category, month, day, state,
prod_name, SUM(sales) AS sales
FROM sales_history GROUP BY
prod_category ROLLUP month, day CUBE city,
state
Product Category Month Day State Product Name Sales
Drinks Feb 26 CA Coke 12.3
Drinks Feb 26 CA Heineken 5.4
Drinks
Drinks Feb 26 CA ALL 30.4
Drinks Feb 26 ALL Coke
Drinks
Snacks Feb 26 CA Doritos 12.0
Snacks
25
Overview
  • Motivation
  • Multi-Dimensional Data Model
  • Research Areas
  • Optimizations
  • Materializing multiple aggregates simultaneously
  • Materialization strategy

26
Research Areas
  • SQL language extensions
  • Server architecture
  • Parallel processing
  • Index structures
  • Materialized views

27
Overview
  • Motivation
  • Multi-Dimensional Data Model
  • Research Areas
  • Optimizations
  • Materializing multiple aggregates simultaneously
  • Materialization strategy

28
Simultaneous
Multi-Dimensional
Aggregates
  • Y. Zhao, P. Deshpande, J. Naughton An Array-Based
    Algorithm for Simultaneous Multidimensional
    Aggregates SIGMOD 1997
  • Optimization to calculate multiple aggregates
    simultaneously
  • Useful for materialization of aggregate views

29
Multiple Aggregates
Aggregate on
Product City Month Sales
Coke San Diego Feb 06 12
Pepsi Los Angeles Feb 06 13
Doritos San Diego Mar 06 72
Altoids San Diego Mar 06 65
... ... ... ...
Month / Product Feb Mar Total
Altoids 36 131 167
Coke 37 138 175
Doritos 21 136 157
Heineken 44 110 154
Pepsi 31 122 153
Pringles 37 126 164
Total 206 764 970
30
Multiple Aggregates
City / Product San Diego Los Angeles Total
Altoids 90 77 167
Coke 89 86 175
Doritos 74 83 157
Heineken 74 80 154
Pepsi 68 85 153
Pringles 73 90 164
Total 469 501 970
Aggregate on
Product City Month Sales
Coke San Diego Feb 06 12
Pepsi Los Angeles Feb 06 13
Doritos San Diego Mar 06 72
Altoids San Diego Mar 06 65
... ... ... ...
Month / Product Feb Mar Total
Altoids 36 131 167
Coke 37 138 175
Doritos 21 136 157
Heineken 44 110 154
Pepsi 31 122 153
Pringles 37 126 164
Total 206 764 970
Month / City Feb Mar Total
Los Angeles 112 358 469
San Diego 95 407 501
Total 206 764 970
31
Multiple Aggregates
Aggregate on
  1. Sales by Product / City
  2. Sales by Product / Month
  3. Sales by Month / City
  4. Sales by Product
  5. Sales by City
  6. Sales by Month
  7. Sales (Total)

Product City Month Sales
Coke San Diego Feb 06 12
Pepsi Los Angeles Feb 06 13
Doritos San Diego Mar 06 72
Altoids San Diego Mar 06 65
... ... ... ...
  • Is it possible to
  • make a single pass over the transactional table?
  • calculate multiple aggregates simultaneously?

32
Chunking
64
Partition transactional data into array chunks
14
15
16
13
12
9
10
11
42
Dimension B
City
36
5
8
6
7
Array Chunk
20
1
2
3
4
1
Dimension C
12
Month
Dimension A
Product
Product City Month Sales
Coke San Diego Feb 06 12
33
Naïve Algorithm
64
14
15
16
13
Dimension A




12
9
10
11
42
Dimension B
36
5
8
6
7
20
4
1
2
3
Dimension C
Pivot on AB aggregate on all C
Dimension A
34
Naïve Algorithm
64
13
14
15
16












9
12
10
11
42
Dimension B
36
5
8
6
7
20
1
2
3
4
Dimension C
Pivot on AB aggregate on all C
Dimension A
Pivot on AC aggregate on all B
Pivot on BC aggregate on all A
35
Single Pass Algorithm
64
AB



1 2 3 4
14
15
16
13
AC
12
9
10
11



1 2 3 4
42
B
36
5
8
6
7
20
1
2
3
4
Dimension C
BC



1 2 3 4
Dimension A
Make a single pass over data
36
Single Pass Algorithm
64
AB
13
9 10 11 12
5 6 7 8
1 2 3 4
14
15
16
13
AC
12
9
10
11



1 5 9 13 2 6 10 3 7 11 4 5 12
42
B
36
5
8
6
7
20
1
2
3
4
Dimension C
BC
13
9 10 11 12
5 6 7 8
1 2 3 4
Dimension A
Simultaneously maintain multiple aggregates
37
Single Pass Algorithm
64
AB
13
9 10 11 12
5 6 7 8
1 2 3 4
14
15
16
13
AC
12
9
10
11



1 5 9 13 2 6 10 3 7 11 4 5 12
42
B
36
5
8
6
7
20
1
2
3
4
Dimension C
BC
13
9 10 11 12
5 6 7 8
1 2 3 4
Dimension A
Write out completed aggregates
38
Single Pass Algorithm
64
AB
13
9 10 11 12
5 6 7 8
1 2 3 4
14
15
16
13
AC
12
9
10
11



1 5 9 13 2 6 10 3 7 11 4 5 12
42
B
36
5
8
6
7
20
1
2
3
4
Dimension C
BC



13
Dimension A
Only allocate memory that is necessary
39
Single Pass Algorithm
AB
13
9 10 11 12
5 6 7 8
1 2 3 4
Array Chunk
ABC 4 x 4 x 4
AC



1 5 9 13 2 6 10 3 7 11 4 5 12
AB 16 x 4 x 4
AC 4 x 4 x 4
BC 4 x 4
B 4
C 4
A 4 x 4
BC



13
Minimum memory spanning tree
all 1
40
Multi Pass Algorithm
Recursively aggregate
ABCD
ABC
ABD
ACD
BCD
AB
AC
BC
BD
CD
AD
D
A
C
B
all
41
Overview
  • Motivation
  • Multi-Dimensional Data Model
  • Research Areas
  • Optimizations
  • Materializing multiple aggregates simultaneously
  • Materialization strategy

42
Implementing Data Cubes
  • Biggest problem for data warehouses the size
  • Space / time trade-off accelerate queries by
    materializing the cube

43
Implementing Data Cubes
  • Biggest problem for data warehouses the size
  • Space / time trade-off accelerate queries by
    materializing the cube
  • The size of the relations gets even bigger!

44
Implementing Data Cubes
  • Biggest problem for data warehouses the size
  • Space / time trade-off accelerate queries by
    materializing the cube
  • The size of the relations gets even bigger!
  • M(ultidimensional)OLAP good query performance,
    but bad scalability
  • R(elational)OLAP very scalable query
    performance improved by materializing (partial)
    results

45
Implementing Data Cubes
  • V. Harinarayan, A. Rajaraman, J.D.
    Ullman Implementing Data Cubes
    Efficiently SIGMOD 1996 Presents a
    materialization strategy for the cells of the
    cube.

46
Implementing Data Cubes
Month
Year
Day
Month
Week
Year
Time Id
City Id
Product Id
Sales
Week
City Id
City
State
Product Id
Name
Category
Category Id
Category Name
47
Implementing Data Cubes
  • casted as particular case of the rewriting using
    views problem
  • what cells to materialize ? what SQL views to
    materialize

48
Implementing Data Cubes
  • casted as particular case of the rewriting using
    views problem
  • what cells to materialize ? what SQL views to
    materialize

p product t time c city
  • simple idea Q1 depends on Q2 (Q1Q2) if Q1 can
    be fully answered using the results of Q2

49
Implementing Data Cubes
  • but cube dimensions are usually hierarchical

X
X
p product t time c city
  • direct-product lattice

50
Implementing Data Cubes
  • Def. cost of answering Q of rows in the table
    of ancestor(Q)
  • It can be estimated w/o materializing the views
  • Assume that all queries are identical to some
    view in the lattice

51
Implementing Data Cubes
  • For a set S and a view v B(v,S) ?wv, (w not in
    S) maxcost(w)-cost(v), 0
  • Greedy algorithm for selecting k views to
    materialize from the lattice
  • S top view
  • For i1 to k, add v to S s.t. B(v,S) is
    maximized
  • The greedy algorithm is an (e-1)/e 0.63 approx.
    of the optimum.

52
Discussion
  • Questions from the audience
About PowerShow.com