OLAP on line analytical processing - PowerPoint PPT Presentation

Loading...

PPT – OLAP on line analytical processing PowerPoint presentation | free to download - id: 150625-MTQ2N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

OLAP on line analytical processing

Description:

Relational OLAP (ROLAP): extended relational DBMS that maps operations on ... Low maintenance as the data warehouse matures. Highest possible performance. CS 336 ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 43
Provided by: joachim8
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: OLAP on line analytical processing


1
OLAP (on line analytical processing)
2
Three-Tier Decision Support Systems
  • Warehouse database server
  • Almost always a relational DBMS, rarely flat
    files
  • OLAP servers
  • Relational OLAP (ROLAP) extended relational DBMS
    that maps operations on multidimensional data to
    standard relational operators
  • Multidimensional OLAP (MOLAP) special-purpose
    server that directly implements multidimensional
    data and operations
  • Clients
  • Query and reporting tools
  • Analysis tools
  • Data mining tools

3
The Complete Decision Support System
Information Sources
Data Warehouse Server (Tier 1)
OLAP Servers (Tier 2)
Clients (Tier 3)
e.g., MOLAP
Analysis
Semistructured Sources
Data Warehouse
serve
extract transform load refresh etc.
Query/Reporting
serve
e.g., ROLAP
Operational DBs
Data Mining
serve
Data Marts
4
Approaches to OLAP Servers
  • Relational DBMS as Warehouse Servers
  • Two possibilities for OLAP servers
  • (1) Relational OLAP (ROLAP)
  • Relational and specialized relational DBMS to
    store and manage warehouse data
  • OLAP middleware to support missing pieces
  • (2) Multidimensional OLAP (MOLAP)
  • Array-based storage structures
  • Direct access to array data structures

5
OLAP Server Query Engine Requirements
  • Aggregates (maintenance and querying)
  • Decide what to precompute and when
  • Query language to support multidimensional
    operations
  • Standard SQL falls short
  • Scalable query processing
  • Data intensive and data selective queries

6
OLAP for Decision Support
  • Support ad-hoc querying for business analyst
  • Think in terms of spreadsheets
  • View sales data by geography, time, or product
  • Extend spreadsheet analysis model to work with
    warehouse data
  • Large data sets
  • Semantically enriched to understand business
    terms
  • Combine interactive queries with reporting
    functions
  • Multidimensional view of data is the foundation
    of OLAP
  • Data model, operations, etc.

7
Warehouse Models Operators
  • Data Models
  • relations
  • stars snowflakes
  • cubes
  • Operators
  • slice dice
  • roll-up, drill down
  • pivoting
  • other

8
Multi-Dimensional Data
  • Measures - numerical data being tracked
  • Dimensions - business parameters that define a
    transaction
  • Example Analyst may want to view sales data
    (measure) by geography, by time, and by product
    (dimensions)
  • Dimensional modeling is a technique for
    structuring data around the business concepts
  • ER models describe entities and relationships
    vs dimensional models that describe measures
    and dimensions

9
The Multi-Dimensional Model
  • Sales by product line over the past six months
  • Sales by store between 1990 and 1995

Store Info
Key columns joining fact table to dimension tables
Numerical Measures
Prod Code Time Code Store Code Sales Qty
Fact table for measures
Product Info
Dimension tables
Time Info
. . .
10
Dimensional Modeling
  • Dimensions are organized into hierarchies
  • E.g., Time dimension days ? weeks ? quarters
  • E.g., Product dimension product ? product line ?
    brand
  • Dimensions have attributes

11
Dimension Hierarchies
Store Dimension
Product Dimension
Total
Total
Region
Manufacturer
District
Brand
Stores
Products
12
An expanded view of the model shows three
dimensions Time, Store and Product. Attribute
hierarchies are vertical relationships, while
extended attribute characteristics are diagonal.
13
                                                  
                     
  • In the time dimension,
  • a given date is further described by its extended
    attributes "current flag," "sequence" and "day of
    the week."
  • Extended attribute characteristics have no impact
    on granularity. The fact that February 4, 1996 is
    on a Sunday has no effect on the fact that we
    collect sales by day. In practice, though. we may
    wish to compare Sunday sales to other days. We
    can do this by constraining our query on the
    extended attribute characteristic "day of the
    week" without having to gather any additional
    information.
  • The store dimension includes an extended
    attribute characteristic at a higher level each
    region has a Regional Manager.
  • Attribute hierarchies imply aggregation of data
    stores roll up into districts districts into
    regions.

14
ROLAP Dimensional Modeling Using Relational DBMS
  • Special schema design star, snowflake
  • Special indexes bitmap, multi-table join
  • Special tuning maximize query throughput
  • Proven technology (relational model, DBMS), tend
    to outperform specialized MDDB especially on
    large data sets
  • Products
  • IBM DB2, Oracle, Sybase IQ, RedBrick, Informix

15
MOLAP Dimensional Modeling Using the Multi
Dimensional Model
  • MDDB a special-purpose data model
  • Facts stored in multi-dimensional arrays
  • Dimensions used to index array
  • Sometimes on top of relational DB
  • Products
  • Pilot, Hyperion, Gentia, Express

16
Star Schema (in RDBMS)
17
Star Schema Example
18
Star Schema with Sample Data
19
The Classic Star Schema
  • A single fact table, with detail and summary data
  • Fact table primary key has only one key column
    per dimension
  • Each key is generated
  • Each dimension is a single table, highly
    denormalized

Benefits Easy to understand, easy to define
hierarchies, reduces of physical joins, low
maintenance, very simple metadata Drawbacks
Summary data in the fact table yields poorer
performance for summary levels, huge dimension
tables a problem
20
The Classic Star Schema
  • The fact table may also contain partially
    consolidated data, such as sales dollars for a
    region, for a given product for a given time
    period.
  • Confusion is possible if ALL consolidated data
    isn't included. For example, if data is
    consolidated only to the district level, a query
    for regional information will bring back no
    records and, hence, report no sales activity. For
    this reason, simple stars MUST contain either
  • ALL of the combinations of aggregated data or
  • At least views of every combination

21
  • One approach is to create a multi-part key,
    identifying each record by the combination of
    store/district/region. Using compound keys in a
    dimension table can cause problems
  • It requires three separate metadata definitions
    to define a single relationship, which adds to
    complexity in the design and sluggishness in
    performance.
  • Since the fact table must carry all three keys
    as part of its primary key, addition or deletion
    of levels in the hierarchy (such as the addition
    of "territory" between store and district) will
    require physical modification of the fact table,
    a time-consuming process that limits flexibility
  • Carrying all of the segments of the compound
    dimensional key in the fact table increases the
    size of the crucial fact table index, a real
    determinant to both performance and scalability.

22
                                                
                           One alternative to
compound keys is to concatenate the keys into a
single key. While this approach solves the first
two problems with compound keys (extra metadata
and rigidity in the fact table), the size of the
key is still a problem. Also, as in the example
above, dealing with nulls can be confusing.
23
  •                                                   
                            
  • Instead of "meaningful" keys generate the
    smallest possible key that will insure uniqueness
    of each record. Integers are the most efficient
    in most cases.
  • Note that the meaningful keys do not have to
    disappear they may be shifted to non-key
    attribute columns. If, in fact, these attributes
    are used frequently in queries (where
    region_description is "North"), the columns can
    still be indexed, even if they aren't used as the
    key.
  • The use of generated keys is preferred because
  • It allows for the highest level of flexibility
    of metadata
  • Low maintenance as the data warehouse matures
  • Highest possible performance

24
The Classic Star Schema
The biggest drawback dimension tables must carry
a level indicator for every record and every
query must use it. In the example below, without
the level constraint, keys for all stores in the
NORTH region, including aggregates for region and
district will be pulled from the fact table,
resulting in error.
Example Select A.STORE_KEY, A.PERIOD_KEY,
A.dollars from Fact_Table A where A.STORE_KEY in
(select STORE_KEY from Store_Dimension
B where region North and Level 2) and
etc...
Level is needed whenever aggregates are stored
with detail facts.
25
The Level Problem
  • Level is a problem because it causes potential
    for error. If the query builder, human or
    program, forgets about it, perfectly reasonable
    looking WRONG answers can occur.
  • One alternative the FACT CONSTELLATION model
    (summary tables)

The biggest drawback of the level indicator is
that it limits flexibility (we may not know all
of the levels in the attribute hierarchies at
first). By limiting ourselves to only certain
levels, we force a physical structure that may
change, resulting in higher maintenance costs and
more downtime. The level concept is a useful
tool for very controlled data warehouses, that
is, those that either have no ad hoc users or at
least only those ad hoc users who are
knowledgably about the database. In particular,
when the results of queries are pre-formatted
reports or extracts to smaller systems, such as
data marts, the drawbacks of the level indicator
are not so evident.
26
The Fact Constellation Schema
District Fact Table
Region Fact Table
District_ID PRODUCT_KEY PERIOD_KEY
Region_ID PRODUCT_KEY PERIOD_KEY
Dollars Units Price
Dollars Units Price
27
The chart above is composed of all of the tables
from the Classic Star, plus aggregated fact
(summary) tables. For example, the Store
dimension is formed of a hierarchy of store-gt
district -gt region. The District fact table
contains ONLY data aggregated by district,
therefore there are no records in the table with
STORE_KEY matching any record for the Store
dimension at the store level. Therefore, when we
scan the Store dimension table, and select keys
that have district "Texas," they will only
match STORE_KEY in the District fact table when
the record is aggregated for stores in the Texas
district. No double (or triple, etc.) counting is
possible and the Level indicator is not needed.
These aggregated fact tables can get very
complicated, though. For example, we need a
District and Region fact table, but what level of
detail will they contain about the product
dimension? All of the following STORE/PROD
DISTRICT/PROD REGION/PROD STORE/BRAND
DISTRICT/BRAND REGION/BRAND STORE/MANUF
DISTRICT/MANUF REGION/MANUF And these are just
the combinations from two dimensions!
28
The Fact Constellation Schema
In the Fact Constellations, aggregate tables are
created separately from the detail, therefore,
it is impossible to pick up, for example, Store
detail when querying the District Fact Table.
Major Advantage No need for the Level
indicator in the dimension tables, since no
aggregated data is stored with lower-level
detail Disadvantage Dimension tables are still
very large in some cases, which can slow
performance front-end must be able to detect
existence of aggregate facts, which requires more
extensive metadata
29
Another Alternative to Level
  • Fact Constellation is a good alternative to the
    Star, but when dimensions have very high
    cardinality, the sub-selects in the dimension
    tables can be a source of delay.
  • Another drawback is that multiple SQL statements
    may be needed to answer a single question, for
    example measure the percent to total of a
    district to its region. Separate queries are
    needed to both the district and region fact
    tables, then some complex "stitching" together of
    the results is needed.
  • Once again, it is easy to see that even with its
    disadvantages, the Classic Star enjoys the
    benefit of simplicity over its alternatives.
  • An alternative is to normalize the dimension
    tables by attribute level, with each smaller
    dimension table pointing to an appropriate
    aggregated fact table, the Snowflake Schema ...

30
The Snowflake Schema
Store Dimension
STORE KEY
District_ID
Region_ID
Store Description City State District
ID Region_ID
District Desc. Region_ID
Region Desc. Regional Mgr.
Store Fact Table
District Fact Table
RegionFact Table
Region_ID PRODUCT_KEY PERIOD_KEY
District_ID PRODUCT_KEY PERIOD_KEY
STORE KEY
PRODUCT KEY
Dollars Units Price
PERIOD KEY
Dollars Units Price
Dollars Units Price
31
The Snowflake Schema
Store Dimension
STORE KEY
District_ID
Region_ID
Store Description City State District ID District
Desc. Region_ID Region Desc. Regional Mgr.
District Desc. Region_ID
Region Desc. Regional Mgr.
Store Fact Table
District Fact Table
RegionFact Table
Region_ID PRODUCT_KEY PERIOD_KEY
District_ID PRODUCT_KEY PERIOD_KEY
STORE KEY
PRODUCT KEY
Dollars Units Price
PERIOD KEY
Dollars Units Price
Dollars Units Price
The original Store Dimension table, completely
de-normalized, is kept intact, since certain
queries can benefit by its all-encompassing
content.
32
The Snowflake Schema
  • No LEVEL in dimension tables
  • Dimension tables are normalized by decomposing at
    the attribute level
  • Each dimension table has one key for each level
    of the dimensions hierarchy
  • The lowest level key joins the dimension table to
    both the fact table and the lower level attribute
    table

How does it work? The best way is for the query
to be built by understanding which summary levels
exist, and finding the proper snowflaked
attribute tables, constraining there for keys,
then selecting from the fact table.
33
Notice how the Store dimension table generates
subsets of records. First, all records from the
table (where level "District" in the Star) are
extracted, and only those attributes that refer
to that level (District Description, for example)
and the keys of the parent hierarchy (Region_ID)
are included in the table. Though the tables are
subsets, it is absolutely critical that column
names are the same throughout the schema.
The diagram above is a partial schema - it only
shows the "snowflaking" of one dimension. In
fact, the product and time dimensions would be
similarly decomposed as follows Product -
product -gt brand -gt manufacturer (color and size
are extended attribute characteristics of the
attribute "product," not part of the attribute
hierarchy) Time - day -gt month -gt quarter -gt
year
34
The Snowflake Schema
  • Additional features The original Store Dimension
    table, completely de-normalized, is kept intact,
    since certain queries can benefit by its
    all-encompassing content.
  • In practice, start with a Star Schema and create
    the snowflakes with queries. This eliminates
    the need to create separate extracts for each
    table, and referential integrity is inherited
    from the dimension table.

Advantage Best performance when queries involve
aggregation Disadvantage Complicated
maintenance and metadata, explosion in the number
of tables in the database
35
Aggregates
  • Add up amounts for day 1
  • In SQL SELECT sum(amt)
  • FROM sale
  • WHERE date 1

81
36
Aggregates
  • Add up amounts by day
  • In SQL SELECT date, sum(amt)
  • FROM sale
  • GROUP BY date

37
Aggregates (Another Example)
  • Add up amounts by day, product
  • In SQL SELECT date, sum(amt)
  • FROM sale
  • GROUP BY date, prodId

rollup
drill-down
38
Aggregates
  • Operators sum, count, max, min, median,
    ave
  • Having clause
  • In order to retrieve the values of these
    functions only for groups that satisfy certain
    conditions, we use HAVING.
  • SELECT date, sum(amt)
  • FROM sale
  • GROUP BY date
  • HAVING amt gt 20
  • Using dimension hierarchy
  • average by region (within store)
  • maximum by month (within date)

39
  • ROLAP aggregation performance
  • D3 Stores 100
  • D2 Products 1000
  • D1 Days 10,000
  • Its given that on average, each store sells
    5,000 different products during the entire
    period.
  • Sparsity
  • Cube size 100 1,000 10,000 1,000,000,000
  • Actual data 100 5,000 500,000
  • Sparsity 0.05
  • Relational calculation
  • Data
  • Length of Fact Table r 500,000 records
  • Record size R 100 Bytes
  • Block size B 1024 Bytes
  • Block factor bfr 1024/100 10

40
  • Single Retrieval store, product, day
  • Binary search for index log2 1667 11 disk
    accesses
  • one data access total 12 disk accesses
  • 12 10msec 120 msec 0.12 sec

Join In order to see the total sales for a
particular month for a particular product
SELECT Sum ( SalesFact.SalesDollars) AS
SumOfSalesDollars FROM TimeDimension JOIN
(ProductDimension JOIN StoreDimension JOIN
SalesFact ON StoreDimension.StoreID
SalesFact.StoreID ON Product
Dimension.ProductID SalesFact.ProductID) ON
TimeDimension.TimeID SalesFact.TimeID WHERE Sto
reDimension.StoreName Dizengoff AND
ProductDimension.ProductName White Cheese
AND TimeDimension.Month3 AND
TimeDimension.Year1999
41
  • To perform a JOIN needed
  • Access (FT StoreDim. Result1Table
    ProductDim. Result2Table TimeDim.)
  • For simplicity lets assume only one FT access.
  • Monthly aggregation store, product, month
  • Making search for 1st month, 2nd , etc.
  • For each month 12 30 1000 100 36,000,000
    accesses
  • But there are 10 days per block, therefore we
    need only one access for every 10 days. It makes
    it 3.6 Million accesses per month.
  • Number of months 10,000/30 333
  • Total 333 3,600,000 1,200,000,000 accesses
  • 12,000,000 sec 3,300 hours!

42
Advantages of ROLAP Dimensional Modeling
  • Define complex, multi-dimensional data with
    simple model
  • Standard SQL tools
  • Allows the data warehouse to evolve with
    relatively low maintenance
  • HOWEVER! Star schema and relational DBMS are not
    the magic solution
  • Query optimization is still problematic
About PowerShow.com