Data%20Warehouse%20Modeling - PowerPoint PPT Presentation

About This Presentation
Title:

Data%20Warehouse%20Modeling

Description:

DBMS Magazine, April, 1995. OLAP VS OLTP. OLTP (RELATIONAL) ATOMIZED ... QUESTIONS LIKE WHAT WERE TOTAL SALES OF NUTS?' OR WHAT WERE TOTAL SALES FOR THE EAST? ... – PowerPoint PPT presentation

Number of Views:166
Avg rating:3.0/5.0
Slides: 81
Provided by: paul53
Category:

less

Transcript and Presenter's Notes

Title: Data%20Warehouse%20Modeling


1
Data Warehouse Modeling
  • CSSE 522
  • Instructor Paul Chen

2
Data Warehouse Modeling
  • PURPOSE AND FOCUS
  • APPROACH
  • LOGICAL DATA MODELING vs. DATA WAREHOUSE DATA
    MODELING
  • CRITERIA FOR SELECTING DATA FOR INCLUSION INTO A
    DW
  • TRANSFORM THE LOGICAL DATA MODEL INTO
  • THE DATA WAREHOUSE MODEL
  • DIMENSIONALITY MODELING
  • IMPLEMENT THE DW DATA MODEL

3
PURPOSE AND FOCUS
  • A WELL-DESIGN DATA WAREHOUSE MODEL
  • WILL PERFORM THE FOLLOWING TASKS
  • ANSWER THE BUSINESS USERS QUESTIONS AND SUPPORT
    THE KEY BUSINESS PROCESSES.
  • RECOGNIZE THE SIMILARITIES AND DIFFERENCES IN
    DATA STORED IN VARIOUS LEGACY SYSTEMS AND
    EXTRACT FROM THESE SYSTEMS ONLY PERTINENT DATA.

4
PURPOSE AND FOCUS (Contd)
  • CONTROL THE COLLECTION AND LOADING PROCESSES OF
    THE LEGACY SYSTEM.
  • MAP THE KEY BUSINESS PROCESSES TO CUSTOMER NEEDS.

5
APPROACH
  • Create the high level enterprise ERD
  • Develop logical data model for subject area only
  • Create data warehouse data model from LDM
  • Develop physical data model
  • The above is an iterative process user reviews
    are critical.

6
The First Iteration of the DW
  • Selecting the first subject area or areas to be
    populated
  • Use the enterprise level data model in
    selecting appropriate subject area(s)
  • Three Options
  • -- Implement a single subject area (best
    option)
  • -- Implement a subset of a subject area
  • -- Implement a subset of several subject areas
    (most common)
  • Determine how much data should be loaded and its
    variety

7
Logical Data Model vs. DW Data Model -Focus
  • LDW
  • Business operations
  • (e.g., what orders from
  • the two months backlog
  • are scheduled to be
  • shipped today?).
  • DWDW
  • Business intelligence
  • (e.g., how many orders,
  • by product, were shipped
  • more than two months
  • late over the past year?).

8
Logical Data Model vs. DW Data Model -Content
  • Transaction driven allow updates to reflect
    current business transactions dynamic
  • Analysis driven query primarily historical
    summarized derived

9
Logical Data Model vs. DW Data Model -Access
  • Structured queries
  • Ad hoc queries. Used to
  • drill-down into data for
  • analysis

10
Logical Data Model vs. DW Data Model -Table
  • Normalized
  • Organized around business rules
  • Element of time
  • Maybe specified
  • Repeating group
  • Shown only once
  • Denormalized
  • Organized around usage and stability
  • Must be specified
  • Can contain data arrays

11
Data Warehouse Models and Operators
  • Data Models
  • -- relations
  • -- stars snowflakes
  • -- cubes
  • Operators
  • -- slice dice
  • -- roll-up, drill down
  • -- pivoting

12
CRITERIA FOR SELECTING DATA FOR INCLUSION INTO A
DW
  • Subject Criteria
  • Time Based Criteria
  • Attribute Criteria

13
Subject Criteria
  • Subject areas are collections of like data that
    support
  • analysis of the major subjects in a business.
    Election
  • criteria
  • They consist of two or more attributes.
  • They are essential to the successful operation of
    the target system or business area to meet client
    objectives.
  • They can be defined by governing business rules.

14
Subject Criteria
  • Weight the Subject areas based upon the following
  • criteria
  • They exhibit measurable results to the Users and
    Management.
  • They are visible within the business and through
    management.
  • They are manageable.

15
OAG (Official Airlines Guide)
  • This subject contains airline schedule data
    containing
  • from/to airports, airline and flight numbers,
  • departure/arrival times, airline model and
    approximate
  • number and cargo capacity.
  • Data is drawn from the monthly operational
    schedules
  • of commercial airlines. Today, access to the OAG
    comes
  • in hard copy and electronic copy, via touch tone
    telephone
  • or downloaded to a workstation.

16
Time Based Criteria
  • Due to disk space constraint, data selected must
    be time
  • relevant in terms of trend, predictability, and
  • profitability for the enterprise.

17
Attribute Criteria
  • Attributes should be chosen for the following
    reasons
  • Their applicability for trend analysis
  • Their applicability for historical analysis
  • Their additivity
  • Their textual support of additive attributes

18
TRANSFORM THE LOGICAL DATA MODEL INTO DWDM
  • Remove purely operational data
  • Add an Element of Time to the key structure
  • Accommodate multiple hierarchies and classes
  • Add derived data
  • Add summarization schemes

19
Remove purely operational data
  • Staff TableStaff NoFname XLname
    XAddress XTel XSexDOBJob Title
    XTyping speedBranch No

X
20
ADD AN ELEMENT OF TIME TO THE KEY STRUCTURE
  • Time is probably the most common dimension in a
    multidimensional databases. It is used to project
    trends-sales trends, market trends, and so forth.
  • A series of numbers representing a particular
    variable (such as sales) over time is called a
    time series. (for ex. 52 weekly sales numbers for
    auto is a time-series).
  • Do not mix different periodicities in one
    dimension (A time series always has a particular
    periodicity, such as weekly, monthly, quarterly,
    yearly, and so on).

21
When do we keep time- series data?
  • When trends and patterns are desired
  • When comparisons are needed (e,g., last quarter
    to this quarter)
  • For example, Auto Sales information by month
    or by calendar year.

22
Accommodate multiple hierarchies and classes
  • DIMENSIONS are roughly equivalent to Fields in a
    relational database. In the relational table,
    there are fields called Product and Region..
    In the dimensional data, Product and region
    are both Dimension.
  • The single biggest factor in determining how many
    dimensions youll need for a particular database
    is the existence of multiple hierarchies and
    classes.

23
Accommodate multiple hierarchies and classes
  • If your OLAP server supports multiple hierarchies
    and
  • classes within one dimension, store them in one
  • dimension.
  • Classes are typically attributes such as size
    color and
  • other characteristics that define a subset of the
    members
  • of a dimension.

24
Accommodate multiple hierarchies and classes
  • For example
  • A common use for multiple hierarchies is in the
  • geographic dimension. (Sales Territory might roll
    up into
  • City, State and Region.)
  • For Classes, A car line might be defined by
    Model, Make,
  • and Series.

25
Simple Hierarchies (Roll up) Classes Within
Dimensions --Dimension Hierarchies
Region Total
Central
East
West
Chevrolet
make
model
Series
26
Multiple Levels of Hierarchies
27
Some OLAP servers support multiple hierarchies
within one dimension. One child can have many
parents.
Sales Region
State
City
Sales Zone
Dealer
28
Roll up
Without multiple hierarchies, the previous
database would have to be represented
with separate dimensions for each roll-up.
Region Zone Dealer
State City Dealer
29
ADD DERIVED DATA
  • Benefits
  • Less space used
  • Enhanced performance
  • Breaking_lease
  • Percentage_of_breaking_lease (lt 3_months)
  • Percentage_of_breaking_lease (gt3 but lt 6
    months)
  • Percentage_of_breaking_lease (gt6 but lt9
    months)
  • Percentage_of_breaking_lease (gt9 but lt12
    months)
  • Percentage_of_breaking_lease (gt 12 months)

30
Add summarization schemes
  • Simple summation
  • Summation by group
  • Aggregation
  • Vertical summarization

31
Simple Summation --Add summarization schema
  • Individual Daily sales
  • Date Product Qty Sales
  • Jan 1 nuts 100 300
  • Jan 1 nuts 200 600
  • Jan 2 nuts 300 900
  • Jan 2 nuts 100 300
  • Jan 3 Nuts 50 150
  • Jan 3 Nuts 40 120

Daily Sales Summary Date Product Qty
Sales Jan 1 Nuts 300 900 Jan 2
Nuts 400 1,200 Jan 3 Nuts 90
20
32
Summation by group
  • Group data attributes based on usage and
  • stability.
  • Group stable and slowly changing data all in one
    table
  • Group unstable and frequently changing data all
    in another table

33
Aggregation
  • Aggregation is used to create data marts.
  • For instance, a group of users frequently
    perform analysis comparing sales across
    geographic regions, broken by product line. If a
    data mart were created that stores the sales data
    already aggregated to the desired level, the
    users queries would be simpler.

34
Aggregation
  • Add up amounts by day
  • in sql SELECT date, sum (amt) FROM SALE
  • GROUP BY date

p
sale
Store
date
amt
ans
date
sum
p1 p2 p1 p2
1 1 2 1
1 2 4 3
c1 c2 c3 c1
6 4
1 2
Roll Up Drill Down
35
Aggregation
  • Operators sum, count, max, min, median, ave
  • HAVING clause
  • Using dimension hierarchy
  • average by region (within store)
  • maximum by month (within date)

36
AGGREGATION THE KEY TO CONSISTENTLY FAST RESPONSE
  • PRODUCT EAST WEST CENTRAL TOTAL
  • NUTS 50 40 30
    120
  • SCREWS 60 50 60 170
  • BOLTS 100 120 80 300
  • WASHERS 90 100 40 230
  • TOTAL 300 310 210
    820

37
Multiple Reads Database Writes
  • In the above example, computing the totals
    involves 28 (44 34) database reads and eight
    database writes. A typical relational database
    can read about 200 records per second and writes
    perhaps 20 records per second. So consolidating
    this tiny database would take less than one
    second. However, for some larger tables,
    computing for totals could take days or even
    weeks to consolidate.

38
Vertical Summarization
  • Summarization building upon a single dimensional
  • theme
  • Monthly renters
  • Total of all renters
  • Total of new renters
  • Total rental income
  • Monthly sales
  • Staff name
  • Total sales
  • Total houses sold

39
Dimensionality Modeling
  • Star Schema Snowflake Schema
  • Whats OLAP?
  • OLTP vs. OLAP
  • Multi-dimensional Model, Data Operators
  • ROLAP vs. MOLAP

40
Star Schema Snowflake Schema
  • FACT TABLE A Fact Table is a table in a
    relational
  • database with a multi-part key. Each element
    of the key is itself a foreign key to a single
    dimension tale.
  • Star Schema (or Star Joint Schema)
  • A specific organization of a database in
    which a fact
  • table with a composite key is joined to a
    number of single-level dimension tables, each
    with a single, primary key
  • -- Kimball Ralph, Data Warehouse Toolkit ---

41
Star Schema Snowflake Schema (Contd)
  • Snowflake Schema
  • A variant of the star schema where each
    dimension can
  • have its dimensions.
  • Dimension Tables
  • They are the constraints used in forming the
    fact table.

42
A STAR JOIN SCHEMA(Based on Case study on Renton
Food Cooperative)
Times
Food Item
Sales
Food Item Key Food Item Desc Qty
time key day of week quarter year
Food Item Key Profile Key Time Key YTD_Sales_dolla
rs YTD_Sales_qty
YTD_
Member Profile
Profile key Profile desc Territory
Demographics
Demographic Key
Age category
Cluster 1 Population
Income category
Cluster 2 Population
43
Whats OLAP?
  • OLAP (ON-LINE ANALYTICAL PROCESSING)
    DESCRIBES A CLASS OF DATABASE SERVERS WHICH
    ARE DESIGNED FOR LIVE AND AD HOC DATA ACCESS AND
    ANALYSIS

44
Whats OLAP? (Contd)
  • On-Line Analytical Processing (OLAP) is a
    category of software technology that enables
    analysts, managers and executives to gain insight
    into data through fast, consistent, interactive
    access to a wide variety of possible views of
    information that has been transformed from raw
    data to reflect the real dimensions of the
    enterprise as understood by the user
  • -- DBMS Magazine, April, 1995

45
OLAP VS OLTP
  • OLTP (RELATIONAL)
  • ATOMIZED
  • PRESENT
  • RECORD-AT-A-TIME
  • PROCESS ORIENTED
  • OLAP (MULTIDIMENSIONAL)
  • SUMMARIZED
  • HISTORICAL
  • MANY RECORDS-AT-A-TIME
  • SUBJECT ORIENTED

46
OLAP VS OLTP (contd)
  • WHILE OLTP APPLICATIONS GENERALLY DO NOT
    REQUIRE HISTORICAL DATA, NEARLY EVERY OLAP
    APPLICATION IS CONCERNED WITH VIEWING TRENDS AND
    THEREFORE REQUIRES HISTORICAL DATA. OLTP
    APPLICATIONS AND DATABASE TEND TO BE ORGANIZED
    AROUND SPECIFIC PROCESSES (SUCH AS ORDER ENTRY),
    OLAP APPLICATIONS TEND TO BE SUBJECT-ORIENTED
    ANSWERING SUCH QUESTIONS AS WHAT PRODUCTS ARE
    SELLING WELL OR WHAT ARE MY WEAKEST SALES
    OFFICES?

47
A list of some important features supported by
some OLAP Servers
  • Special time-series data types
  • Special dimensions for variables (complex
    mathematical relationships, such as computed
    averages, and simultaneous equations)
  • Multiple hierarchies within a dimension
  • Classes with a dimension
  • Virtual variables (computed on the fly at run
    time, such as gross margin derived from
    revenues and expenses.

48
Multidimensional Model
  • In a multidimensional model, data is stored as
    Facts and Dimensions instead of rows and columns
  • Multidimensional DB (MDB)
  • A database in which the data is stored and
    managed in a multi-dimensional manner.

49
Relational Database vs. Multidimensional OLAP
Server
  • With a multidimensional OLAP server, we can
    perform the same consolidation with row and
    column arithmetic. Whereas a relational database
    can access a few hundred records per second, a
    good OLAP server should be capable of
    consolidating 20,000 to 30,000 cells (equivalent
    to records in the relational table) per second,
    including the time to write the totals to the
    database. The multidimensional OLAP database will
    take up less space since the names of the regions
    and products are not repeated in the
    multidimensional database as they are in the
    relational database.

50
WHAT IS MULTIDIMENSIONAL DATA?
  • RELATIONAL DATABASES ARE ORGANIZED AROUND A LIST
    OF RECORDS. EACH RECORD CONTAINS RELATED
    INFORMATION WHICH ARE ORGANIZED INTO FIELDS.
  • CUS NAME CUSTOMER TELEPHONE ADDRESS
  • JACK 10001 345-4444 40
    MAIN
  • WALTER 10002 345-6666 30
    ELM
  • HOOVER 10003 345-8588 6
    BELLRED
  • THIS TABLE HAS ONLY ONE DIMENSION.

51
WHAT IS MULTIDIMENSIONAL DATA?
  • LOOKING AT CUSTOMER BY TELEPHONE OR
  • TELEPHONE BY CUSTOMER ONLY PRODUCES
  • A ONE-FOR-ONE CORRESPONDENCE.

52
WHAT IS MULTIDIMENSIONAL DATA?
  • Lets take a look at an example of a relational
    table where
  • there is more than one-to-one correspondence
    between the
  • fields.
  • In the following table, we have sales data for
    each product
  • in each region-- four products sold in three
    regions. The
  • Sales data is a two-dimensional matrix (Product
    and
  • Region).

53
(No Transcript)
54
A much better way to represent the data
Two-Dimensional Matrix
  • PRODUCT EAST WEST CENTRAL
  • NUTS 50 40 30
  • SCREWS 60 50 60
  • BOLTS 100 120 80
  • WASHERS 90 100 40

55
QUERY ON MULTIDIMENSIONAL DATA
  • QUESTIONS LIKE WHAT WERE TOTAL SALES OF
    NUTS? OR WHAT WERE TOTAL SALES FOR THE EAST? TO
    FIND THE ANSWER IN THE TWO DIMENSIONAL TABLE,
    JUST FIND THE CELL CALLED EAST AND ADD UP ALL
    THE NUMBERS IN THE COLUMN.

56
MULTIDIMENSIONAL DATA ---Cube
  • Fact table view
    multi-dimensional cube

c1
c2
p
c3
sale
Store
date
amt
p1
4
Day 2
p1 p2 p1 p2
1 1 2 1
1 2 4 3
c1 c2 c3 c1
p2
C3
C1
C2
Day 1
1
P1 P2
3
2
57
MULTIDIMENSIONAL DATA ---Cube (contd)
  • Fact table view
    multi-dimensional cube

amt
sale
prodid
c3
storeid
c1
c2
p1 p2
p1 p2 p1 p2
12 11
50
c1 c1 c3 c2
12 11 50 8
8
58
MULTIDIMENSIONAL DATA ---Slice Dice
  • Slice and Dice the ability to rotate the data
    cube for multidimensional reporting
  • Cells roughly equivalent to Records in a
    relational
  • database.

PRODUCT EAST WEST CENTRAL NUTS 50
40 30 SCREWS 60 50
60 BOLTS 100 120 80 WASHERS 90 100
40
59
Slice and Dice (Pivoting)
  • A three-dimensional array has a total of six
    faces, or views. A four-dimensional array has
    twelve views. An n-dimensional array has n(n-1)
    views. The ability to rotate the data cube is the
    main technology for multi-dimensional reporting
    and is sometimes called slice and dice.

Region
Product
60
Practical Limitations on Database Size
  • In general, as the number of dimensions
    increases, the number of cells in the database
    increases exponentially.
  • for ex., a two-dimensional database with 100
    products and 100 regions would have 10,000 cells.
    If we add a third dimension for time with 52
    weeks, we now have 520,000 cells.
  • Most commercial OLAP servers hit the cell
    limit long before they run out of dimensions.

61
ROLAP vs. MOLAP
  • Two Prevailing Strategies for Data Storage and
    Access
  • ROLAP (relational)
  • Layer multidimensional metadata on top of an
    RDBMS
  • Provide external calculation facility to
    compensate SQL limitations
  • MOLAP (multidimensional)
  • Store data in multidimensional arrays
  • Contain internal multidimensional calculation
    engine

62
Multi-dimensional Database
total
5 Regions
52 States
5,000 Cities
Relational Database
50,000 customers
63
IMPLEMENT THE DW DATA MODEL
  • Issues to consider
  • Split fact tables -- if table is too large
  • -- Vertical or Horizontal segmentation
  • Denormalize dimension tables if joins are
    excessive
  • Stripe tables across disks if not using RAID
  • DBAs Issues
  • -- Data Recovery Technology
  • -- Data Distribution
  • -- Indexing Strategies

64
Data Recovery Technology
  • RAID (Redundant Arrays of Independent Disks)
  • RAID level 0
  • RAID level 1
  • RAID level 2
  • RAID level 3
  • RAID level 4
  • RAID level 5
  • Most popular RAID Usage

65
RAID 0
  • No data redundancy
  • Data divided and striped (sector interleaved
    across groups of drives without parity) equally
  • Ability to read/write to multiple disks at one
    time

Disk2
Disk3
Disk1
66
Stripe the BLOB (Binary, large Objects) Across
Several Disks
Block Header
record 2-a
record 1
record 2-b
record 3
67
RAID 0
Advantages
Disadvantages
  • High I/O rate
  • Transfer time is proportional to number of
    devices in array
  • Lowest cost solution
  • Read/write to multiple
  • disks at once
  • No provision for data redundancy
  • Lowest reliability

68
RAID 1
  • Hardware level disk mirroring
  • No data striping
  • Highest level of reliability of RAID devices
  • Most expensive solution

Disk1 Mirror
Disk2 Mirror
Disk1
Disk2
69
Data Distribution
  • Vertical Segmentation
  • Horizontal Segmentation
  • Parallel Processing
  • Data Replication

70
Vertical Segmentation

Branch_id PKSchool_id PK Month_yr School_nameSc
hool_Address
  • Separate attributes into other tables

Ref School Branch Branch_id PKSchool_id
PK Month_yr School_nameSchool_Address
Number_of_GraduatesNumber_of_underGraduates
Semaster_Tuition
Branch_id PKSchool_id PK Month_yr Number_of_Grad
uatesNumber_of_underGraduates Semaster_Tuition
71
Vertical Segmentation
  • Separate attributes in other tables
  • Overhead of shared locks may be reduced
  • Table scans can be faster
  • Could cause excessive joins

72
Horizontal Segmentation
  • Separate subset of data to another table
  • For example, separate yearly sales data into
    tables
  • containing only monthly data
  • Using UNION to query multiple tables.

73
Horizontal Segmentation
  • Separate subsets of data to another table (Jan,
    Feb, ..)
  • Multiple queries of multiple tables (UNION)
  • Breaking up tables will speed table scans

74
Parallel processing
  • Types
  • -- Tightly Coupled Systems
  • -- Loosely Coupled Systems
  • -- Massively Parallel Processing

75
Tightly Coupled Systems Symmetric Multiprocessing
(SMP)
  • Accessed through one memory pool
  • Multiple CPUs
  • Performance is limited by bandwidth of the
    network
  • Load Balancing
  • Shared operating systems
  • Shared access to common set of disks

76
SMP
Memory Operating System
CPU
CPU
CPU Master
CPU
Shared Disks
77
Loosely Coupled Systems
  • Each node has 1-n CPUs
  • Memory not shared between nodes
  • Communication over high-speed bus
  • Any node can be a symmetric multiprocessing

78
Massively Parallel Processing(MPP)
  • Supports thousands of nodes
  • Low cost/processing ration
  • Memory is not shared
  • All nodes can access I/O devices
  • Operating system is not shared

79
MPP
High Speed Network
Memory
Memory
Memory
CPU
CPU
CPU
Shared Disks
Non-shared Disks
or
80
Indexing Strategies
  • Bitmap Index
  • Join Index
  • R-tree Index
  • B-tree Index
Write a Comment
User Comments (0)
About PowerShow.com