What%20is%20a%20Data%20Warehouse - PowerPoint PPT Presentation

About This Presentation
Title:

What%20is%20a%20Data%20Warehouse

Description:

data needs to be stored in the DW in a singular, globally-acceptable fashion ... ability of many application designers to create inconsistent applications is ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 34
Provided by: sasade
Category:
Tags: 20data | 20warehouse | 20a | 20is

less

Transcript and Presenter's Notes

Title: What%20is%20a%20Data%20Warehouse


1
What is a Data Warehouse
  • by W. H. Inmon
  • http//www.cait.wustl.edu/cait/papers/prism/vol1_n
    o1/

2
What is a Data Warehouse?
  • A data warehouse is a
  • subject-oriented,
  • integrated,
  • time-variant,
  • nonvolatile,
  • collection of data in support of management's
    decision making process
  • The data comes from the operational environment
  • The data warehouse is always a physically
    separate store

3
Difference Between Operational Systems and Data
and the Data Warehouse (DW)
  • DW is oriented around the major subjects of the
    enterprise
  • The data-driven, subject orientation is in
    contrast to the more classical process/functional
    orientation of applications
  • The DW world focuses on data modeling and
    database design exclusively
  • DW data excludes data that will not be used for
    DSS processing
  • DW data spans a spectrum of time and the
    relationships found in the data warehouse are many

4
The data warehouse has a strong subject
orientation
Operational
Data warehouse
Customer
Loans
Vendor
Savings
Product
Bank card
Activity
Trust
An application orientation
A subject orientation
5
Integration
  • Data found within the DW is integrated
  • ALWAYS
  • WITH NO EXCEPTIONS
  • consistent naming conventions
  • consistent measurement of variables
  • consistent encoding structures
  • consistent physical attributes of data
  • data needs to be stored in the DW in a singular,
    globally-acceptable fashion

6
When data is moved to the DW from the
application-oriented operational environment, the
data is integrated before entering the DW
Operational
Data warehouse
appl A - m, f appl B - 1, 0 appl C - x, y appl D
- male, female
m, f
pipeline cm
appl A - pipeline cm appl B - pipeline
inches appl C - pipeline mcf appl D - pipeline yds
balance dec fixed (13,2)
appl A - balance dec fixed (13,2) appl B -
balance pic 9(9)v99 appl C - balance dec fixed
(11,0) appl D - balance pic s9(7)v99 comp 3
appl A - description appl B - description appl C
- description appl D - description
description
appl A - bal-on-hand appl B - current-balance appl
C - cash-in-house appl D - balance
balance
appl A - date (Julian) appl B - date
(yymmdd) appl C - date (mmddyy) appl D - date
(absolute)
date (Julian)
7
Integration
  • The collective ability of many application
    designers to create inconsistent applications is
    legendary
  • The integration affects almost every aspect of
    design - the physical characteristics of data,
    the dilemma of having more than one source of
    data, the issue of inconsistent naming standards,
    inconsistent date formats, and so forth

8
Time Variancy
  • All data in the data warehouse is accurate as of
    some moment in time (i.e., not "right now")
  • In the operational environment data is accurate
    as of the moment of access
  • Data found in the warehouse is said to be "time
    variant

9
Time Variancy
Operational
Data warehouse
  • Current value data
  • time horizon -- 60 - 90 days
  • key may or may not have an
  • element of time
  • data can be updated
  • Snapshot data
  • time horizon -- 5 - 10 years
  • key contains an element of
  • time
  • once snapshot is made,
  • record cannot be updated

10
Nonvolatile
Change
Replace
Insert
Insert
Load
Replace
Replace
Access
Change
Operational
Data warehouse
Data is updated on a record-by-record basis
regularly
Data is loaded into the warehouse and is accessed
there, but once the snapshot of data is made, the
data in the warehouse does not change
11
Nonvolatile
  • The basic manipulation of data that occurs in the
    data warehouse is simple
  • There are only two kinds of operations
  • the initial loading of data
  • the access of data
  • There is no update of data
  • The need to be cautious of the update anomaly is
    no factor
  • Liberties can be taken to optimize the access of
    data

12
Nonvolatile
  • Another consequence is in the technology
  • Technologies to support
  • record-by-record update in an on-line mode
  • backup and recovery
  • transaction and data integrity
  • detection and remedy of deadlock
  • are quite complex and unnecessary for data
    warehouse processing
  • DW environment is VERY, VERY different from the
    classical operational environment

13
Nonvolatile
  • The source of nearly all data warehouse data is
    the operational environment
  • It is a temptation to think that there is massive
    redundancy of data between the two environments
  • In fact there is a MINIMUM of data redundancy
  • data is filtered much data never passes out of
    the operational environment
  • the time horizon of data is very different
  • the data warehouse contains summary data
  • data undergoes a fundamental transformation as it
    passes into the data warehouse

14
The Structure of the Warehouse
  • Data warehouses have a distinct structure
  • Different components of the data warehouse are
  • meta data
  • current detail data
  • older detail data
  • lightly summarized data
  • highly summarized data
  • The major concern is the current detail data
  • the most recent happenings are always of great
    interest
  • voluminous, stored at the lowest level of
    granularity
  • disk storage is fast to access but expensive and
    complex to manage

15
There are different levels of summarization and
detail that demark the data warehouse
Highly summarized
Lightly summarized
META DATA
Current data
Older detail data
16
The Structure of the Warehouse
  • Older detail data is stored on some form of mass
    storage
  • it is infrequently accessed
  • it is stored at a level of detail consistent with
    current detailed data
  • Lightly summarized data is distilled from the low
    level of detail found at the current detailed
    level
  • it is almost always stored on disk storage
  • the design issues are
  • what unit of time is the summarization done over
  • what attributes will the lightly summarized data
    contain

17
The Structure of the Warehouse
  • Highly summarized data is compact and easily
    accessible
  • Meta data plays a special and very important role
    in the data warehouse
  • It is used as
  • a directory to help locate the contents
  • a guide to the mapping of data as the data is
    transformed from the operational to the DW
    environment
  • a guide to the algorithms used for summarization

18
An example of the levels of summarization that
might be found in the data warehouse
national sales by month 1988-1996
monthly sales by product line 1993-1996
national sales by week 1986-1996
weekly sales by subproduct 1988-1996
META DATA
sales detail 1995-1996
sales detail 1985-1994
19
An Example of the Data Warehouse
  • Old sales detail is that detail about sales that
    is older than 1995
  • The current value detail contains data from 1995
    to 1996
  • The sales detail is summarized weekly by
    subproduct line and by region to produce the
    lightly summarized stores of data
  • The weekly sales detail is further summarized
    monthly along even broader lines to produce the
    highly summarized data
  • Meta data contains (at the least!)
  • the structure of the data
  • the algorithms used for summarization
  • the mapping from the operational environment to
    the data warehouse

20
Old Detail Storage Medium
  • A wide variety of storage media that should be
    considered for storing older detail data
  • photo optical storage
  • CD-ROM
  • micro fiche
  • magnetic tape
  • mass storage
  • It is entirely likely that other storage media
    will serve the needs

21
The Flow of Data Inside the Data Warehouse
Summarization process
Operational environment
Aging process
22
Flow of Data
  • As data enters the data warehouse from the
    operational environment, it is transformed
  • Upon entering the data warehouse, data goes into
    the current detail level of detail
  • It resides there and is used there until one of
    three events occurs
  • it is purged
  • it is summarized, and/or
  • it is archived

23
The higher the levels of summarization, the more
the usage of the data
24
Summarized Data
  • The more summarized the data, the quicker and
    more efficient it is to get to the data
  • The DSS analyst in a pre-data warehouse
    environment has used data at the detailed level
  • One of the tasks of the data architect is to wean
    the DSS user from constantly using data at the
    lowest level of detail
  • installing a chargeback system
  • pointing out very good response time when dealing
    with data at a high level of summarization

25
Other Considerations
  • Data at the higher levels of summarization can be
    freely indexed
  • Data at the lower levels of detail is so
    voluminous that it can be indexed sparingly
  • The data model and formal design applies almost
    exclusively to the current level of detail
  • The data modeling activities do not apply to the
    levels of summarization

26
Indexes and Data Model
Data model
27
Partitioning of DW Data
  • Partitioning can be done in two ways
  • at the DBMS level
  • the DBMS is aware of the partitions and manages
    them accordingly
  • the automatic management of the partitions is
    inflexible
  • at the application level
  • the responsibility for the management of the
    partitions is left up to the programmer
  • provides flexibility in the management of data in
    the data warehouse

28
Current detail data is almost always partitioned
29
The internal structuring of data in a sample data
warehouse
current detailed data
customer history
order/customer
88 - present
part
part/order
part
part
part/assembly
parts shipments
parts bill of material
assembly history
93
95
94
92
91
87 - present
30
An Example of a Data Warehouse
  • The levels of summarization are not shown, nor is
    the old detail archive shown
  • There are tables of the same type divided over
    time
  • For different types of tables there are different
    units of time physically dividing the units of
    data
  • Different tables are linked by means of a common
    identifier

31
Other Anomalies
  • Public summary data is summary data that has been
    calculated outside the boundaries of the data
    warehouse but is used throughout the corporation
  • Another anomaly is that of external data
  • Another exceptional type of data sometimes found
    in a data warehouse is that of permanent detail
    data stored for ethical or legal reasons
  • the medium the data is stored on must be as
    safety proof as possible
  • the data must be able to be restored
  • the data needs special treatment in the indexing
    to be accessible

32
Summary
  • A data warehouse is a subject-oriented,
    integrated, time-variant, nonvolatile collection
    of data in support of management's decision needs
  • There are four levels of data warehouse data
  • old detail
  • current detail
  • lightly summarized data
  • highly summarized data
  • Meta data is also an important part of the data
    warehouse environment

33
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com