Data Warehousing and Data Mining - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Data Warehousing and Data Mining

Description:

Decision makers require access to all data ... query performance can be speeded up by denormalising into a single dimension table ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 27
Provided by: sues5
Category:

less

Transcript and Presenter's Notes

Title: Data Warehousing and Data Mining


1
Data Warehousing and Data Mining
  • Sue Patience/David Nelson
  • April 2004

2
Contents
  • Data Warehousing
  • Data Mining
  • Further Reading

3
Data Warehousing
  • OLTP (online transaction processing) systems
  • range in size from megabytes to terabytes
  • high transaction throughput
  • Decision makers require access to all data
  • A data warehouse is a subject-oriented,
    integrated, time-variant and non-volatile
    collection of data in support of managements
    decision-making process (Inmon 1993)

4
Benefits
  • Potential high returns on investment
  • 90 of companies in 1996 reported return of
    investment (over three years) of gt 40
  • Competitive advantage
  • Data can reveal previously unknown, unavailable
    and untapped information
  • Increased productivity of corporate
    decision-makers
  • Integration allows more substantive, accurate and
    consistent analysis

5
Comparison
Source Connolly Begg, p 1049
6
Data Warehouses
  • Types of Data
  • Detailed
  • Summarised
  • Meta-data
  • Archive
  • Back-up

7
Architecture
Mainframe operational data
Warehouse mgr
Reporting query, app development,EIS tools
Meta-data
Highly summarized data
Departmental RDBMS data
Load mgr
Query manager
OLAP tools
Lightly summarized data
Private data
DBMS
Detailed data
Warehouse mgr
Data-mining tools
External data
Archive/backup
Source Connolly Begg p1053
8
Information Flows
Operational data source 1
Warehouse Mgr
Meta-flow
Reporting query, app development,EIS tools
Meta- data
Highly summ. data
Inflow
Outflow
Load mgr
Query manager
OLAP tools
Lightly summ.
Upflow
DBMS
Detailed data
Warehouse mgr
Data-mining tools
Downflow
Operational data source n
Archive/backup
Source Connolly Begg p1058
9
Information Flow Processes
  • Five primary information flows
  • Inflow - extraction, cleansing and loading of
    data from source systems into warehouse
  • Upflow - adding value to data in warehouse
    through summarizing, packaging and distributing
    data
  • Downflow - archiving and backing up data in
    warehouse
  • Outflow - making data available to end users
  • Metaflow - managing the metadata

10
Problems of Data Warehousing
  • Underestimation of resources for data loading
  • Hidden problems with source systems
  • Required data not captured
  • Increased end-user demands
  • Data homogenization
  • High demand for resources
  • Data ownership
  • High maintenance
  • Long duration projects
  • Complexity of integration

11
Data Warehousing Tools Technologies
  • Extraction, Cleansing Transformation Tools
  • Code generators
  • Database data replication tools
  • Dynamic transformation engines

12
Data Warehouse Requirements
  • Data Warehouse DBMS
  • Load performance Processing
  • Data Quality management
  • Query performance
  • Scalability
  • Advanced Query functionality

13
Data Warehouse Design
  • Data must be designed to allow ad-hoc queries to
    be answered with acceptable performance
    constraints
  • Queries usually require access to factual data
    generated by business transactions
  • e.g. find the average number of properties rented
    out with a monthly rent greater than 700 at each
    branch office over the last six months
  • Uses Dimensionality Modelling

14
Dimensionality Modelling
  • Similar to E-R modelling but with constraints
  • composed of one fact table with a composite
    primary key
  • dimension tables have a simple primary key which
    corresponds exactly to one foreign key in the
    fact table
  • uses surrogate keys based on integer values
  • Can efficiently and easily support ad-hoc
    end-user queries

15
Star Schemas
  • The most common dimensional model
  • A fact table surrounded by dimension tables
  • Fact tables
  • contains FK for each dimension table
  • large relative to dimension tables
  • read-only
  • Dimension tables
  • reference data
  • query performance can be speeded up by
    denormalising into a single dimension table
  • Normally create a dimension table for time

16
E-R Model Example
17
Star Schema Example
18
Data Mining
  • The process of extracting valid, previously
    unknown, comprehensible and actionable
    information from large databases and using it to
    make crucial business decisions. (Simoudis 1996)
  • focus is to reveal information which is hidden or
    unexpected
  • patterns and relationships are identified by
    examining the underlying rules and features of
    the data
  • work from data up
  • require large volumes of data

19
Data Mining Techniques
  • Four Operations
  • predictive modelling
  • database segmentation
  • link analysis
  • deviation direction

20
Data Mining Techniques
  • Predictive Modelling
  • using observations to form a model of the
    important characteristics of some phenomenon.
  • Two techniques-
  • Classification
  • Value prediction

21
Classification Example Tree Induction
Customer renting property gt 2 years
No
Yes
Customer age gt 25 years?
Rent property
No
Yes
Buy property
Rent property
22
Data Mining Techniques
  • Database Segmentation
  • to partition a database into an unknown number of
    segments (or clusters) of records which share a
    number of properties

23
Data Mining Techniques
  • Link Analysis
  • establish associations between individual records
    (or sets of records) in a database
  • e.g. when a customer rents property for more
    than two years and is more than 25 year olds,
    then in 40 of cases, the customer will buy the
    property
  • Techniques
  • Association discovery
  • Sequential pattern discovery
  • Similar time sequence discovery

24
Data Mining Techniques
  • Deviation Detection
  • identify outliers, something which deviates
    from some known expectation or norm

25
Mining and Warehousing
  • Data mining needs single, separate, clean,
    integrated, self-consistent data source
  • Data warehouse well equipped
  • populated with clean, consistent data
  • contains multiple sources
  • utilises query capabilities
  • capability to go back to data source

26
Further Reading
  • Connolly and Begg, 3rd edition, chapters 30, 31
    and 32.
  • Connolly and Begg, 4th edition, chapters 31 - 34.
  • W H Inmon, Building the Data Warehouse, New York,
    Wiley and Sons, 1993.
  • Benyon-Davies P, Database Systems (2nd ed),
    Macmillan Press, 2000, ch 34, 35 36.
Write a Comment
User Comments (0)
About PowerShow.com