Data Warehousing and Data Mining - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Data Warehousing and Data Mining

Description:

range in size from megabytes to terabytes. high transaction throughput ... 'A data warehouse is a subject-oriented, integrated, time-variant and non ... – PowerPoint PPT presentation

Number of Views:1186
Avg rating:3.0/5.0
Slides: 41
Provided by: DavidN161
Category:

less

Transcript and Presenter's Notes

Title: Data Warehousing and Data Mining


1
Data Warehousing and Data Mining
  • David Nelson
  • May 2006

2
Contents
  • Data Warehousing
  • Benefits
  • Architecture
  • Data Warehouse Design
  • Data Analysis Techniques
  • OLAP
  • Data Mining
  • Temporal Databases
  • Further Reading

3
Data Warehousing
  • OLTP (online transaction processing) systems
  • range in size from megabytes to terabytes
  • high transaction throughput
  • Decision makers require access to all data
  • Historical and current
  • 'A data warehouse is a subject-oriented,
    integrated, time-variant and non-volatile
    collection of data in support of managements
    decision-making process' (Inmon 1993)

4
Benefits
  • Potential high returns on investment
  • 90 of companies in 1996 reported return of
    investment (over 3 years) of gt 40
  • Competitive advantage
  • Data can reveal previously unknown, unavailable
    and untapped information
  • Increased productivity of corporate
    decision-makers
  • Integration allows more substantive, accurate and
    consistent analysis

5
Comparison
Source Connolly and Begg p1153
6
Typical Architecture
Mainframe operational n/w,h/w data
Warehouse mgr
Reporting query, app development,EIS tools
Meta-data
Highly summarized data
Departmental RDBMS data
Load mgr
Query manager
OLAP tools
Lightly summarized data
Private data
DBMS
Detailed data
Warehouse mgr
Data-mining tools
External data
Archive/backup
Source Connolly and Begg p1157
7
Information Flows
Operational data source 1
Warehouse Mgr
Meta-flow
Reporting query, app development,EIS tools
Meta- data
Highly summ. data
Inflow
Outflow
Load mgr
Query manager
OLAP tools
Lightly summ.
Upflow
DBMS
Detailed data
Warehouse mgr
Data-mining tools
Downflow
Operational data source n
Archive/backup
Source Connolly and Begg p1162
8
Information Flow Processes
  • Five primary information flows
  • Inflow - extraction, cleansing and loading of
    data from source systems into warehouse
  • Upflow - adding value to data in warehouse
    through summarizing, packaging and distributing
    data
  • Downflow - archiving and backing up data in
    warehouse
  • Outflow - making data available to end users
  • Metaflow - managing the metadata

9
Problems of Data Warehousing
  • Underestimation of resources for data loading
  • Hidden problems with source systems
  • Required data not captured
  • Increased end-user demands
  • Data homogenization
  • High demand for resources
  • Data ownership
  • High maintenance
  • Long duration projects
  • Complexity of integration

10
Data Warehouse Design
  • Data must be designed to allow ad-hoc queries to
    be answered with acceptable performance
    constraints
  • Queries usually require access to factual data
    generated by business transactions
  • e.g. find the average number of properties rented
    out with a monthly rent greater than 700 at each
    branch office over the last six months
  • Uses Dimensionality Modelling

11
E-R Model Example
12
Dimensionality Modelling
  • Similar to E-R modelling but with constraints
  • composed of one fact table with a composite
    primary key
  • dimension tables have a simple primary key which
    corresponds exactly to one foreign key in the
    fact table
  • uses surrogate keys based on integer values
  • Can efficiently and easily support ad-hoc
    end-user queries

13
Star Schemas
  • The most common dimensional model
  • A fact table surrounded by dimension tables
  • Fact tables
  • contains FK for each dimension table
  • large relative to dimension tables
  • read-only
  • Dimension tables
  • reference data
  • query performance speeded up by denormalising
    into a single dimension table

14
Star Schema Example
15
Other Schemas
  • Snowflake schemas
  • variant of star schema
  • each dimension can have its own dimensions
  • Starflake schemas
  • hybrid structure
  • contains mixture of (denormalised) star and
    (normalised) snowflake schemas

16
Data Analysis Techniques
  • Various end-user tools are available to interact
    with the data warehouse for retrieving
    information
  • Two of the most popular methods are
  • OLAP
  • Data mining tools

17
OLAP
  • Online Analytical Processing
  • dynamic synthesis, analysis and consolidation of
    large volumes of multi-dimensional data
  • normally implemented using specialized
    multi-dimensional DBMS
  • a method of visualising and manipulating data
    with many inter-relationships
  • Support common analytical operations such as
  • consolidation
  • drill-down
  • slicing and dicing

18
Codds OLAP Rules
  • 1. Multi-dimensional conceptual view
  • 2. Transparency
  • 3. Accessibility
  • 4. Consistent reporting performance
  • 5. Client-server architecture
  • 6. Generic dimensionality
  • 7. Dynamic sparse matrix handling
  • 8. Multi-user support
  • 9. Unrestricted cross-dimensional operations
  • 10. Intuitive data manipulation
  • 11. Flexible reporting
  • 12. Unlimited dimensions and aggregation levels

19
OLAP Tools
  • Categorised according to architecture of
    underlying database
  • Multi-dimensional OLAP
  • data typically aggregated and stored according to
    predicted usage
  • use array technology
  • Relational OLAP
  • use of relational meta-data layer with enhanced
    SQL
  • Managed Query Environment
  • deliver data direct from DBMS or MOLAP server to
    desktop in form of a datacube

20
MOLAP
RDB Server
MOLAP server
Request
Result
Load
Database/Application Logic Layer
Presentation Layer
21
ROLAP
ROLAP server
Request
SQL
RDB Server
Result
Result
Database Layer
Presentation Layer
Application Logic Layer
22
MQE
End-user tools
RDB Server
SQL
Result
MOLAP server
Request
Load
Result
23
Data Mining
  • The process of extracting valid, previously
    unknown, comprehensible and actionable
    information from large databases and using it to
    make crucial business decisions (Simoudis, 1996)
  • focus is to reveal information which is hidden or
    unexpected
  • patterns and relationships are identified by
    examining the underlying rules and features of
    the data
  • work from data up
  • require large volumes of data

24
Example Data Mining Applications
  • Retail/Marketing
  • Identifying buying patterns of customers
  • Finding associations among customer demographic
    characteristics
  • Predicting response to mailing campaigns
  • Market basket analysis

25
Example Data Mining Applications
  • Banking
  • Detecting patterns of fraudulent credit card use
  • Identifying loyal customers
  • Predicting customers likely to change their
    credit card affiliation
  • Determining credit card spending by customer
    groups

26
Data Mining Techniques
  • Four main techniques
  • predictive modelling
  • database segmentation
  • link analysis
  • deviation direction

27
Data Mining Techniques
  • Predictive Modelling
  • using observations to form a model of the
    important characteristics of some phenomenon
  • Techniques
  • Classification
  • Value Prediction

28
Classification Example Tree Induction
Customer renting property gt 2 years
No
Yes
Customer age gt 25 years?
Rent property
No
Yes
Buy property
Rent property
29
Data Mining Techniques
  • Database Segmentation
  • to partition a database into an unknown number of
    segments (or clusters) of records which share a
    number of properties
  • Techniques
  • Demographic clustering
  • Neural clustering

30
Database Segmentation Scatterplot Example
31
Data Mining Techniques
  • Link Analysis
  • establish associations between individual records
    (or sets of records) in a database
  • e.g. when a customer rents property for more
    than two years and is more than 25 years old,
    then in 40 of cases, the customer will buy the
    property
  • Techniques
  • Association discovery
  • Sequential pattern discovery
  • Similar time sequence discovery

32
Data Mining Techniques
  • Deviation Detection
  • identify outliers, something which deviates
    from some known expectation or norm
  • Statistics
  • Visualisation

33
Deviation Detection Visualisation Example
34
Mining and Warehousing
  • Data mining needs single, separate, clean,
    integrated, self-consistent data source
  • Data warehouse well equipped
  • populated with clean, consistent data
  • contains multiple sources
  • utilises query capabilities
  • capability to go back to data source

35
Temporal Databases
  • Normal databases contain current data only
  • Updates occur as propositions become untrue
  • Database systems, especially when supporting data
    warehouses, need to include features for handling
    temporal data
  • Temporal databases
  • Contain historical data (can include future data)
  • Rarely updated
  • SQL-92 contains some support for dates, times,
    intervals and periods
  • TSQL temporal SQL
  • SQL2003 contains extended support for time
    periods

36
Temporal Database Types
  • Two common types of temporal database
  • Valid Time
  • Time that the event occurred
  • Transaction Time
  • Time that the event was stored in the database
  • Bi-temporal
  • Contains both Valid Time and Transaction Time

37
Temporal Queries
  • Operations
  • Temporal Selection
  • Temporal Projection
  • Temporal Join
  • Temporal Union
  • Temporal Intersect
  • Predicates
  • Precedes
  • Overlaps
  • Contains

38
Oracle Temporal Features
  • Oracle supports the following data types
  • TIMESTAMP
  • TIMESTAMP WITH LOCAL TIMEZONE
  • INTERVAL YEAR TO MONTH
  • INTERVAL DAY TO SECOND
  • Simple examples are given in the example sheet on
    the module web page

39
Summary
  • Data warehouses support data (including
    historical data) for decision support analysis
  • OLAP and data mining are tools which can be used
    with data warehouses
  • Databases have added extensions such as temporal
    SQL and OLAP tools which enable databases to
    better support data warehousing applications

40
Further Reading
  • Connolly and Begg, chapters 31 to 34.
  • W H Inmon, Building the Data Warehouse, New York,
    Wiley and Sons, 1993.
  • Benyon-Davies P, Database Systems (2nd ed),
    Macmillan Press, 2000, ch 34, 35 36.
Write a Comment
User Comments (0)
About PowerShow.com