Difference between Data Warehouse and Data Mining? PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Difference between Data Warehouse and Data Mining?


1
(No Transcript)
2
  • Data Warehousing
  • OLAP
  • Data Mining
  • Further Reading

Enroll Now https//goo.gl/QbTVal
3
Data Warehousing
  • OLTP (online transaction processing) systems
  • range in size from megabytes to terabytes
  • high transaction throughput
  • Decision makers require access to all data
  • Historical and current
  • 'A data warehouse is a subject-oriented,
    integrated, time-variant and non-volatile
    collection of data in support of managements
    decision-making process' (Inmon 1993)

Enroll Now https//goo.gl/QbTVal
4
Benefits
  • Potential high returns on investment
  • 90 of companies in 1996 reported return of
    investment (over 3 years) of gt 40
  • Competitive advantage
  • Data can reveal previously unknown, unavailable
    and untapped information
  • Increased productivity of corporate
    decision-makers
  • Integration allows more substantive, accurate and
    consistent analysis

5
Typical Architecture
Mainframe operational n/w,h/w data
Warehouse mgr
Reporting query, app development,EIS tools
Meta-data
Highly summarized data
Departmental RDBMS data
Load mgr
Query manager
OLAP tools
Lightly summarized data
Private data
DBMS
Detailed data
Warehouse mgr
Data-mining tools
External data
Archive/backup
Source Connolly and Begg p1157
6
Data Warehouses
  • Types of Data
  • Detailed
  • Summarised
  • Meta-data
  • Archive/Back-up

Enroll Now https//goo.gl/QbTVal
7
Information Flows
Operational data source 1
Warehouse Mgr
Meta-flow
Reporting query, app development,EIS tools
Meta- data
Highly summ. data
Inflow
Outflow
Load mgr
Query manager
OLAP tools
Lightly summ.
Upflow
DBMS
Detailed data
Warehouse mgr
Data-mining tools
Downflow
Archive/backup
Operational data source n
Source Connolly and Begg p1162
8
Information Flow Processes
  • Five primary information flows
  • Inflow - extraction, cleansing and loading of
    data from source systems into warehouse
  • Upflow - adding value to data in warehouse
    through summarizing, packaging and distributing
    data
  • Downflow - archiving and backing up data in
    warehouse
  • Outflow - making data available to end users
  • Metaflow - managing the metadata

9
Problems of Data Warehousing
  1. Underestimation of resources for data loading
  2. Hidden problems with source systems
  3. Required data not captured
  4. Increased end-user demands
  5. Data homogenization
  6. High demand for resources
  7. Data ownership
  8. High maintenance
  9. Long duration projects
  10. Complexity of integration

10
Data Warehouse Design
  • Data must be designed to allow ad-hoc queries to
    be answered with acceptable performance
    constraints
  • Queries usually require access to factual data
    generated by business transactions
  • e.g. find the average number of properties rented
    out with a monthly rent greater than 700 at each
    branch office over the last six months
  • Uses Dimensionality Modelling

11
Dimensionality Modelling
  • Similar to E-R modelling but with constraints
  • composed of one fact table with a composite
    primary key
  • dimension tables have a simple primary key which
    corresponds exactly to one foreign key in the
    fact table
  • uses surrogate keys based on integer values
  • Can efficiently and easily support ad-hoc
    end-user queries

Enroll Now https//goo.gl/QbTVal
12
Star Schemas
  • The most common dimensional model
  • A fact table surrounded by dimension tables
  • Fact tables
  • contains FK for each dimension table
  • large relative to dimension tables
  • read-only
  • Dimension tables
  • reference data
  • query performance speeded up by denormalising
    into a single dimension table

13
E-R Model Example
Enroll Now https//goo.gl/QbTVal
14
Star Schema Example
Enroll Now https//goo.gl/QbTVal
15
Other Schemas
  • Snowflake schemas
  • variant of star schema
  • each dimension can have its own dimensions
  • Starflake schemas
  • hybrid structure
  • contains mixture of (denormalised) star and
    (normalised) snowflake schemas

16
OLAP
  • Online Analytical Processing
  • dynamic synthesis, analysis and consolidation of
    large volumes of multi-dimensional data
  • normally implemented using specialized
    multi-dimensional DBMS
  • a method of visualising and manipulating data
    with many inter-relationships

17
Codds OLAP Rules
  • 1. Multi-dimensional conceptual view
  • 2. Transparency
  • 3. Accessibility
  • 4. Consistent reporting performance
  • 5. Client-server architecture
  • 6. Generic dimensionality
  • 7. Dynamic sparse matrix handling
  • 8. Multi-user support
  • 9. Unrestricted cross-dimensional operations
  • 10. Intuitive data manipulation

18
OLAP Tools
  • Categorised according to architecture of
    underlying database
  • Multi-dimensional OLAP
  • data typically aggregated and stored according to
    predicted usage
  • use array technology
  • Relational OLAP
  • use of relational meta-data layer with enhanced
    SQL
  • Managed Query Environment
  • deliver data direct from DBMS or MOLAP server to
    desktop in form of a datacube

19
MOLAP
RDB Server
MOLAP server
Request
Result
Load
Database/Application Logic Layer
Presentation Layer
Enroll Now https//goo.gl/QbTVal
20
ROLAP
ROLAP server
Request
SQL
RDB Server
Result
Result
Database Layer
Presentation Layer
Application Logic Layer
Enroll Now https//goo.gl/QbTVal
21
MQE
End-user tools
RDB Server
SQL
Result
MOLAP server
Request
Load
Result
Enroll Now https//goo.gl/QbTVal
22
Data Mining
  • The process of extracting valid, previously
    unknown, comprehensible and actionable
    information from large databases and using it to
    make crucial business decisions
  • focus is to reveal information which is hidden
    or unexpected
  • patterns and relationships are identified by
    examining the underlying rules and features of
    the data
  • work from data up
  • require large volumes of data

23
Example Data Mining Applications
  • Retail/Marketing
  • Identifying buying patterns of customers
  • Finding associations among customer demographic
    characteristics
  • Predicting response to mailing campaigns
  • Market basket analysis

Enroll Now https//goo.gl/QbTVal
24
Example Data Mining Applications
  • Banking
  • Detecting patterns of fraudulent credit card use
  • Identifying loyal customers
  • Predicting customers likely to change their
    credit card affiliation
  • Determining credit card spending by customer
    groups

Enroll Now https//goo.gl/QbTVal
25
Data Mining Techniques
  • Four main techniques
  • Predictive Modeling
  • Database Segmentation
  • Link Analysis
  • Deviation Direction

Enroll Now https//goo.gl/QbTVal
26
Data Mining Techniques
  • Predictive Modelling
  • using observations to form a model of the
    important characteristics of some phenomenon
  • Techniques
  • Classification
  • Value Prediction

Enroll Now https//goo.gl/QbTVal
27
Classification Example- Tree Induction
Customer renting property gt 2 years
No
Yes
Customer age gt 25 years?
Rent property
No
Yes
Buy property
Rent property
Enroll Now https//goo.gl/QbTVal
28
Data Mining Techniques
  • Database Segmentation
  • to partition a database into an unknown number of
    segments (or clusters) of records which share a
    number of properties
  • Techniques
  • Demographic clustering
  • Neural clustering

Enroll Now https//goo.gl/QbTVal
29
Segmentation Scatterplot Example
Enroll Now https//goo.gl/QbTVal
30
Data Mining Techniques
  • Link Analysis
  • establish associations between individual records
    (or sets of records) in a database
  • e.g. when a customer rents property for more
    than two years and is more than 25 years old,
    then in 40 of cases, the customer will buy the
    property
  • Techniques
  • Association discovery
  • Sequential pattern discovery
  • Similar time sequence discovery

31
Data Mining Techniques
  • Deviation Detection
  • identify outliers, something which deviates
    from some known expectation or norm
  • Statistics
  • Visualisation

Enroll Now https//goo.gl/QbTVal
32
Deviation Detection Visualisation Example
33
Mining and Warehousing
  • Data mining needs single, separate, clean,
    integrated, self-consistent data source
  • Data warehouse well equipped
  • populated with clean, consistent data
  • contains multiple sources
  • utilises query capabilities
  • capability to go back to data source

34
Further Reading
  • Connolly and Begg, chapters 31 to 34.
  • W H Inmon, Building the Data Warehouse, New York,
    Wiley and Sons, 1993.
  • Benyon-Davies P, Database Systems (2nd ed),
    Macmillan Press, 2000, ch 34, 35 36.

Enroll Now https//goo.gl/QbTVal
35
(No Transcript)
36
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com