ITEC 423 Data Warehousing and Data Mining

About This Presentation

Title:

ITEC 423 Data Warehousing and Data Mining

Description:

ITEC 423 Data Warehousing and Data Mining Lecture 2 – PowerPoint PPT presentation

Number of Views:650

Avg rating:3.0/5.0

Slides: 46

Provided by: edut1299

Category:

more less

Transcript and Presenter's Notes

Title: ITEC 423 Data Warehousing and Data Mining

1
ITEC 423 Data Warehousing and Data Mining

Lecture 2

DATA WAREHOUSEBUILDING BLOCKS
2
(No Transcript)
3
What is a data warehouse?

A data warehouse is a
subject-oriented,
Integrated (consolidated)
time-variant, and
nonvolatile
collection of data in support of managements
decision-making process.
W. H. Inmon

4
Subject-Oriented Data

Warehouse is organized around major subjects of
the enterprise rather than major application
areas
Contains decision-support data rather than
application-oriented data.
The focus of the design is
providing users easy access to data so that
current and future questions can be answered

5
Application-Oriented vs Subject-Oriented
6
Integrated or Consolidated Data

Integrates corporate level application-oriented
data from different source systems
data is often inconsistent or missing
Integrated data source must be made consistent to
present a unified view of the data to the users.

7
Integrated Data
8
Time-Variant Data
This is necessary to support trending,
forecasting, and time-based performance
reporting, such as current year versus previous
year.

Data in the warehouse is only accurate and valid
at some point in time or over some time interval.
contains slices of data across different periods
of time.
With these data slices, the user can view current
and past reports.
data represents a series of snapshots.
Time-variance is also shown in the extended time
that data is stored
contains several years worth of data
implicit or explicit association of time with all
data

9
Non-Volatile Data

Data in the warehouse is not updated in real-time
but is refreshed from operational systems on a
regular basis
New data is always added as a supplement to the
database, rather than a replacement.

10
Data Granularity

operational database
data is usually kept at the lowest level of
detail
the units of sale are captured at the level of
units of a product per transaction at the check
-out counter.
the quantity ordered is captured and stored at
the level of units of a product per order
received from the customer.
If summary data is needed, the individual
transactions are grouped.
data warehouse
Initial requests are for summary data to use for
analysis.
total sale units of a product in an entire
region.
Progessively more details may be required
breakdown by states in the region.
Examine sale units at individual stores.

11
Data Granularity

Is the level of detail
keep data summarized at different levels of
detail

THREE DATA LEVELS IN A BANKING DATA WAREHOUSE
12
Typical Properties of a data warehouse

A data warehouse is housed on an enterprise
mainframe server.
Data from various online transaction processing
(OLTP) applications and other sources is
selectively extracted and organized
Read-only Copy
re-structured
Data warehouse database is used for processing
analytical applications and user queries.

13
OLTP vs. Warehousing

Organized by transactions vs. Organized by
particular subject
More number of users vs. less
Accesses few records vs. entire table
Smaller database vs. Large database
Normalized data structure vs. Unnormalized
Continuous update vs. periodic update (load)

14
Data Warehouse Compared to OLTP
PROPERTY OLTP DATA WAREHOUSE
ACTIVITIES Processes Analysis
RESPONSE TIME Subseconds to seconds Seconds to hours
OPERATIONS DML Primarily read-only
NATURE OFDATA 30-60 days Snapshots over time
DATA ORGANIZE. By application By subject,time
SIZE Small to large Large to very large
DATA SOURCES Operational Internal Operational, Internal, External
15
Data Warehouse Compared to OLTP
PROPERTY
OLTP
DATA WAREHOUSE
ACTIVITIES
Processes
Analysis
RESPONSE TIME
Subseconds
to
Seconds to hours
seconds
OPERATIONS
DML
Primarily read
-
only
NATURE OF
30
-
60 days
Snapshots over time
DATA
DATA ORGANIZE.
By application
By
subject,time
SIZE
Small to large
Large to very large
DATA SOURCES
Operational
Operational,
Internal
Internal, External
Predictable
Unpredictable
USAGE CURVEE
16
Data Warehouse or Data Mart
17
Data Warehouse Compared to Data Mart
Property Data Warehouse Data Mart
Scope Enterprise Department
Subjcts Multiple Single-subject, line of business (LOB)
Data Source Many Few
Size (typical) 100 GB to gt 1 TB lt 100 GB
Implementationtime Months to years Months
Data Warehouse
Data Mart
18
Data Warehouse Compared to Data Mart
Property
Data Warehouse
Data Mart
Scope
Enterprise
Department
Subjcts
Multiple
Single
-
subject, line
of business (LOB)
Data Source
Many
Few
Size (typical)
100 GB to gt 1 TB
lt 100 GB
Implementation
Months to years
Months
time
Data Warehouse
Data Mart
19
Which one to build first? Data warehouse or Data
Mart?
20
Top Down Approach
21
Top Down Approach

Build the overall, big, enterprise- wide data
warehouse.
Instead of collection of fragmented islands of
information.
Data warehouse is large and integrated.
would take longer to build and has a high risk
of failure.
If you do not have experienced professionals on
your team, this approach could be hazardous.
Difficult to sell this approach to senior
management and sponsors
They are not likely to see results soon enough.

22
Pros and Cons of Top Down Approach

Advantages

Disadvantages

A truly corporate effort, an enterprise view of
data
Inherently architected, not a union of disparate
data marts
Single, central storage of data about the content
Centralized rules and control
May see quick results if implemented with
iterations

Takes longer to build even with an iterative
method
High exposure to risk of failure
Needs high level of cross-functional skills
High outlay without proof of concept

23
Bottom Up Approach
24
Bottom Up Approach

Build departmental data marts one by one based on
priority
Collection of data marts make up the data
warehouse
Beware of data fragmentation
Independent data marts are blind to the overall
requirements of the entire organization.
Data marts contain
data at the lowest level of granularity
summaries depending on the needs for analysis
Data marts are joined or unioned together by
conforming the dimensions

25
Pros and Cons of Bottom Up Approach

Disadvantages

Advantages

Faster and easier implementation of manageable
pieces
Favorable return on investment and proof of
concept
Less risk of failure
Inherently incremental
can schedule important data marts first
Allows project team to learn and grow

Each data mart has its own narrow view of data
Permeates redundant data in every data mart
Perpetuates inconsistent and irreconcilable data
Proliferates unmanageable interfaces

26
Architectural Types
27
Architectural Types

Centralized Data Warehouse
Takes into account the enterprise-level
information requirements
Atomic level data at the lowest level of
granularity is stored
Some summarized data may be included
Queries and applications access the central data
warehouse.
No separate data marts
Independent Data Marts
Evolves in companies where the organizational
units develop their own data marts for their own
specific purposes
Each data mart serves a particular organizational
unit
More than one version of the truth may be found
Data marts are independent of one another
Different data marts may have inconsistent data
definitions and standards
Such variances hinder analysis of data across
data marts.

28
Architectural Types

Federated
An existing legacy of an assortment of DSS in the
form of operational systems, extracted datasets,
primitive data marts,
May not be possible to discard investment and
start from scratch
Practical solution is a federated architectural
type
data may be physically or logically integrated
through shared key fields, overall global
metadata , distributed queries, and such other
methods
No one overall data warehouse
Data-Mart Bus
Conformed supermarts approach
Analyzing requirements for a specific business
subject such as orders, shipments, billings,
insurance claims, car rentals, and ...
Build the first data mart (supermart) using
business dimensions and metrics
These business dimensions will be shared in the
future data marts.
Conform dimensions among the various data marts
Result would be logically integrated supermarts
that will provide an enterprise view of the data
Data marts contain atomic data organized as a
dimensional data model
Results from adopting an enhanced bottom-up
approach to data warehouse development

29
Architectural Types

Hub-and-S poke
Similar to the centralized data warehouse
architecture
Overall enterprise-wide data warehouse
Atomic data is stored in the centralized data
warehouse
Major and useful difference is the presence of
dependent data marts in this architectural type
Dependent data marts obtain data from the
centralized data warehouse
The centralized data warehouse forms the hub to
feed data to the data marts on the spokes
Dependent data marts may be developed for a
variety of purposes
departmental analytical needs, specialized
queries, data mining, and ...
Dependent data mart may have normalized,
denormalized, summarized, or dimensional data
structures based on individual requirements
Most queries are directed to the dependent data
marts although the centralized data warehouse may
itself be used for querying
Result s from adopting a top-down approach to
data warehouse development.

30
Building Blocks of Data Warehouses
31
What is OLAP?

online analytical processing
Approach to answer multi-dimensional analytical
queries.
part of the broader category of
business intelligence reporting, data mining.
Applications include
business reporting for sales,
management reporting,
budgeting
forecasting

32
Typical Data Warehousing Process
Phase I - STRATEGY Identify business
requirements Define objectives purpose of DW
Phase II - DEFINITION Project scoping and
planning Using building block approach
Phase III - ANALYSIS Information requirements are
defined
Phase IV - DESIGN Database structures to hold
base data and summaries are created Translation
mechanisms are designed
Phase V - BUILD DOCUMENT The warehouse is built
and documentation is developed
Phase VI - POPULATE, TEST TRAIN The warehouse
is populated and tested the users are trained on
system and tools
Iterative
Phase VII - DISCOVERY EVOLUTION The warehouse
is monitored and adjustments are applied, or
future extensions are planned
33
What Does All This Mean?

On a daily basis, organizations turn to their
data warehouses to answer a limitless variety of
questions.
Nothing is free
these benefits do come with a cost.
The value of a data warehouse is a result of the
new and changed business processes it enables.
There are limitations
A DW cannot correct problems with the data,
although it may help to clearly identify them.

34
Comparison of Typical DW Costs and Benefits

Costs
Hardware, software, development personnel and
consultant costs.
Operational costs like ongoing systems
maintenance.
Benefits
Added Revenue
Will the new (business objective) process
generate new customers (what is the estimated
value?)
Will the new (business objective) process
increase the buying propensity of existing
customers (by how much?)
Is the new process necessary to ensure that the
competition doesn't offer a demanded service that
you can't match?
Reduced costs
What costs of current systems will be eliminated?
Is the new process intended to make some
operation more efficient? If so, how and what is
the dollar value?

35
The Cost of Warehousing Data

Expenditures can be categorized as one-time
initial costs or as recurring, ongoing costs.
The initial costs can further be identified as
for hardware or software.
Expenditures can also be categorized as capital
costs (associated with acquisition of the
warehouse) or as operational costs (associated
with running and maintaining the warehouse)

36
Expenditures Associated with Building a DW
Recurring Costs One-Time Costs
Capital Hardware maintenance Software maintenance Terminal analysis Middleware Hardware Software Disk DBMS CPU Terminal analysis Network Middleware Terminal analysis Network Log utility Processing Metadata Infrastructure
Operational Ongoing refreshment Integration transformation Data model maintenance Record identification maintenance Metadata infrastructure maintenance Archival of data Data aging within the DW Integration/transformation processing specification Metadata infrastructure population System of record definition Data dictionary language definition Network transfer definition CASE/Repository interface Initial data warehouse population Data model definition Database design definition
37
Cost is Highly Variable

A company that spends less money for their data
warehouse is often happier with it.
The main justification for the development
expense is that a DW reduces the cost of
accessing the information owned by the
organization.
Since information has to be retrieved just once
(when it is placed in the warehouse), DW users
see a lower cost on each report generated.

38
Typical Multidatabase Report and Screen Generation
Data download and transformation contribute to
retrieval costs for every report or screen
generated
39
Typical DW Report and Screen Generation
Data upload and transformation costs occur just
once. Retrieval costs are lower.
40
Farmers and Explorers

Every corporation has two types of DW users.
Farmers know what they want before they set out
to find it. They submit small queries and
retrieve small nuggets of information.
Explorers are quite unpredictable. They often
submit large queries. Sometimes they find
nothing, sometimes they find priceless nuggets.
Cost justification for the DW is usually done on
the basis of the results obtained by farmers
since explorers are unpredictable.

41
Data Marts and the Data Warehouse
Legacy systems feed data to the warehouse. The
warehouse feeds specialized information to
departments.
42
The Data Mart is More Specialized
The data mart serves the needs of one business
unit, not the organization.
43
Foundations of Data Mining

Data mining is the process of using raw data to
infer important business relationships.
Despite a consensus on the value of data mining,
a great deal of confusion exists about what it
is.
It is a collection of powerful techniques
intended for analyzing large datasets.
There is no single data mining approach, but
rather a set of techniques that can be used in
combination with each other.

44
The Roots of Data Mining

The approach has roots in practice dating back
over 30 years.
In the early 1960s, data mining was called
statistical analysis, and the pioneers were
statistical software companies such as SAS and
SPSS.
By the 1980s, the traditional techniques had been
augmented by new methods such as fuzzy logic,
heuristics and neural networks.

45
A General Approach

Although all data mining endeavors are unique,
they possess a common set of process steps
Infrastructure preparation choice of hardware
platform, the database system and one or more
mining tools
Exploration looking at summary data, sampling
and applying intuition
Analysis each discovered pattern is analyzed
for significance and trends

46
A General Approach (continued)

Interpretation Once patterns have been
discovered and analyzed, the next step is to
interpret them. Considerations include business
cycles, seasonality and the population the
pattern applies to.
Exploitation this is both a business and a
technical activity. One way to exploit a pattern
is to use it for prediction. Others are to
package, price or advertise the product in a
different way.

47
Review Vocabulary

Data warehouse
Data mart
OLTP
OLAP
Dimensional Model
Subject Oriented
Time variant
Non volatile
Integrated/consolidate

Write a Comment

User Comments (0)

About PowerShow.com

ITEC 423 Data Warehousing and Data Mining - PowerPoint PPT Presentation

ITEC 423 Data Warehousing and Data Mining

ITEC 423 Data Warehousing and Data Mining Lecture 2 – PowerPoint PPT presentation