Title: Data Warehousing
1Data Warehousing Data Mining Concepts
- David Chrestman
- Evan Rakestraw
- Andy Stevens
2Data Warehousing
- A data warehouse is a collection of information
with a supporting system that is - Intended for decision-support applications.
- Optimized for data retrieval, not transaction
processing.
3Data Warehousing
- A data warehouse is characterized as a
subject-oriented, integrated, time-variant
collection of data to support the decisions of
management. - Data warehouses provide access to data for
complex analysis, knowledge discovery, and
decision making.
4Data Warehousing
- A data warehouse can be normalized or
denormalized. - Data warehouse data is not changed often.
- Outputs from data warehouses are usually tabular
listings or formatted formal reports.
5Data Warehousing
- Data warehouses support high-performance demands
on data with support for applications such as - OLAP
- DSS
- Data Mining Applications
6OLAP
- Online Analytical Processing describes the
analysis of complex data from the data warehouse. - OLAP tools use distributed computing capabilities
for analyses that require more storage and
processing power than found on an average desktop.
7DSS
- DSS stands for Decision Support Systems and is
also known as Executive Information Systems
(EIS). - DSS supplies higher level data for complex
decisions.
8Data Mining
- Data Mining is used for knowledge discovery.
- Data Mining is the process of searching data for
unanticipated new knowledge.
9Data Warehousing
- Compared with transactional databases,
information in a data warehouse is regarded as
non-real-time with periodic updating. - Warehouse updates are handled by the warehouses
acquisition component that provides all required
processing.
10Data Warehousing Characteristics
- Multidimensional Conceptual View
- Generic Dimensionality
- Unlimited Dimensions and Aggregation Levels
- Unrestricted Cross-Dimensional Operations
- Dynamic Sparse Matrix Handling
- Client-Server Architecture
11Data Warehousing Characteristics
- Multi-User Support
- Accessibility
- Transparency
- Intuitive Data Manipulations
- Consistent Reporting Performance
- Flexible Reporting
12Data Warehousing
- Data Warehouses are generally 10 times larger
than the source databases, and are likely to be
in terabytes. - The issue of their size has been dealt with
using - Enterprise-Wide Data Warehouses
- Virtual Data Warehouses
- Data Marts
13Data Warehousing
- Enterprise-Wide Data Warehouses are huge projects
requiring massive investment of time and
resources. - Virtual Data Warehouses provide views of
operational databases that are materialized for
efficient access. - Data Marts generally are targeted to a subset of
an organization and are tightly focused.
14Data Modeling For Warehouses
- Multidimensional models take advantage of
inherent relationships in data to populate data
in multidimensional matrices called data cubes. - If there are more than three dimensions they may
be called hypercubes.
15Data Modeling For Warehouses
- Examples of dimensions in a data warehouses would
be a companys fiscal periods, products, and
regions. - Changing the dimensional orientation is easily
accomplished using rotation (also called
pivoting).
16Two Dimensional Matrix Model
17Three Dimensional Data Cube
18Pivoted 3D Data Cube
19Data Modeling For Warehouses
- Roll-up display moves up the hierarchy, grouping
into larger units along a dimension. - Example Summing weekly data by month, quarter,
or year. - Drill-down display moves down the hierarchy,
dividing into smaller units along a dimension. - Example Dividing country sales by region and
then region by sub-region.
20Roll-Up
21Drill-Down
22Data Modeling For Warehouses
- The multidimensional storage model involves two
types of tables - Dimension Table consists of tuples of attributes
of the dimension. - Fact Table has tuples (one per recorded fact),
contains some measured variables, and identifies
it with pointers to dimension tables.
23Data Modeling For Warehouses
- Two common multidimensional schemas are
- Star Schema consists of a fact table with a
single table for each dimension. - Snowflake Schema a variation on the star schema
in which the dimensional tables from a star
schema are organized into a hierarchy by
normalization
24Star Schema
25Snowflake Schema
26Data Modeling For Warehouses
- To support high performance access, data
warehouse storage utilizes indexing techniques - Bitmap Indexing constructs a bit vector for each
value in a domain being indexed. - Join Indexing uses traditional indexes to
maintain relationships between primary key and
foreign key values
27Building A Data Warehouse
- Warehouse design should specifically support
ad-hoc querying, which is accessing data with any
meaningful combination of values for attributes
in the dimension or fact tables. - This is because there is no way to anticipate all
possible queries or analyses during the design
phase.
28Building A Data Warehouse
- Acquisition of data for the warehouse involves
the following steps - Data must be extracted from multiple,
heterogeneous sources - Data must be formatted for consistency within the
warehouse - Data must be cleaned before loaded into the
warehouse to ensure validity.
29Building A Data Warehouse
- Data must be fitted into the data model of the
warehouse. - Data must be loaded into warehouse.
30Building A Data Warehouse
31Building A Data Warehouse
- Due to the large size of data loaded into a
warehouse, the updating policy must keep in mind
a few questions - How up-to-date must the data be?
- Can the warehouse go offline, and for how long?
- What is the storage availability?
- What is the loading time?
32Building A Data Warehouse
- Data storage in a warehouse must be able to
reflect the specialization of optimized access,
which involves - Storing data according to data model.
- Maintaining required data structures.
- Maintaining appropriate access paths.
- Providing for time-variant data as new data is
added.
33Building A Data Warehouse
- Supporting the updating of the warehouse data.
- Refreshing the data.
- Purging the data.
34Building A Data Warehouse
- Data warehouses must be designed with
consideration to the environment in which they
reside. Important considerations - Usage projections
- Fit of the data model
- Characteristics of available sources
- Design of the metadata component
35Building A Data Warehouse
- Modular component design
- Design for manageability and change
- Considerations of distributed and parallel
architecture
36Oracle Data Warehouses
- Oracle offers their Oracle Warehouse Builder
(OWB) in Oracle 10g. - OWB manages the full life-cycle of data and
metadata for the Oracle 10g Database. - Provides an easy to use, graphical environment to
rapidly design, deploy, and manage business
intelligence systems.
37Oracle Data Warehouses
- Oracle Warehouse Builder
- Provides specific mapping operators to cleanse
data upon loading - Extracts data from heterogeneous data sources
- Provides users with a graphical environment to
model the ETL processes - Reduces extraction, transformation and loading
development times
38Oracle Data Warehouses
- More information on OWB at http//www.oracle.com/
technology/products/warehouse/index.html
39Data Warehouse Summary
- A data warehouse is a collection of data used for
research and decision support. - Data warehouses exist to facilitate complex,
data-intensive, and frequent ad hoc queries. - Data warehouses must provide far greater and more
efficient query support than is demanded of
transactional databases.
40Data Mining
- Data mining refers to the mining or discovery of
new information in terms of patterns or rules
from vast amounts of data. - To date, it is not well-integrated with database
management systems.
41Goals of Data Mining
- Prediction
- Identification
- Classification
- Optimization
42Prediction
- Shows how certain attributes within the data will
behave in the future. - Example
- Certain seismic wave patterns may predict an
earthquake with high probability.
43Identification
- Data patterns can be used to identify the
existence of an item, and event, or an activity. - Example
- Intruders trying to break a system may be
identified by the programs executed, files
accessed, and CPU time per session.
44Classification
- Catagorizes data so that different categories can
be identified - Example
- Customers in a supermarket can be categorized
into discount-seeking shoppers, shoppers in a
rush, loyal regular shoppers, shoppers attached
to name brands, and infrequent shoppers.
45Optimization
- Optimizes the use of limited resources
- Maximize output variables such as sales or profits
46Knowledge discovered in Data Mining
- Association
- Classification hierarchies
- Sequential patterns
- Patterns within time series
- Clustering
- For most applications the desired knowledge is a
combination of these types.
47Association Rules
- These rules correlate the presence of a set of
items with another range of values for another
set of variables. - Example
- When a person buys a gallon of milk that person
is likely to buy a box of cereal.
48Classification Hierarchies
- Works from an existing set of events to create a
hierarchy of classes. - Example
- A model may be developed for the factors that
determine the desirability of location of a store
on a scale of - 1-10.
49Sequential Patterns
- A sequence of actions or events is sought.
- Example
- If a patient underwent cardiac bypass surgery
for blocked arteries and an aneurysm and later
developed high blood urea within a year of
surgery, he or she is likely to suffer from
kidney failure within the next 18 months.
50Patterns Within Time Series
- Similarities detected within positions of a time
series of data - A sequence of data taken in intervals such as
daily sales or daily closing stock prices. - Example
- Two products show the same selling pattern in
the summer but a different one in the winter.
51Clustering
- A population of events or items partitioned into
sets of similar elements. - Example
- An entire population of treatment data on a
disease may be divided into groups based on
similarity of side effects produced
52Interest Measures
- Two common interest measures
- Support
- Confidence
53Support
- Refers to how frequently a specific itemset
occurs in the database. - LHS (left-hand side) gt RHS (right-hand side)
- Support is the percentage of transactions that
contain all of the items in the itemset, LHS U
RHS.
54Confidence
- The probability that one item (RHS) will be
purchased along with another item (LHS). - Computed as
- the support (LHS U RHS)/support (LHS).
55Support/Confidence Example
- Milk gt Juice and Bread gt Juice
- Support Milk, Juice 50
- Support Bread, Juice 25
- Confidence Milk gt Juice 66.7
- Confidence Bread gt Juice 50
- ( of milks and juices together / of milks)
56Knowledge Discovery in Databases (KDD)
- Steps in the KDD
- Data Collection
- Data Cleansing
- Data Mining
- Evaluation/Interpretation
57Data Collection
- Identify the target data set
- Acquire the relevant application domain knowledge
58Data Cleaning
- Remove noise
- Account for missing fields
- Transform field values to common units
- Generate Data
59Data Mining
- Extract patterns of interest from the data
generated in the data cleaning phase.
60Evaluation/Interpretation
- Evaluates and Interprets the mined data.
- Determines if the data is meaningful and useful.
61KDD Diagram
62Evolution of Data Mining
- Data Collection (1960s)
- Data Access (1980s)
- Data Warehousing Decision Support
- (1990s)
- Data Mining (Emerging Today)
63Evolution Timeline and Enabling Technologies
1960s
1980s
1990s
Today
Computers, tapes, disks
Relational databases SQL, ODBC
OLAP, multidimensional databases, data warehouses
Advanced algorithms, multiprocessor computers,
massive databases
64Oracle Data Mining
- Oracle Data Mining is an option to Oracle 10g
Database Enterprise Edition - It embeds data mining functionality for making
classifications, predictions, and associations - Also extracts new features from data, clusters
data and ranks relative attribute importance.
65What mining capabilities does Oracle Data Mining
support?
- Oracle Data Mining provides programmatic access
to six data mining algorithms embedded in Oracle
Database. - SVM
- NMF
- MDL
- Enhanced K-Means
- O-Cluster
- ABN
66Areas that use these algorithms
- Classification
- Regression
- Association Rules
- Clustering
- Attribute Importance
- Feature Extraction
- Text Mining
67Classification
- Predicts binary or multi-class outcomes.
- In binary problems, each record either will or
will not exhibit the modeled behavior. For
example, a model could be built to predict
whether a customer will churn or remain loyal. - Predictions for multi-class problems where there
are several possible outcomes can also be made.
For example, a model could be built to predict
which class of service will be preferred by each
prospect. - Deals with discrete/categorical target attributes.
68Regression
- Creates predictive models.
- Deals with numerical/continuous target attributes
- If the target attribute contains floating-point
values, a regression technique is required. - If the target attribute contains string or
discrete integer values, a classification
technique is used. - Most common form is linear regression
- A line that best fits the data is calculated
69Association Rules
- Detect associated or co-occurring events hidden
in data. - Often used to find popular products that are
related to each other, such as milk and cereal
being associated with each other.
70Oracle Association Rules
- Used to identify co-occurring items or events in
a variety of business problems, such as - Which items or products is this person most
likely to buy or like? - The associations discovered are useful in
designing special promotions, products bundles,
and store displays.
71Clustering
- ODM provides two Clustering algorithms for the
segmentation of cases in a dataset. - Enhanced version of K-Means algorithm
- O-Cluster algorithm
72Attribute Importance
- Measures the predictive power of each attribute
in classifying the target values - Produces a list of attributes ranked by relative
importance. - This information can be used to reduce the size
of input data, increasing the speed of mining
tasks.
73Feature Extraction
- A combination of attributes in the data that is
of special interest and captures important
characteristics of the data. - Creates a new set of features by decomposing the
original data. - Lets you describe the data with a number of
features smaller than the number of original
attributes.
74Text Mining
- Conventional data mining done using text
features. - Usually keywords, frequencies of words, or other
document-derived features. -
- Once you derive text features, you mine them just
as you would any other data.
75Text Mining (cont.)
- Some of the applications for text mining include
- Create and manage taxonomies
- Classify or categorize documents
- Integrating search capabilities and
classification and clustering of documents
returned from a search - Automatic extraction of topics
- Feature extraction for subsequent mining
76Sources
- http//web.cecs.pdx.edu/saxenas/cs544/CS544_kdd.p
df - http//www.thearling.com/text/dmwhite/dmwhite.htm
- http//www.oracle.com/technology/products/bi/odm/o
dm_10g_faq.html
77Other Types of Association Rules
- Multidimensional Associations
- e.g. People who buy milk between 600 800
- Partition into non-overlapping labels
78Other Types of Association Rules
- Negative Associations
- Harder than finding a positive association
- Must find interesting rules
- Use hierarchies
79Additional Considerations
- Millions of transactions
- Variability
- Multiple dimensions
- Quality of data is variable
80Classification
- What is Classification?
- The process of learning a model that describes
different classes of data - Classes are predetermined
- Supervised Learning
81Clustering
- What is Clustering?
- Partitioning data without having a training
sample - Unsupervised Learning
- Goal place records in groups so that records are
with similar records and not with dissimilar
records
82Clustering
- Decision Trees
- http//www.20q.net/
83Neural Networks
- What is a Neural Network?
- A technique derived from artificial intelligence
research using generalized regression and an
iterative method
84Neural Networks
- Supervised Networks
- Adaptive methods that attempt to reduce the
output error
85Neural Networks
- Unsupervised Networks
- Adaptive methods that develop internal
representations without sample outputs
86Genetic Algorithms
- What is a Genetic Algorithm?
- A class of randomized search procedures capable
of adaptive search over a wide range of
topologies - Based on the A,C,T,G nucleotides of DNA
87Genetic Algorithms
- Uses a set of solutions
- Randomized algorithm due to probabilistic
operators - Finds near-optimal balance between knowledge
acquisition and exploitation
88Applications of Data Mining
- Marketing
- Finance
- Manufacturing
- Health Care
89Applications - Marketing
- Analysis of customer behavior
- Determination of marketing strategies
- Design of catalogs and store layouts
90Applications - Finance
- Analysis of creditworthiness
- Evaluation of financing options
- Fraud detection
91Applications - Manufacturing
- Optimization of resources
- Optimal design of manufacturing processes
- Product design
92Applications Health Care
- Discovering patterns in radiology images
- Analysis of experimental data
- Optimization of processes within a hospital
93Commercial Data Mining Tools
- Use Open Database Connectivity (ODBC)
- Use GUIs
- Application Programming Interface (API)
94Commercial Data Mining Tools
- Examples
- IBM Intelligent Miner
- DBMiner
- SAS Enterprise Miner
- Silicon Graphics MineSet
95Commercial Data Mining Tools
- Future Directions
- More fully developed internet capability
- Hybrid approaches
- Parallel and Distributed environments
- Windows NT and UNIX will be standard
- Proprietary input formats expected to disappear
- Non-standard data