Data Warehousing - PowerPoint PPT Presentation

1 / 95
About This Presentation
Title:

Data Warehousing

Description:

... Health Care Discovering patterns in radiology images Analysis of experimental data Optimization of processes within a hospital ... multiprocessor computers, ... – PowerPoint PPT presentation

Number of Views:375
Avg rating:3.0/5.0
Slides: 96
Provided by: DavidCh70
Category:

less

Transcript and Presenter's Notes

Title: Data Warehousing


1
Data Warehousing Data Mining Concepts
  • David Chrestman
  • Evan Rakestraw
  • Andy Stevens

2
Data Warehousing
  • A data warehouse is a collection of information
    with a supporting system that is
  • Intended for decision-support applications.
  • Optimized for data retrieval, not transaction
    processing.

3
Data Warehousing
  • A data warehouse is characterized as a
    subject-oriented, integrated, time-variant
    collection of data to support the decisions of
    management.
  • Data warehouses provide access to data for
    complex analysis, knowledge discovery, and
    decision making.

4
Data Warehousing
  • A data warehouse can be normalized or
    denormalized.
  • Data warehouse data is not changed often.
  • Outputs from data warehouses are usually tabular
    listings or formatted formal reports.

5
Data Warehousing
  • Data warehouses support high-performance demands
    on data with support for applications such as
  • OLAP
  • DSS
  • Data Mining Applications

6
OLAP
  • Online Analytical Processing describes the
    analysis of complex data from the data warehouse.
  • OLAP tools use distributed computing capabilities
    for analyses that require more storage and
    processing power than found on an average desktop.

7
DSS
  • DSS stands for Decision Support Systems and is
    also known as Executive Information Systems
    (EIS).
  • DSS supplies higher level data for complex
    decisions.

8
Data Mining
  • Data Mining is used for knowledge discovery.
  • Data Mining is the process of searching data for
    unanticipated new knowledge.

9
Data Warehousing
  • Compared with transactional databases,
    information in a data warehouse is regarded as
    non-real-time with periodic updating.
  • Warehouse updates are handled by the warehouses
    acquisition component that provides all required
    processing.

10
Data Warehousing Characteristics
  • Multidimensional Conceptual View
  • Generic Dimensionality
  • Unlimited Dimensions and Aggregation Levels
  • Unrestricted Cross-Dimensional Operations
  • Dynamic Sparse Matrix Handling
  • Client-Server Architecture

11
Data Warehousing Characteristics
  • Multi-User Support
  • Accessibility
  • Transparency
  • Intuitive Data Manipulations
  • Consistent Reporting Performance
  • Flexible Reporting

12
Data Warehousing
  • Data Warehouses are generally 10 times larger
    than the source databases, and are likely to be
    in terabytes.
  • The issue of their size has been dealt with
    using
  • Enterprise-Wide Data Warehouses
  • Virtual Data Warehouses
  • Data Marts

13
Data Warehousing
  • Enterprise-Wide Data Warehouses are huge projects
    requiring massive investment of time and
    resources.
  • Virtual Data Warehouses provide views of
    operational databases that are materialized for
    efficient access.
  • Data Marts generally are targeted to a subset of
    an organization and are tightly focused.

14
Data Modeling For Warehouses
  • Multidimensional models take advantage of
    inherent relationships in data to populate data
    in multidimensional matrices called data cubes.
  • If there are more than three dimensions they may
    be called hypercubes.

15
Data Modeling For Warehouses
  • Examples of dimensions in a data warehouses would
    be a companys fiscal periods, products, and
    regions.
  • Changing the dimensional orientation is easily
    accomplished using rotation (also called
    pivoting).

16
Two Dimensional Matrix Model
17
Three Dimensional Data Cube
18
Pivoted 3D Data Cube
19
Data Modeling For Warehouses
  • Roll-up display moves up the hierarchy, grouping
    into larger units along a dimension.
  • Example Summing weekly data by month, quarter,
    or year.
  • Drill-down display moves down the hierarchy,
    dividing into smaller units along a dimension.
  • Example Dividing country sales by region and
    then region by sub-region.

20
Roll-Up
21
Drill-Down
22
Data Modeling For Warehouses
  • The multidimensional storage model involves two
    types of tables
  • Dimension Table consists of tuples of attributes
    of the dimension.
  • Fact Table has tuples (one per recorded fact),
    contains some measured variables, and identifies
    it with pointers to dimension tables.

23
Data Modeling For Warehouses
  • Two common multidimensional schemas are
  • Star Schema consists of a fact table with a
    single table for each dimension.
  • Snowflake Schema a variation on the star schema
    in which the dimensional tables from a star
    schema are organized into a hierarchy by
    normalization

24
Star Schema
25
Snowflake Schema
26
Data Modeling For Warehouses
  • To support high performance access, data
    warehouse storage utilizes indexing techniques
  • Bitmap Indexing constructs a bit vector for each
    value in a domain being indexed.
  • Join Indexing uses traditional indexes to
    maintain relationships between primary key and
    foreign key values

27
Building A Data Warehouse
  • Warehouse design should specifically support
    ad-hoc querying, which is accessing data with any
    meaningful combination of values for attributes
    in the dimension or fact tables.
  • This is because there is no way to anticipate all
    possible queries or analyses during the design
    phase.

28
Building A Data Warehouse
  • Acquisition of data for the warehouse involves
    the following steps
  • Data must be extracted from multiple,
    heterogeneous sources
  • Data must be formatted for consistency within the
    warehouse
  • Data must be cleaned before loaded into the
    warehouse to ensure validity.

29
Building A Data Warehouse
  • Data must be fitted into the data model of the
    warehouse.
  • Data must be loaded into warehouse.

30
Building A Data Warehouse
31
Building A Data Warehouse
  • Due to the large size of data loaded into a
    warehouse, the updating policy must keep in mind
    a few questions
  • How up-to-date must the data be?
  • Can the warehouse go offline, and for how long?
  • What is the storage availability?
  • What is the loading time?

32
Building A Data Warehouse
  • Data storage in a warehouse must be able to
    reflect the specialization of optimized access,
    which involves
  • Storing data according to data model.
  • Maintaining required data structures.
  • Maintaining appropriate access paths.
  • Providing for time-variant data as new data is
    added.

33
Building A Data Warehouse
  • Supporting the updating of the warehouse data.
  • Refreshing the data.
  • Purging the data.

34
Building A Data Warehouse
  • Data warehouses must be designed with
    consideration to the environment in which they
    reside. Important considerations
  • Usage projections
  • Fit of the data model
  • Characteristics of available sources
  • Design of the metadata component

35
Building A Data Warehouse
  • Modular component design
  • Design for manageability and change
  • Considerations of distributed and parallel
    architecture

36
Oracle Data Warehouses
  • Oracle offers their Oracle Warehouse Builder
    (OWB) in Oracle 10g.
  • OWB manages the full life-cycle of data and
    metadata for the Oracle 10g Database.
  • Provides an easy to use, graphical environment to
    rapidly design, deploy, and manage business
    intelligence systems.

37
Oracle Data Warehouses
  • Oracle Warehouse Builder
  • Provides specific mapping operators to cleanse
    data upon loading
  • Extracts data from heterogeneous data sources
  • Provides users with a graphical environment to
    model the ETL processes
  • Reduces extraction, transformation and loading
    development times

38
Oracle Data Warehouses
  • More information on OWB at http//www.oracle.com/
    technology/products/warehouse/index.html

39
Data Warehouse Summary
  • A data warehouse is a collection of data used for
    research and decision support.
  • Data warehouses exist to facilitate complex,
    data-intensive, and frequent ad hoc queries.
  • Data warehouses must provide far greater and more
    efficient query support than is demanded of
    transactional databases.

40
Data Mining
  • Data mining refers to the mining or discovery of
    new information in terms of patterns or rules
    from vast amounts of data.
  • To date, it is not well-integrated with database
    management systems.

41
Goals of Data Mining
  • Prediction
  • Identification
  • Classification
  • Optimization

42
Prediction
  • Shows how certain attributes within the data will
    behave in the future.
  • Example
  • Certain seismic wave patterns may predict an
    earthquake with high probability.

43
Identification
  • Data patterns can be used to identify the
    existence of an item, and event, or an activity.
  • Example
  • Intruders trying to break a system may be
    identified by the programs executed, files
    accessed, and CPU time per session.

44
Classification
  • Catagorizes data so that different categories can
    be identified
  • Example
  • Customers in a supermarket can be categorized
    into discount-seeking shoppers, shoppers in a
    rush, loyal regular shoppers, shoppers attached
    to name brands, and infrequent shoppers.

45
Optimization
  • Optimizes the use of limited resources
  • Maximize output variables such as sales or profits

46
Knowledge discovered in Data Mining
  • Association
  • Classification hierarchies
  • Sequential patterns
  • Patterns within time series
  • Clustering
  • For most applications the desired knowledge is a
    combination of these types.

47
Association Rules
  • These rules correlate the presence of a set of
    items with another range of values for another
    set of variables.
  • Example
  • When a person buys a gallon of milk that person
    is likely to buy a box of cereal.

48
Classification Hierarchies
  • Works from an existing set of events to create a
    hierarchy of classes.
  • Example
  • A model may be developed for the factors that
    determine the desirability of location of a store
    on a scale of
  • 1-10.

49
Sequential Patterns
  • A sequence of actions or events is sought.
  • Example
  • If a patient underwent cardiac bypass surgery
    for blocked arteries and an aneurysm and later
    developed high blood urea within a year of
    surgery, he or she is likely to suffer from
    kidney failure within the next 18 months.

50
Patterns Within Time Series
  • Similarities detected within positions of a time
    series of data
  • A sequence of data taken in intervals such as
    daily sales or daily closing stock prices.
  • Example
  • Two products show the same selling pattern in
    the summer but a different one in the winter.

51
Clustering
  • A population of events or items partitioned into
    sets of similar elements.
  • Example
  • An entire population of treatment data on a
    disease may be divided into groups based on
    similarity of side effects produced

52
Interest Measures
  • Two common interest measures
  • Support
  • Confidence

53
Support
  • Refers to how frequently a specific itemset
    occurs in the database.
  • LHS (left-hand side) gt RHS (right-hand side)
  • Support is the percentage of transactions that
    contain all of the items in the itemset, LHS U
    RHS.

54
Confidence
  • The probability that one item (RHS) will be
    purchased along with another item (LHS).
  • Computed as
  • the support (LHS U RHS)/support (LHS).

55
Support/Confidence Example
  • Milk gt Juice and Bread gt Juice
  • Support Milk, Juice 50
  • Support Bread, Juice 25
  • Confidence Milk gt Juice 66.7
  • Confidence Bread gt Juice 50
  • ( of milks and juices together / of milks)

56
Knowledge Discovery in Databases (KDD)
  • Steps in the KDD
  • Data Collection
  • Data Cleansing
  • Data Mining
  • Evaluation/Interpretation

57
Data Collection
  • Identify the target data set
  • Acquire the relevant application domain knowledge

58
Data Cleaning
  • Remove noise
  • Account for missing fields
  • Transform field values to common units
  • Generate Data

59
Data Mining
  • Extract patterns of interest from the data
    generated in the data cleaning phase.

60
Evaluation/Interpretation
  • Evaluates and Interprets the mined data.
  • Determines if the data is meaningful and useful.

61
KDD Diagram
62
Evolution of Data Mining
  • Data Collection (1960s)
  • Data Access (1980s)
  • Data Warehousing Decision Support
  • (1990s)
  • Data Mining (Emerging Today)

63
Evolution Timeline and Enabling Technologies
1960s
1980s
1990s
Today
Computers, tapes, disks
Relational databases SQL, ODBC
OLAP, multidimensional databases, data warehouses
Advanced algorithms, multiprocessor computers,
massive databases
64
Oracle Data Mining
  • Oracle Data Mining is an option to Oracle 10g
    Database Enterprise Edition
  • It embeds data mining functionality for making
    classifications, predictions, and associations
  • Also extracts new features from data, clusters
    data and ranks relative attribute importance.

65
What mining capabilities does Oracle Data Mining
support?
  • Oracle Data Mining provides programmatic access
    to six data mining algorithms embedded in Oracle
    Database.
  • SVM
  • NMF
  • MDL
  • Enhanced K-Means
  • O-Cluster
  • ABN

66
Areas that use these algorithms
  • Classification
  • Regression
  • Association Rules
  • Clustering
  • Attribute Importance
  • Feature Extraction
  • Text Mining

67
Classification
  • Predicts binary or multi-class outcomes.
  • In binary problems, each record either will or
    will not exhibit the modeled behavior. For
    example, a model could be built to predict
    whether a customer will churn or remain loyal.
  • Predictions for multi-class problems where there
    are several possible outcomes can also be made.
    For example, a model could be built to predict
    which class of service will be preferred by each
    prospect.
  • Deals with discrete/categorical target attributes.

68
Regression
  • Creates predictive models.
  • Deals with numerical/continuous target attributes
  • If the target attribute contains floating-point
    values, a regression technique is required.
  • If the target attribute contains string or
    discrete integer values, a classification
    technique is used.
  • Most common form is linear regression
  • A line that best fits the data is calculated

69
Association Rules
  • Detect associated or co-occurring events hidden
    in data.
  • Often used to find popular products that are
    related to each other, such as milk and cereal
    being associated with each other.

70
Oracle Association Rules
  • Used to identify co-occurring items or events in
    a variety of business problems, such as
  • Which items or products is this person most
    likely to buy or like?
  • The associations discovered are useful in
    designing special promotions, products bundles,
    and store displays.

71
Clustering
  • ODM provides two Clustering algorithms for the
    segmentation of cases in a dataset.
  • Enhanced version of K-Means algorithm
  • O-Cluster algorithm

72
Attribute Importance
  • Measures the predictive power of each attribute
    in classifying the target values
  • Produces a list of attributes ranked by relative
    importance.
  • This information can be used to reduce the size
    of input data, increasing the speed of mining
    tasks.

73
Feature Extraction
  • A combination of attributes in the data that is
    of special interest and captures important
    characteristics of the data.
  • Creates a new set of features by decomposing the
    original data.
  • Lets you describe the data with a number of
    features smaller than the number of original
    attributes.

74
Text Mining
  • Conventional data mining done using text
    features.
  • Usually keywords, frequencies of words, or other
    document-derived features.
  • Once you derive text features, you mine them just
    as you would any other data.

75
Text Mining (cont.)
  • Some of the applications for text mining include
  • Create and manage taxonomies
  • Classify or categorize documents
  • Integrating search capabilities and
    classification and clustering of documents
    returned from a search
  • Automatic extraction of topics
  • Feature extraction for subsequent mining

76
Sources
  • http//web.cecs.pdx.edu/saxenas/cs544/CS544_kdd.p
    df
  • http//www.thearling.com/text/dmwhite/dmwhite.htm
  • http//www.oracle.com/technology/products/bi/odm/o
    dm_10g_faq.html

77
Other Types of Association Rules
  • Multidimensional Associations
  • e.g. People who buy milk between 600 800
  • Partition into non-overlapping labels

78
Other Types of Association Rules
  • Negative Associations
  • Harder than finding a positive association
  • Must find interesting rules
  • Use hierarchies

79
Additional Considerations
  • Millions of transactions
  • Variability
  • Multiple dimensions
  • Quality of data is variable

80
Classification
  • What is Classification?
  • The process of learning a model that describes
    different classes of data
  • Classes are predetermined
  • Supervised Learning

81
Clustering
  • What is Clustering?
  • Partitioning data without having a training
    sample
  • Unsupervised Learning
  • Goal place records in groups so that records are
    with similar records and not with dissimilar
    records

82
Clustering
  • Decision Trees
  • http//www.20q.net/

83
Neural Networks
  • What is a Neural Network?
  • A technique derived from artificial intelligence
    research using generalized regression and an
    iterative method

84
Neural Networks
  • Supervised Networks
  • Adaptive methods that attempt to reduce the
    output error

85
Neural Networks
  • Unsupervised Networks
  • Adaptive methods that develop internal
    representations without sample outputs

86
Genetic Algorithms
  • What is a Genetic Algorithm?
  • A class of randomized search procedures capable
    of adaptive search over a wide range of
    topologies
  • Based on the A,C,T,G nucleotides of DNA

87
Genetic Algorithms
  • Uses a set of solutions
  • Randomized algorithm due to probabilistic
    operators
  • Finds near-optimal balance between knowledge
    acquisition and exploitation

88
Applications of Data Mining
  • Marketing
  • Finance
  • Manufacturing
  • Health Care

89
Applications - Marketing
  • Analysis of customer behavior
  • Determination of marketing strategies
  • Design of catalogs and store layouts

90
Applications - Finance
  • Analysis of creditworthiness
  • Evaluation of financing options
  • Fraud detection

91
Applications - Manufacturing
  • Optimization of resources
  • Optimal design of manufacturing processes
  • Product design

92
Applications Health Care
  • Discovering patterns in radiology images
  • Analysis of experimental data
  • Optimization of processes within a hospital

93
Commercial Data Mining Tools
  • Use Open Database Connectivity (ODBC)
  • Use GUIs
  • Application Programming Interface (API)

94
Commercial Data Mining Tools
  • Examples
  • IBM Intelligent Miner
  • DBMiner
  • SAS Enterprise Miner
  • Silicon Graphics MineSet

95
Commercial Data Mining Tools
  • Future Directions
  • More fully developed internet capability
  • Hybrid approaches
  • Parallel and Distributed environments
  • Windows NT and UNIX will be standard
  • Proprietary input formats expected to disappear
  • Non-standard data
Write a Comment
User Comments (0)
About PowerShow.com