MBA 7020 Business Analysis Foundations Data Warehousing - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

MBA 7020 Business Analysis Foundations Data Warehousing

Description:

Data Warehouse & Relational Database. Designing & Building the Data Warehouse. Appendix: SQL ... Hot Dogs IMPLIES Hot Dog Buns. TV IMPLIES TV Warranty. Inexplicable ... – PowerPoint PPT presentation

Number of Views:244
Avg rating:3.0/5.0
Slides: 49
Provided by: BHREn8
Category:

less

Transcript and Presenter's Notes

Title: MBA 7020 Business Analysis Foundations Data Warehousing


1
MBA 7020Business Analysis FoundationsData
Warehousing Data Mining July 25, 2005
2
Agenda
Data Mining
Designing Building the Data Warehouse
Appendix SQL
3
The Data Warehouse
  • The Data Warehouse
  • is physically separated from all other
    operational systems
  • holds aggregated data and transactional data for
    management separate from that data used for
    online transaction processing

4
Data Flow
Operational Data Store
Data Warehouse
Personal Data Warehouse
Legacy Systems
Data Mart
Metadata
5
Metadata
  • What is Metadata?
  • Data about Data
  • Without metadata, the data is meaningless
  • Provides consistency of the truth
  • Components of Metadata
  • Transformation Mapping
  • Extraction and Relationship History
  • Algorithms for Summarization (and calculations)
  • Data Ownership
  • Patterns of Warehouse Access
  • Business Friendly naming conventions
  • Status Information

6
Data Warehouse Vendors
  • Business Objects
  • Cognos
  • Hyperion
  • IBM
  • Microsoft
  • NCR / Teradata
  • Oracle
  • SAS

7
Relational Database
  • A relational database is a collection of data
    items organized as a set of formally-described
    tables from which data can be accessed or
    reassembled in many different ways without having
    to reorganize the database tables. The relational
    database was invented by E. F. Codd at IBM in
    1970.
  • The standard user and application program
    interface to a relational database is the
    structured query language (SQL). SQL statements
    are used both for interactive queries for
    information from a relational database and for
    gathering data for reports.
  • A relational database is a set of tables
    containing data fitted into predefined
    categories. Each table (which is sometimes called
    a relation) contains one or more data categories
    in columns. Each row contains a unique instance
    of data for the categories defined by the
    columns. For example, a typical business order
    entry database would include a table that
    described a customer with columns for name,
    address, phone number, and so forth. Another
    table would describe an order product, customer,
    date, sales price, and so forth. A user of the
    database could obtain a view of the database that
    fitted the user's needs. For example, a branch
    office manager might like a view or report on all
    customers that had bought products after a
    certain date. A financial services manager in the
    same company could, from the same tables, obtain
    a report on accounts that needed to be paid.

8
Relational Database
When creating a relational database, you can
define the domain of possible values in a data
column and further constraints that may apply to
that data value. For example, a domain of
possible customers could allow up to ten possible
customer names but be constrained in one table to
allowing only three of these customer names to be
specifiable. The definition of a relational
database results in a table of metadata or formal
descriptions of the tables, columns, domains,
and constraints. Meta is a prefix that in most
information technology usages means "an
underlying definition or description." Thus,
metadata is a definition or description of data
and metalanguage is a definition or description
of language. A database is a collection of data
that is organized so that its contents can easily
be accessed, managed, and updated. The most
prevalent type of database is the relational
database, a tabular database in which data is
defined so that it can be reorganized and
accessed in a number of different ways. A
distributed database is one that can be dispersed
or replicated among different points in a
network. An object-oriented programming database
is one that is congruent with the data defined in
object classes and subclasses. SQL (Structured
Query Language) is a standard interactive and
programming language for getting information from
and updating a database. Although SQL is both an
ANSI and an ISO standard, many database products
support SQL with proprietary extensions to the
standard language. Queries take the form of a
command language that lets you select, insert,
update, find out the location of data, and so
forth.
9
Business Intelligence Environment
Internal Source Systems
Data Warehouse Data Mart
Extract, Transformation and Load
External Data Sources
10
Relational Database
  • IBM DB2, DB2/400
  • Microsoft SQL/Server
  • Teradata
  • Oracle
  • Sybase
  • Informix / Red Brick
  • Microsoft Access
  • MySQL

11
SQL
  • SQL Structured Query Language
  • DDL Data Definition Language
  • Create
  • Drop
  • Alter
  • DML Data Manipulation Language
  • Insert
  • Update
  • Delete
  • Select

12
Relational Database
SQL Request
RDBMS
Software Application
Result Set
13
Agenda
Data Mining
Designing Building the Data Warehouse
Appendix SQL
14
Why Business Intelligence
  • Improve consistency and accuracy of reporting
  • Reduce stress on operational systems for
    reporting and analysis
  • Faster access to information
  • BI tools provide increased analytical
    capabilities
  • Empowering the Business User
  • Companies are realizing that data is a companys
    most underutilized asset

15
ERM vs. DM
  • ERM - Entity Relationship Model
  • Remove redundancy
  • Efficiency of transactions
  • DM - Dimensional Model
  • Intuitive View of the Data
  • Efficiency of access and analysis

16
Dimensional Model
17
Retail Sales Dimensional Model (Partial)
18
Fact Table
  • Contains Foreign Keys that relate to Dimension
    Tables
  • Have a many-to-one relationship to Dimension
    Tables
  • Contains Metrics to be aggregated
  • Typically does not contain any non-foreign key or
    non-metric data elements
  • Level of Granularity defines depth and
    flexibility of analysis

19
Dimension Table
  • Contains a Primary Key that relates to the Fact
    Table(s)
  • Has a one-to-many relationship to the Fact
    Table(s)
  • Contains Descriptive data used to limit and
    aggregated metrics from the Fact Table(s)
  • Can sometimes contain pre-aggregated data

20
Agenda
Data Mining
Designing Building the Data Warehouse
Appendix SQL
What is Data Mining? Market Basket
Analysis Marketing Analytics Direct Marketing
Campaign Cluster Analysis
21
What is Data Mining?
  • A set of activities used to find new, hidden, or
    unexpected patterns in data
  • Verification versus Discovery
  • Accuracy in predicting consumer behavior

22
OLAP Online Analytical Processing
  • MOLAP Multidimensional OLAP
  • ROLAP Relational OLAP

23
Limitations of Data Mining
  • All relevant data items / attributes may not be
    collected by the operational systems
  • Data noise or missing values (data quality)
  • Large database requirements and
    multi-dimensionality

24
Techniques and Technologies
  • Techniques Used to Mine the Data
  • Classification
  • Association
  • Sequence
  • Cluster
  • Data Mining Technologies
  • Statistical Analysis
  • Neural Networks, Genetic Algorithms and Fuzzy
    Logic
  • Decision Trees

25
General Data Mining Methods
  • Predicting which customers will purchase, based
    on demographics, psychographics, firmographics,
    service history, transactions, credit history,
    etc. Statistical algorithms and decision trees
    are used for these problems with much success.
  • Market Basket Analysis which customers who
    purchase an additional telephone line are also
    likely to purchase dialup internet service?
    Pattern matching works well associative rules,
    fuzzy logic, neural networks.
  • Which types of activities precede each other eg,
    do customer hospitality and gaming activities
    show patterns or sequences? We use a combination
    of statistical modeling and simulations to
    identify these trigger points for action, and to
    estimate the marginal value of each.
  • Clustering is useful for determining similar
    groups based on how closely they resemble each
    other. Multitude of clustering techniques exist,
    with the primary difference being in how they
    define what is close. Clustering can be very
    useful for marketing messaging and advertising,
    strategy development and implementation, and
    channel development.

Classification
Association
Sequencing
Clustering
26
Analytics Process
DATA
KNOWLEDGE
LEVERAGING
POST
DISCOVERY
PREPARATION
DEVELOPMENT
ANALYTICS
ANALYSIS
FEEDBACK
IDENTIFYING
HYPOTHESIS
CUSTOMER
DATA WAREHOUSE
TESTING
OPPORTUNITIES
BEHAVIOR
RESULTS
SCORING
EXTERNAL DATA
DECOMPOSITION
SCOPING
STATISTICAL
APPEND
DIRECT MAIL
MODELING
EFFORT
TELEMARKETING
OBJECTIVE
DATA EXTRACTION
FEEDBACK FOR
SEGMENTATION
EMAIL
SETTING
REFINING
ANALYTICS
LOYALTY
DATA VALIDATION
OFFER
DEVELOPING
CAMPAIGN
OPTIMIZATION
HYPOTHESES
27
Market Basket Analysis
  • Market Basket Analysis
  • Most common and useful in Marketing
  • What products customers purchase together
  • Diapers and Beer sell well on Thursday nights
  • Benefits
  • Better target marketing
  • Product positioning with stores (virtual stores)
  • Inventory management
  • Limitations
  • Large volume of real transactions needed
  • Difficult to correlate frequently purchased items
    with infrequently purchased items
  • Results of previous transactions could have been
    affected by other marketing promotions

28
Market Basket Analysis
  • Association Rules for Market Basket Analysis
  • All associations are unidirectional and take on
    the following form
  • Left-hand side rule IMPLIES Right-hand side rule
  • Left and Right hand side can both contain
    multiple items (Multi-dimensional Market
    Analysis)
  • Examples
  • Steak IMPLIES Red Wine
  • Hunting Magazines IMPLIES Smokeless Tobacco

29
Market Basket Analysis
  • 3 Measures of Market Basket Analysis
  • Support the percentage of baskets in the
    analysis where the rule is true
  • Of 100 baskets 11 contained both steaks and red
    wine.
  • 11 support
  • Confidence the percentage of Left-hand side
    items that also have right-side items
  • Of the 17 baskets that contained steak, 11
    contained red wine.
  • 65 confidence
  • Lift compares the likelihood of finding the
    right-hand item in any random basket
  • Also referred to as Improvement
  • Lift of less than 1 means it is less predictive
    than random choice
  • If Confidence is 35, but the right-hand side
    items is in 40 of the baskets, the rule offers
    no Improvement of random selection.

30
Market Basket Analysis
  • Market Basket Analysis results can be
  • Trivial
  • Hot Dogs IMPLIES Hot Dog Buns
  • TV IMPLIES TV Warranty
  • Inexplicable
  • Virtual Items Associating non-items or other
    attributes into the correlation study
  • New Customer

31
Marketing Analytics Landscape
32
Direct Marketing Campaign Platform
33
Cluster Analysis
  • Definition The identification and grouping of
    consumers that share similar characteristics
  • Yields better understanding of
    prospects/customers
  • Translates into improved business results
    through revised strategies attributes
  • Definition The identification and grouping of
    consumers that share similar characteristics
  • Process
  • Data Selection
  • Missing Values
  • Standardization
  • Removal of Outliers
  • Cluster Analysis Considerations

34
Cluster Analysis
  • Only want a small subset of variables for
    clustering
  • Weed out undesirable variables
  • Can use PROC FACTOR, PROC CORR
  • Can use expert system
  • Consideration for observations, weighting
  • Probably done with factor analysis
  • If not, then two options
  • Set Missing to Mean of data
  • Set Missing to Value of Equivalent Performance
  • No right or wrong answer
  • Might do both - depending on variables

35
Clustering
Midscale / Business Traveler
Midscale / Leisure Traveler
Upscale / Business Traveler Loan Dependent
Upscale / Leisure Traveler
Prospect Base
Country Club / Resort Set
Upscale / Business Traveler Prosperous Traveler
Other
36
Cluster Analysis
 
37
Cluster Analysis
 
38
Cluster Analysis
Cluster 1
Cluster 1
Cluster 1
------------
Calculate Scores (ROI, Response, Utilization)
Overlay Profitability Estimate
Low
RETURN
High
Evaluate Risk-Return Tradeoff (by Offer and by
Cluster)
Low
Mail
RISK
Make Final Selections
No-Mail
High
DM/Offer 1
DM /Offer 2
DM /Offer N
--------
 
39
Agenda
Data Mining
Appendix SQL
Designing Building the Data Warehouse
40
SQL Select Statement
SELECT column1, column2, . . .
FROM table1, table2, . . .
WHERE criteria1 AND/OR criteria2 . . . .
.
ORDER BY column1, column1, . . .
41
SQL Select Statement
SELECT column1, column2, . . .
FROM table1, table2, . . .
WHERE criteria1 AND/OR criteria2 . . . .
.
GROUP BY column1, column1, . . .
  • HAVING criteria1
  • AND/OR criteria2
  • . . . . .

Aggregation
ORDER BY column1, column1, . . .
42
SQL Example 1
SQL CREATE TABLE ADDR_BOOK ( NAME char(30),
COMPANY char(20), E_MAIL char (25)
Output Name Company Email John
Smith Microsoft john.smith_at_microsoft.com Jeff
Jones Delta jeffjones_at_delta.com
43
SQL Example 2
2a) SQL SELECT NAME, COMPANY, E_MAIL FROM
ADDR_BOOK WHERE COMPANY Microsoft'
Output Name Company Email John
Smith Microsoft john.smith_at_microsoft.com
2b) Table - Product ID Name Category I Internet
A B Browsers A A Application Null G Graphics N
ull SQL SELECT ID, NAME from PRODUCT WHERE
CATEGORY NULL
44
SQL Example 3
SQL SELECT ADDR_BOOK.NAME, COMPANY.EMAIL FROM
ADDR_BOOK, COMPANY WHERE ADDR_BOOK.EMPLOYEE_ID
COMPANY.EMPLOYEE_ID Output Name Email John
Smith john.smith_at_microsoft.com Jeff
Jones jeffjones_at_delta.com
45
SQL Example 4
SQL CREATE TABLE CUSTOMER ( CUST_NO INTEGER,
FIRST_NAME CHAR(30), LAST_NAME CHAR(30),
ADDRESS CHAR(50), CITY CHAR(30), STATE CHAR
(2), ZIP_CODE CHAR(9), COUNTRY CHAR(20) )
CREATE TABLE ORDER ( ORDER_NO INTEGER,
DATE_ENTERED DATE, CUST_NO INTEGER )
SQL SELECT ORDER.ORDER_NO, CUSTOMER.NAME,
CUSTOMER.ADDRESS, CUSTOMER.CITY,
CUSTOMER.ZIP_CIDE, CUSTOMER.COUNTRY FROM ORDER,
CUSTOMER WHERE ORDER.CUST_NO CUSTOMER.CUST_NO
AND ORDER.DATE_ENTERED '1998-20-11'
46
SQL Example 5
SQL CREATE TABLE ADDR_BOOK ( NAME char(30),
COMPANY char(20), E_MAIL char (25)
Output Name Company Email John
Smith Microsoft john.smith_at_microsoft.com Jeff
Jones Delta jeffjones_at_delta.com
47
SQL Example 6 Referential Integrity
SQL CREATE TABLE CUSTOMER ( CUST_NO INTEGER
PRIMARY KEY, FIRST_NAME CHAR(30), LAST_NAME
CHAR(30), ADDRESS CHAR(50), CITY CHAR(30),
ZIP_CODE CHAR(9), COUNTRY CHAR(20) ) CREATE
TABLE ORDER ( ORDER_NO INTEGER PRIMARY KEY,
DATE_ENTERED DATE, CUST_NO INTEGER REFERENCES
CUSTOMER (CUST_NO) )
SQL CREATE TABLE ORDER_ITEMS ( ORDER_NO INTEGER,
ITEM_NO INTEGER, PRODUCT CHAR(30), QUANTITY
INTEGER, UNIT_PRICE MONEY ) ALTER TABLE
ORDER_ITEMS ADD PRIMARY KEY PK_ORDER_ITEMS
(ORDER_NO, ITEM_NO) ALTER TABLE ORDER_ITEMS
ADD FOREIGN KEY FK_ORDER_ITEMS_1 (ORDER_NO)
REFERENCES ORDER (ORDER_NO)
48
SQL Example 7 Index
When you have a primary key, you already have an
implicitly (or explicitly) defined unique index
on the primary key columns. It's generally a good
idea to define non-unique indexes on the foreign
keys. SQL CREATE UNIQUE INDEX PK_CUSTOMER ON
CUSTOMER (CUST_NO) CREATE UNIQUE INDEX PK_ORDER
ON ORDER (ORDER_NO) CREATE INDEX FK_ORDER_1 ON
ORDER (CUST_NO) CREATE UNIQUE INDEX
PK_ORDER_ITEMS ON ORDER_ITEMS (ORDER_NO, ITEM_NO)
CREATE INDEX FK_ORDER_ITEMS_1 ON ORDER_ITEMS
(ORDER_NO)
Write a Comment
User Comments (0)
About PowerShow.com