IST 210: Organization of Data Data Mining and Data Warehouse - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

IST 210: Organization of Data Data Mining and Data Warehouse

Description:

Commonly, the data at a warehouse is of two types ... young people tend to spend more time watching TV, then put more beer commercials ... – PowerPoint PPT presentation

Number of Views:198
Avg rating:3.0/5.0
Slides: 22
Provided by: jwa22
Category:

less

Transcript and Presenter's Notes

Title: IST 210: Organization of Data Data Mining and Data Warehouse


1
IST 210 Organization of DataData Mining and
Data Warehouse
  • James Z. Wang
  • School of Information Sciences and Technology
  • http//wang.ist.psu.edu

2
Data Warehousing
user query
result
Warehouse
Combiner
Wrapper
Wrapper
Wrapper
3
Schemas of Data Warehouse
  • Commonly, the data at a warehouse is of two types
  • Fact data very large, accumulation of facts such
    as sales
  • Often INSERT only once there, a tuple remains
  • Dimension Data smaller, generally static,
    information about the entities involved in the
    facts

4
Example
  • Fact data about a Wal-Mart store is recorded in a
    relation with schema
  • Sales (storeID,product,customer,day,time,price)
  • Dimension data could include a relation for
    stores, one for customers, and one for products
  • Stores (storeID, addr, phone)
  • Customers (name, addr, phone, creditcard)
  • Products (barcode, name, manf)

5
OLTP vs. OLAP
  • Most database operations are of a type called
    on-line transaction processing (OLTP)
  • Short, simple queries and frequent updates
    involving one or a small number of tuples
  • Example answering queries from a Web interface,
    recording sales at cash-registers, selling
    airline tickets
  • Of increasing importance are operations of the
    on-line analytic processing (OLAP)
  • Few, but very complex and time-consuming queries
    (can run for hours)

6
OLAP
  • Updates are infrequent, and/or the answer to the
    query in not dependent on having an absolutely
    up-to-date database
  • Example Amazon analyses purchases by all its
    customers to come up with a front screen with
    products of likely interest to the customers
  • Example Analysts at Wal-Mart look for items with
    increasing sales at stores in some region

7
Common Architecture
  • Local databases, say one per branch store, handle
    OLTP
  • A data warehouse integrating information from all
    branches handles OLAP
  • Complex OLAP queries are considered data mining
    activities

8
Data Mining Overview
  • Data Mining Large-scale data processing designed
    to extract patterns from data
  • (1) Find association rules or frequent itemsets
  • Example if a customer likes product A, what
    other products should we recommend to this
    customer
  • (2) Find dominating factors
  • Example what is the most important factor(s)
    related to your IST 210 midterm grade?

9
Data Mining
SQL Queries
Warehouse
Patterns Extracted
Data Preparation
Data Mining Program
10
(1) Find Association Rules
  • An important source of data for association rule
    is market baskets
  • As a customer passes through the checkout, we
    learn what items they buy together, e.g., diaper
    and beer
  • Gives us data with schema Baskets (ID, item)
  • Marketers would like to know what items people
    buy together
  • Example if people tend to buy diaper and beer
    together, put them near each other
  • Example run a sale on diaper and raise the price
    of beer

11
Simplest Problem
  • Find the frequent pairs of items
  • SELECT b1.item, b2.item
  • FROM Baskets b1, Baskets b2
  • WHERE b1.IDb2.ID AND b1.itemltb2.item
  • GROUP BY b1.item, b2.item
  • HAVING COUNT () gt 10000
  • This query is prohibitively expensive for large
    data.

12
A-Priori Trick
  • A-priori algorithm uses the fact that a pair
    (item1, item2) cannot appear together 10000 times
    unless item1 and item2 both appear at least 10000
    times
  • More efficient implementation
  • INSERT INTO Baskets1(ID, item)
  • SELECT FROM Baskets
  • WHERE item IN (
  • SELECT item From Baskets
  • GROUP BY item
  • HAVING COUNT() gt 10000)
  • Then run the query for pairs on Basket1 instead
    of Baskets.

13
(2) Find Dominating Factor
  • An important source of data for dominating
    factors is survey forms
  • As a customer fills out survey forms or
    questionnaires, we learn about the consumer
    demographics and the consumer behaviors
  • Marketers would like to know what factors are
    important for certain behavior
  • Example if young people tend to spend more time
    watching TV, then put more beer commercials
  • Example send junk mails to young people about
    subscription to cable TV programs

14
Method
  • Mostly statistical methods
  • CART Classification and Regression Trees
  • One of the best algorithms
  • Developed by Stanford and Berkeley Statisticians
  • Capable of automatic generation of predictions
  • Very fast (processing millions of data points in
    a few minutes)

15
A Market Project
  • We did a market study a total of 9409
    questionnaires were filled out by shopping mall
    customers in the S.F. area
  • Use the customer demographics as predictor
    variables
  • Train customer behavior models using the CART
    algorithm

16
Data Collection
  • Input gender (M/F), marital status (married,
    living together, divorced or separated, widowed,
    never married), age, education, occupation,
    income, dual incomes (if married), number of
    persons in household, number of kids under 18,
    householder status (own, rent, live with
    parents), type of home, ethnic classification,
    main language
  • Behaviors number of round trip flights last
    year, have a pet or not, hours spent watching TV
    each day, frequents night clubs,

17
Market Findings
  • After executing CART (15 secs CPU time), we
    found
  • Number of round trip flights last year depends
    mainly on occupation, annual income of household,
    and education. People with a professional,
    managerial, or military occupation, or people
    with high income or high education, tend to
    travel more.
  • Have a pet or not depends mainly on home type
    (house), householder status (own), and number of
    persons in the household (more family members).
  • Hours watching TV depends mainly on education
    (low), occupation (workers), age (mid-aged),
    marriage (not married), and income (low).
  • Frequent night clubs depends mainly on age
    (18-34), marriage status (unmarried), and gender
    (male).

18
Another Project (spring 2001)Study IST 210
Midterm Grades
  • Input 121 students in sections 2 and 3
  • The survey forms you filled in with your
    abilities in Math, UNIX, Windows, C, PERL, JAVA,
    DB, Visual Basic, HTML
  • Each item has three levels (Weak, OK, and Excel)
  • Behaviors
  • Your midterm I grade (lt70 or gt70)
  • Your midterm II grade (lt60 or gt60)
  • Execution of CART 0.05 sec CPU time

19
Findings on Midterm I
Math is most important to midterm I grade.
lt70
excel
excel OK
gt70
JAVA
WIN
excel OK
OK Weak
HTML
weak
gt70
excel
gt70
weak
MATH
gt70
lt70
excel OK
excel OK
OK Weak
C
JAVA
excel OK
lt70
HTML
weak
weak
weak
lt70
20
Findings on Midterm II
lt60
excel
lt60
OK
WIN
excel
OK weak
HTML
HTML
excel OK
gt60
UNIX
OK weak
Weak
gt60
excel OK
gt60
weak
C
Experience in C is most important to midterm II
grade.
lt60
Weak
excel OK
JAVA
gt60
weak
21
Summary
  • Data warehouse is designed mainly for data mining
    purposes
  • OLTP vs.OLAP (transaction vs. analytic)
  • Data mining
  • Find association rules
  • Find dominating factors
  • CART an automatic algorithm
Write a Comment
User Comments (0)
About PowerShow.com