MINING - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

MINING

Description:

DATA WAREHOUSE. A large collection of historical data that is organized specifically for use in decision support (i.e. OLAP, data mining) – PowerPoint PPT presentation

Number of Views:155
Avg rating:3.0/5.0
Slides: 65
Provided by: wiu62
Category:

less

Transcript and Presenter's Notes

Title: MINING


1
MINING WAREHOUSING (MSS2) BY CHANDRA S.
AMARAVADI
2
EXTENSIONS TO DSS
  • BI systems (aka EIS)
  • Geographical Information Systems (GIS)
  • Collaborative Systems (formerly GDSS)
  • Expert Systems
  • OLAP/Data mining/warehousing

3
DATA WAREHOUSES
4
DATA WAREHOUSE
A large collection of historical data that is
organized specifically for use in decision
support (i.e. OLAP, data mining)
5
DATA WAREHOUSING
USERS
APPLICATIONS
WAREHOUSE
User1
Prodn.
S E R V E R
Mktg.
User2
Fin.
User3
User4
Acctg.
6
THE DATA LIFE CYCLE
Dashboard/BI
Internal Data
Data Visualization
Data Marts
OLAP
External Data
Data Warehouse
Decisions
Internal Data
Data Mining
Data Marts
Knowledge
The activities taking place with respect to data
for warehouse/OLAP/mining
7
INTEGRATION OF DATA
Data Warehouse
application A m,f
M/F application B
1,0 application C x,y application D male,
female
8
CHARACTERISTICS OF WAREHOUSES
Data warehouses have a number of characteristics
  • Subject-oriented
  • Integrated
  • Time-variant
  • Non-volatile

9
KEY FEATURES OF DATA WAREHOUSES
Subject-oriented A data warehouse is organized
around major subjects, such as customer,
supplier, product, and sales. Integrated A data
warehouse is usually constructed by integrating
data from multiple heterogeneous sources, such as
relational databases, flat files, and on-line
transaction records. Time-variant Data are
stored to provide information from a historical
perspective (e.g., the past 5-10 years). Every
key structure in the data warehouse contains,
either implicitly or explicitly, an element of
time Nonvolatile The data in a warehouse is
permanent.
10
DESIGN OF WAREHOUSE
Design of warehouses is similar to databases
  • Schema not radically different
  • W/H designed primarily for queries
  • Database will be converted to cube at run time.

11
ORGANIZING THE W/H
STAR SCHEMA Consists of a large central table
and a set of smaller tables, one each for each
dimension. SNOWFLAKE SCHEMA A variant of the
star schema, Where some dimension tables are
normalized, thereby splitting the data into
additional tables. CONSTELLATION SCHEMA A
collection of stars.
12
STAR SCHEMA OF A DATA WAREHOUSE (FYI)
BRANCH
ACCOUNTS
TRANSACTIONS
What are the dimensions here?
13
SNOWFLAKE SCHEMA OF A DATA WAREHOUSE (FYI)
BRANCH
ACCOUNTS
TRANSACTIONS
CREDIT
DEBIT
14
CONSTELLATION SCHEMA OF A DATA WAREHOUSE FOR
SALES AND SHIPPING (FYI)
DEPARTMENTS
BRANCH
EMPLOYEES
ACCOUNTS
TRANSACTIONS
CREDIT
DEBIT
15
EXAMPLE OF WAREHOUSE MARTS
Weekly sales by region 2007-2013
Sales by Product Line 2010-2013
Weekly sales by state 2005-2013
Weekly sales by product 2008-2013
Data Mart
Data Mart
Warehouse Sales Detail
Sales Detail 2003-2013
16
OLAP DATA WAREHOUSES
17
ONLINE ANALYTICAL PROCESSING
OLAP Tools to analyze data in a warehouse for
decision support. How many light bulbs sold in
December?
  • Aggregation (data summarization)
  • Dimension
  • Slice Dice
  • Cube organization (Multi-dimensional database)

18
AGGREGATION CUBE
Region
Units
NE
SE
SW
North. E 40 South E.
20 South W. 30 Midwest
20 South W.
50 North E. 65
65
20
30
40
50
sales in the Northern region?
19
DIMENSIONS CONCEPT HIERARCHIES
A dimension is an aspect of the data, it is a
characteristic of a variable such as location,
for sales variable. Dimensions can have
hierarchies (or various levels of aggregations) A
concept hierarchy defines a sequence of mappings
from a set of low-level concepts to higher-level,
more general concepts
20
CONCEPT HIERARCHY
Mfr. dimension
iPhone5
iPhone4
150,600
GALII
GAL
FX
G
G2
13,600
30,000
15,067
18,240
51,326
21
MULTI-DIMENSIONAL ORGANIZATION
Sales, costs etc.
nw
Products (tables, desks, lamps..)
sw
Regions
Cube organization supports slice dice
22
MULTI-DIMENSIONAL ORGANIZATION..


LG MW SW SE NW NE TOTAL
G 3,780 4,893 7,494 6,520 2,450 25,137
G2 2,342 1,200 1,400 1,678 950 7,570
FX 3,780 7,893 5,647 6,493 7,839 31,652
TOTAL 9,902 13,986 14,541 14,691 11,239 64,359





LG MW SW SE NW NE TOTAL
G 3,780 4,893 7,494 6,520 2,450 25,137
G2 2,342 1,200 1,400 1,678 950 7,570
FX 3,780 7,893 5,647 6,493 7,839 31,652
TOTAL 9,902 13,986 14,541 14,691 11,239 64,359



March sales


LG MW SW SE NW NE TOTAL
G 3,780 4,893 7,494 6,520 2,450 25,137
G2 2,342 1,200 1,400 1,678 950 7,570
FX 3,780 7,893 5,647 6,493 7,839 31,652
TOTAL 9,902 13,986 14,541 14,691 11,239 64,359



February sales
January sales
shows multi-dimensional/cube organization
23
CUBE ORGANIZATION
  • Data from warehouse imported into memory
  • A sophisticated 3D representation is created
  • Referred to as sparse matrix
  • Sides of cube are dimensions
  • Allows slice dice
  • Answers to high level queries/reports

24
DATA MINING
25
DATA MINING
Application of statistical and AI techniques to
identify patterns that exist in large databases
but are hidden in the vast amounts of data.
e.g. sequence/association, classification, and
clustering
  • Buyer non-buyer
  • Success/failure of an operation
  • Customers who spend more
  • Preferences of target groups

26
SOME DATA MINING APPLICATIONS
  • Predicting the probability of default for
    consumer loans
  • Predicting audience response to TV advertisements
  • Predicting the probability that a cancer patient
    will respond to radiation therapy.
  • Predicting the probability that an offshore well
    will produce oil
  • A phone company mined data from 140 million
    households, each with as many as 10,000
    attributes, including life-style and calling
    habits. Identified 22 profiles

27
DATA MINING AS A STEP IN THE PROCESS OF
KNOWLEDGE DISCOVERY
Evaluation and Presentation
Knowledge
Data Mining
Selection and Transformation
Patterns
Data Warehouse
Cleaning and Integration
Database
Flat files
28
DATA MINING PROCESS
Cleaning integration data is brought in from
multiple sources Selection transformation
sometimes called dimensionality reduction, it is
concerned with selection of dimensions and
sometimes the raw data needs to be transformed to
suit the problem e.g. calculate margin. Data
mining - process of extracting data patterns,
using statistical or AI techniques. Pattern
evaluation - identifying patterns useful and
relevant to the organizational context.
Knowledge presentation -- Visualization and
knowledge representation techniques are used to
present the mined knowledge to the user.
29
OLAP, MINING WAREHOUSING...
Data warehousing refers to the use of high
speed/high capacity servers to store historical
transaction information and to make this
information accessible to decision makers.
OLAP is used to perform high level analysis of
data based on data summarization (aggregation)
and slice and dice operations. For e.g. how many
shoes sold in midwest in Feb? Data mining refers
to identification of patterns from data.
30
DATA MINING TECHNIQUES
31
TYPES OF ANALYSES
Sequence -- Activities occurring one after
another e.g. loan after buying car,
warranty. Association -- (AKA Market Basket
Analysis) Activities
which occur together (e.g. bread and
meat) Classification -- Identifying profiles of
data classified into
pre-defined groups (frequent infrequent
shoppers) Clustering --
Identifying natural characteristics of data
(what major areas are cust.
coming from?)
32
SEQUENTIAL ANALYSIS
33
SEQUENTIAL ANALYSIS
Predict purchasing
  • 40 of customers buy a gray skirt six months
    after buying a red jacket
  • Identified by time-series analysis, Neural
    networks or genetic algorithms

34
IDENTIFYING SEQUENTIAL PATTERNS..
applications in forecasting exchange rates, meat
consumption , bankruptcies etc.
35
ASSOCIATIVE ANALYSIS
36
ASSOCIATION
  • Identifies items purchased together
  • Customers who bought items A, B, C also buy Y
  • 85 percent of customers who buy a certain wine
    brand also buy a certain type of pasta
  • Many men buy a six-pack when they purchase
    diapers
  • Enhances product placement in stores

37
ANALYSIS TECHNIQUES ASSOCIATION
Algorithm A-priori
TID Items
100 A C D
200 B C E
300 A B C E
400 B E
  • list all items 1 item set (C1)
  • filter by min. transaction support (L1)
  • identify 2-item sets (L1 L1) (C2 )
  • filter and generate L2
  • repeat process...

Database
Min. transaction support is the number (sometimes
given as ) of transactions in which the item
must occur.
38
A-PRIORI ALGORITHM
TID Items
100 A C D
200 B C E
300 A B C E
400 B E
Itemset Sup.
A 2
B 3
C 3
D 1
E 3
Itemset Sup.
A 2
B 3
C 3
E 3
minimum Support 2
Scan Database
C1
L1
Itemset
A B
A C
A E
B C
B E
C E
Itemset Sup.
A B 1
A C 2
A E 1
B C 2
B E 3
C E 2
Itemset Sup.
A C 2
B C 2
B E 3
C E 2
Scan Database
minimum Support 2
L2
C2
C2
Scan Database
Itemset Sup.
B C E 2
Itemset
B C E
Itemset Sup.
B C E 2
L3
C3
C3
39
DISCUSSION
Apply associative rule mining (Use A-priori
algorithm) to the following portfolios of clients
of a brokerage company, to identify stocks that
are purchased together. Use a minimum support of
two.
Client1 AAPL, ORCL, WFC, BAC
Client2 BAC, WTU
Client3 GOOG, IBM, C
Client4 WTU, WFC, C, IBM
40
CLASSIFICATION ANALYSIS
41
CLASSIFICATION
A technique for grouping data into pre-defined
classes using certain attributes of the data.
E.g. defaulter or not, cruise customer, 4G
subscriber or not etc.
  • Define classes
  • Identify classification criteria
  • Select suitable technique
  • application

42
CLASSIFICATION TECHNIQUES
  • Discriminant analysis (categorical membership)
  • Bayesian classification
  • Logistic regression (probabilistic membership)
  • Neural networks

43
DISCRIMINANT ANALYSIS
A method of classification that uses a
Discriminant Function to decide classes
  • DF based on attributes and weights.
  • Calculated as a weighted average of attributes
    and weights (of training data) (Zave score)
  • E.g. (GMAT 200 UGPA) gt 1200
  • Given a new set of data, calculate Z score
  • Compare with Zav

DF Discriminant Function
a simplified version
44
CLASSIFICATION WITH BAYESIAN ANALYSIS
45
REFRESHER ON PROBABILITY
Probability is the chance that an event/outcome
will take place.
  • E.g. Cubs had 43 wins 36 losses in one season
  • Probability of winning?

46
PRIOR (CONDITIONAL) PROBABILITIES
Prior probabilities are knowledge of other events
which may help improve predictions
  • E.g. p(IPO success) 0.33
  • Suppose we know that a big company behind IPO
  • p(IPO success/big company) 0.99

Which is higher?
P(successful cellphone call) or P(successful
call/subscriber in service area)
47
CONDITIONAL PROBABILITIES..
College of students
CBT 1,800
COAS 2,300
COFAC 1,200
COE 6,700
Total 12,000
If we see a student in the union and he/she is a
WIU student What is the probability he/she is a)
CBT?, b) COAS?, c) COFAC?, d) COE?
48
BAYESIAN ANALYSIS
Bayes theorem can be exploited for classification
  • Bayes theorem
  • P(A/B) P(B/A) x P(A)/P(B)
  • Classify into class I or II based on conditional
    probability
  • P(class I/x) gt P(class II/x)?
  • E.g. wiu student in union, which college are
    they?

A method for classifying objects/events into
classes based on probabilities of occurrence of
the objects/events
x is some condition e.g. surgery or being a
shopper in a retail chain
49
INTRODUCTION TO BAYESIAN ANALYSIS
We are interested in p(person becoming a
manager/mba) How can we use Bayes theorem?
ASSUME
  • 300 m population 100 m employees
  • 500,000 are managers
  • 10,000 managers go to college for an MBA
  • 20 m go to college
  • 100,000 do MBA

you need to write formula using terms from the
problem
50
THIS SLIDE IS INTENTIONALLY BLANK
51
BAYESIAN ANALYSIS - EXAMPLE
PROBLEM
  • An observer has collected information about
  • Eagles Hawks for a long time. If a new bird
  • is spotted with a certain wingspan (x), need to
    know
  • whether eagle or hawk

From Data Mining Methods for Knowledge
Discovery by K.Cios, W.Pedrycz, R.Swiniarski
52
BAYESIAN ANALYSIS - EXAMPLE
CLASSIFICATION PROBLEM
  • Compare p(eagle/x) and p(hawk/x)
  • Whichever is higher ? unknown bird
  • We dont know p(eagle/x) or p(hawk/x)

53
BAYESIAN ANALYSIS - EXAMPLE
Bayes theorem
Decision rule?
54
PROBABILITY DENSITY FUNCTION
Shows from observations of birds, the
probability of a bird having a particular wingspan
P(xeagle), P(x hawk)
55
PROBABILITY OF EAGLE OR HAWK
N Number of birds
neagle nhawk
0.8
0.2
56
USING BAYES THEOREM
New birds size 45 cm
(from known probability density functions)
p(45eagle) 2.22 x 10-2 p(45hawk) 1.10 x 10-2
2.22 x 10-2 x 0.8 vs 1.10 x 10-2 x 0.2
0.01776 gt 0.0021
Decision rule predicts eagle
57
CLUSTERING
58
DATA MINING FUNCTIONS
Clustering
  • Goal is to identify natural groupings of data.
    applicationications in market segmentation,
    discovering affinity groups, and defect analysis
  • based on euclidean distance
  • similar to centroid
  • trial error approach
  • need to guess of clusters

59
CLUSTER ANALYSIS OF CUSTOMER LOCATIONS IN A CITY
60
CUSTOMERS ARE CLUSTERED INTO FOUR SEGMENTS
Income Medium Children 2
Income High Children 1 Car Luxury
Income Medium Children3
Car Sedan and Car Truck
Cluster 4
Cluster 1
Cluster 3
Income Low Children 0 Car Compact
Cluster 2
61
A SIMPLE EXAMPLE
ACCT MONTHLY SPENDING
109905 200
112343 30
124205 1,510
125687 11,078
130984 341
131987 543
140001 178
140387 2,000
150933 1,750
167912 230
62
(No Transcript)
63
(No Transcript)
64
DISCUSSION QUESTIONS
  • What kind of data is stored in a data warehouse?
  • What are the characteristics of a data warehouse?
  • How is data organized in a warehouse?
  • What tool(s) can we use to analyze data in a
    warehouse?
  • What type of multidimensional organization of
    data is also called a sparse matrix?
  • What is a concept hierarchy?
  • What technique can you use to predict the
    probability that a cancer patient will respond to
    radiation therapy?
  • What type of mining identifies natural
    characteristics of data?
  • What technique is useful for predicting trends?
  • What is minimum transaction support?
  • How is Bayes theorem used for classification?
Write a Comment
User Comments (0)
About PowerShow.com