Chapter 2 Data Mining Processes and Knowledge Discovery - PowerPoint PPT Presentation

About This Presentation

Title:

Chapter 2 Data Mining Processes and Knowledge Discovery

Description:

Describes the Cross-Industry Standard Process for Data Mining (CRISP-DM), a set ... Recode categorical data to numerical scales. ... – PowerPoint PPT presentation

Number of Views:208

Avg rating:3.0/5.0

Slides: 50

Provided by: MHE485

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 2 Data Mining Processes and Knowledge Discovery

1
Chapter 2Data Mining Processes and Knowledge
Discovery

Identify actionable results

2
Contents

Describes the Cross-Industry Standard Process for
Data Mining (CRISP-DM), a set of phases that can
be used in data mining studies
Discusses each phase in detail
Gives an example illustration
Discusses a knowledge discovery process

3
CRISP-DM

Cross-Industry Standard Process for Data Mining
One of first comprehensive attempts toward
standard process model for data mining
Independent of industry sector technology

4
CRISP-DM Phases

Business (or problem) understanding
Data understanding
A systematic process to try to make sense of the
massive amounts of data generated from daily
operations.
Data preparation
Transform create data set for modeling
Modeling
Evaluation
Check good models, evaluate to assure nothing
missing
Deployment

5
Business Understanding

Solve a specific problem
Determining business objectives, assessing the
current situation, establishing data mining
goals, and developing a project plan.
Clear definition helps
Measurable success criteria
Convert business objectives to set of data-mining
goals
What to achieve in technical terms, such as
What types of customers are interested in each of
our products?
What are typical profiles of customers

6
Data Understanding

Initial data collection, data description, data
exploration, and the verification of data
quality.
Three issues considered in data selection
Set up a concise and clear description of the
problem. For example, a retail DM project may
seek to identify spending behaviors of female
shoppers who purchase seasonal clothes.
Identify the relevant data for the problem
description, such demographical, credit card
transactional, financial data
Select variables for the relevant important for
the project.

7
Data Understanding (cont.)

Data types
Demographic data (income, education, age )
Socio-graphic data (hobby, club membership,)
Transactional data (sales record, credit card
spending)
Quantitative data are measurable using numerical
values)
Qualitative data known as categorical data,
contains both nominal and ordinal data. (see also
page. 22)
Related data Can come from many sources?
Internal
ERP (or MIS)
Data Warehouse
External
Government data
Commercial data
Created
Research

8
Data Preparation

Once data sources available are identified, the
data need to be selected, cleaned, built into the
desired and formatted forms.
Clean data Formats, gaps, filters outliers
redundancies (see page .22)
Unified numerical scales
Nominal data
Code (such gender data, male and female)
Ordinal data
Nominal code or scale (excellent, fair, poor)
Cardinal data (Categorical, A, B, C levels)

9
Types of Data
Type Features Synonyms
Numerical Continuous Range
Integer Range
Binary Yes/No Flag
Categorical Finite Set
Date/Time Range
String Typeless
Text String
Range Numeric vales (integer, real, or
date/time) Set Data with distinct multiple value
(numeric, string, or data/time) Typeless for
other types of data
10
Data Preparation (Cont.)

Several statistical method and visualization
tools can be used to preprocess the selected
data.
Such max, min, mean, and mode can be used to
aggregate or smooth the data.
Scatter plots and box plots can be used to filter
outliers.
More advanced techniques, such as regression
analysis, cluster analysis, decision tree, or
hierarchical analysis may be applied in data
preprocessing.
In some cases, data preprocessing could take over
50 of the time of the entire data mining
process.
Shortening data processing time can reduce much
of the total computation time in data mining.

11
Data Preparation data transformation

Data transformation is to use simple mathematical
formulations or learning curves to convert
different measurements of selected, and clean,
data into a unified numerical scale for the data
analysis.
Data transformation can be used to
Transform from numerical to numerical scales, to
shrink or enlarge the given data. Such as
(x-min)/max-min) to shrink the data into the
interval 0,1.
Recode categorical data to numerical scales.
Categorical data can be ordinal (less, moderate,
strong) and nominal (red, yellow, blue..). Such
1yes, 0no. see also page. 24.

See page. 24 for more details.
12
Modeling

Data modeling is where the data mining software
is used to generate results for various
situations. Data visualization and cluster
analysis are useful for initial analysis.
Depending on the data type,
if the task is to group data, discriminant
analysis is applied.
If the purpose is estimation, regression is
appropriate the data are continuous (and logistic
regression is not).
Neural networks could be applied for both tasks.
Data Treatment
Training set for development of the model.
Test set for testing the model that is built.
Maybe others for refining the model

13
Data mining techniques

Techniques
Association the relationship of a particular
item in a data transaction on other items in the
same transaction is used to predict patterns. See
also page 25 for example.
Classification the methods are intended for
learning different functions that map each item
of the selected data into one of a predefined set
of classes. Two key research problems related to
classification results are the evaluation of
misclassification and prediction power(C4.5).
Mathematical modeling is often used to construct
classification methods are binary decision trees
(CART), neural networks (nonlinear), linear
programming (boundary), and statistics.
See also page. 25, 26 for more explanations

14
Data mining techniques (Cont.)

Clustering taking ungrouped data and uses
automatic techniques to put this data into
groups.
Clustering is unsupervised and does not require a
learning set. (Chapter 5)
Predictions is related to regression technique,
to discover the relationship between the
dependent and independent variables.
Sequential patterns seeks to find similar
patterns in data transaction over a business
period.
The mathematical models behind sequential
patterns are logic rules, fuzzy logic, and so on.
Similar time sequences applied to discover
sequences similar to a known sequence over both
past and current business periods.

15
Evaluation

Does model meet business objectives?
Any important business objectives not addressed?
Does model make sense?
Is model actionable?

PDCA
CRISP-DM
16
Deployment

DM can be used to verify previously held
hypotheses or for knowledge discovery.
DM models can be applied to business purposes ,
including prediction or identification of key
situations
Ongoing monitoring maintenance
Evaluate performance against success criteria
Market reaction competitor changes (remodeling
or fine tune)

17
Example

Training set for computer purchase
16 records
5 attributes
Goal
Find classifier for consumer behavior

18
Database (1st half)
Case Age Income Student Credit Gender Buy?
A1 31-40 High No Fair Male Yes
A2 gt40 Medium No Fair Female Yes
A3 gt40 Low Yes Fair Female Yes
A4 31-40 Low Yes Excellent Female Yes
A5 30 Low Yes Fair Female Yes
A6 gt40 Medium Yes Fair Male Yes
A7 30 Medium Yes Excellent Male Yes
A8 31-40 Medium No Excellent Male Yes
19
Database (2nd half)
Case Age Income Student Credit Gender Buy?
A9 31-40 High Yes Fair Male Yes
A10 30 High No Fair Male No
A11 30 High No Excellent Female No
A12 gt40 Low Yes Excellent Female No
A13 30 Medium No Fair Male No
A14 gt40 Medium No Excellent Female No
A15 30 Unknown No Fair Male Yes
A16 gt40 Medium No N/A Female No
20
Data Selection

Gender has weak relationship with purchase
Based on correlation
Drop gender
Selected Attribute Set
Age, Income, Student, Credit

21
Data Preprocessing

Income unknown in Case 15
Credit not available in Case 16
Drop these noisy cases

22
Data Transformation

Assign numerical values to each attribute
Age 30 3 31-40 2 gt40 1
Income High 3 Medium 2 Low 1
Student Yes 2 No 1
Credit Excellent 2 Fair 1

23
Data Mining

Categorize output
Buys C1 Doesnt buy C2
Conduct analysis
Model says A8, A10 dont buy rest do
Of the actual yes, 7 correct and 1 not
Of the actual no, 2 correct
Confusion matrix

24
Data Interpretation and Test Data Set

Test on independent data

Case Actual Model
B1 Yes Yes (1)
B2 Yes Yes (2)
B3 Yes Yes (3)
B4 Yes Yes (4)
B5 Yes Yes (5)
B6 Yes Yes (6)
B7 Yes Yes (7)
B8 (do not) No No
B9 No Yes
B10 (do not) No No
25
Confusion Matrix
Model Buy Model Not Totals
Actual Buy 7 0 7
Actual Not 1 2 3
Totals 8 2 10
right
26
Measures

Correct classification rate
9/10 0.90
Cost function
cost of error
model says buy, actual no 20
model says no, actual buy 200
1 x 20 0 x 200 20

27
Goals

Avoid broad concepts
Gain insight discover meaningful patterns learn
interesting things
Cant measure attainment
Narrow and specify
Identify customers likely to renew reduce churn
Rank order by propensity (favor) to

28
Goals

Description what is
understand
explain
discover knowledge
Prescription what should be done
classify
predict

29
Goal

Method A
four rules, explains 70
Method B
fifty rules, explains 72
BEST?
Gain understanding Method A better
minimum description length (MDL)
Reduce cost of mailing Method B better

30
Measurement

Accuracy
How well does model describe observed data?
Confidence levels
a proportion of the time between lower and upper
limits
Comprehensibility
Whole or parts?

31
Measuring Predictive

Classification prediction
error rate incorrect/total
requires evaluation set be representative
Estimators
predicted - actual (MAD, MSE, MAPE)
variance sum(predicted - actual)2
standard deviation square root of variance
distance - how far off

32
Statistics

Population - entire group studied
Sample - subset from population
Bias - difference between sample average
population average
mean, median, mode
distribution
significance
correlation, regression (hamming distance)

33
Classification Models

LIFT probability in class by sample divided by
probability in class by population
if population probability is 20 and
sample probability is 30,
LIFT 0.3/0.2 1.5
Best lift not necessarily best need sufficient
sample size as confidence increase.

34
Lift Chart
35
Measuring Impact

Ideal - (NPV) because of expenditure
Mass mailing may be better
Depends on
fixed cost
cost per recipient
cost per respondent
value of positive response

36
Bottom Line

Return on investment

37
Example Application

Telephone industry
Problem Unpaid bills
Data mining used to develop models to predict
nonpayment as early as possible

See page. 27
38
Knowledge Discovery Process
1 Data Selection Learning the application domain Creating target data set
2 Data Preprocessing Data cleaning preprocessing
3 Data Transformation Data reduction projection
4 Data Mining Choosing function Choosing algorithms Data mining
5 Data Interpretation Interpretation Using discovered knowledge
39
1 Business Understanding

Predict which customers would be insolvent
In time for firm to take preventive measures (and
avert losing good customers)
Hypothesis
Insolvent customers would change calling habits
phone usage during a critical period before
immediately after termination of billing period

40
2 Data Understanding

Static customer information available in files
Bills, payments, usage
Used data warehouse to gather organize data
Coded to protect customer privacy

41
Creating Target Data Set

Customer files
Customer information
Disconnects
Reconnections
Time-dependent data
Bills
Payments
Usage
100,000 customers over 17-month period
Stratified (hierarchical) sampling to assure all
groups appropriately represented

42
3 Data Preparation

Filtered out incomplete data
Deleted inexpensive calls
Reduced data volume about 50
Low number of fraudulent cases
Cross-checked with phone disconnects
Lagged data made synchronization necessary

43
Data Reduction Projection

Information grouped by account
Customer data aggregated by 2-week periods
Discriminant analysis on 23 categories
Calculated average owed by category (significant)
Identified extra charges (significant)
Investigated payment by installments (not
significant)

44
Choosing Data Mining Function

Classes
Most possibly solvent (99.3)
Most possibly insolvent (0.7)
Costs of error widely different
New data set created through stratified sampling
Retained all insolvent
Altered distribution to 90 solvent
Used 2,066 cases total
Critical period identified
Last 15 two-week periods before service
interruption
Variables defined by counting measures in
two-week periods
46 variables as candidate discriminant factors

45
4 Modeling

Discriminant Analysis
Linear model
SPSS stepwise forward selection
Decision Trees
Rule-based classifier, C5, C4.5
Neural Networks
Nonlinear model

46
Data Mining

Training set about 2/3rds
Rest test
Discriminant analysis
Used 17 variables
Equal costs 0.875 correct
Unequal costs 0.930 correct
Rule-based 0.952 correct
Neural network 0.929 correct

47
5 Evaluation

1st objective to maximize accuracy of predicting
insolvent customers
Decision tree classifier best
2nd objective to minimize error rate for solvent
customers
Neural network model close to Decision tree
Used all 3 on case-by-case basis

48
Coincidence Matrix Combined Models
Model insolvent Model solvent Unclass Totals
Actual insolvent 19 17 28 64
Actual solvent 1 626 27 654
Totals 20 643 91 718
49
6 Implementation