Title: Chapter 3 Data Mining Concepts: Data Preparation, Model Evaluation
1Chapter 3 Data Mining Concepts Data Preparation,
Model Evaluation
 Credits
 Padhraic Smyth notes
 Cook and Swayne book
2Data Mining Tasks
 EDA/Exploration
 visualization
 Predictive Modelling
 goal is predict an answer
 see how independent variables effect an outcome
 predict that outcome for new cases
 Inference
 will this drug fight that disease
 Descriptive Modelling
 there is no specific outcome of interest
 describe the data in qualitative ways
 simple crosstabs of categorical data
 clustering, density estimation, segmentation
 includes pattern identification
 frequent itemsets  beer and diapers?
 anomaly detection  network intrusion
 sports rules  when X is in the game, Y scores
30 more.
3Data Mining Tasks cont
 Retreival by content
 user supplies a pattern to a large dataset, which
retrieves the most relevant answer  web search
 image retrieval
4Data Preparation
 Data in the real world is dirty
 incomplete lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data  noisy containing errors or outliers
 inconsistent containing discrepancies in codes
or names  No quality data, no quality mining results!
 Quality decisions must be based on quality data
 Data warehouse needs consistent integration of
quality data  Assessment of quality reflects on confidence in
results
5Preparing Data for Analysis
 Think about your data
 how is it measured, what does it mean?
 nominal or categorical
 jersey numbers, ids, colors, simple labels
 sometimes recoded into integers  careful!
 ordinal
 rank has meaning  numeric value not necessarily
 educational attainment, military rank
 integer valued
 distances between numeric values have meaning
 temperature, time
 ratio
 zero value has meaning  means that fractions and
ratios are sensible  money, age, height,
 It might seem obvious what a given data value is,
but not always  pain index, movie ratings, etc
6Investigate your data carefully!
 Example lapsed donors to a charity (KDD Cup
1998)  Made their last donation to PVA 13 to 24 months
prior to June 1997  200,000 (training and test sets)
 Who should get the current mailing?
 What is the cost effective strategy?
 tcode was an important variable
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11Mixed data
 Many realworld data sets have multiple types of
variables,  e.g., medical diagnosis data on patients and
controls  Categorical (Nominal) employment type, ethnic
group  Ordinal education level
 Interval body temperature, pixel values in
medical image  Ratio income, age, response to drug
 Unfortunately, many data analysis algorithms are
suited to only one type of data (e.g., interval)  Linear regression, neural networks, support
vector machines, etc  These models implicitly assume intervalscale
data  Exception decision trees
 Trees operate by subgrouping variable values at
internal nodes  Can operate effectively on binary, nominal,
ordinal, interval
12Tasks in Data Preprocessing
 Data cleaning
 Check for data quality
 Missing data
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but
produces the same or similar analytical results  Data discretization
 Combination of reduction and transformation but
with particular importance, especially for
numerical data
13Data Cleaning / Quality
 Individual measurements
 Random noise in individual measurements
 Outliers
 Random data entry errors
 Noise in label assignment (e.g., class labels in
medical data sets)  can be corrected or smoothed out
 Systematic errors
 E.g., all ages gt 99 recorded as 99
 More individuals aged 20, 30, 40, etc than
expected  Missing information
 Missing at random
 Questions on a questionnaire that people randomly
forget to fill in  Missing systematically
 Questions that people dont want to answer
 Patients who are too ill for a certain test
14Missing Data
 Data is not always available
 E.g., many records have no recorded value for
several attributes,  survey respondents
 disparate sources of data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus
deleted  data not entered due to misunderstanding
 certain data may not be considered important at
the time of entry  not register history or changes of the data
15How to Handle Missing Data?
 Ignore the tuple not effective when the
percentage of missing values per attribute varies
considerably.  Fill in the missing value manually tedious
infeasible?  Use a global constant to fill in the missing
value e.g., unknown, a new class?!  Use the attribute mean to fill in the missing
value  Use imputation
 nearest neighbor
 model based (regression or Bayesian MC based)
16Missing Data
 What do I choose for a given situation?
 What you do depends
 the data  how much is missing? are they
important values?  the model  can it handle missing values?
 there is no right answer!
17Noisy Data
 Noise random error or variance in a measured
variable  Incorrect attribute values (outliers) may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data
18How to Handle Noisy Data?
 Is the suspected outlier from
 human error? or
 real data?
 err on the side of caution, if unsure, use
methods that are robust to outliers
19Data Transformation
 Can help reduce influence of extreme values
 Putting data on the real line is often convenient
 Variance reduction
 logtransform often used for incomes and other
highly skewed variables.  Normalization scaled to fall within a small,
specified range  minmax normalization
 zscore normalization
 normalization by decimal scaling
 Attribute/feature construction
 New attributes constructed from the given ones
20Data Transformation Standardization
We can standardize data by dividing by the sample
standard deviation. This makes them all equally
important.
The estimate for the standard deviation of xk
where xk is the sample mean
(When might standardization not be a such a
good idea?)
21Dealing with massive data
 What if the data simply does not fit on my
computer (or R crashes)?  Force the data into main memory
 be careful, you need some overhead for modelling!
 use scalable algorithms
 keep up on the literature, this keeps changing!
 Use a database
 Mysql is a good (and free) one
 Investigate data reduction strategies
22Data Reduction Strategies
 Warehouse may store terabytes of data Complex
data analysis/mining may take a very long time to
run on the complete data set  Data reduction
 Obtains a reduced representation of the data set
that is much smaller in volume but yet produces
the same (or almost the same) analytical results  Goal reduce dimensionality of data from p to p
23Data Reduction Dimension Reduction
 In general, incurs loss of information about x
 If dimensionality p is very large (e.g., 1000s),
representing the data in a lowerdimensional
space may make learning more reliable,  e.g., clustering example
 100 dimensional data
 if cluster structure is only present in 2 of the
dimensions, the others are just noise  if other 98 dimensions are just noise (relative
to cluster structure), then clusters will be much
easier to discover if we just focus on the 2d
space  Dimension reduction can also provide
interpretation/insight  e.g for 2d visualization purposes
24Data Reduction Dimension Reduction
 Feature selection (i.e., attribute subset
selection)  Select a minimum set of features such that the
minimal signal is lost  Heuristic methods (exhaustive search implausible
except for small p)  stepwise forward selection
 stepwise backward elimination
 combining forward selection and backward
elimination  decisiontree induction
 can work well, but often gets trapped in local
minima  often computationally expensive
25Data Reduction Principal Components
 One of several projection methods
 Given N data vectors from kdimensions, find c lt
k orthogonal vectors that can be best used to
represent data  The original data set is reduced to one
consisting of N data vectors on c principal
components (reduced dimensions)  Each data vector is a linear combination of the c
principal component vectors  Works for numeric data only
 Used when the number of dimensions is large
26PCA Example
Direction of 1st principal component vector
(highest variance projection)
x2
x1
27PCA Example
Direction of 1st principal component vector
(highest variance projection)
x2
x1
Direction of 2nd principal component vector
28Data Reduction Multidimensional Scaling
 Say we have data in the form of an N x N matrix
of dissimilarities  0s on the diagonal
 Symmetric
 Could either be given data in this form, or
create such a dissimilarity matrix from our data
vectors  Examples
 Perceptual dissimilarity of N objects in
cognitive science experiments  Stringedit distance between N protein sequences
 MDS
 Find kdimensional coordinates for each of the N
objects such that Euclidean distances in
embedded space matches set of dissimilarities
as closely as possible
29Multidimensional Scaling (MDS)
 MDS score function (stress)
 Often used for visualization, e.g., k2, 3
Original dissimilarities
Euclidean distance in embedded kdim space
30MDS Example data
31MDS 2d embedding of face images
32Data Reduction Sampling
 Dont forget about sampling!
 Choose a representative subset of the data
 Simple random sampling may be ok but beware of
skewed variables.  Stratified sampling methods
 Approximate the percentage of each class (or
subpopulation of interest) in the overall
database  Used in conjunction with skewed data
 Propensity scores may be useful if response is
unbalanced.
33Data Discretization
 Discretization
 divide the range of a continuous attribute into
intervals  Some classification algorithms only accept
categorical attributes.  Reduce data size by discretization
 Prepare for further analysis
 Reduces information in the data, but sometimes
surprisingly good!  Bagnall and Janacek  KDD 2004
34Model Evaluation
35Model Evaluation
 e.g. pick the best a and b in Y aX b
 how do you define Best?
 Big difference between Data Mining and Statistics
is the focus on predictive performance over find
best estimators.  Most obvious criterion for predictive
performance
36Predictive Model Scores
 More generally
 Assumes all observations equally important
 assumes errors are treated equally
  what about if recent cases are more important,
or high revenue, etc.  Depends on differences rather than values
  scale matters with squared error
37Descriptive Model Scores
 If your model assigns probabilities to classes
 likelihood based  Pick the model that assigns
highest probability to what actually happened  Many different scoring rules for
nonprobabilistic models
38Scoring Models with Different Complexities
complex models can fit data perfectly!!!!
score(model) Goodnessoffit  penalty for
complexity Classic bias/variance tradeoff this
is called regularization and is used to combat
overfitting
39Using HoldOut Data
 Instead of penalizing complexity, look at
performance on holdout data  Using the same set of examples for training as
well as for evaluation results in an
overoptimistic evaluation of model performance.  Need to test performance on data not seen by
the modeling algorithm. I.e., data that was no
used for model building
40Data Partition
 Randomly partition data into training and test
set  Training set data used to train/build the
model.  Estimate parameters (e.g., for a linear
regression), build decision tree, build
artificial network, etc.  Test set a set of examples not used for model
induction. The models performance is evaluated
on unseen data. Also referred to as outofsample
data.  Generalization Error Models error on the test
data.
Set of training examples
Set of test examples
41Set of training examples
Use model to predict outcome on test set
Set of testing examples
42Holding out data
 The holdout method reserves a certain amount for
testing and uses the remainder for training  Usually one third for testing, the rest for
training  For unbalanced datasets, random samples might
not be representative  Few or none instances of some classes
 Stratified sample
 Make sure that each class is represented with
approximately equal proportions in both subsets
42
43Evaluation on small data
 What if we have a small data set?
 The chosen 2/3 for training may not be
representative.  The chosen 1/3 for testing may not be
representative.
43
44Repeated holdout method
 Holdout estimate can be made more reliable by
repeating the process with different subsamples  In each iteration, a certain proportion is
randomly selected for training (possibly with
stratification)  The error rates on the different iterations are
averaged to yield an overall error rate  This is called the repeated holdout method
44
45Crossvalidation
 Most popular and effective type of repeated
holdout is crossvalidation  Crossvalidation avoids overlapping test sets
 First step data is split into k subsets of equal
size  Second step each subset in turn is used for
testing and the remainder for training  This is called kfold crossvalidation
 Often the subsets are stratified before the
crossvalidation is performed  The error estimates are averaged to yield an
overall error estimate
45
46Crossvalidation example
 Break up data into groups of the same size


 Hold aside one group for testing and use the rest
to build model 
 Repeat
Test
46
46
47More on crossvalidation
 Standard method for evaluation stratified
tenfold crossvalidation  Why ten? Extensive experiments have shown that
this is the best choice to get an accurate
estimate  Stratification reduces the estimates variance
 Even better repeated stratified crossvalidation
 E.g. tenfold crossvalidation is repeated ten
times and results are averaged (reduces the
variance)
47
48LeaveOneOut crossvalidation
 LeaveOneOut a particular form of
crossvalidation  Set number of folds to number of training
instances  I.e., for n training instances, build classifier
n times  Makes best use of the data
 Involves no random subsampling
 Computationally expensive, but good performance
48
49LeaveOneOutCV and stratification
 Disadvantage of LeaveOneOutCV stratification
is not possible  It guarantees a nonstratified sample because
there is only one instance in the test set!  Extreme example random dataset split equally
into two classes  Best model predicts majority class
 50 accuracy on fresh data
 LeaveOneOutCV estimate is 100 error!
49
50Models Performance Evaluation
 Classification models predict what class an
observation belongs to.  E.g., good vs. bad credit risk (Credit), Response
vs. no response to a direct marketing campaign,
etc.  Classification Accuracy Rate
 Proportion of accurate classifications of
examples in test set. E.g., the model predicts
the correct class for 70 of test examples. 
51Classification Accuracy Rate
Classification Accuracy Rate S/N Proportion
examples accurately classified by the model S
number of examples accurately classified by
model N Total number of examples
Inputs Inputs Inputs Inputs Output Models prediction Correct/ incorrect prediction
Single No of cards Age Incomegt50K Good/ Bad risk Good/ Bad risk
0 1 28 1 1 1 )
1 2 56 0 0 0 )
0 5 61 1 0 1 (
0 1 28 1 1 1 )
52Consider the following
 Response rate for a mailing campaign is 1
 We build a classification model to predict
whether or not a customer would respond.  The model classification accuracy rate is 99
 Is the model good?
99 do not respond
1 respond
53Confusion Matrix for Classification
Models performance over test set
Actual Class
Respond Do not respond
Respond 0 0
Do not Respond 10 990
Predicted Class
Diagonal left to right predicted classactual
class. 990/1000 (99) accurate predictions.
54Evaluation for a Continuous Response
 In a logistic regression example, we predict
probabilites that the response 1.  Classification accuracy depends on the threshold
55predicted probabilities
Suppose we use a cutoff of 0.5
actual outcome
1
0
8 3
0 9
1
predicted outcome
0
Test Data
56More generally
misclassification rate
actual outcome
1
0
recall or sensitivity (how many of those that are
really positive did you predict?)
a b
c d
1
predicted outcome
d
specificity
0
bd
a
precision (how many of those predicted positive
really are?)
ab
57Suppose we use a cutoff of 0.5
actual outcome
1
0
8
sensitivity 100
8 3
0 9
80
1
predicted outcome
9
specificity 75
0
93
we want both of these to be high
58Suppose we use a cutoff of 0.8
actual outcome
1
0
6
sensitivity 75
6 2
2 10
62
1
predicted outcome
10
specificity 83
0
102
59 Note there are 20 possible thresholds
 ROC computes sensitivity and specificity for all
possible thresholds and plots them  Note if threshold minimum
 cd0 so sens1 spec0
 If threshold maximum
 ab0 so sens0 spec1
actual outcome
1
0
a b
c d
1
0
60Area under the curve (AUC) is a common measure
of predictive performance
61Another Look at ROC Curves
Pts without the disease
Pts with disease
Test Result
62Threshold
Test Result
63Some definitions ...
True Positives
Test Result
without the disease
with the disease
64False Positives
Test Result
without the disease
with the disease
65True negatives
Test Result
without the disease
with the disease
66False negatives
Test Result
without the disease
with the disease
67Moving the Threshold right

Test Result
without the disease
with the disease
68Moving the Threshold left

Test Result
without the disease
with the disease
69ROC curve
70ROC curve comparison
A poor test
A good test
71ROC curve extremes
Best Test
Worst test
The distributions dont overlap at all
The distributions overlap completely
72ROC Curve
For a given threshold on f(x), you get a point
on the ROC curve.
Ideal ROC curve (AUC1)
100
Actual ROC
Positive class success rate (hit rate,
sensitivity)
Random ROC (AUC0.5)
0 ? AUC ? 1
0
100
1  negative class success rate (false alarm
rate, 1specificity)
73AUC for ROC curves
AUC 100
AUC 50
AUC 90
AUC 65
74Interpretation of AUC
 AUC can be interpreted as the probability that
the test result from a randomly chosen diseased
individual is more indicative of disease than
that from a randomly chosen nondiseased
individual P(Xi ? Xj Di 1, Dj 0)  So can think of this as a nonparametric distance
between disease/nondisease test results  equivalent to MannWhitney Ustatistic
(nonparametric test of difference in location
between two populations)
75Lab 3
 Missing Data using visualization to see if the
data is missing at random through missing in the
margins  Multidimensional scaling