Chapter 3 Data Mining Concepts: Data Preparation, Model Evaluation - PowerPoint PPT Presentation

Loading...

PPT – Chapter 3 Data Mining Concepts: Data Preparation, Model Evaluation PowerPoint presentation | free to view - id: 53f08c-MTA3N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Chapter 3 Data Mining Concepts: Data Preparation, Model Evaluation

Description:

Title: Chapter 3 Data Mining Concepts: Data Preparation, Model Evaluation Author: ctv Last modified by: ctv Created Date: 7/21/2008 3:20:11 AM Document presentation ... – PowerPoint PPT presentation

Number of Views:484
Avg rating:3.0/5.0
Slides: 76
Provided by: ctv3
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Chapter 3 Data Mining Concepts: Data Preparation, Model Evaluation


1
Chapter 3Data Mining ConceptsData Preparation,
Model Evaluation
  • Credits
  • Padhraic Smyth notes
  • Cook and Swayne book

2
Data Mining Tasks
  • EDA/Exploration
  • visualization
  • Predictive Modelling
  • goal is predict an answer
  • see how independent variables effect an outcome
  • predict that outcome for new cases
  • Inference
  • will this drug fight that disease
  • Descriptive Modelling
  • there is no specific outcome of interest
  • describe the data in qualitative ways
  • simple crosstabs of categorical data
  • clustering, density estimation, segmentation
  • includes pattern identification
  • frequent itemsets - beer and diapers?
  • anomaly detection - network intrusion
  • sports rules - when X is in the game, Y scores
    30 more.

3
Data Mining Tasks cont
  • Retreival by content
  • user supplies a pattern to a large dataset, which
    retrieves the most relevant answer
  • web search
  • image retrieval

4
Data Preparation
  • Data in the real world is dirty
  • incomplete lacking attribute values, lacking
    certain attributes of interest, or containing
    only aggregate data
  • noisy containing errors or outliers
  • inconsistent containing discrepancies in codes
    or names
  • No quality data, no quality mining results!
  • Quality decisions must be based on quality data
  • Data warehouse needs consistent integration of
    quality data
  • Assessment of quality reflects on confidence in
    results

5
Preparing Data for Analysis
  • Think about your data
  • how is it measured, what does it mean?
  • nominal or categorical
  • jersey numbers, ids, colors, simple labels
  • sometimes recoded into integers - careful!
  • ordinal
  • rank has meaning - numeric value not necessarily
  • educational attainment, military rank
  • integer valued
  • distances between numeric values have meaning
  • temperature, time
  • ratio
  • zero value has meaning - means that fractions and
    ratios are sensible
  • money, age, height,
  • It might seem obvious what a given data value is,
    but not always
  • pain index, movie ratings, etc

6
Investigate your data carefully!
  • Example lapsed donors to a charity (KDD Cup
    1998)
  • Made their last donation to PVA 13 to 24 months
    prior to June 1997
  • 200,000 (training and test sets)
  • Who should get the current mailing?
  • What is the cost effective strategy?
  • tcode was an important variable

7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Mixed data
  • Many real-world data sets have multiple types of
    variables,
  • e.g., medical diagnosis data on patients and
    controls
  • Categorical (Nominal) employment type, ethnic
    group
  • Ordinal education level
  • Interval body temperature, pixel values in
    medical image
  • Ratio income, age, response to drug
  • Unfortunately, many data analysis algorithms are
    suited to only one type of data (e.g., interval)
  • Linear regression, neural networks, support
    vector machines, etc
  • These models implicitly assume interval-scale
    data
  • Exception decision trees
  • Trees operate by subgrouping variable values at
    internal nodes
  • Can operate effectively on binary, nominal,
    ordinal, interval

12
Tasks in Data Preprocessing
  • Data cleaning
  • Check for data quality
  • Missing data
  • Data transformation
  • Normalization and aggregation
  • Data reduction
  • Obtains reduced representation in volume but
    produces the same or similar analytical results
  • Data discretization
  • Combination of reduction and transformation but
    with particular importance, especially for
    numerical data

13
Data Cleaning / Quality
  • Individual measurements
  • Random noise in individual measurements
  • Outliers
  • Random data entry errors
  • Noise in label assignment (e.g., class labels in
    medical data sets)
  • can be corrected or smoothed out
  • Systematic errors
  • E.g., all ages gt 99 recorded as 99
  • More individuals aged 20, 30, 40, etc than
    expected
  • Missing information
  • Missing at random
  • Questions on a questionnaire that people randomly
    forget to fill in
  • Missing systematically
  • Questions that people dont want to answer
  • Patients who are too ill for a certain test

14
Missing Data
  • Data is not always available
  • E.g., many records have no recorded value for
    several attributes,
  • survey respondents
  • disparate sources of data
  • Missing data may be due to
  • equipment malfunction
  • inconsistent with other recorded data and thus
    deleted
  • data not entered due to misunderstanding
  • certain data may not be considered important at
    the time of entry
  • not register history or changes of the data

15
How to Handle Missing Data?
  • Ignore the tuple not effective when the
    percentage of missing values per attribute varies
    considerably.
  • Fill in the missing value manually tedious
    infeasible?
  • Use a global constant to fill in the missing
    value e.g., unknown, a new class?!
  • Use the attribute mean to fill in the missing
    value
  • Use imputation
  • nearest neighbor
  • model based (regression or Bayesian MC based)

16
Missing Data
  • What do I choose for a given situation?
  • What you do depends
  • the data - how much is missing? are they
    important values?
  • the model - can it handle missing values?
  • there is no right answer!

17
Noisy Data
  • Noise random error or variance in a measured
    variable
  • Incorrect attribute values (outliers) may due to
  • faulty data collection instruments
  • data entry problems
  • data transmission problems
  • technology limitation
  • inconsistency in naming convention
  • Other data problems which requires data cleaning
  • duplicate records
  • incomplete data
  • inconsistent data

18
How to Handle Noisy Data?
  • Is the suspected outlier from
  • human error? or
  • real data?
  • err on the side of caution, if unsure, use
    methods that are robust to outliers

19
Data Transformation
  • Can help reduce influence of extreme values
  • Putting data on the real line is often convenient
  • Variance reduction
  • log-transform often used for incomes and other
    highly skewed variables.
  • Normalization scaled to fall within a small,
    specified range
  • min-max normalization
  • z-score normalization
  • normalization by decimal scaling
  • Attribute/feature construction
  • New attributes constructed from the given ones

20
Data Transformation Standardization
We can standardize data by dividing by the sample
standard deviation. This makes them all equally
important.
The estimate for the standard deviation of xk
where xk is the sample mean
(When might standardization not be a such a
good idea?)
21
Dealing with massive data
  • What if the data simply does not fit on my
    computer (or R crashes)?
  • Force the data into main memory
  • be careful, you need some overhead for modelling!
  • use scalable algorithms
  • keep up on the literature, this keeps changing!
  • Use a database
  • Mysql is a good (and free) one
  • Investigate data reduction strategies

22
Data Reduction Strategies
  • Warehouse may store terabytes of data Complex
    data analysis/mining may take a very long time to
    run on the complete data set
  • Data reduction
  • Obtains a reduced representation of the data set
    that is much smaller in volume but yet produces
    the same (or almost the same) analytical results
  • Goal reduce dimensionality of data from p to p

23
Data Reduction Dimension Reduction
  • In general, incurs loss of information about x
  • If dimensionality p is very large (e.g., 1000s),
    representing the data in a lower-dimensional
    space may make learning more reliable,
  • e.g., clustering example
  • 100 dimensional data
  • if cluster structure is only present in 2 of the
    dimensions, the others are just noise
  • if other 98 dimensions are just noise (relative
    to cluster structure), then clusters will be much
    easier to discover if we just focus on the 2d
    space
  • Dimension reduction can also provide
    interpretation/insight
  • e.g for 2d visualization purposes

24
Data Reduction Dimension Reduction
  • Feature selection (i.e., attribute subset
    selection)
  • Select a minimum set of features such that the
    minimal signal is lost
  • Heuristic methods (exhaustive search implausible
    except for small p)
  • step-wise forward selection
  • step-wise backward elimination
  • combining forward selection and backward
    elimination
  • decision-tree induction
  • can work well, but often gets trapped in local
    minima
  • often computationally expensive

25
Data Reduction Principal Components
  • One of several projection methods
  • Given N data vectors from k-dimensions, find c lt
    k orthogonal vectors that can be best used to
    represent data
  • The original data set is reduced to one
    consisting of N data vectors on c principal
    components (reduced dimensions)
  • Each data vector is a linear combination of the c
    principal component vectors
  • Works for numeric data only
  • Used when the number of dimensions is large

26
PCA Example
Direction of 1st principal component vector
(highest variance projection)
x2
x1
27
PCA Example
Direction of 1st principal component vector
(highest variance projection)
x2
x1
Direction of 2nd principal component vector
28
Data Reduction Multidimensional Scaling
  • Say we have data in the form of an N x N matrix
    of dissimilarities
  • 0s on the diagonal
  • Symmetric
  • Could either be given data in this form, or
    create such a dissimilarity matrix from our data
    vectors
  • Examples
  • Perceptual dissimilarity of N objects in
    cognitive science experiments
  • String-edit distance between N protein sequences
  • MDS
  • Find k-dimensional coordinates for each of the N
    objects such that Euclidean distances in
    embedded space matches set of dissimilarities
    as closely as possible

29
Multidimensional Scaling (MDS)
  • MDS score function (stress)
  • Often used for visualization, e.g., k2, 3

Original dissimilarities
Euclidean distance in embedded k-dim space
30
MDS Example data
31
MDS 2d embedding of face images
32
Data Reduction Sampling
  • Dont forget about sampling!
  • Choose a representative subset of the data
  • Simple random sampling may be ok but beware of
    skewed variables.
  • Stratified sampling methods
  • Approximate the percentage of each class (or
    subpopulation of interest) in the overall
    database
  • Used in conjunction with skewed data
  • Propensity scores may be useful if response is
    unbalanced.

33
Data Discretization
  • Discretization
  • divide the range of a continuous attribute into
    intervals
  • Some classification algorithms only accept
    categorical attributes.
  • Reduce data size by discretization
  • Prepare for further analysis
  • Reduces information in the data, but sometimes
    surprisingly good!
  • Bagnall and Janacek - KDD 2004

34
Model Evaluation
35
Model Evaluation
  • e.g. pick the best a and b in Y aX b
  • how do you define Best?
  • Big difference between Data Mining and Statistics
    is the focus on predictive performance over find
    best estimators.
  • Most obvious criterion for predictive
    performance

36
Predictive Model Scores
  • More generally
  • Assumes all observations equally important
  • assumes errors are treated equally
  • - what about if recent cases are more important,
    or high revenue, etc.
  • Depends on differences rather than values
  • - scale matters with squared error

37
Descriptive Model Scores
  • If your model assigns probabilities to classes
  • likelihood based - Pick the model that assigns
    highest probability to what actually happened
  • Many different scoring rules for
    non-probabilistic models

38
Scoring Models with Different Complexities
complex models can fit data perfectly!!!!
score(model) Goodness-of-fit - penalty for
complexity Classic bias/variance tradeoff this
is called regularization and is used to combat
overfitting
39
Using Hold-Out Data
  • Instead of penalizing complexity, look at
    performance on hold-out data
  • Using the same set of examples for training as
    well as for evaluation results in an
    overoptimistic evaluation of model performance.
  • Need to test performance on data not seen by
    the modeling algorithm. I.e., data that was no
    used for model building

40
Data Partition
  • Randomly partition data into training and test
    set
  • Training set data used to train/build the
    model.
  • Estimate parameters (e.g., for a linear
    regression), build decision tree, build
    artificial network, etc.
  • Test set a set of examples not used for model
    induction. The models performance is evaluated
    on unseen data. Also referred to as out-of-sample
    data.
  • Generalization Error Models error on the test
    data.

Set of training examples
Set of test examples
41
Set of training examples
Use model to predict outcome on test set
Set of testing examples
42
Holding out data
  • The holdout method reserves a certain amount for
    testing and uses the remainder for training
  • Usually one third for testing, the rest for
    training
  • For unbalanced datasets, random samples might
    not be representative
  • Few or none instances of some classes
  • Stratified sample
  • Make sure that each class is represented with
    approximately equal proportions in both subsets

42
43
Evaluation on small data
  • What if we have a small data set?
  • The chosen 2/3 for training may not be
    representative.
  • The chosen 1/3 for testing may not be
    representative.

43
44
Repeated holdout method
  • Holdout estimate can be made more reliable by
    repeating the process with different subsamples
  • In each iteration, a certain proportion is
    randomly selected for training (possibly with
    stratification)
  • The error rates on the different iterations are
    averaged to yield an overall error rate
  • This is called the repeated holdout method

44
45
Cross-validation
  • Most popular and effective type of repeated
    holdout is cross-validation
  • Cross-validation avoids overlapping test sets
  • First step data is split into k subsets of equal
    size
  • Second step each subset in turn is used for
    testing and the remainder for training
  • This is called k-fold cross-validation
  • Often the subsets are stratified before the
    cross-validation is performed
  • The error estimates are averaged to yield an
    overall error estimate

45
46
Cross-validation example
  • Break up data into groups of the same size
  • Hold aside one group for testing and use the rest
    to build model
  • Repeat

Test
46
46
47
More on cross-validation
  • Standard method for evaluation stratified
    ten-fold cross-validation
  • Why ten? Extensive experiments have shown that
    this is the best choice to get an accurate
    estimate
  • Stratification reduces the estimates variance
  • Even better repeated stratified cross-validation
  • E.g. ten-fold cross-validation is repeated ten
    times and results are averaged (reduces the
    variance)

47
48
Leave-One-Out cross-validation
  • Leave-One-Outa particular form of
    cross-validation
  • Set number of folds to number of training
    instances
  • I.e., for n training instances, build classifier
    n times
  • Makes best use of the data
  • Involves no random subsampling
  • Computationally expensive, but good performance

48
49
Leave-One-Out-CV and stratification
  • Disadvantage of Leave-One-Out-CV stratification
    is not possible
  • It guarantees a non-stratified sample because
    there is only one instance in the test set!
  • Extreme example random dataset split equally
    into two classes
  • Best model predicts majority class
  • 50 accuracy on fresh data
  • Leave-One-Out-CV estimate is 100 error!

49
50
Models Performance Evaluation
  • Classification models predict what class an
    observation belongs to.
  • E.g., good vs. bad credit risk (Credit), Response
    vs. no response to a direct marketing campaign,
    etc.
  • Classification Accuracy Rate
  • Proportion of accurate classifications of
    examples in test set. E.g., the model predicts
    the correct class for 70 of test examples.

51
Classification Accuracy Rate
Classification Accuracy Rate S/N Proportion
examples accurately classified by the model S
number of examples accurately classified by
model N Total number of examples
Inputs Inputs Inputs Inputs Output Models prediction Correct/ incorrect prediction
Single No of cards Age Incomegt50K Good/ Bad risk Good/ Bad risk
0 1 28 1 1 1 )
1 2 56 0 0 0 )
0 5 61 1 0 1 (
0 1 28 1 1 1 )

52
Consider the following
  • Response rate for a mailing campaign is 1
  • We build a classification model to predict
    whether or not a customer would respond.
  • The model classification accuracy rate is 99
  • Is the model good?

99 do not respond
1 respond
53
Confusion Matrix for Classification
Models performance over test set
Actual Class
Respond Do not respond
Respond 0 0
Do not Respond 10 990
Predicted Class
Diagonal left to right predicted classactual
class. 990/1000 (99) accurate predictions.
54
Evaluation for a Continuous Response
  • In a logistic regression example, we predict
    probabilites that the response 1.
  • Classification accuracy depends on the threshold

55
predicted probabilities
Suppose we use a cutoff of 0.5
actual outcome
1
0
8 3
0 9
1
predicted outcome
0
Test Data
56
More generally
misclassification rate
actual outcome
1
0
recall or sensitivity (how many of those that are
really positive did you predict?)
a b
c d
1
predicted outcome
d
specificity
0
bd
a
precision (how many of those predicted positive
really are?)
ab
57
Suppose we use a cutoff of 0.5
actual outcome
1
0
8
sensitivity 100
8 3
0 9
80
1
predicted outcome
9
specificity 75
0
93
we want both of these to be high
58
Suppose we use a cutoff of 0.8
actual outcome
1
0
6
sensitivity 75
6 2
2 10
62
1
predicted outcome
10
specificity 83
0
102
59
  • Note there are 20 possible thresholds
  • ROC computes sensitivity and specificity for all
    possible thresholds and plots them
  • Note if threshold minimum
  • cd0 so sens1 spec0
  • If threshold maximum
  • ab0 so sens0 spec1

actual outcome
1
0
a b
c d
1
0
60
Area under the curve (AUC) is a common measure
of predictive performance
61
Another Look at ROC Curves
Pts without the disease
Pts with disease
Test Result
62
Threshold
Test Result
63
Some definitions ...
True Positives
Test Result
without the disease
with the disease
64
False Positives
Test Result
without the disease
with the disease
65
True negatives
Test Result
without the disease
with the disease
66
False negatives
Test Result
without the disease
with the disease
67
Moving the Threshold right
-

Test Result
without the disease
with the disease
68
Moving the Threshold left
-

Test Result
without the disease
with the disease
69
ROC curve
70
ROC curve comparison
A poor test
A good test
71
ROC curve extremes
Best Test
Worst test
The distributions dont overlap at all
The distributions overlap completely
72
ROC Curve
For a given threshold on f(x), you get a point
on the ROC curve.
Ideal ROC curve (AUC1)
100
Actual ROC
Positive class success rate (hit rate,
sensitivity)
Random ROC (AUC0.5)
0 ? AUC ? 1
0
100
1 - negative class success rate (false alarm
rate, 1-specificity)
73
AUC for ROC curves
AUC 100
AUC 50
AUC 90
AUC 65
74
Interpretation of AUC
  • AUC can be interpreted as the probability that
    the test result from a randomly chosen diseased
    individual is more indicative of disease than
    that from a randomly chosen nondiseased
    individual P(Xi ? Xj Di 1, Dj 0)
  • So can think of this as a nonparametric distance
    between disease/nondisease test results
  • equivalent to Mann-Whitney U-statistic
    (nonparametric test of difference in location
    between two populations)

75
Lab 3
  • Missing Data using visualization to see if the
    data is missing at random through missing in the
    margins
  • Multi-dimensional scaling
About PowerShow.com