Data preprocessing before classification - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Data preprocessing before classification

Description:

Data preprocessing before classification In Kennedy et al.: Solving data mining problems Outline Ch.7 Collecting data Ch.8 Preparing data Ch.9 Data ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 24

Provided by: Hsin8

Category:

more less

Transcript and Presenter's Notes

Title: Data preprocessing before classification

1
Data preprocessing before classification

In Kennedy et al. Solving data mining problems

2
Outline

Ch.7 Collecting data
Ch.8 Preparing data
Ch.9 Data preprocessing

3
Ch.7 Collecting data
4
Collecting data

Collecting example patterns
Inputs (vectors of independent variables)
Outputs (vectors dependent variables)
More data is better
Begin with an elementary set of data

5
Collecting data

Choose an appropriate sampling rate for
time-series data.
Make sure the data measurements units are
consistent.
Keep non-essential variables not in the input
vector
Make sure no major structural (systemic) changes
have occurred during collection.

6
Collecting data

How much data is enough?
Training and testing using a subset of data
If the performance does not increase when full
data is used, data is enough
There are statistical validating methods (Ch.11)
Using simulated data
When it is difficult to collect (sufficient) data
Realistic
Representative

7
Ch.8 Preparing data
8
Preparing data

Handling
Missing data
Categorical data
Inconsistent data and outliers

9
Missing data

Discard incomplete example patterns
Manually enter a reasonable, probable, or
expected values
Use an statistic generated from the example
patterns with that value
Mean, mode
Encode missing values explicitly by creating new
indicator variables
Generate a predictive model to predict each of
the missing data value

10
Categorical data

Ordinal
Convert to a numerical representation in a
straightforward manner
Low, medium, high gt 0, 1, 2
Nominal
One of n representation
Encode the input variables as n different binary
inputs, when there are n distinct categories.

11
Further process of one of n

When n is too large, reduce the number of inputs
in the new encoding.
Manually
PCA-based reduction
Reduce the one-of-n representation to a one-of-m
representation where m is less than n.
Eigenvalue-based reduction
Output variable-based reduction

12
Inconsistent data and outliers

Removing erroneous data
Identifying inconsistent data
Thresholding, filtering
Outliers
Data points that lie outside of the normal region
of interest in the input space, which may be
Unusual situations that are correct
Misleading or incorrect measurements

13
Outliers

Ways to spot outliers
Plot box plot, histogram
Number of S.D. from the mean
Handling outliers
Remove them
Assumption the input space where the outliers
reside are not concerned
Winzorize them
Convert the values of outliers into the values of
upper or lower thresholds.
Outliers can always be reintroduced into the
satisfying model to study the changes in the
performance of the model.

14
Ben Shabad
15
Ch.9 Data preprocessing
16
Reasons to preprocess data

Reducing noise
Enhancing the signal
Reducing input space
Feature extraction
Normalizing data
Modifying prior probabilities (specific for
classification)

17
Reducing noise

Averaging data values
Thresholding data
Convert numeric format data into categorical
E.g. grey-scale gt monotone image

18
Reducing input space

Principle component analysis (PCA)
Identify m-dimensional subspace of the
n-dimensional input space
original n variables are reduced to m variables
that are mutually orthogonal (independent)
Eliminating correlated input variables
Identify highly correlated input variables by
Statistical correlation tests
Visual inspection of graphed data variables
Seeing if a data variable can be modeled using
one or more others.

19
Reducing input space

Combining non-correlated input variables
Sensitivity analysis
If variations of a particular input variable
cause large changes in the estimation model
output, the variable is very significant.
Sensitivity analysis prunes input variables based
on information provided by both input and output
data.

20
Normalizing data

Not transform to normal distribution
For models that perform better
Non-parametric algorithms implicitly assume
distances in different directions carry the same
weight (e.g. K-nearest neighbor, KNN)
Backpropagation (BP) and multi-layered perception
(MLP) models often perform better if all inputs
and outputs are normalized
Avoiding numerical problems

21
Types of normalization

Min-max normalization
It preserves all relationships of the data values
exactly
It would compress the normal range if extreme
values or outliers exist
Z-score normalization
Sigmoidal normalization

22
Other considerations

According to the characteristics of the specific
classifiers being used for modeling
E.g. CHAID uses categorical data directly
Input variables produce the best modeling
accuracy when exhibiting a uniform or Gaussian
distribution
Add expert knowledge when preprocessing data

23
Get prepared and then go!

Write a Comment

User Comments (0)