Data preprocessing before classification - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Data preprocessing before classification

Description:

Data preprocessing before classification In Kennedy et al.: Solving data mining problems Outline Ch.7 Collecting data Ch.8 Preparing data Ch.9 Data ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 24
Provided by: Hsin8
Category:

less

Transcript and Presenter's Notes

Title: Data preprocessing before classification


1
Data preprocessing before classification
  • In Kennedy et al. Solving data mining problems

2
Outline
  • Ch.7 Collecting data
  • Ch.8 Preparing data
  • Ch.9 Data preprocessing

3
Ch.7 Collecting data
4
Collecting data
  • Collecting example patterns
  • Inputs (vectors of independent variables)
  • Outputs (vectors dependent variables)
  • More data is better
  • Begin with an elementary set of data

5
Collecting data
  • Choose an appropriate sampling rate for
    time-series data.
  • Make sure the data measurements units are
    consistent.
  • Keep non-essential variables not in the input
    vector
  • Make sure no major structural (systemic) changes
    have occurred during collection.

6
Collecting data
  • How much data is enough?
  • Training and testing using a subset of data
  • If the performance does not increase when full
    data is used, data is enough
  • There are statistical validating methods (Ch.11)
  • Using simulated data
  • When it is difficult to collect (sufficient) data
  • Realistic
  • Representative

7
Ch.8 Preparing data
8
Preparing data
  • Handling
  • Missing data
  • Categorical data
  • Inconsistent data and outliers

9
Missing data
  • Discard incomplete example patterns
  • Manually enter a reasonable, probable, or
    expected values
  • Use an statistic generated from the example
    patterns with that value
  • Mean, mode
  • Encode missing values explicitly by creating new
    indicator variables
  • Generate a predictive model to predict each of
    the missing data value

10
Categorical data
  • Ordinal
  • Convert to a numerical representation in a
    straightforward manner
  • Low, medium, high gt 0, 1, 2
  • Nominal
  • One of n representation
  • Encode the input variables as n different binary
    inputs, when there are n distinct categories.

11
Further process of one of n
  • When n is too large, reduce the number of inputs
    in the new encoding.
  • Manually
  • PCA-based reduction
  • Reduce the one-of-n representation to a one-of-m
    representation where m is less than n.
  • Eigenvalue-based reduction
  • Output variable-based reduction

12
Inconsistent data and outliers
  • Removing erroneous data
  • Identifying inconsistent data
  • Thresholding, filtering
  • Outliers
  • Data points that lie outside of the normal region
    of interest in the input space, which may be
  • Unusual situations that are correct
  • Misleading or incorrect measurements

13
Outliers
  • Ways to spot outliers
  • Plot box plot, histogram
  • Number of S.D. from the mean
  • Handling outliers
  • Remove them
  • Assumption the input space where the outliers
    reside are not concerned
  • Winzorize them
  • Convert the values of outliers into the values of
    upper or lower thresholds.
  • Outliers can always be reintroduced into the
    satisfying model to study the changes in the
    performance of the model.

14
Ben Shabad
15
Ch.9 Data preprocessing
16
Reasons to preprocess data
  • Reducing noise
  • Enhancing the signal
  • Reducing input space
  • Feature extraction
  • Normalizing data
  • Modifying prior probabilities (specific for
    classification)

17
Reducing noise
  • Averaging data values
  • Thresholding data
  • Convert numeric format data into categorical
  • E.g. grey-scale gt monotone image

18
Reducing input space
  • Principle component analysis (PCA)
  • Identify m-dimensional subspace of the
    n-dimensional input space
  • original n variables are reduced to m variables
    that are mutually orthogonal (independent)
  • Eliminating correlated input variables
  • Identify highly correlated input variables by
  • Statistical correlation tests
  • Visual inspection of graphed data variables
  • Seeing if a data variable can be modeled using
    one or more others.

19
Reducing input space
  • Combining non-correlated input variables
  • Sensitivity analysis
  • If variations of a particular input variable
    cause large changes in the estimation model
    output, the variable is very significant.
  • Sensitivity analysis prunes input variables based
    on information provided by both input and output
    data.

20
Normalizing data
  • Not transform to normal distribution
  • For models that perform better
  • Non-parametric algorithms implicitly assume
    distances in different directions carry the same
    weight (e.g. K-nearest neighbor, KNN)
  • Backpropagation (BP) and multi-layered perception
    (MLP) models often perform better if all inputs
    and outputs are normalized
  • Avoiding numerical problems

21
Types of normalization
  • Min-max normalization
  • It preserves all relationships of the data values
    exactly
  • It would compress the normal range if extreme
    values or outliers exist
  • Z-score normalization
  • Sigmoidal normalization

22
Other considerations
  • According to the characteristics of the specific
    classifiers being used for modeling
  • E.g. CHAID uses categorical data directly
  • Input variables produce the best modeling
    accuracy when exhibiting a uniform or Gaussian
    distribution
  • Add expert knowledge when preprocessing data

23
Get prepared and then go!
Write a Comment
User Comments (0)
About PowerShow.com