Data preparation: Selection, Preprocessing, and Transformation - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Data preparation: Selection, Preprocessing, and Transformation

Description:

Simple visualization tools are very useful for identifying problems ... Use them to build a promising model for the caravan data! ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 26
Provided by: fda017
Category:

less

Transcript and Presenter's Notes

Title: Data preparation: Selection, Preprocessing, and Transformation


1
Data preparation Selection, Preprocessing, and
Transformation
  • Literature
  • I.H. Witten and E. Frank, Data Mining, chapter 2
    and chapter 7

2
Fayyads KDD Methodology
data
3
Contents
  • Data Selection
  • Data Preprocessing
  • Data Transformation

4
Data Selection
  • Goal
  • Understanding the data
  • Explore the data
  • possible attributes
  • their values
  • distribution, outliers

5
Getting to know the data
  • Simple visualization tools are very useful for
    identifying problems
  • Nominal attributes histograms (Distribution
    consistent with background knowledge?)
  • Numeric attributes graphs (Any obvious
    outliers?)
  • 2-D and 3-D visualizations show dependencies
  • Domain experts need to be consulted
  • Too much data to inspect? Take a sample!

6
Data preprocessing
  • Problem different data sources (e.g. sales
    department, customer billing department, )
  • Differences styles of record keeping,
    conventions, time periods, data aggregation,
    primary keys, errors
  • Data must be assembled, integrated, cleaned up
  • Data warehouse consistent point of access
  • External data may be required (overlay data)
  • Critical type and level of data aggregation

7
Data Preprocessing
  • Choose data structure (table, tree or set of
    tables)
  • Choose attributes with enough information
  • Decide on a first representation of the
    attributes (numeric or nominal)
  • Decide on missing values
  • Decide on inaccurate data (cleansing)

8
Attribute types used in practice
  • Most schemes accommodate just two levels of
    measurement nominal and ordinal
  • Nominal attributes are also called categorical,
    enumerated, or discrete
  • But enumerated and discrete imply order
  • Special case dichotomy (boolean attribute)
  • Ordinal attributes are called numeric, or
    continuous
  • But continuous implies mathematical continuity

9
The ARFF format
  • ARFF file for weather data with some numeric
    features
  • _at_relation weather
  • _at_attribute outlook sunny, overcast, rainy
  • _at_attribute temperature numeric
  • _at_attribute humidity numeric
  • _at_attribute windy true, false
  • _at_attribute play? yes, no
  • _at_data
  • sunny, 85, 85, false, no
  • sunny, 80, 90, true, no
  • overcast, 83, 86, false, yes
  • ...

10
Attribute types
  • ARFF supports numeric and nominal attributes
  • Interpretation depends on learning scheme
  • Numeric attributes are interpreted as
  • ordinal scales if less-than and greater-than are
    used
  • ratio scales if distance calculations are
    performed
  • (normalization/standardization may be required)
  • Instance-based schemes define distance between
    nominal values (0 if values are equal, 1
    otherwise)
  • Integers nominal, ordinal, or ratio scale?

11
Nominal vs. ordinal
  • Attribute age nominalIf age young and
    astigmatic no and tear production rate
    normal then recommendation softIf age
    pre-presbyopic and astigmatic no and
    tear production rate normal then
    recommendation soft
  • Attribute age ordinal(e.g. young lt
    pre-presbyopic lt presbyopic)If age ?
    pre-presbyopic and astigmatic no and
    tear production rate normal then
    recommendation soft

12
Missing values
  • Frequently indicated by out-of-range entries
  • Types unknown, unrecorded, irrelevant
  • Reasons malfunctioning equipment, changes in
    experimental design, collation of different
    datasets, measurement not possible
  • Missing value may have significance in itself
    (e.g. missing test in a medical examination)
  • Most schemes assume that is not the case ?
    missing may need to be coded as additional
    value

13
Inaccurate values
  • Reason data has not been collected for mining it
  • Result errors and omissions that dont affect
    original purpose of data (e.g. age of customer)
  • Typographical errors in nominal attributes
    ?values need to be checked for consistency
  • Typographical and measurement errors in numeric
    attributes ? outliers need to be identified
  • Errors may be deliberate (e.g. wrong zip codes)
  • Other problems duplicates, stale data

14
TransformationAttribute selection
  • Adding a random (i.e. irrelevant) attribute can
    significantly degrade C4.5s performance
  • Problem attribute selection based on smaller and
    smaller amounts of data
  • IBL is also very susceptible to irrelevant
    attributes
  • Number of training instances required increases
    exponentially with number of irrelevant
    attributes
  • Naïve Bayes doesnt have this problem.
  • Relevant attributes can also be harmful

15
Scheme-independent selection
  • Filter approach assessment based on general
    characteristics of the data
  • One method find subset of attributes that is
    enough to separate all the instances
  • Another method use different learning scheme
    (e.g. C4.5, 1R) to select attributes
  • IBL-based attribute weighting techniques can also
    be used (but cant find redundant attributes)
  • CFS uses correlation-based evaluation of subsets

16
Attribute subsets for weather data
17
Searching the attribute space
  • Number of possible attribute subsets is
    exponential in the number of attributes
  • Common greedy approaches forward selection and
    backward elimination
  • More sophisticated strategies
  • Bidirectional search
  • Best-first search can find the optimum solution
  • Beam search approximation to best-first search
  • Genetic algorithms

18
Scheme-specific selection
  • Wrapper approach attribute selection implemented
    as wrapper around learning scheme
  • Evaluation criterion cross-validation
    performance
  • Time consuming adds factor k2 even for greedy
    approaches with k attributes
  • Linearity in k requires prior ranking of
    attributes
  • Scheme-specific attribute selection essential for
    learning decision tables
  • Can be done efficiently for DTs and Naïve Bayes

19
Discretizing numeric attributes
  • Can be used to avoid making normality assumption
    in Naïve Bayes and Clustering
  • Simple discretization scheme is used in 1R
  • C4.5 performs local discretization
  • Global discretization can be advantageous because
    its based on more data
  • Learner can be applied to discretized attribute
    or
  • It can be applied to binary attributes coding the
    cut points in the discretized attribute

20
Unsupervised discretization
  • Unsupervised discretization generates intervals
    without looking at class labels
  • Only possible way when clustering
  • Two main strategies
  • Equal-interval binning
  • Equal-frequency binning (also called histogram
    equalization)
  • Inferior to supervised schemes in classification
    tasks

21
Entropy-based discretization
  • Supervised method that builds a decision tree
    with pre-pruning on the attribute being
    discretized
  • Entropy used as splitting criterion
  • MDLP used as stopping criterion
  • State-of-the-art discretization method
  • Application of MDLP
  • Theory is the splitting point (log2N-1 bits)
    plus class distribution in each subset
  • DL before/after adding splitting point is compared

22
Example temperature attribute
23
Formula for MDLP
  • N instances and
  • k classes and entropy E in original set
  • k1 classes and entropy E1 in first subset
  • k2 classes and entropy E2 in first subset
  • Doesnt result in any discretization intervals
    for the temperature attribute

24
Other discretization methods
  • Top-down procedure can be replaced by bottomup
    method
  • MDLP can be replaced by chi-squared test
  • Dynamic programming can be used to find optimum
    k-way split for given additive criterion
  • Requires time quadratic in number of instances if
    entropy is used as criterion
  • Can be done in linear time if error rate is used
    as evaluation criterion

25
Transformation
  • WEKA provides a lot of filters that can help you
    transforming and selecting your attributes!
  • Use them to build a promising model for the
    caravan data!
Write a Comment
User Comments (0)
About PowerShow.com