Chap' 2 Data Preprocessing - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Chap' 2 Data Preprocessing

Description:

Certain data may not be considered important at the time of entry ... Attribute selection using decision-tree. Based on the information theory ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 31
Provided by: jiaw185
Category:

less

Transcript and Presenter's Notes

Title: Chap' 2 Data Preprocessing


1
Chap. 2 Data Preprocessing
  • Data Mining

2
Why Data Preprocessing?
  • Data in the real world is dirty
  • incomplete lacking attribute values, lacking
    certain attributes of interest, or containing
    only aggregate data
  • e.g., occupation
  • noisy containing errors or outliers
  • e.g., Salary-10
  • inconsistent containing discrepancies in codes
    or names
  • e.g., gender vs. sex
  • e.g., sexwoman vs. sexfemale
  • No quality data, no quality mining results!
  • Quality decisions must be based on quality data
  • Data warehouse needs consistent integration of
    quality data

3
Major Tasks
  • Data cleaning
  • Fill in missing values, smooth noisy data,
    identify or remove outliers, and resolve
    inconsistencies
  • Data integration
  • Integration of multiple databases, data cubes, or
    files
  • Data transformation
  • Normalization and aggregation
  • Data reduction
  • Obtains reduced representation in volume but
    produces the same or similar analytical results
  • Data discretization
  • Part of data reduction but with particular
    importance, especially for numerical data

4
Major Tasks
5
Data Cleaning
  • Data cleaning tasks
  • Fill in missing values
  • Identify outliers and smooth out noisy data
  • Correct inconsistent data
  • Resolve redundancy caused by data integration

6
Missing Data
  • Data is not always available
  • Many tuples have no recorded value for several
    attributes, such as customer income in sales data
  • Missing data may be due to
  • Equipment malfunction
  • Inconsistent with other recorded data and thus
    deleted
  • Data not entered due to misunderstanding
  • Certain data may not be considered important at
    the time of entry
  • Missing data may need to be inferred

7
How to Handle Missing Data?
  • Ignore the tuple
  • usually done when class label is missing
    (assuming the tasks in classification)
  • Fill in manually
  • time-consuming
  • Use a global constant
  • Exgt unknown, 0, or -?
  • Use the attribute mean
  • Use the attribute mean for all samples of the
    same class
  • Exgt For customer of risk_high class ? fill in
    the average of risk_high people
  • Use the most probable value
  • Inference-based such as Bayesian formula or
    decision tree

8
Noisy Data
  • Noise
  • Random error or variance in a measured variable
  • Incorrect attribute values may due to
  • Faulty data collection instruments
  • Data entry problems
  • Data transmission problems
  • Inconsistency in naming convention

9
How to Handle Noisy Data?
  • Binning
  • First sort data and partition into (equi-depth)
    bins
  • Then one can smooth by bin means, smooth by bin
    median, smooth by bin boundaries, etc.
  • Clustering
  • Similar values are organized into groups
    (clusters) ? detect and remove
    outliers
  • Combined computer and human inspection
  • Detect suspicious values and check by human
  • Regression
  • Smooth by fitting the data into regression
    functions

10
Binning
  • Equal-width (distance) partitioning
  • It divides the data value range into N intervals
    of equal size
  • Outliers may dominate presentation
  • Equal-depth (frequency) partitioning
  • It divides the range into N intervals, each
    containing approximately same number of samples
  • Example
  • Sorted data for price (in dollars) 4, 8, 9,
    15, 21, 21, 24, 25, 26, 28, 29, 34
  • Partition into (equi-depth) bins Bin 1 4, 8,
    9, 15
  • Bin 2 21, 21, 24, 25
  • Bin 3 26, 28, 29, 34
  • Smoothing by bin means Bin 1 9, 9, 9, 9
  • Bin 2 23, 23, 23, 23
  • Bin 3 29, 29, 29, 29

11
Cluster Analysis
12
Regression
13
Data Integration
  • Data integration
  • Combines data from multiple sources into a
    coherent store
  • Schema integration
  • Integrate metadata from different sources
  • Entity identification problem identify real
    world entities from multiple data sources
  • Exgt customer_id ? cust-No
  • Detecting and resolving data value conflicts
  • For the same real world entity, attribute values
    from different sources are different
  • Possible reasons different representations,
    different scales
  • Exgt 100 vs. 100.00 dollors

14
Data Integration
  • Redundancy
  • One attribute may be a derived from another
    attribute
  • Exgt monthly sales vs. annual sales
  • Detecting redundancy
  • Some redundancy can be detected by correlation
    analysis (how strongly one attribute implies
    the other)
  • r gt 0 highly correlated (A increase ? B
    increase)
  • r 0 independant
  • r gt 0 negatively correlated

15
Data Integration
  • For categorical data, correlation can be
    discovered by ?2 test
  • The larger the ?2 value, the more likely the
    variables are related

16
Data Transformation
  • Data transformation
  • Change data to appropriate form
  • Smoothing
  • Remove noise from data (binning, clustering)
  • Aggregation
  • Summarization, data cube construction
  • Generalization
  • Concept hierarchy climbing
  • Normalization
  • Scaled to fall within a small, specified range
    (Exgt -1.0, 1.0)
  • Attribute/feature construction
  • New attributes constructed from the given ones

17
Normalization
  • min-max normalization
  • z-score normalization (zero-mean normalization)
  • normalization by decimal scaling

Where j is the smallest integer such that Max(
) lt 1
18
Data Reduction
  • Warehouse may store terabytes of data
  • Complex data mining may take a very long time to
    run on the complete data set
  • Data reduction
  • Obtains a reduced representation of the data set
    that is much smaller but yet produces the same
    analytical results
  • Data reduction strategies
  • Data cube aggregation
  • Dimensionality reduction
  • Data compression
  • Numerosity reduction
  • Discretization and concept hierarchy generation

19
Data Cube Aggregation
  • Multiple levels of aggregation in data cubes
  • Further reduce the size of data to deal with
  • Exgt monthly data ? annual data
  • Reference appropriate levels
  • Use the smallest available cuboid relevant to the
    given task.

20
Dimensionality Reduction
  • Attribute (feature) selection
  • Remove irrelevant or redundant attributes
    (dimensions)
  • Exgt remove telephone no. in customer data
    analysis
  • Step-wise forward selection
  • Step-wise backward elimination
  • Attribute selection using decision-tree
  • Based on the information theory
  • Select best attributes that classify the data

21
Dimensionality Reduction
Initial attribute set A1, A2, A3, A4, A5, A6
A4 ?
A6?
A1?
Class 2
Class 2
Class 1
Class 1
? Reduced attribute set A1, A4, A6
22
Data Compression
  • String compression
  • Typically lossless
  • Only limited manipulation is possible
  • Wavelet transform
  • N-D data vector D ? transformed to N-D data
    vector D (DWT, DFT)
  • Store only a small fraction of the coefficients
    after transformation
  • Typically lossy compression
  • Principal component analysis
  • N-d data vector ? projected to k-d data vector (N
    gt k)

23
Numerosity Reduction
  • Linear regression
  • Data (x1,y1), (x2, y2) are modeled to fit a
    straight line
  • Y ? ? X
  • Uses the least-square method to minimize the
    error
  • Histogram
  • Divide data into buckets and store average (sum)
    for each bucket

24
Numerosity Reduction
  • Clustering
  • Partition data set into clusters ? one can store
    cluster representation only
  • Can have hierarchical clustering and be stored in
    multi-dimensional index tree structures
  • Choices of clustering algorithms ? detailed in
    Chapter 8
  • Sampling
  • Choose a subset of the data
  • Simple random sampling ? may have very poor
    performance on the skewed data set
  • Stratified sampling ? approximate the percentage
    of each class
  • Exgt Sampling customers in several age group

25
Numerosity Reduction
Clustering
Stratified Sampling
Raw Data
26
Discretization
  • Discretization
  • Dividing the range of the attribute into
    intervals
  • ? Interval labels can be used to replace actual
    data values
  • ? Reduce the number of values for a continuous
    attribute
  • Concept hierarchy
  • Defines a discretization
  • Low level concepts ? higher level concepts
  • Exgt Age (integer) ? young, middle-aged, senior
  • (18, 15, 27, 14, 19, 63, 32, ) ? ( Y,
    Y, M, Y, Y, S, M, )
  • Can be automatically generated based on data
    distribution

27
Concept Hierarchy Generation for Numerical Data
  • Binning, histogram analysis, clustering analysis
  • Entropy-based discretization
  • Given a set of samples S, if S is partitioned
    into two intervals S1 and S2 using boundary T,
    the entropy after partitioning is
  • The boundary that minimizes the entropy function
    over all possible boundaries is selected as a
    binary discretization
  • The process is recursively applied to partitions
    obtained until some stopping criterion

28
Concept Hierarchy Generation for Numerical Data
  • Natural Partitioning
  • Use 3-4-5 rule to segment data into natural
    intervals
  • Exgt (213.98, 802.34 by Entropy-based
    discretization vs. (200, 800
  • Check the distinct values at the most significant
    digit
  • 3, 6, 7, 9 distinct values ? partition the range
    into 3 intervals
  • 2, 4, 8 distinct values ? partition the range
    into 4 intervals
  • 1, 5, or 10 distinct values ? partition the range
    into 5 intervals

29
(No Transcript)
30
Concept Hierarchy Generation for Categorical Data
  • Manual
  • Specification of a partial ordering by experts
  • Exgt Define (street lt city lt country)
  • Automatic
  • Specification of a set of attributes, but not of
    their ordering
  • Generate hierarchy based on the number of
    distinct values
  • Exgt Select (country, street, city) attributes for
    location ? street 674,339 distinct
    values, city 3,567 distinct values,
    country 15 distinct values ? Generate
    hierarchy (street lt city lt country)
Write a Comment
User Comments (0)
About PowerShow.com