Data Preprocessing

About This Presentation

Title:

Data Preprocessing

Description:

Data Preprocessing Chapter 2 – PowerPoint PPT presentation

Number of Views:1036

Avg rating:3.0/5.0

Slides: 64

Provided by: Jiaw244

Category:

more less

Transcript and Presenter's Notes

Title: Data Preprocessing

1
Data Preprocessing
Chapter 2
2
Chapter Objectives

Realize the importance of data preprocessing for
real world data before data mining or
construction of data warehouses.
Get an overview of some data preprocessing issues
and techniques.

3
The course
(4)
DS
OLAP
(2)
(3)
Data Preprocessing
DW
DS
DM
(5)
Association
DS
(6)
Classification
(7)
Clustering
DS Data source DW Data warehouse DM Data
Mining
4
- The Chapter
(2.3)
(2.4)
(2.4)
(2.5)
5
- Chapter Outline

Introduction (2.1, 2.2)
Data Cleaning (2.3)
Data Integration (2.4)
Data Transformation (2.4)
Data Reduction (2.5)
Concept Hierarchy (2.6)

6
- Introduction

Introduction
Why Preprocess the Data (2.1)
Where Preprocess Data
Identifying Typical properties of Data (2.2)

7
-- Why Preprocess the Data

A well-accepted multi-dimensional measure of data
quality
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility

8
-- Why Preprocess the Data

Reason for data cleaning
Incomplete data (missing data)
Noisy data (contains errors)
Inconsistent data (containing discrepancies)
Reasons for data integration
Data comes from multiple sources
Reason for data transformation
Some data must be transformed to be used for
mining
Reasons for data reduction
Performance
No quality data ? no quality mining results!

9
-- Where Preprocess Data
DS
OLAP
DW
SD
DS
DM
Association
Data preprocessing is done here, In the Staging
Database
DS
Classification
Clustering
DS Data source DW Data warehouse DM Data
Mining SD Staging Database
10
-- Identifying Typical Properties of Data

Descriptive Data Summarization techniques can be
used to identify the typical properties of data
and helps which data values should be treated as
noise. For many data preprocessing tasks it is
useful to know the following measures of the data
The central tendency
The dispersion

11
--- Measuring the Central Tendency

Central Tendency measures
Mean
Median
Mode
Midrange (max() min())/2
For data mining purposes, we need to know how to
compute these measures efficiently in large
databases. It is important to know whether the
measure is
distributive
Algebraic or
holistic

12
--- Measuring the Central Tendency

Distributive measure A measure that can be
computed by partitioning the data, compute the
measure for each partition, and the merge the
results to arrive at the measures value for the
entire data.
eg. Sum(), count(), max(), min().
Algebraic measure is a measure that can be
computed by applying an algebraic function to one
or more distributed measures.
eg. Avg() which is sum()/count()
Holistic measure. You need the entire data to
compute the measure
eg. median

13
--- Measuring the Dispersion

Dispersion or variance is the degree to which
numerical data tends to spread.
The most common measures are
Standard deviation
Range max() min()
Quartiles
The five-number summary
Interquartile range (IQR)
Boxplot Analysis

14
---- Quartiles

The kth percentile of a data sorted in ascending
order is the value x having the property the k
percent of the data entries lie at or below x.
The first quartile, Q1, is the 25th percentile,
Q2 and median the 50th percentile, and Q3 is the
75th percentile.
IQR is Q3 Q1 and is a simple measure that gives
the spread of the middle half.
A common rule of thump for identifying suspected
outliers is to single out values 1.5 IQR above
Q3 or below Q1.
The 5-number summary The min, Q1, median, Q3,
the max
Box plots can be plotted based on the 5-number
summary and are useful tools for identifying
outliers.

15
---- Boxplot Analysis

Boxplot
Data is represented with a box
The ends of the box are Q1 and Q3,
i.e., the height of the box is IRQ
The median is marked by a line within the box
Whiskers two lines outside the box extend to
Minimum and Maximum

Highest value
Q3
Median
Q1
Whisker
Lowest value
16
- Chapter Outline

Introduction (2.1, 2.2)
Data Cleaning (2.3)
Data Integration (2.4)
Data Transformation (2.4)
Data Reduction (2.5)
Concept Hierarchy (2.6)

17
- Data Cleaning

Importance
Data cleaning is the number one problem in data
warehousing
In data cleaning, the following data problems are
resolved
Incomplete data (missing data)
Noisy data (contains errors)
Inconsistent data (containing discrepancies)

18
-- Missing Data

Data is not always available
E.g., many tuples have no recorded value for
several attributes, such as customer income in
sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus
deleted
data not entered due to misunderstanding
certain data may not be considered important at
the time of entry

19
--- How to Handle Missing Data?

Fill in missing value manually (often unfeasible)
Fill in with a global constant. Unknown or n/a
not recommended (data mining algorithm will see
this as a normal value)
Fill in with attribute mean or median
Fill in with class mean or median (classes need
to be known)
Fill in with most likely value (using regression,
decision trees, most similar records, etc.)
Use other attributes to predict value (e.g. if a
postcode is missing use suburb value)
Ignore the record

20
-- Noisy Data

Noise random error or variance in a measured
variable
Incorrect attribute values may due to
faulty data collection
data entry problems
data transmission problems
data conversion errors
Data decay problems
technology limitations, e.g. buffer overflow or
field size limits

21
--- How to Handle Noisy Data?

Binning
First sort data and partition into
(equal-frequency) bins, then one can smooth by
bin means, or by bin median, or by bin
boundaries, etc.
Regression
smooth by fitting the data into regression
functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human.

22
--- Binning Methods for Data Smoothing

Sorted data for price 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
Partition into equal-frequency (equi-depth) bins
Bin 1 4, 8, 9, 15
Bin 2 21, 21, 24, 25
Bin 3 26, 28, 29, 34
Smoothing by bin means
Bin 1 9, 9, 9, 9
Bin 2 23, 23, 23, 23
Bin 3 29, 29, 29, 29
Smoothing by bin boundaries
Bin 1 4, 4, 4, 15
Bin 2 21, 21, 25, 25
Bin 3 26, 26, 26, 34

23
--- Regression
y
Y1
y x 1
Y1
x
X1
24
--- Cluster Analysis
25
-- Inconsistent data

Inconsistent data can be due to
data entry errors
data integration errors (different formats,
codes, etc.)
Handling inconsistent data
Important to have data entry verification (check
both format and values of data entered)
Correct with help of external reference data

26
-- Data Cleaning as a Process

Data discrepancy detection
Use metadata (e.g., domain, range, correlation,
distribution, DDS)
Check field overloading
Inconsistent use of codes (e.g. 5/12/2004 and
12/5/2004)
Check uniqueness rule, consecutive rule, and null
rule
Use commercial tools
Data scrubbing use simple domain knowledge
(e.g., postal code, spell-check) to detect errors
and make corrections
Data auditing by analyzing data to discover
rules and relationship to detect violators (e.g.,
correlation and clustering to find outliers)

27
--- Properties of Normal Distribution Curve

The normal (distribution) curve
From µs to µs contains about 68 of the
measurements (µ mean, s standard deviation)
From µ2s to µ2s contains about 95 of it
From µ3s to µ3s contains about 99.7 of it

28
--- Correlation
Positive correlation
Negative correlation
No correlation
29
- Chapter Outline

Introduction (2.1, 2.2)
Data Cleaning (2.3)
Data Integration (2.4)
Data Transformation (2.4)
Data Reduction (2.5)
Concept Hierarchy (2.6)

30
- Data Integration

Data integration Combines data from multiple
sources into a coherent data store
Main problems
Entity identification problem
Identify real world entities from multiple data
sources, e.g., A.cust-id ? B.cust-
Redundancy problem
An attribute is redundant if it can be derived
from other attribute(s).
Inconsistencies in attribute naming can cause
redundancy
Solutions
Entity identification problems can be resolved
using metadata
Some redundancy problems can be also be resolved
using metadata and some others can be resolved
correlation analysis.

31
-- Correlation
Positive correlation
Negative correlation
No correlation
32
--- Correlation Analysis (Numerical Data)

Correlation coefficient (also called Pearsons
product moment coefficient)
where n is the number of tuples, and
are the respective means of A and B, sA and sB
are the respective standard deviation of A and B,
and S(AB) is the sum of the AB cross-product.
If rA,B gt 0, A and B are positively correlated
rA,B 0 independent
rA,B lt 0 negatively correlated

33
--- Correlation Analysis (Categorical Data)

?2 (chi-square) test
The larger the ?2 value, the more likely the
variables are related
The cells that contribute the most to the ?2
value are those whose actual count is very
different from the expected count

34
--- Chi-Square Calculation An Example

?2 (chi-square) calculation (numbers in
parenthesis are expected counts calculated based
on the data distribution in the two categories)
It shows that like_science_fiction and play_chess
are correlated in the group

Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
35
- Chapter Outline

Introduction (2.1, 2.2)
Data Cleaning (2.3)
Data Integration (2.4)
Data Transformation (2.4)
Data Reduction (2.5)
Concept Hierarchy (2.6)

36
- Data Transformation

In data transformation, data is transformed or
consolidated to forms appropriate for mining.
Data transformation can involve
Smoothing remove noise from data using binning,
regression, or clustering.
Aggregation E.g. sales data can be aggregated to
monthly.
Generalization concept hierarchy climbing. E.g.
cities can be generalized to countries. Ages can
be generalized to youth, middle-aged, and senior.
Normalization Attribute data scaled to fall
within a small, specified range
Attribute/feature construction New attributes
constructed from the given ones

37
-- Data Transformation Normalization

Min-max normalization to new_minA, new_maxA
Ex. Let income range 12,000 to 98,000
normalized to 0.0, 1.0. Then 73,000 is mapped
to
Z-score normalization (µ mean, s standard
deviation)
Ex. Let µ 54,000, s 16,000. Then
Normalization by decimal scaling

Where j is the smallest integer such that
Max(?) lt 1
38
-- Attribute/feature construction

Sometimes it is helpful or necessary to construct
new attributes or features
Helpful for understanding and accuracy
For example Create attribute volume based on
attributes height, depth and width
Construction is based on mathematical or logical
operations
Attribute construction can help to discover
missing information about the relationships
between data attributes

39
- Chapter Outline

Introduction (2.1, 2.2)
Data Cleaning (2.3)
Data Integration (2.4)
Data Transformation (2.4)
Data Reduction (2.5)
Concept Hierarchy (2.6)

40
- Data Reduction

The data is often too large. Reducing the data
can improve performance. Data reduction consists
of reducing the representation of the data set
while producing the same (or almost the same)
results.
Data Reduction Includes
Reducing the number of rows (objects)
Reducing the number of attributes (features)
Compression
Discretization (will be covered in the next
section)

41
-- Reducing the number of Rows

Aggregation
Aggregation of data in to a higher concept level.
We can have multiple levels of aggregation. E.g.,
Weekly, monthly, quarterly, yearly, and so on.
For data reduction use the highest aggregation
level which is enough
Numerosity reduction
Data volume can be reduced by choosing
alternative forms of data representation

42
--- Types of Numerosity reduction

Parametric
Assume the data fits some model, estimate model
parameters, store only the parameters, and
discard the data (except possible outliers)
E.g. Linear regression Data are modeled to fit
a straight line
Non-parametric
Histograms
Clustering
Sampling

43
---- Reduction with Histograms

A popular data reduction technique. Divide data
into buckets and store representation of buckets
(sum, count, etc.)
Histogram Types
Equal-width Divides the range into N intervals
of equal size.
Equal-depth Divides the range into N intervals,
each containing approximately same number of
samples
V-optimal Considers all histogram types for a
given number of buckets and chooses the one with
the least variance.
MaxDiff After sorting the data to be
approximated, the borders of the buckets are
defined at points where the adjacent values have
the maximum difference

44
---- Example Histogram
45
---- Reduction with Clustering

Partition data into clusters based on closeness
in space. Retain representatives of clusters
(centroids) and outliers. Effectiveness depends
upon the distribution of data. Hierarchical
clustering is possible (multi-resolution).

Outlier
x
x
x
Centroid
46
---- Reduction with Sampling

Allows a large data set to be represented by a
much smaller random sample of the data (sub-set).
Will the patterns in the sample represent the
patterns in the data?
How to select a random sample?
Simple random sample without replacement (SRSWOR)
Simple random sampling with replacement (SRSWR)
Cluster sample (SRSWOR or SRSWR from clusters)
Stratified sample (stratum group based on
attribute value)

47
----Sampling
SRSWOR (simple random sample without
replacement)
SRSWR
48
Sampling Example
Cluster/Stratified Sample
Raw Data
49
-- Reduce the number of Attributes

Reduce the number of attributes or dimensions or
features.
Select a minimum set of attributes (features)
that is sufficient for the data mining or
analytical task.
Purpose
Avoid curse of dimensionality which creates
sparse data space and bad clusters.
Reduce amount of time and memory required by data
mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant and duplicate
features or reduce noise

50
--- Reduce the number of Attributes techniques

Step-wise forward selection
E.g. ?A1 ? A1,A3 ? A1,A3,A5
Step-wise backward elimination
E.g. A1,A2,A3,A4,A5 ? A1,A3,A4,A5 ?
A1,A3,A5
Combining forward selection and backward
elimination
Decision-tree induction (This will be covered in
Chapter 5).

51
-- Data Compression

Data compression reduces the size of data and can
be used for all sorts of data.
saves storage space.
saves communication time.
There is lossless compression and lossy
compression. E.g., ZIP, Discrete wavelet
transform (DWT), and Principal Component Analysis
(PCA).
For data mining, data compression is beneficial
if data mining algorithms can manipulate
compressed data directly without uncompressing
it. Examples String compression (e.g. ZIP, only
allow limited manipulation of data.)

52
- Chapter Outline

Introduction (2.1, 2.2)
Data Cleaning (2.3)
Data Integration (2.4)
Data Transformation (2.4)
Data Reduction (2.5)
Data discritization Concept Hierarchy (2.6)

53
- Discretization

Three types of attributes
Nominal values from an unordered set, e.g.,
color, profession
Ordinal values from an ordered set, e.g.,
military or academic rank
Continuous real numbers, e.g., integer or real
numbers
Discretization
Divide the range of a continuous attribute into
intervals
Some classification algorithms only accept
categorical attributes.
Reduce data size by discretization
Prepare for further analysis

54
-- Discretization and Concept Hierarchy

Discretization
Reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals
Interval labels can then be used to replace
actual data values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an
attribute
Concept hierarchy formation
Recursively reduce the data by collecting and
replacing low level concepts (such as numeric
values for age) by higher level concepts (such as
young, middle-aged, or senior)

55
-- Discretization and Concept Hierarchy
Generation for Numeric Data

Typical methods All the methods can be applied
recursively
Binning (covered above)
Top-down split, unsupervised,
Histogram analysis (covered above)
Top-down split, unsupervised
Clustering analysis (covered above)
Either top-down split or bottom-up merge,
unsupervised
Entropy-based discretization supervised,
top-down split
Interval merging by ?2 Analysis unsupervised,
bottom-up merge
Segmentation by natural partitioning top-down
split, unsupervised

56
--- Entropy-Based Discretization

Given a set of samples S, if S is partitioned
into two intervals S1 and S2 using boundary T,
the information gain after partitioning is
Entropy is calculated based on class distribution
of the samples in the set. Given m classes, the
entropy of S1 is
where pi is the probability of class i in S1
The boundary that minimizes the entropy function
over all possible boundaries is selected as a
binary discretization
The process is recursively applied to partitions
obtained until some stopping criterion is met

57
--- Interval Merge by ?2 Analysis

Merging-based (bottom-up) vs. splitting-based
methods
Merge Find the best neighboring intervals and
merge them to form larger intervals recursively
ChiMerge
Initially, each distinct value of a numerical
attr. A is considered to be one interval
?2 tests are performed for every pair of adjacent
intervals
Adjacent intervals with the least ?2 values are
merged together, since low ?2 values for a pair
indicate similar class distributions
This merge process proceeds recursively until a
predefined stopping criterion is met (such as
significance level, max-interval, max
inconsistency, etc.)

58
--- Segmentation by Natural Partitioning

A simply 3-4-5 rule can be used to segment
numeric data into relatively uniform, natural
intervals.
If an interval covers 3, 6, 7 or 9 distinct
values at the most significant digit, partition
the range into 3 equi-width intervals
If it covers 2, 4, or 8 distinct values at the
most significant digit, partition the range into
4 intervals
If it covers 1, 5, or 10 distinct values at the
most significant digit, partition the range into
5 intervals

59
---- Example of 3-4-5 Rule
(-400 -5,000)
Step 4
60
-- Concept Hierarchy Generation for Categorical
Data

Categorical data
Discrete, finite cardinality, unordered
E.g. Geographic location, job category, product
Problem how to compose an order to categorical
data
E.g. Organize location into categories
Street lt city lt province lt country

61
-- Concept Hierarchy Generation for Categorical
Data

A partial order of attributes at the schema level
Street lt city lt state lt country
A portion of a hierarchy by explicit data
grouping
Jeddah, Riyadh lt Saudi Arabia
A set of attributes
A partial order generated by cardinality of
attributes
E.g., street lt city ltstate lt country
Only a partial set of attributes
E.g., only street lt city, not others
System automatically fills in others

62
--- Automatic Concept Hierarchy Generation

Some hierarchies can be automatically generated
based on the analysis of the number of distinct
values per attribute in the data set
The attribute with the most distinct values is
placed at the lowest level of the hierarchy
Exceptions, e.g., weekday, month, quarter, year

63
End

Write a Comment

User Comments (0)

About PowerShow.com

Data Preprocessing - PowerPoint PPT Presentation

Data Preprocessing

Data Preprocessing Chapter 2 – PowerPoint PPT presentation