Course on Data Mining (581550-4) - PowerPoint PPT Presentation

About This Presentation

Title:

Course on Data Mining (581550-4)

Description:

Post- processing. Phases of the KDD process (2) 21.11.2001. Data ... post-processing ... 3750 eat cereal. 2000 both play basket ball and eat cereal ... – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 64

Provided by: moenp

Category:

more less

Transcript and Presenter's Notes

Title: Course on Data Mining (581550-4)

1
Course on Data Mining (581550-4)
Intro/Ass. Rules
Clustering
Episodes
KDD Process
Text Mining
Appl./Summary
2
Course on Data Mining (581550-4)
Today 22.11.2001

Today's subject
KDD Process
Next week's program
Lecture Data mining applications, future,
summary
Exercise KDD Process
Seminar KDD Process

3
KDD process

Overview
Preprocessing
Post-processing
Summary

4
What is KDD? A process!

Aim the selection and processing of data for
the identification of novel, accurate, and useful
patterns, and
the modeling of real-world phenomena
Data mining is a major component of the KDD
process

5
Typical KDD process
6
Phases of the KDD process (1)
Learning the domain
Creating a target data set
Pre- processing
Data cleaning, integration and transformation
Data reduction and projection
Choosing the DM task
7
Phases of the KDD process (2)
Choosing the DM algorithm(s)
Data mining search
Pattern evaluation and interpretation
Post- processing
Knowledge presentation
Use of discovered knowledge
8
Preprocessing - overview

Why data preprocessing?
Data cleaning
Data integration and transformation
Data reduction

9
Why data preprocessing?

Aim to select the data relevant with respect to
the task in hand to be mined
Data in the real world is dirty
incomplete lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
noisy containing errors or outliers
inconsistent containing discrepancies in codes
or names
No quality data, no quality mining results!

10
Measures of data quality

accuracy
completeness
consistency
timeliness
believability
value added
interpretability
accessibility

11
Preprocessing tasks (1)

Data cleaning
fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies
Data integration
integration of multiple databases, files, etc.
Data transformation
normalization and aggregation

12
Preprocessing tasks (2)

Data reduction (including discretization)
obtains reduced representation in volume, but
produces the same or similar analytical results
data discretization is part of data reduction,
but with particular importance, especially for
numerical data

13
Preprocessing tasks (3)

14
Data cleaning tasks

Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data

15
Missing Data

Data is not always available
Missing data may be due to
equipment malfunction
inconsistent with other recorded data, and thus
deleted
data not entered due to misunderstanding
certain data may not be considered important at
the time of entry
not register history or changes of the data
Missing data may need to be inferred

16
How to Handle Missing Data? (1)

Ignore the tuple
usually done when the class label is missing
not effective, when the percentage of missing
values per attribute varies considerably
Fill in the missing value manually
tedious infeasible?
Use a global constant to fill in the missing
value
e.g., unknown, a new class?!

17
How to Handle Missing Data? (2)

Use the attribute mean to fill in the missing
value
Use the attribute mean for all samples belonging
to the same class to fill in the missing value
smarter solution than using the general
attribute mean
Use the most probable value to fill in the
missing value
inference-based tools such as decision tree
induction or a Bayesian formalism
regression

18
Noisy Data

Noise random error or variance in a measured
variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention

19
How to Handle Noisy Data?

Binning
smooth a sorted data value by looking at the
values around it
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
Regression
smooth by fitting the data into regression
functions

20
Binning methods (1)

Equal-depth (frequency) partitioning
sort data and partition into bins, N intervals,
each containing approximately same number of
samples
smooth by bin means, bin median, bin boundaries,
etc.
good data scaling
managing categorical attributes can be tricky

21
Binning methods (2)

Equal-width (distance) partitioning
divide the range into N intervals of equal size
uniform grid
if A and B are the lowest and highest values of
the attribute, the width of intervals will be W
(B-A)/N.
the most straightforward
outliers may dominate presentation
skewed data is not handled well

22
Equal-depth binning - Example

Sorted data for price (in dollars)
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equal-depth) bins
Bin 1 4, 8, 9, 15
Bin 2 21, 21, 24, 25
Bin 3 26, 28, 29, 34
Smoothing by bin means
Bin 1 9, 9, 9, 9
Bin 2 23, 23, 23, 23
Bin 3 29, 29, 29, 29

by bin boundaries
Bin 1 4, 4, 4, 15
Bin 2 21, 21, 25, 25
Bin 3 26, 26, 26, 34

23
Data Integration (1)

Data integration
combines data from multiple sources into a
coherent store
Schema integration
integrate metadata from different sources
entity identification problem identify real
world entities from multiple data sources, e.g.,
A.cust-id ? B.cust-

24
Data Integration (2)

Detecting and resolving data value conflicts
for the same real world entity, attribute values
from different sources are different
possible reasons different representations,
different scales, e.g., metric vs. British units

25
Handling Redundant Data

Redundant data occur often, when multiple
databases are integrated
the same attribute may have different names in
different databases
one attribute may be a derived attribute in
another table, e.g., annual revenue
Redundant data may be detected by correlation
analysis
Careful integration of data from multiple sources
may
help to reduce/avoid redundancies and
inconsistencies
improve mining speed and quality

26
Data Transformation

Smoothing remove noise from data
Aggregation summarization, data cube
construction
Generalization concept hierarchy climbing
Normalization scaled to fall within a small,
specified range, e.g.,
min-max normalization
normalization by decimal scaling
Attribute/feature construction
new attributes constructed from the given ones

27
Data Reduction

Data reduction
obtains a reduced representation of the data set
that is much smaller in volume
produces the same (or almost the same) analytical
results as the original data
Data reduction strategies
dimensionality reduction
numerosity reduction
discretization and concept hierarchy generation

28
Dimensionality Reduction

Feature selection (i.e., attribute subset
selection)
select a minimum set of features such that the
probability distribution of different classes
given the values for those features is as close
as possible to the original distribution given
the values of all features
reduce the number of patterns in the patterns,
easier to understand
Heuristic methods (due to exponential of
choices)
step-wise forward selection
step-wise backward elimination
combining forward selection and backward
elimination

29
Dimensionality Reduction - Example
Initial attribute set A1, A2, A3, A4, A5, A6
A4 ?
A6?
A1?
Class 2
Class 2
Class 1
Class 1
Reduced attribute set A1, A4, A6
30
Numerosity Reduction

Parametric methods
assume the data fits some model, estimate model
parameters, store only the parameters, and
discard the data (except possible outliers)
e.g., regression analysis, log-linear models
Non-parametric methods
do not assume models
e.g., histograms, clustering, sampling

31
Discretization

Reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals
Interval labels can then be used to replace
actual data values
Some classification algorithms only accept
categorical attributes

32
Concept Hierarchies

Reduce the data by collecting and replacing low
level concepts by higher level concepts
For example, replace numeric values for the
attribute age by more general values young,
middle-aged, or senior

33
Discretization and concept hierarchy generation
for numeric data

Binning
Histogram analysis
Clustering analysis
Entropy-based discretization
Segmentation by natural partitioning

34
Concept hierarchy generation for categorical data

Specification of a partial ordering of attributes
explicitly at the schema level by users or
experts
Specification of a portion of a hierarchy by
explicit data grouping
Specification of a set of attributes, but not of
their partial ordering
Specification of only a partial set of attributes

35
Specification of a set of attributes

Concept hierarchy can be automatically generated
based on the number of distinct values per
attribute in the given attribute set. The
attribute with the most distinct values is placed
at the lowest level of the hierarchy.

15 distinct values
65 distinct values
3567 distinct values
674 339 distinct values
36
Post-processing - overview

Why data post-processing?
Interestingness
Visualization
Utilization

Post-processing
37
Why data post-processing? (1)

Aim to show the results, or more precisely the
most interesting findings, of the data mining
phase to a user/users in an understandable way
A possible post-processing methodology
find all potentially interesting patterns
according to some rather loose criteria
provide flexible methods for iteratively and
interactively creating different views of the
discovered patterns
Other more restrictive or focused methodologies
possible as well

38
Why data post-processing? (2)

A post-processing methodology is useful, if
the desired focus is not known in advance (the
search process cannot be optimized to look only
for the interesting patterns)
there is an algorithm that can produce all
patterns from a class of potentially interesting
patterns (the result is complete)
the time requirement for discovering all
potentially interesting patterns is not
considerably longer than, if the discovery was
focused to a small subset of potentially
interesting patterns

39
Are all the discovered pattern interesting?

A data mining system/query may generate thousands
of patterns, but are they all interesting?
Usually NOT!
How could we then choose the interesting
patterns?
gt Interestingness

40
Interestingness criteria (1)

Some possible criteria for interestingness
evidence statistical significance of finding?
redundancy similarity between findings?
usefulness meeting the user's needs/goals?
novelty already prior knowledge?
simplicity syntactical complexity?
generality how many examples covered?

41
Interestingness criteria(2)

One division of interestingness criteria
objective measures that are based on statistics
and structures of patterns, e.g.,
J-measure statistical significance
certainty factor support or frequency
strength confidence
subjective measures that are based on users
beliefs in the data, e.g.,
unexpectedness is the found pattern
surprising?"
actionability can I do something with it?"

42
Criticism Support Confidence

Example (Aggarwal Yu, PODS98)
among 5000 students
3000 play basketball, 3750 eat cereal
2000 both play basket ball and eat cereal
the rule play basketball ? eat cereal 40,
66.7 is misleading, because the overall
percentage of students eating cereal is 75,
which is higher than 66.7
the rule play basketball ? not eat cereal 20,
33.3 is far more accurate, although with lower
support and confidence

43
Interest

Yet another objective measure for interestingness
is interest that is defined as
Properties of this measure
takes both P(A) and P(B) in consideration
P(AB)P(B)P(A), if A and B are independent
events
A and B negatively correlated, if the value is
less than 1 otherwise A and B positively
correlated.

44
J-measure

Also J-measure
is an objective measure for interestingness
Properties of J-measure
again, takes both P(A) and P(B) in consideration
value is always between 0 and 1
can be computed using pre-calculated values

45
Support/Frequency/J-measure
46
Confidence
47
Example Selection of Interesting Association
Rules

For reducing the number of association rules that
have to be considered, we could, for example, use
one of the following selection criteria
frequency and confidence
J-measure or interest
maximum rule size (whole rule, left-hand side,
right-hand side)
rule attributes (e.g., templates)

48
Example Problems with selection of rules

A rule can correspond to prior knowledge or
expectations
how to encode the background knowledge into the
system?
A rule can refer to uninteresting attributes or
attribute combinations
could this be avoided by enhancing the
preprocessing phase?
Rules can be redundant
redundancy elimination by rule covers etc.

49
Interpretation and evaluation of the results of
data mining

Evaluation
statistical validation and significance testing
qualitative review by experts in the field
pilot surveys to evaluate model accuracy
Interpretation
tree and rule models can be read directly
clustering results can be graphed and tabled
code can be automatically generated by some
systems

50
Visualization of Discovered Patterns (1)

In some cases, visualization of the results of
data mining (rules, clusters, networks) can be
very helpful
Visualization is actually already important in
the preprocessing phase in selecting the
appropriate data or in looking at the data
Visualization requires training and practice

51
Visualization of Discovered Patterns (2)

Different backgrounds/usages may require
different forms of representation
e.g., rules, tables, cross-tabulations, or
pie/bar chart
Concept hierarchy is also important
discovered knowledge might be more understandable
when represented at high level of abstraction
interactive drill up/down, pivoting, slicing and
dicing provide different perspective to data
Different kinds of knowledge require different
kinds of representation
association, classification, clustering, etc.

52
Visualization
53
(No Transcript)
54
Utilization of the results
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
55
Summary

Data mining semi-automatic discovery of
interesting patterns from large data sets
Knowledge discovery is a process
preprocessing
data mining
post-processing
using and utilizing the knowledge

56
Summary

Preprocessing is important in order to get useful
results!
If a loosely defined mining methodology is used,
post-processing is needed in order to find the
interesting results!
Visualization is useful in pre- and
post-processing!
One has to be able to utilize the found knowledge!

57
References KDD Process

P. Adriaans and D. Zantinge. Data Mining.
Addison-Wesley Harlow, England, 1996.
R.J. Brachman, T. Anand. The process of knowledge
discovery in databases. Advances in Knowledge
Discovery and Data Mining. AAAI/MIT Press, 1996.
D. P. Ballou and G. K. Tayi. Enhancing data
quality in data warehouse environments.
Communications of ACM, 4273-78, 1999.
M. S. Chen, J. Han, and P. S. Yu. Data mining An
overview from a database perspective. IEEE Trans.
Knowledge and Data Engineering, 8866-883, 1996.
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
R. Uthurusamy. Advances in Knowledge Discovery
and Data Mining. AAAI/MIT Press, 1996.
T. Imielinski and H. Mannila. A database
perspective on knowledge discovery.
Communications of ACM, 3958-64, 1996.
Jagadish et al., Special Issue on Data Reduction
Techniques. Bulletin of the Technical Committee
on Data Engineering, 20(4), December 1997.
D. Keim, Visual techniques for exploring
databases. Tutorial notes in KDD97, Newport
Beach, CA, USA, 1997.
D. Keim, Visual data mining. Tutorial notes in
VLDB97, Athens, Greece, 1997.
D. Keim, and H.P. Krieger, Visual techniques for
mining large databases a comparison. IEEE
Transactions on Knowledge and Data Engineering,
8(6), 1996.

58
References KDD Process

W. Kloesgen, Explora A multipattern and
multistrategy discovery assistant. In U.M.
Fayyad, et al. (eds.), Advances in Knowledge
Discovery and Data Mining, 249-271. AAAI/MIT
Press, 1996.
M. Klemettinen, A knowledge discovery methodology
for telecommunication network alarm databases.
Ph.D. thesis, University of Helsinki, Report
A-1999-1, 1999.
M. Klemettinen, H. Mannila, P. Ronkainen, H.
Toivonen, and A.I. Verkamo. Finding interesting
rules from large sets of discovered association
rules. CIKM94, Gaithersburg, Maryland, Nov.
1994.
G. Piatetsky-Shapiro, U. Fayyad, and P. Smith.
From data mining to knowledge discovery An
overview. In U.M. Fayyad, et al. (eds.), Advances
in Knowledge Discovery and Data Mining, 1-35.
AAAI/MIT Press, 1996.
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge
Discovery in Databases. AAAI/MIT Press, 1991.
D. Pyle. Data Preparation for Data Mining. Morgan
Kaufmann, 1999.
T. Redman. Data Quality Management and
Technology. Bantam Books, New York, 1992.
A. Silberschatz and A. Tuzhilin. What makes
patterns interesting in knowledge discovery
systems. IEEE Trans. on Knowledge and Data
Engineering, 8970-974, Dec. 1996.
D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton,
R. Motwani, and S. Nestorov. Query flocks A
generalization of association-rule mining.
SIGMOD'98, Seattle, Washington, June 1998.

59
References KDD Process

Y. Wand and R. Wang. Anchoring data quality
dimensions ontological foundations.
Communications of ACM, 3986-95, 1996.
R. Wang, V. Storey, and C. Firth. A framework for
analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7623-640, 1995.

60
Reminder Course Organization
Course Evaluation

Passing the course min 30 points
home exam min 13 points (max 30 points)
exercises/experiments min 8 points (max 20
points)
at least 3 returned and reported experiments
group presentation min 4 points (max 10 points)
Remember also the other requirements
attending the lectures (5/7)
attending the seminars (4/5)
attending the exercises (4/5)

61
Seminar Presentations/Groups 9-10
Visualization and data mining
D. Keim, H.P., Kriegel, T. Seidl Supporting
Data Mining of Large Databases by Visual Feedback
Queries", ICDE94.
62
Seminar Presentations/Groups 9-10
Interestingness
G. Piatetsky-Shapiro, C.J. Matheus The
Interestingness of Deviations, KDD94.
63
KDD process
Thanks to Jiawei Han from Simon Fraser
University and Mika Klemettinen from Nokia
Research Center for their slides which greatly
helped in preparing this lecture! Also thanks
to Fosca Giannotti and Dino Pedreschi from
Pisa for their slides.

Write a Comment

User Comments (0)