Course on Data Mining (581550-4) - PowerPoint PPT Presentation

About This Presentation
Title:

Course on Data Mining (581550-4)

Description:

Post- processing. Phases of the KDD process (2) 21.11.2001. Data ... post-processing ... 3750 eat cereal. 2000 both play basket ball and eat cereal ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 64
Provided by: moenp
Category:
Tags: cereal | course | data | mining | post

less

Transcript and Presenter's Notes

Title: Course on Data Mining (581550-4)


1
Course on Data Mining (581550-4)
Intro/Ass. Rules
Clustering
Episodes
KDD Process
Text Mining
Appl./Summary
2
Course on Data Mining (581550-4)
Today 22.11.2001
  • Today's subject
  • KDD Process
  • Next week's program
  • Lecture Data mining applications, future,
    summary
  • Exercise KDD Process
  • Seminar KDD Process

3
KDD process
  • Overview
  • Preprocessing
  • Post-processing
  • Summary

4
What is KDD? A process!
  • Aim the selection and processing of data for
  • the identification of novel, accurate, and useful
    patterns, and
  • the modeling of real-world phenomena
  • Data mining is a major component of the KDD
    process

5
Typical KDD process
6
Phases of the KDD process (1)
Learning the domain
Creating a target data set
Pre- processing
Data cleaning, integration and transformation
Data reduction and projection
Choosing the DM task
7
Phases of the KDD process (2)
Choosing the DM algorithm(s)
Data mining search
Pattern evaluation and interpretation
Post- processing
Knowledge presentation
Use of discovered knowledge
8
Preprocessing - overview
  • Why data preprocessing?
  • Data cleaning
  • Data integration and transformation
  • Data reduction

9
Why data preprocessing?
  • Aim to select the data relevant with respect to
    the task in hand to be mined
  • Data in the real world is dirty
  • incomplete lacking attribute values, lacking
    certain attributes of interest, or containing
    only aggregate data
  • noisy containing errors or outliers
  • inconsistent containing discrepancies in codes
    or names
  • No quality data, no quality mining results!

10
Measures of data quality
  • accuracy
  • completeness
  • consistency
  • timeliness
  • believability
  • value added
  • interpretability
  • accessibility

11
Preprocessing tasks (1)
  • Data cleaning
  • fill in missing values, smooth noisy data,
    identify or remove outliers, and resolve
    inconsistencies
  • Data integration
  • integration of multiple databases, files, etc.
  • Data transformation
  • normalization and aggregation

12
Preprocessing tasks (2)
  • Data reduction (including discretization)
  • obtains reduced representation in volume, but
    produces the same or similar analytical results
  • data discretization is part of data reduction,
    but with particular importance, especially for
    numerical data

13
Preprocessing tasks (3)

14
Data cleaning tasks
  • Fill in missing values
  • Identify outliers and smooth out noisy data
  • Correct inconsistent data

15
Missing Data
  • Data is not always available
  • Missing data may be due to
  • equipment malfunction
  • inconsistent with other recorded data, and thus
    deleted
  • data not entered due to misunderstanding
  • certain data may not be considered important at
    the time of entry
  • not register history or changes of the data
  • Missing data may need to be inferred

16
How to Handle Missing Data? (1)
  • Ignore the tuple
  • usually done when the class label is missing
  • not effective, when the percentage of missing
    values per attribute varies considerably
  • Fill in the missing value manually
  • tedious infeasible?
  • Use a global constant to fill in the missing
    value
  • e.g., unknown, a new class?!

17
How to Handle Missing Data? (2)
  • Use the attribute mean to fill in the missing
    value
  • Use the attribute mean for all samples belonging
    to the same class to fill in the missing value
  • smarter solution than using the general
    attribute mean
  • Use the most probable value to fill in the
    missing value
  • inference-based tools such as decision tree
    induction or a Bayesian formalism
  • regression

18
Noisy Data
  • Noise random error or variance in a measured
    variable
  • Incorrect attribute values may due to
  • faulty data collection instruments
  • data entry problems
  • data transmission problems
  • technology limitation
  • inconsistency in naming convention

19
How to Handle Noisy Data?
  • Binning
  • smooth a sorted data value by looking at the
    values around it
  • Clustering
  • detect and remove outliers
  • Combined computer and human inspection
  • detect suspicious values and check by human
  • Regression
  • smooth by fitting the data into regression
    functions

20
Binning methods (1)
  • Equal-depth (frequency) partitioning
  • sort data and partition into bins, N intervals,
    each containing approximately same number of
    samples
  • smooth by bin means, bin median, bin boundaries,
    etc.
  • good data scaling
  • managing categorical attributes can be tricky

21
Binning methods (2)
  • Equal-width (distance) partitioning
  • divide the range into N intervals of equal size
    uniform grid
  • if A and B are the lowest and highest values of
    the attribute, the width of intervals will be W
    (B-A)/N.
  • the most straightforward
  • outliers may dominate presentation
  • skewed data is not handled well

22
Equal-depth binning - Example
  • Sorted data for price (in dollars)
  • 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
  • Partition into (equal-depth) bins
  • Bin 1 4, 8, 9, 15
  • Bin 2 21, 21, 24, 25
  • Bin 3 26, 28, 29, 34
  • Smoothing by bin means
  • Bin 1 9, 9, 9, 9
  • Bin 2 23, 23, 23, 23
  • Bin 3 29, 29, 29, 29
  • by bin boundaries
  • Bin 1 4, 4, 4, 15
  • Bin 2 21, 21, 25, 25
  • Bin 3 26, 26, 26, 34

23
Data Integration (1)
  • Data integration
  • combines data from multiple sources into a
    coherent store
  • Schema integration
  • integrate metadata from different sources
  • entity identification problem identify real
    world entities from multiple data sources, e.g.,
    A.cust-id ? B.cust-

24
Data Integration (2)
  • Detecting and resolving data value conflicts
  • for the same real world entity, attribute values
    from different sources are different
  • possible reasons different representations,
    different scales, e.g., metric vs. British units

25
Handling Redundant Data
  • Redundant data occur often, when multiple
    databases are integrated
  • the same attribute may have different names in
    different databases
  • one attribute may be a derived attribute in
    another table, e.g., annual revenue
  • Redundant data may be detected by correlation
    analysis
  • Careful integration of data from multiple sources
    may
  • help to reduce/avoid redundancies and
    inconsistencies
  • improve mining speed and quality

26
Data Transformation
  • Smoothing remove noise from data
  • Aggregation summarization, data cube
    construction
  • Generalization concept hierarchy climbing
  • Normalization scaled to fall within a small,
    specified range, e.g.,
  • min-max normalization
  • normalization by decimal scaling
  • Attribute/feature construction
  • new attributes constructed from the given ones

27
Data Reduction
  • Data reduction
  • obtains a reduced representation of the data set
    that is much smaller in volume
  • produces the same (or almost the same) analytical
    results as the original data
  • Data reduction strategies
  • dimensionality reduction
  • numerosity reduction
  • discretization and concept hierarchy generation

28
Dimensionality Reduction
  • Feature selection (i.e., attribute subset
    selection)
  • select a minimum set of features such that the
    probability distribution of different classes
    given the values for those features is as close
    as possible to the original distribution given
    the values of all features
  • reduce the number of patterns in the patterns,
    easier to understand
  • Heuristic methods (due to exponential of
    choices)
  • step-wise forward selection
  • step-wise backward elimination
  • combining forward selection and backward
    elimination

29
Dimensionality Reduction - Example
Initial attribute set A1, A2, A3, A4, A5, A6
A4 ?
A6?
A1?
Class 2
Class 2
Class 1
Class 1
Reduced attribute set A1, A4, A6
30
Numerosity Reduction
  • Parametric methods
  • assume the data fits some model, estimate model
    parameters, store only the parameters, and
    discard the data (except possible outliers)
  • e.g., regression analysis, log-linear models
  • Non-parametric methods
  • do not assume models
  • e.g., histograms, clustering, sampling

31
Discretization
  • Reduce the number of values for a given
    continuous attribute by dividing the range of the
    attribute into intervals
  • Interval labels can then be used to replace
    actual data values
  • Some classification algorithms only accept
    categorical attributes

32
Concept Hierarchies
  • Reduce the data by collecting and replacing low
    level concepts by higher level concepts
  • For example, replace numeric values for the
    attribute age by more general values young,
    middle-aged, or senior

33
Discretization and concept hierarchy generation
for numeric data
  • Binning
  • Histogram analysis
  • Clustering analysis
  • Entropy-based discretization
  • Segmentation by natural partitioning

34
Concept hierarchy generation for categorical data
  • Specification of a partial ordering of attributes
    explicitly at the schema level by users or
    experts
  • Specification of a portion of a hierarchy by
    explicit data grouping
  • Specification of a set of attributes, but not of
    their partial ordering
  • Specification of only a partial set of attributes

35
Specification of a set of attributes
  • Concept hierarchy can be automatically generated
    based on the number of distinct values per
    attribute in the given attribute set. The
    attribute with the most distinct values is placed
    at the lowest level of the hierarchy.

15 distinct values
65 distinct values
3567 distinct values
674 339 distinct values
36
Post-processing - overview
  • Why data post-processing?
  • Interestingness
  • Visualization
  • Utilization

Post-processing
37
Why data post-processing? (1)
  • Aim to show the results, or more precisely the
    most interesting findings, of the data mining
    phase to a user/users in an understandable way
  • A possible post-processing methodology
  • find all potentially interesting patterns
    according to some rather loose criteria
  • provide flexible methods for iteratively and
    interactively creating different views of the
    discovered patterns
  • Other more restrictive or focused methodologies
    possible as well

38
Why data post-processing? (2)
  • A post-processing methodology is useful, if
  • the desired focus is not known in advance (the
    search process cannot be optimized to look only
    for the interesting patterns)
  • there is an algorithm that can produce all
    patterns from a class of potentially interesting
    patterns (the result is complete)
  • the time requirement for discovering all
    potentially interesting patterns is not
    considerably longer than, if the discovery was
    focused to a small subset of potentially
    interesting patterns

39
Are all the discovered pattern interesting?
  • A data mining system/query may generate thousands
    of patterns, but are they all interesting?
  • Usually NOT!
  • How could we then choose the interesting
    patterns?
  • gt Interestingness

40
Interestingness criteria (1)
  • Some possible criteria for interestingness
  • evidence statistical significance of finding?
  • redundancy similarity between findings?
  • usefulness meeting the user's needs/goals?
  • novelty already prior knowledge?
  • simplicity syntactical complexity?
  • generality how many examples covered?

41
Interestingness criteria(2)
  • One division of interestingness criteria
  • objective measures that are based on statistics
    and structures of patterns, e.g.,
  • J-measure statistical significance
  • certainty factor support or frequency
  • strength confidence
  • subjective measures that are based on users
    beliefs in the data, e.g.,
  • unexpectedness is the found pattern
    surprising?"
  • actionability can I do something with it?"

42
Criticism Support Confidence
  • Example (Aggarwal Yu, PODS98)
  • among 5000 students
  • 3000 play basketball, 3750 eat cereal
  • 2000 both play basket ball and eat cereal
  • the rule play basketball ? eat cereal 40,
    66.7 is misleading, because the overall
    percentage of students eating cereal is 75,
    which is higher than 66.7
  • the rule play basketball ? not eat cereal 20,
    33.3 is far more accurate, although with lower
    support and confidence

43
Interest
  • Yet another objective measure for interestingness
    is interest that is defined as
  • Properties of this measure
  • takes both P(A) and P(B) in consideration
  • P(AB)P(B)P(A), if A and B are independent
    events
  • A and B negatively correlated, if the value is
    less than 1 otherwise A and B positively
    correlated.

44
J-measure
  • Also J-measure
  • is an objective measure for interestingness
  • Properties of J-measure
  • again, takes both P(A) and P(B) in consideration
  • value is always between 0 and 1
  • can be computed using pre-calculated values

45
Support/Frequency/J-measure
46
Confidence
47
Example Selection of Interesting Association
Rules
  • For reducing the number of association rules that
    have to be considered, we could, for example, use
    one of the following selection criteria
  • frequency and confidence
  • J-measure or interest
  • maximum rule size (whole rule, left-hand side,
    right-hand side)
  • rule attributes (e.g., templates)

48
Example Problems with selection of rules
  • A rule can correspond to prior knowledge or
    expectations
  • how to encode the background knowledge into the
    system?
  • A rule can refer to uninteresting attributes or
    attribute combinations
  • could this be avoided by enhancing the
    preprocessing phase?
  • Rules can be redundant
  • redundancy elimination by rule covers etc.

49
Interpretation and evaluation of the results of
data mining
  • Evaluation
  • statistical validation and significance testing
  • qualitative review by experts in the field
  • pilot surveys to evaluate model accuracy
  • Interpretation
  • tree and rule models can be read directly
  • clustering results can be graphed and tabled
  • code can be automatically generated by some
    systems

50
Visualization of Discovered Patterns (1)
  • In some cases, visualization of the results of
    data mining (rules, clusters, networks) can be
    very helpful
  • Visualization is actually already important in
    the preprocessing phase in selecting the
    appropriate data or in looking at the data
  • Visualization requires training and practice

51
Visualization of Discovered Patterns (2)
  • Different backgrounds/usages may require
    different forms of representation
  • e.g., rules, tables, cross-tabulations, or
    pie/bar chart
  • Concept hierarchy is also important
  • discovered knowledge might be more understandable
    when represented at high level of abstraction
  • interactive drill up/down, pivoting, slicing and
    dicing provide different perspective to data
  • Different kinds of knowledge require different
    kinds of representation
  • association, classification, clustering, etc.

52
Visualization
53
(No Transcript)
54
Utilization of the results
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
55
Summary
  • Data mining semi-automatic discovery of
    interesting patterns from large data sets
  • Knowledge discovery is a process
  • preprocessing
  • data mining
  • post-processing
  • using and utilizing the knowledge

56
Summary
  • Preprocessing is important in order to get useful
    results!
  • If a loosely defined mining methodology is used,
    post-processing is needed in order to find the
    interesting results!
  • Visualization is useful in pre- and
    post-processing!
  • One has to be able to utilize the found knowledge!

57
References KDD Process
  • P. Adriaans and D. Zantinge. Data Mining.
    Addison-Wesley Harlow, England, 1996.
  • R.J. Brachman, T. Anand. The process of knowledge
    discovery in databases. Advances in Knowledge
    Discovery and Data Mining. AAAI/MIT Press, 1996.
  • D. P. Ballou and G. K. Tayi. Enhancing data
    quality in data warehouse environments.
    Communications of ACM, 4273-78, 1999.
  • M. S. Chen, J. Han, and P. S. Yu. Data mining An
    overview from a database perspective. IEEE Trans.
    Knowledge and Data Engineering, 8866-883, 1996.
  • U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
    R. Uthurusamy. Advances in Knowledge Discovery
    and Data Mining. AAAI/MIT Press, 1996.
  • T. Imielinski and H. Mannila. A database
    perspective on knowledge discovery.
    Communications of ACM, 3958-64, 1996.
  • Jagadish et al., Special Issue on Data Reduction
    Techniques. Bulletin of the Technical Committee
    on Data Engineering, 20(4), December 1997.
  • D. Keim, Visual techniques for exploring
    databases. Tutorial notes in KDD97, Newport
    Beach, CA, USA, 1997.
  • D. Keim, Visual data mining. Tutorial notes in
    VLDB97, Athens, Greece, 1997.
  • D. Keim, and H.P. Krieger, Visual techniques for
    mining large databases a comparison. IEEE
    Transactions on Knowledge and Data Engineering,
    8(6), 1996.

58
References KDD Process
  • W. Kloesgen, Explora A multipattern and
    multistrategy discovery assistant. In U.M.
    Fayyad, et al. (eds.), Advances in Knowledge
    Discovery and Data Mining, 249-271. AAAI/MIT
    Press, 1996.
  • M. Klemettinen, A knowledge discovery methodology
    for telecommunication network alarm databases.
    Ph.D. thesis, University of Helsinki, Report
    A-1999-1, 1999.
  • M. Klemettinen, H. Mannila, P. Ronkainen, H.
    Toivonen, and A.I. Verkamo. Finding interesting
    rules from large sets of discovered association
    rules. CIKM94, Gaithersburg, Maryland, Nov.
    1994.
  • G. Piatetsky-Shapiro, U. Fayyad, and P. Smith.
    From data mining to knowledge discovery An
    overview. In U.M. Fayyad, et al. (eds.), Advances
    in Knowledge Discovery and Data Mining, 1-35.
    AAAI/MIT Press, 1996.
  • G. Piatetsky-Shapiro and W. J. Frawley. Knowledge
    Discovery in Databases. AAAI/MIT Press, 1991.
  • D. Pyle. Data Preparation for Data Mining. Morgan
    Kaufmann, 1999.
  • T. Redman. Data Quality Management and
    Technology. Bantam Books, New York, 1992.
  • A. Silberschatz and A. Tuzhilin. What makes
    patterns interesting in knowledge discovery
    systems. IEEE Trans. on Knowledge and Data
    Engineering, 8970-974, Dec. 1996.
  • D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton,
    R. Motwani, and S. Nestorov. Query flocks A
    generalization of association-rule mining.
    SIGMOD'98, Seattle, Washington, June 1998.

59
References KDD Process
  • Y. Wand and R. Wang. Anchoring data quality
    dimensions ontological foundations.
    Communications of ACM, 3986-95, 1996.
  • R. Wang, V. Storey, and C. Firth. A framework for
    analysis of data quality research. IEEE Trans.
    Knowledge and Data Engineering, 7623-640, 1995.

60
Reminder Course Organization
Course Evaluation
  • Passing the course min 30 points
  • home exam min 13 points (max 30 points)
  • exercises/experiments min 8 points (max 20
    points)
  • at least 3 returned and reported experiments
  • group presentation min 4 points (max 10 points)
  • Remember also the other requirements
  • attending the lectures (5/7)
  • attending the seminars (4/5)
  • attending the exercises (4/5)

61
Seminar Presentations/Groups 9-10
Visualization and data mining
D. Keim, H.P., Kriegel, T. Seidl Supporting
Data Mining of Large Databases by Visual Feedback
Queries", ICDE94.
62
Seminar Presentations/Groups 9-10
Interestingness
G. Piatetsky-Shapiro, C.J. Matheus The
Interestingness of Deviations, KDD94.
63
KDD process
Thanks to Jiawei Han from Simon Fraser
University and Mika Klemettinen from Nokia
Research Center for their slides which greatly
helped in preparing this lecture! Also thanks
to Fosca Giannotti and Dino Pedreschi from
Pisa for their slides.
Write a Comment
User Comments (0)
About PowerShow.com