Data mining and the knowledge discovery process - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Data mining and the knowledge discovery process

Description:

Opening / acquaintance. What is data mining. Data mining methodology. Course perspective ... Knowledge: application of data and information; answers 'how' questions ... – PowerPoint PPT presentation

Number of Views:364
Avg rating:3.0/5.0
Slides: 29
Provided by: Donk152
Category:

less

Transcript and Presenter's Notes

Title: Data mining and the knowledge discovery process


1
Data mining and the knowledge discovery process
  • Summer Course 2006
  • H.H.L.M. Donkers

2
Content
  • Opening / acquaintance
  • What is data mining
  • Data mining methodology
  • Course perspective
  • Course contents

3
Data - Information - Knowledge -
  • Data symbols
  • Information data that are processed to be
    useful provides answers to "who", "what",
    "where", and "when" questions
  • Knowledge application of data and information
    answers "how" questions
  • Understanding appreciation of "why"
  • Wisdom evaluated understanding. (Russell
    Ackoff - http//www.outsights.com/systems/dikw/dik
    w.htm)

4
Data - Information - Knowledge -
  • http//www.outsights.com/systems/dikw/dikw.htm

5
What is Data Mining Traditionally
  • Data mining is the extraction of implicit,
    previously unknown, and potentially useful
    information from data.
  • Witten Frank (2000). Data Mining.

6
What is Data Mining Traditionally
  • The application of specific algorithms for
    extracting patterns from data, it is a part of
    knowledge discovery from databases
  • Fayyad (1997). From data mining to knowledge
    discovery in databases.

7
What is Data Mining Traditionally
  • Data mining is a process, not just a series of
    statistical analyses.
  • SAS Institute (2003). Finding the solution to
    data mining.

8
What is Data Mining Traditionally
  • Computer Science
  • (Semi-)automated application of algorithms for
    pattern discovery
  • Algorithms developed in the field of Artificial
    Intelligence (machine learning)
  • Part of the process of knowledge discovery
  • Statistics
  • Process of discovering patterns in data
  • (Manual) application of a series of statistical
    techniques (among which machine learning)
  • Incorporates
  • Exploration
  • Sampling
  • Modeling
  • Validation

Data mining Statistics Marketing
9
What is Data Mining A Fusion
  • An analytic process designed to explore data in
    search of consistent patterns and/or systematic
    relationships between variables, and then to
    validate the findings by applying the detected
    patterns to new subsets of data. The ultimate
    goal is prediction.
  • Statsoft (2003). Data Mining Techniques.

10
What is Data Mining A Fusion
  • An information extraction activity whose goal is
    to discover hidden facts contained in databases.
    Using a combination of machine learning,
    statistical analysis, modeling techniques and
    database technology, data mining finds patterns
    and subtle relationships in data and infers rules
    that allow the prediction of future results.
  • Rudjer Boskovic Institute (2001). DMS Tutorial.

11
Data Mining In This Course
  • We use the book of Witten Frank
  • Computer science (machine learning) approach
  • Emphasis on algorithms for pattern discovery and
    rule extraction
  • What are the underlying models
  • What are the properties of the algorithms
  • When to use (for which tasks)
  • How to apply and to tune
  • How to interpret and assess the results

12
Data Mining Process
  • These algorithms are only part of a process that
    computer scientists call Knowledge Discovery and
    the statisticians call Data Mining
  • The process starts with the recognition of a
    problem and ends with the control of a deployed
    solution
  • The whole process needs to be supported for a
    successful application

13
Methodologies for Data Mining
  • As Data Mining is coming of age, several
    methodologies have been developed, each with
    their own perspective. We will discuss three of
    them
  • Fayyad et al. (Computer science)
  • E.g., WEKA
  • SEMMA (SAS) (Statistics)
  • SAS Enterprise Miner, R
  • CRISP-DM (SPSS, OHRA, a.o.) (Business)
  • SPSS Clementine

14
Fayyads KDD Methodology
data
15
SEMMA Methodology
Supported by SAS Enterprise Mining environment
16
CRISP-DM Methodology
  • Developed by data-mining companies (SPSS, NCR,
    OHRA, ChryslerDaimler), funded by the European
    Commission
  • Tool-independent / industry-independent
  • Hierarchical process model
  • 1 Generic phases 2 Generic tasks
  • 3 Specific tasks 4 Task instances
  • Supported by SPSS Clementine environment

17
CRISP-DM Methodology
TASKS Business objective Assess situation Data
mining goals Project plan
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
18
CRISP-DM Methodology
TASKS Collect data Describe data Explore
data Verify data quality
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
19
CRISP-DM Methodology
TASKS Select data Clean data Construct
data Integrate data Format data
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
20
CRISP-DM Methodology
TASKS Select modeling techniques Design the
test Build model Assess model
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
21
CRISP-DM Methodology
TASKS Evaluate results Review
process Determine next steps
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
22
CRISP-DM Methodology
TASKS Plan deployment Plan monitoring and
maintenance Final report Review project
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
23
A Comparison
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
24
A Small Poll (July 2002)
Source http//www.kdnuggets.com/polls/2002/method
ology.htm
25
Poll repeated (2004)
Source http//www.kdnuggets.com/polls/2004/data_m
ining_methodology.htm
26
Course perspective and goal
  • The perspective is from computer science
    (machine learning) Fayyads approach
  • The emphasis is on techniques for the automated
    discovery of patterns in data and the automated
    extraction of rules (the model phase of SEMMA and
    CRISP)
  • The goal is to get acquainted with these
    techniques, so you can use them in the
    methodology of your choice

27
Course contents
  • Data preparation (Tuesday)
  • Selection, preprocessing, transformation
  • Techniques, algorithms and models
  • Decision trees (Monday)
  • Instance based and Bayesian learning (Wednesday)
  • Neural networks (Wednesday)
  • Association rules (Thursday)
  • Clustering (Thursday)
  • Support Vector Machines (Friday)
  • Evaluation of learned models (Tuesday)

28
Course contents
  • For each technique you learn
  • For which tasks it is suitable
  • Classification, rules, prediction,
  • Restrictions on input data (numerical, symbolic,
    etc.)
  • What algorithms are available
  • What parameters should be tuned
  • How to interpret the results
  • How to evaluate the model
Write a Comment
User Comments (0)
About PowerShow.com