Data Mining - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Data Mining

Description:

collected data on Mars. Johannes Kepler (1571-1630) mined Brahe's data ... interesting subsets. DATA MINING - 10 FEBRUARY 2004 LUC DEHASPE - 2004. Data preparation ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 49
Provided by: dirkb5
Category:

less

Transcript and Presenter's Notes

Title: Data Mining


1
Data Mining
  • Luc Dehaspe
  • K.U.L. Computer Science Department
  • -
  • Marc Van Hulle
  • K.U.L. Neurofysiologie Department

http//toledo.kuleuven.ac.be/
2
Course overview
Data Mining
3
Exercise session
  • Part 1 (L. Dehaspe)
  • 2 2.5 h paper-and-pencil sessions
  • application of algorithms
  • Part 2 (M. Van Hulle)
  • hands-on exercises

4
Exam
  • Written exam, closed book
  • Part 1 (Sessions 1-7) 50
  • Coverage
  • Questions RESTRICTED TO CONTENT OF SLIDES
  • Occasional pointers to additional material I do
    not expect you to study this material
  • Questions
  • One main question applyunderstand algorithm
    (30)
  • Two smaller questions explain concept, compute
    model quality, (210)
  • Part 2 (Sessions 8-14) 50 (explained later by
    Marc Van Hulle)

5
Working definition data mining
  • tools to search data for patterns and
    relationships that lead to better business
    decisions
  • business commercial/scientific

6
Overview
  • myths and facts
  • the Data Mining process
  • methods
  • visual
  • non-visual

7
Myths and facts
  • New technology cycle
  • phase 1 hype
  • unrealistic expectations
  • naive users
  • phase 2 frustration
  • phase 3 rejection
  • Alternative realistic view on vital technology

8
Myth 1 tabula rasa (virgin territory)
  • Data mining methods are fundamentally different
    from previous methods

Fact
  • Underlying ideas often decades old
  • neural networks 1940
  • k-nearest neighbour 1950
  • CART (regression trees) 1960
  • Novel
  • integrated applications to general business
    problems
  • more data, more computing power
  • non-academic users

9
Data Mining
  • Myth the magic bullet

Problem
Solution
Meta learning
Task
Data
Performance
Solution integration of tools, mixture of old
and new
10
Take home lesson 1
  • Not 1 optimal method optimal
  • But portfolio of tools, mixture of old and new

11
Myth 2 manna from heaven
  • Data mining produces surprising results
  • that will turn your business upside-down
  • without any input of domain expert knowledge
  • without any tuning of the technology

Fact
  • incremental changes rather than revolutionary
  • long term competitive advantage
  • occasional breakthroughs (e.g. link
    aspirine-Reyes Syndrome)
  • technology assistant to the domain expert
  • careful selection required of
  • goal
  • technology

12
Take home lesson 2
  • Crucial combination of
  • business (application domain) expertise
  • data mining technology expertise

13
Data Mining process model
  • Definition
  • Link with the scientific method

14
The data mining process
The non-trivial process of finding valid, novel,
potentially useful, and ultimately understandable
patterns in data
  • process iterative learn to ask better
    questions
  • valid patterns can be generalized to new data
  • novel and useful offer a competitive advantage
  • understandable contribute to insight in the
    domain

15
Interrogating the databaseLook-up queries
Which customers have a car insurance?
What is the average toxicity of cadmium chloride?
How many earthquakes have occurred last year?
How did HIV patient p123 react to AZT?
16
Interrogating the databaseFinding patterns
What is the profile of returning customers?
What is the relation between in vitro activity
and chemical structure?
What is the relation between geological features
and the occurrence of earthquakes?
What is the relation between the HIV patients
therapy history and response to AZT?
17
Science
ACTIVE
5
6
7
8
18
Science
The actual discovery of such an explanatory
hypothesis is a process of creation, in which
imagination as well as knowledge is
involved. Irving Copi, Introduction to Logic,
1986
collect data
build hypothesis
formulate theory
The formation of hypotheses is the most
mysterious of all the categories of scientific
method. Where they come from, no one knows. A
person is setting somewhere, minding his own
business, and suddenly - flash! - he understands
something he didnt understand before. Robert M.
Pirsig, Zen and the Art of Motorcycle maintenance
verify hypothesis
19
Evolution of data generation
2000
Data source
Data
Data Rich Knowledge Poor
Everyone, even the most patient and thorough
investigator, must pick and choose, deciding
which facts to study and which to pass over.
Irving Copi, Introduction to Logic, 1986
Data analyst
20
The scientific method
Knowledge discovery in Databases
Data warehousing
collect data
Data Mining
build hypothesis
formulate theory
care
inspiration
Statistics - OLAP
verify hypothesis
21
Data Mining
  • Definition
  • Extracting or mining knowledge from large
    amounts of data

CRISP-DM process model
22
Data mining in industry
  • An in silico research assistant allowing
    researchers to
  • Explore integrated database
  • For variety of research purposes (business
    goals)
  • Using optimal selection of data mining
    technologies

knowledge
pattern
23
Data Mining process model CRISP-DM
24
Business understanding
  • Which are the business goals?
  • Translation to data mining problem definition
  • Design of a plan to meet objectives

25
Data understanding
  • First collection of data
  • Becoming familiar with the data
  • Judge data quality
  • Discovery of
  • first insights
  • interesting subsets

26
Data preparation
  • Extract final data set from original set
  • Selection of
  • tables
  • records
  • attributes
  • transformation
  • data cleaning

27
Modelling
  • Selection modelling techniques
  • calibrating parameters
  • regular backtracking to adapt data to technology
  • (some techniques discussed further on)

28
Evaluation
  • Decide whether to use Data Mining results
  • Verification of all steps
  • Check whether business goals have been met

29
Deployment
  • Organisation presentation of new insights
  • variable complexity
  • deliver report
  • implement software that allows process to be
    repeated

30
Visual Data Mining methods
  • Pro
  • image has got broader information-bandwidth than
    text
  • (cf., an image tells more than a thousand
    words)
  • Con
  • problems with representation of 3 dimensions
  • not effective in case of color blindness
  • interpretation gives more information on subject
    than on object
  • stars, clouds, Hermann Rorschach test

31
Visual Data Mining methods
  • Error detection

32
Visual Data Mining methods
  • Linkage analysis

33
Visual Data Mining methods
  • Conditional probabilities

34
Visual Data Mining methods
  • landscapes

35
Visual Data Mining methods
  • Scatter plots

36
Non-visual data mining methods
  • Statistics - OLAP
  • descriptive average, median, standard deviation,
    distribution
  • hypothesis testing (observed differences)/(random
    variation)
  • discriminant analysis
  • predictive regression analysis linear,
    non-linear
  • clustering
  • Neural networks
  • Decision trees and rules
  • Conceptual clustering
  • Association rules

37
(Non-)visual Data Mining methodsOLAP - Data cubes
  • Online analytical processing
  • Classical statistical methods database
    technology
  • real-time calculations
  • powerful visualisation methods

38
Non-Visual Data Mining methodsRegression
39
Non-visual Data Mining methodsDiscriminant
analysis
  • R.A. Fischer, 1936
  • discovers planes that separate classes

40
Non-Visual Data Mining methodsNeural Networks
  • Represent functions with output a discrete value,
    a real value, or a vector
  • Neurobiological motivation
  • Parameters network tuned on basis of input-output
    examples (backpropagation)
  • e.g. . input from sensors
  • camera (face recognition)
  • microphone (speech recognition)

41
Non-Visual Data Mining methodsDecision trees
42
Non-Visual Data Mining methodsDecision trees
  • Attribute selection
  • information gain
  • how well does an attribute distribute the data
    according to their target class
  • maximal reduction of Entropy
  • - pM log2 pM - pF log2 pF

43
Non-Visual Data Mining methodsDecision rules
  • IF
  • Frame 2-Door AND
  • Engine ? V6 AND
  • Age
  • Cost 30K AND
  • Color Red
  • THEN
  • buyer is highly likely to be male

44
Non-Visual Data Mining methods Clustering
Cholesterol biosynthesis
Cell cycle
Early response
Signaling and angiogenesis
Wound healing
Eisen et al, PNAS 1998
45
Non-Visual Data Mining methodsConceptual
clustering
  • Groups examples and provides description of each
    group

u all examples A Age-20 B Age
20-40 b1 Age 20-40 en Frame2-Door b2
Age 20-40 en Frame 4-Door C Age 40-60 D
Age 60 d1 Age 60 en Frame 2-Door d2
Age 60 en Frame 4-Door
u
46
Non-Visual Data Mining methodsAssociation rules
  • IF-THEN rules show relationships
  • e.g. . Which products bought together?

47
Evaluation pitfallsPost hoc ergo propter hoc
Everyone who drank Stella in the year 1743 is
now dead. Therefore, Stella is fatal.
48
Evaluation pitfallsCorrelation does not imply
Causality
  • Palm size correlates with your life expectancy
  • The larger your palm, the less you will live, on
    average.

Why?
Women have smaller palms and live 6 years
longer on average
!actions inspired by data mining results!
49
Evaluation pitfallsHypothesis validation
  • descriptive statistics 1 hypothesis
  • data mining 1 hypothesis-SPACE
  • much higher probability of random relationships
  • validation on separate data set required
Write a Comment
User Comments (0)
About PowerShow.com