Machine Learning and Inductive Inference - PowerPoint PPT Presentation

1 / 122
About This Presentation
Title:

Machine Learning and Inductive Inference

Description:

Bacon : rediscovered some laws of physics (e.g. Kepler's laws of planetary motion) ... e.g. in hospital: help with diagnosis of patients. Learning to perform ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 123
Provided by: csKule
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning and Inductive Inference


1
Machine Learning and Inductive Inference
  • Hendrik Blockeel2001-2002

2
1 Introduction
  • Practical information
  • What is "machine learning and inductive
    inference"?
  • What is it useful for? (some example
    applications)
  • Different learning tasks
  • Data representation
  • Brief overview of approaches
  • Overview of the course

3
Practical informationabout the course
  • 10 lectures (2h) 4 exercise sessions (2.5h)
  • Audience with diverse backgrounds
  • Course material
  • Book Machine Learning (Mitchell, 1997,
    McGraw-Hill)
  • Slides notes, http//www.cs.kuleuven.ac.be/hend
    rik/ML/
  • Examination
  • oral exam (20') with written preparation (/- 2h)
  • 2/3 theory, 1/3 exercises
  • Only topics discussed in lectures / exercises

4
What is machine learning?
  • Study of how to make programs improve their
    performance on certain tasks from own experience
  • "performance" speed, accuracy, ...
  • "experience" set of previously seen cases
    ("observations")
  • For instance (simple method)
  • experience taking action A in situation S
    yielded result R
  • situation S arises again
  • if R was undesirable try something else
  • if R was desirable try action A again

5
  • This is a very simple example
  • only works if precisely the same situation is
    encountered
  • what if similar situation?
  • Need for generalisation
  • how about choosing another action even if a good
    one is already known ? (you might find a better
    one)
  • Need for exploration
  • This course focuses mostly on generalisation or
    inductive inference

6
Inductive inference
  • Reasoning from specific to general
  • e.g. statistics from sample, infer properties of
    population

sample
population
observation "these dogs are all brown"
hypothesis "all dogs are brown"
7
  • Note inductive inference is more general than
    statistics
  • statistics mainly consists of numerical methods
    for inference
  • infer mean, probability distribution, of
    population
  • other approaches
  • find symbolic definition of a concept (concept
    learning)
  • find laws with complicated structure that govern
    the data
  • study induction from a logical, philosophical,
    point of view

8
  • Applications of inductive inference
  • Machine learning
  • "sample" of observations experience
  • generalizing to population finding patterns in
    the observations that generally hold and may be
    used for future tasks
  • Knowledge discovery (Data mining)
  • "sample" database
  • generalizing finding patterns that hold in this
    database and can also be expected to hold on
    similar data not in the database
  • discovered knowledge comprehensible description
    of these patterns
  • ...

9
What is it useful for?
  • Scientifically for understanding learning and
    intelligence in humans and animals
  • interesting for psychologists, philosophers,
    biologists,
  • More practically
  • for building AI systems
  • expert systems that improve automatically with
    time
  • systems that help scientists discover new laws
  • also useful outside classical AI-like
    applications
  • when we dont know how to program something
    ourselves
  • when a program should adapt regularly to new
    circumstances
  • when a program should tune itself towards its user

10
Knowledge discovery
  • Scientific knowledge discovery
  • Some toy examples
  • Bacon rediscovered some laws of physics (e.g.
    Keplers laws of planetary motion)
  • AM rediscovered some mathematical theorems
  • More serious recent examples
  • mining the human genome
  • mining the web for information on genes,
    proteins,
  • drug discovery
  • context robots perform lots of experiments at
    high rate this yields lots of data, to be
    studied and interpreted by humans try to
    automate this process (because humans cant keep
    up with robots)

11
Example given molecules that are active against
some disease, find out what is common in them
this is probably the reason for their activity.
12
  • Data mining in databases, looking for
    interesting patterns
  • e.g. for marketing
  • based on data in DB, who should be interested in
    this new product? (useful for direct mailing)
  • study customer behaviour to identify typical
    profiles of customers
  • find out which products in store are often bought
    together
  • e.g. in hospital help with diagnosis of patients

13
Learning to perform difficult tasks
  • Difficult for humans
  • LEX system learned how to perform symbolic
    integration of functions
  • or easy for humans, but difficult to program
  • humans can do it, but cant explain how they do
    it
  • e.g.
  • learning to play games (chess, go, )
  • learning to fly a plane, drive a car,
  • recognising faces

14
Adaptive systems
  • Robots in changing environment
  • continuously needs to adapt its behaviour
  • Systems that adapt to the user
  • based on user modelling
  • observe behaviour of user
  • build model describing this behaviour
  • use model to make users life easier
  • e.g. adaptive web pages, intelligent mail
    filters, adaptive user interfaces (e.g.
    intelligent Unix shell),

15
Illustration building a system that learns
checkers
  • Learning improving on task T, with respect to
    performance measure P, based on experience E
  • In this example
  • T playing checkers
  • P of games won in world tournament
  • E games played against self
  • possible problem is experience representative
    for real task?
  • Questions to be answered
  • exactly what is given, exactly what is learnt,
    what representation learning algorithm should
    we use

16
  • What do we want to learn?
  • given board situation, which move to make
  • What is given?
  • direct or indirect evidence ?
  • direct e.g., which moves were good, which were
    bad
  • indirect consecutive moves in game, outcome of
    the game
  • in our case indirect evidence
  • direct evidence would require a teacher

17
  • What exactly shall we learn?
  • Choose type of target function
  • ChooseMove Board ? Move ?
  • directly applicable
  • V Board ? ? ?
  • indicates quality of state
  • when playing, choose move that leads to best
    state
  • Note reasonable definition for V easy to give
  • V(won) 100, V(lost) -100, V(draw) 0, V(s)
    V(e) with e best state reachable from s when
    playing optimally
  • Not feasible in practice (exhaustive minimax
    search)
  • Lets choose the V function here

18
  • Choose representation for target function
  • set of rules?
  • neural network?
  • polynomial function of numerical board features?
  • Lets choose V w1bpw2rpw3bkw4rkw5btw6rt
  • bp, rp number of black / red pieces
  • bk, rk number of black / red kings
  • bt, rt number of black / read pieces threatened
  • wi constants to be learnt from experience

19
  • How to obtaining training examples?
  • we need a set of examples bp, rp, bk, rk, bt,
    rt, V
  • bp etc. easy to determine but how to guess V?
  • we have indirect evidence only!
  • possible method
  • with V(s) true target function, V(s) learnt
    function, Vt(s) training value for a state s
  • Vt(s) lt- V(successor(s))
  • adapt V using Vt values (making V and Vt
    converge)
  • hope that V will converge to V
  • intuitively V for end states is known propagate
    V values from later states to earlier states in
    the game

20
  • Training algorithm how to adapt the weights wi?
  • possible method
  • look at error error(s) V(s) - Vt(s)
  • adapt weights so that error is reduced
  • e.g. using gradient descent method
  • for each feature fi wi ? wi c fi error(s)
    with c some small constant

21
Overview of design choices
type of training experience
games against self
games against expert
table of good moves

determine type of target function

Board ? Move
Board ? ?

determine representation
linear function of 6 features
determine learning algorithm

gradient descent
ready!

22
Some issues that influence choices
  • Which algorithms useful for what type of
    functions?
  • How is learning influenced by
  • training examples
  • complexity of hypothesis (function)
    representation
  • noise in the data
  • Theoretical limits of learning?
  • Can we help the learner with prior knowledge?
  • Could a system alter its representation itself?

23
Typical learning tasks
  • Concept learning
  • learn a definition of a concept
  • supervised vs. unsupervised
  • Function learning ("predictive modelling")
  • Discrete ("classification") or continuous
    ("regression")
  • Concept function with boolean result
  • Clustering
  • Finding descriptive patterns

24
Concept learning supervised
  • Given positive () and negative (-) examples of a
    concept, infer properties that cause instances to
    be positive or negative ( concept definition)

X
X
-
-
C




-


-




-
-
-
-
-
-
-
-
-
-
-
-
-
-
C X true,false
25
Concept learning unsupervised
  • Given examples of instances
  • Invent reasonable concepts ( clustering)
  • Find definitions for these concepts

X
X
C1
C2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
C3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
  • Cf. taxonomy of animals, identification of market
    segments, ...

26
Function learning
  • Generalises over concept learning
  • Learn function f XS where
  • S is finite set of values classification
  • S is a continuous range of reals regression

X
X
f
. 1.4
. 1.4
3
. 2.1
. 2.1
2
. 2.7
. 2.7
. 0.6
. 0.6
1
0
. 0.9
. 0.9
27
Clustering
  • Finding groups of instances that are similar
  • May be a goal in itself (unsupervised
    classification)
  • ... but also used for other tasks
  • regression
  • flexible prediction when it is not known in
    advance which properties to predict from which
    other properties

X
X
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
Finding descriptive patterns
  • Descriptive patterns any kind of patterns, not
    necessarily directly useful for prediction
  • Generalises over predictive modelling ( finding
    predictive patterns)
  • Examples of patterns
  • "fast cars usually cost more than slower cars"
  • "people are never married to more than one person
    at the same time"

29
Representation of data
  • Numerical data instances are points in ?n
  • Many techniques focus on this kind of data
  • Symbolic data (true/false, black/white/red/blue,
    ...)
  • Can be converted to numeric data
  • Some techniques work directly with symbolic data
  • Structural data
  • Instances have internal structure (graphs, sets,
    cf. molecules)
  • Difficult to convert to simpler format
  • Few techniques can handle these directly

30
Brief overview of approaches
  • Symbolic approaches
  • Version Spaces, Induction of decision trees,
    Induction of rule sets, inductive logic
    programming,
  • Numeric approaches
  • neural networks, support vector machines,
  • Probabilistic approaches (bayesian learning)
  • Miscellaneous
  • instance based learning, genetic algorithms,
    reinforcement learning

31
Overview of the course
  • Introduction (today) (Ch. 1)
  • Concept-learning Versionspaces (Ch. 2 - brief)
  • Induction of decision trees (Ch. 3)
  • Artificial neural networks (Ch. 4 - brief)
  • Evaluating hypotheses (Ch. 5)
  • Bayesian learning (Ch. 6)
  • Computational learning theory (Ch. 7)
  • Support vector machines (brief)

32
  • Instance-based learning (Ch. 8)
  • Genetic algorithms (Ch. 9)
  • Induction of rule sets association rules (Ch.
    10)
  • Reinforcement learning (Ch. 13)
  • Clustering
  • Inductive logic programming
  • Combining different models
  • bagging, boosting, stacking,

33
2 Version Spaces
  • Recall basic principles from AI course
  • stressing important concepts for later use
  • Difficulties with version space approaches
  • Inductive bias
  • ? Mitchell, Ch.2

34
Basic principles
  • Concept learning as search
  • given hypothesis space H and data set S
  • find all h ? H consistent with S
  • this set is called the version space, VS(H,S)
  • How to search in H
  • enumerate all h in H not feasible
  • prune search using some generality ordering
  • h1 more general than h2 ? (x ? h2 ? x ? h1)
  • See Mitchell Chapter 2 for examples

35
An example
  • belongs to concept - does not
  • S set of these and - examples
  • Assume hypotheses are rectangles
  • I.e., H set of all rectangles
  • VS(H,S) set of all rectangles that contain all
    and no -

36
  • Example of consistent hypothesis green rectangle

37
  • h1 more general than h2 ? h2 totally inside h1

h2 more specific than h1 h3 incomparable with h1
h1
h2
h3
38
Version space boundaries
  • Bound versionspace by giving its most specific
    (S) and most general (G) borders
  • S rectangles that cannot become smaller without
    excluding some
  • G rectangles that cannot become larger without
    including some -
  • Any hypothesis h consistent with the data
  • must be more general than some element in S
  • must be more specific than some element in G
  • Thus, G and S completely specify VS

39
Example, continued
  • So what are S and G here?

S h1, G h2,h3
40
Computing the version space
  • Computing G and S is sufficient to know the full
    versionspace
  • Algorithms in Mitchells book
  • FindS computes only S set
  • S is always singleton in Mitchells examples
  • Candidate Elimination computes S and G

41
Candidate Elimination Algorithm demonstration
with rectangles
  • Algorithm see Mitchell
  • Representation
  • Concepts are rectangles
  • Rectangle represented with 2 attributes
  • ltXmin-Xmax, Ymin-Ymaxgt
  • Graphical representation
  • hypothesis consistent with data if
  • all inside rectangle
  • no - inside rectangle

42
G
3
2
1
4
5
6
S
1
  • Start S none, G all

2
3
4
5
6
S lt?,?gt G lt1-6, 1-6gt
43
G
S
  • Start S none, G all
  • Example e1 appears, not covered by S

S lt?,?gt G lt1-6,1-6gt
44
G
S
  • Start S none, G all


(3,2)
  • Example e1 appears, not covered by S
  • S is extended to cover e1

S lt3-3,2-2gt G lt1-6,1-6gt
45
G
S
  • Start S none, G all

(3,2)
  • Example e1 appears, not covered by S
  • S is extended to cover e1
  • Example e2 appears, covered by G

S lt3-3,2-2gt G lt1-6,1-6gt
46
G
S
  • Start S none, G all


(3,2)
  • Example e1 appears, not covered by S
  • S is extended to cover e1
  • Example e2 appears, covered by G
  • G is changed to avoid covering e2
  • note now consists of 2 parts
  • each part covers all and no -

S lt3-3,2-2gt G lt1-4,1-6gt, lt1-6, 1-3gt
47
G
S
  • Start S none, G all


(3,2)
  • Example e1 appears, not covered by S
  • S is extended to cover e1
  • Example e2 appears, covered by G
  • G is changed to avoid covering e2
  • Example e3 appears, covered by G

S lt3-3,2-2gt G lt1-4,1-6gt, lt1-6, 1-3gt
48
G
S
  • Start S none, G all


(3,2)
  • Example e1 appears, not covered by S
  • S is extended to cover e1
  • Example e2 appears, covered by G
  • G is changed to avoid covering e2
  • Example e3 appears, covered by G
  • One part of G is affected reduced

S lt3-3,2-2gt G lt3-4,1-6gt, lt1-6, 1-3gt
49
G
S
  • Start S none, G all


(3,2)
  • Example e1 appears, not covered by S
  • S is extended to cover e1
  • Example e2 appears, covered by G
  • G is changed to avoid covering e2
  • Example e3 appears, covered by G
  • One part of G is affected reduced
  • Example e4 appears, not covered by S

S lt3-3,2-2gt G lt3-4,1-6gt, lt1-6, 1-3gt
50
G
S
  • Start S none, G all


(3,2)
  • Example e1 appears, not covered by S
  • S is extended to cover e1
  • Example e2 appears, covered by G
  • G is changed to avoid covering e2
  • Example e3 appears, covered by G
  • One part of G is affected reduced
  • Example e4 appears, not covered by S
  • S is extended to cover e4

S lt3-5,2-3gt G lt3-4,1-6gt, lt1-6, 1-3gt
51
G
S
  • Start S none, G all


(3,2)
  • Example e1 appears, not covered by S
  • S is extended to cover e1
  • Example e2 appears, covered by G
  • G is changed to avoid covering e2
  • Example e3 appears, covered by G
  • One part of G is affected reduced
  • Example e4 appears, not covered by S
  • S is extended to cover e4
  • Part of G not covering new S is removed

S lt3-5,2-3gt G lt1-6, 1-3gt
52
G
S
h

Current versionspace contains all rectangles
covering S and covered by G, e.g. h lt2-5,2-3gt
S lt3-5,2-3gt G lt1-6, 1-3gt
53
  • Interesting points
  • We here use an extended notion of generality
  • In book ? lt value lt ?
  • Here e.g. ? lt 2-3 lt 2-5 lt 1-5 lt ?
  • We still use a conjunctive concept definition
  • each concept is 1 rectangle
  • this could be extended as well (but complicated)

54
Difficulties with version space approaches
  • Idea of VS provides nice theoretical framework
  • But not very useful for most practical problems
  • Difficulties with these approaches
  • Not very efficient
  • Borders G and S may be very large (may grow
    exponentially)
  • Not noise resistant
  • VS collapses when no consistent hypothesis
    exists
  • often we would like to find the best hypothesis
    in this case
  • in Mitchells examples only conjunctive
    definitions
  • We will compare with other approaches...

55
Inductive bias
  • After having seen a limited number of examples,
    we believe we can make predictions for unseen
    cases.
  • From seen cases to unseen cases inductive leap
  • Why do we believe this ? Is there any guarantee
    this prediction will be correct ? What extra
    assumptions do we need to guarantee correctness?
  • Inductive bias minimal set of extra assumptions
    that guarantees correctness of inductive leap

56
Equivalence between inductive and deductive
systems
training examples
inductive system
result (by inductive leap)
new instance
training examples
deductive system
result (by proof)
new instance
inductive bias
57
Definition of inductive bias
  • More formal definition of inductive bias
    (Mitchell)
  • L(x,D) denotes classification assigned to
    instance x by learner L after training on D
  • The inductive bias of L is any minimal set of
    assertions B such that for any target concept c
    and corresponding training examples D,
  • ?x?X B?D?x - L(x,D)

58
Effect of inductive bias
  • Different learning algorithms give different
    results on same dataset because each may have a
    different bias
  • Stronger bias means less learning
  • more is assumed in advance
  • Is learning possible without any bias at all?
  • I.e., pure learning, without any assumptions in
    advance
  • The answer is No.

59
Inductive bias of version spaces
  • Bias of candidate elimination algorithm target
    concept is in H
  • H typically consists of conjunctive concepts
  • in our previous illustration, rectangles
  • H could be extended towards disjunctive concepts
  • Is it possible to use version spaces with H set
    of all imaginable concepts, thereby eliminating
    all bias?

60
Unbiased version spaces
  • Let U be the example domain
  • Unbiased target concept C can be any subset of U
  • hence, H 2U
  • Condider VS(H,D) with D a strict subset of U
  • Assume you see an unseen instance x (x ? U \ D)
  • For each h?VS that predicts x?C, there is a h?VS
    that predicts x?C, and vice versa
  • just take h h ? x since x?D, h and h are
    exactly the same w.r.t. D so either both are in
    VS, or none of them are

61
  • Conclusion version spaces without any bias do
    not allow generalisation
  • To be able to make an inductive leap, some bias
    is necessary.
  • We will see many different learning algorithms
    that all differ in their inductive bias.
  • When choosing one in practice, bias should be an
    important criterium
  • unfortunately not always well understood

62
To remember
  • Definition of version space, importance of
    generality ordering for searching
  • Definition of inductive bias, practical
    importance, why it is necessary for learning, how
    it relates inductive systems to deductive systems

63
3 Induction of decision trees
  • What are decision trees?
  • How can they be induced automatically?
  • top-down induction of decision trees
  • avoiding overfitting
  • converting trees to rules
  • alternative heuristics ?
  • a generic TDIDT algorithm ?
  • ? Mitchell, Ch. 3

64
What are decision trees?
  • Represent sequences of tests
  • According to outcome of test, perform a new test
  • Continue until result obtained known
  • Cf. guessing a person using only yes/no
    questions
  • ask some question
  • depending on answer, ask a new question
  • continue until answer known

65
Example decision tree 1
  • Mitchells example Play tennis or not?
    (depending on weather conditions)

Outlook
Sunny
Rainy
Overcast
Humidity
Wind
Yes
Normal
Strong
Weak
High
No
Yes
No
Yes
66
Example decision tree 2
  • Again from Mitchell tree for predicting whether
    C-section necessary
  • Leaves are not pure here ratio pos/neg is given

Fetal_Presentation
1
3
2
Previous_Csection
-
-
0
3, 29- .11 .89-
8, 22- .27 .73-
1

Primiparous
55, 35- .61 .39-


67
Representation power
  • Typically
  • examples represented by array of attributes
  • 1 node in tree tests value of 1 attribute
  • 1 child node for each possible outcome of test
  • Leaf nodes assign classification
  • Note
  • tree can represent any boolean function
  • i.e., also disjunctive concepts (lt-gt VS examples)
  • tree can allow noise (non-pure leaves)

68
Representing boolean formulae
  • E.g., A ? B
  • Similarly (try yourself)
  • A ? B, A xor B, (A ? B) ? (C ? ?D ? E)
  • M of N (at least M out of N propositions are
    true)
  • What about complexity of tree vs. complexity of
    original formula?

A
false
true
B
true
true
false
true
false
69
Classification, Regression and Clustering trees
  • Classification trees represent function X -gt C
    with C discrete (like the decision trees we just
    saw)
  • Regression trees predict numbers in leaves
  • could use a constant (e.g., mean), or linear
    regression model, or
  • Clustering trees just group examples in leaves
  • Most (but not all) research in machine learning
    focuses on classification trees

70
Example decision tree 3 (from study of river
water quality)
  • "Data mining" application
  • Given descriptions of river water samples
  • biological description occurrence of organisms
    in water (abundance, graded 0-5)
  • chemical description 16 variables (temperature,
    concentrations of chemicals (NH4, ...))
  • Question characterize chemical properties of
    water using organisms that occur

71
Clustering tree
abundance(Tubifex sp.,5) ?
yes
no
T 0.357111 pH -0.496808 cond
1.23151 O2 -1.09279 O2sat -1.04837
CO2 0.893152 hard 0.988909 NO2
0.54731 NO3 0.426773 NH4 1.11263 PO4
0.875459 Cl 0.86275 SiO2
0.997237 KMnO4 1.29711 K2Cr2O7 0.97025 BOD
0.67012
abundance(Sphaerotilus natans,5) ?
yes
no
T 0.0129737 pH -0.536434 cond
0.914569 O2 -0.810187 O2sat
-0.848571 CO2 0.443103 hard
0.806137 NO2 0.4151 NO3
-0.0847706 NH4 0.536927 PO4
0.442398 Cl 0.668979 SiO2
0.291415 KMnO4 1.08462 K2Cr2O7 0.850733 BOD
0.651707
abundance(...)
lt- "standardized" values (how many standard
deviations above mean)
72
Top-Down Induction of Decision Trees
  • Basic algorithm for TDIDT (later more formal
    version)
  • start with full data set
  • find test that partitions examples as good as
    possible
  • good examples with same class, or otherwise
    similar examples, should be put together
  • for each outcome of test, create child node
  • move examples to children according to outcome of
    test
  • repeat procedure for each child that is not
    pure
  • Main question how to decide which test is best

73
Finding the best test (for classification trees)
  • For classification trees find test for which
    children are as pure as possible
  • Purity measure borrowed from information theory
    entropy
  • is a measure of missing information more
    precisely, bits needed to represent the missing
    information, on average, using optimal encoding
  • Given set S with instances belonging to class i
    with probability pi Entropy(S) - ? pi log2
    pi

74
Entropy
  • Intuitive reasoning
  • use shorter encoding for more frequent messages
  • information theory message with probability p
    should get -log2p bits
  • e.g. A,B,C,D both 25 probability 2 bits for
    each (00,01,10,11)
  • if some are more probable, it is possible to do
    better
  • average bits for a message is then - ? pi log2
    pi

75
Entropy
  • Entropy in function of p, for 2 classes

76
Information gain
  • Heuristic for choosing a test in a node
  • choose that test that on average provides most
    information about the class
  • this is the test that, on average, reduces class
    entropy most
  • on average class entropy reduction differs
    according to outcome of test
  • expected reduction of entropy information gain
  • Gain(S,A) Entropy(S) - ? Sv/S Entropy(Sv)

77
Example
  • Assume S has 9 and 5 - examples partition
    according to Wind or Humidity attribute

S 9,5-
S 9,5-
Humidity
Wind
Normal
Strong
Weak
High
S 3,4-
S 6,1-
S 6,2-
S 3,3-
78
  • Assume Outlook was chosen continue partitioning
    in child nodes

9,5-
Outlook
Sunny
Rainy
Overcast
?
?
Yes
2,3-
3,2-
4,0-
79
Hypothesis space search in TDIDT
  • Hypothesis space H set of all trees
  • H is searched in a hill-climbing fashion, from
    simple to complex

...
80
Inductive bias in TDIDT
  • Note for e.g. boolean attributes, H is complete
    each concept can be represented!
  • given n attributes, can keep on adding tests
    until all attributes tested
  • So what about inductive bias?
  • Clearly no restriction bias (H ? 2U) as in
    cand. elim.
  • Preference bias some hypotheses in H are
    preferred over others
  • In this case preference for short trees with
    informative attributes at the top

81
Occams Razor
  • Preference for simple models over complex models
    is quite generally used in machine learning
  • Similar principle in science Occams Razor
  • roughly do not make things more complicated than
    necessary
  • Reasoning, in the case of decision trees more
    complex trees have higher probability of
    overfitting the data set

82
Avoiding Overfitting
  • Phenomenon of overfitting
  • keep improving a model, making it better and
    better on training set by making it more
    complicated
  • increases risk of modelling noise and
    coincidences in the data set
  • may actually harm predictive power of theory on
    unseen cases
  • Cf. fitting a curve with too many parameters

.
.
.
.
.
.
.
.
.
.
.
.
83
Overfitting example
-



-

-

-

-

-
-

-
-
-
-
-
-
-
-
-
-
-
-
84
Overfittingeffect on predictive accuracy
  • Typical phenomenon when overfitting
  • training accuracy keeps increasing
  • accuracy on unseen validation set starts
    decreasing

accuracy on training data accuracy on unseen
data
accuracy
overfitting starts about here
size of tree
85
How to avoid overfitting when building
classification trees?
  • Option 1
  • stop adding nodes to tree when overfitting starts
    occurring
  • need stopping criterion
  • Option 2
  • dont bother about overfitting when growing the
    tree
  • after the tree has been built, start pruning it
    again

86
Stopping criteria
  • How do we know when overfitting starts?
  • a) use a validation set data not considered for
    choosing the best test
  • when accuracy goes down on validation set stop
    adding nodes to this branch
  • b) use some statistical test
  • significance test e.g., is the change in class
    distribution still significant? (?2-test)
  • MDL minimal description length principle
  • fully correct theory tree corrections for
    specific misclassifications
  • minimize size(f.c.t.) size(tree)
    size(misclassifications(tree))
  • Cf. Occams razor

87
Post-pruning trees
  • After learning the tree start pruning branches
    away
  • For all nodes in tree
  • Estimate effect of pruning tree at this node on
    predictive accuracy
  • e.g. using accuracy on validation set
  • Prune node that gives greatest improvement
  • Continue until no improvements
  • Note this pruning constitutes a second search
    in the hypothesis space

88
accuracy on training data accuracy on unseen
data
accuracy
effect of pruning
size of tree
89
Comparison
  • Advantage of Option 1 no superfluous work
  • But tests may be misleading
  • E.g., validation accuracy may go down briefly,
    then go up again
  • Therefore, Option 2 (post-pruning) is usually
    preferred (though more work, computationally)

90
Turning trees into rules
  • From a tree a rule set can be derived
  • Path from root to leaf in a tree 1 if-then rule
  • Advantage of such rule sets
  • may increase comprehensibility
  • can be pruned more flexibly
  • in 1 rule, 1 single condition can be removed
  • vs. tree when removing a node, the whole subtree
    is removed
  • 1 rule can be removed entirely

91
Rules from trees example
Outlook
Sunny
Rainy
Overcast
Humidity
Wind
Yes
Normal
Strong
Weak
High
No
Yes
No
Yes
if Outlook Sunny and Humidity High then No if
Outlook Sunny and Humidity Normal then Yes
92
Pruning rules
  • Possible method
  • 1. convert tree to rules
  • 2. prune each rule independently
  • remove conditions that do not harm accuracy of
    rule
  • 3. sort rules (e.g., most accurate rule first)
  • before pruning each example covered by 1 rule
  • after pruning, 1 example might be covered by
    multiple rules
  • therefore some rules might contradict each other

93
Pruning rules example
A
false
true
Tree representing A ? B
B
true
true
false
true
false
if Atrue then true if Afalse and Btrue then
true if Afalse and Bfalse then false
Rules represent A ? (?A?B)
94
Alternative heuristics for choosing tests
?
  • Attributes with continuous domains (numbers)
  • cannot different branch for each possible outcome
  • allow, e.g., binary test of the form Temperature
    lt 20
  • Attributes with many discrete values
  • unfair advantage over attributes with few values
  • cf. question with many possible answers is more
    informative than yes/no question
  • To compensate divide gain by max. potential
    gain SI
  • Gain Ratio GR(S,A) Gain(S,A) / SI(S,A)
  • Split-information SI(S,A) - ? Si/S log2
    Si/S
  • with i ranging over different results of test A

95
  • Tests may have different costs
  • e.g. medical diagnosis blood test, visual
    examination, have different costs
  • try to find tree with low expected cost
  • instead of low expected number of tests
  • alternative heuristics, taking cost into
    account,have been proposed

96
Properties of good heuristics
  • Many alternatives exist
  • ID3 uses information gain or gain ratio
  • CART uses Gini criterion (not discussed here)
  • Q Why not simply use accuracy as a criterion?

80-, 20
80-, 20
How would - accuracy - information gain rate
these splits?
A1
A2
40-,0
40-,20
40-,10
40-,10
97
Heuristics compared
Good heuristics are strictly concave
98
Why concave functions?
E1
E
E2
p
p2
p1
Assume node with size n, entropy E and proportion
of positives p is split into 2 nodes with n1,
E1, p1 and n2, E2 p2. We have p (n1/n)p1
(n2/n) p2 and the new average entropy E
(n1/n)E1(n2/n)E2 is therefore found by linear
interpolation between (p1,E1) and (p2,E2) at p.
Gain difference in height between (p, E) and
(p,E).
99
Handling missing values
  • What if result of test is unknown for example?
  • e.g. because value of attribute unknown
  • Some possible solutions, when training
  • guess value just take most common value (among
    all examples, among examples in this node /
    class, )
  • assign example partially to different branches
  • e.g. counts for 0.7 in yes subtree, 0.3 in no
    subtree
  • When using tree for prediction
  • assign example partially to different branches
  • combine predictions of different branches

100
Generic TDIDT algorithm
?
function TDIDT(E set of examples) returns
tree T' grow_tree(E) T
prune(T') return T function grow_tree(E set
of examples) returns tree T
generate_tests(E) t best_test(T, E) P
partition induced on E by t if
stop_criterion(E, P) then return
leaf(info(E)) else for all Ej in P tj
grow_tree(Ej) return node(t, (j,tj)
101
For classification...
  • prune e.g. reduced-error pruning, ...
  • generate_tests Attrval, Attrltval, ...
  • for numeric attributes generate val
  • best_test Gain, Gainratio, ...
  • stop_criterion MDL, significance test (e.g.
    ?2-test), ...
  • info most frequent class ("mode")
  • Popular systems C4.5 (Quinlan 1993), C5.0
    (www.rulequest.com)

102
For regression...
  • change
  • best_test e.g. minimize average variance
  • info mean
  • stop_criterion significance test (e.g., F-test),
    ...

1,3,4,7,8,12
1,3,4,7,8,12
A1
A2
1,4,12
3,7,8
1,3,7
4,8,12
103
CART
  • Classification and regression trees (Breiman et
    al., 1984)
  • Classification info mode, best_test Gini
  • Regression info mean, best_test variance
  • prune "error complexity" pruning
  • penalty ? for each node
  • the higher ?, the smaller the tree will be
  • optimal ? obtained empirically (cross-validation)

104
n-dimensional target spaces
  • Instead of predicting 1 number, predict vector of
    numbers
  • info mean vector
  • best_test variance (mean squared distance) in
    n-dimensional space
  • stop_criterion F-test
  • mixed vectors (numbers and symbols)?
  • use appropriate distance measure
  • -gt "clustering trees"

105
Clustering tree
abundance(Tubifex sp.,5) ?
yes
no
T 0.357111 pH -0.496808 cond
1.23151 O2 -1.09279 O2sat -1.04837
CO2 0.893152 hard 0.988909 NO2
0.54731 NO3 0.426773 NH4 1.11263 PO4
0.875459 Cl 0.86275 SiO2
0.997237 KMnO4 1.29711 K2Cr2O7 0.97025 BOD
0.67012
abundance(Sphaerotilus natans,5) ?
yes
no
T 0.0129737 pH -0.536434 cond
0.914569 O2 -0.810187 O2sat
-0.848571 CO2 0.443103 hard
0.806137 NO2 0.4151 NO3
-0.0847706 NH4 0.536927 PO4
0.442398 Cl 0.668979 SiO2
0.291415 KMnO4 1.08462 K2Cr2O7 0.850733 BOD
0.651707
abundance(...)
lt- "standardized" values (how many standard
deviations above mean)
106
To Remember
  • Decision trees their representational power
  • Generic TDIDT algorithm and how to instantiate
    its parameters
  • Search through hypothesis space, bias, tree to
    rule conversion
  • For classification trees details on heuristics,
    handling missing values, pruning,
  • Some general concepts overfitting, Occams razor

107
4 Neural networks
  • (Brief summary - studied in detail in other
    courses)
  • Basic principle of artificial neural networks
  • Perceptrons and multi-layer neural networks
  • Properties
  • ? Mitchell, Ch. 4

108
Artificial neural networks
  • Modelled after biological neural systems
  • Complex systems built from very simple units
  • 1 unit neuron
  • has multiple inputs and outputs, connecting the
    neuron to other neurons
  • when input signal sufficiently strong, neuron
    fires (i.,e., propagates signal)

109
  • ANNs consists of
  • neurons
  • connections between them
  • these connections have weights associated with
    them
  • input and output
  • ANNs can learn to associate inputs to outputs by
    adapting the weights
  • For instance (classification)
  • inputs pixels of photo
  • outputs classification of photo (person? tree?
    )

110
Perceptrons
  • Simplest type of neural network
  • Perceptron simulates 1 neuron
  • Fires if sum of (inputs weights) gt some
    threshold
  • Schematically

x1
threshold function Y -1 if Xltt, Y1 otherwise
w1
x2
?
x3
x4
w5
Y
x5
X
computes ? wixi
111
2-input perceptron
  • represent inputs in 2-D space
  • perceptron learns a function of following form
  • if aX bY gt c then 1, else -1
  • i.e., creates linear separation between classes
    and -

1
-1
112
n-input perceptrons
  • In general, perceptrons construct a hyperplane in
    an n-dimensional space
  • one side of hyperplane , other side -
  • Hence, classes must be linearly separable,
    otherwise perceptron cannot learn them
  • E.g. learning boolean functions
  • encode true/false as 1, -1
  • is there a perceptron that encodes 1. A and B?
    2. A or B? 3. A xor B?

113
Multi-layer networks
  • Increase representation power by combining
    neurons in a network

1
-1
output
output layer
-1
-1
hidden layer
1
1
-1
-1
inputs
X
Y
neuron 1
neuron 2
114
  • Sigmoid function instead of crisp threshold
  • changes continuously instead of in 1 step
  • has advantages for training multi-layer networks

x1
w1
x2
?
x3
x4
w5
x5
115
  • Non-linear sigmoid function causes non-linear
    decision surfaces
  • e.g., 5 areas for 5 classes a,b,c,d,e
  • Very powerful representation

e
b
c
d
a
116
  • Note previous network had 2 layers of neurons
  • Layered feedforward neural networks
  • neurons organised in n layers
  • each layer has output from previous layer as
    input
  • neurons fully interconnected
  • successive layers different representations of
    input
  • 2-layer feedforward networks very popular
  • but many other architectures possible!
  • e.g. recurrent NNs

117
  • Example 2-layer net representing ID function
  • 8 input patterns, mapped to same pattern in
    output
  • network converges to binary representation in
    hidden layer

for instance 1 101 2 100 3 011 4 111 5
000 6 010 7 110 8 001
118
Training neural networks
  • Trained by adapting the weights
  • Popular algorithm backpropagation
  • minimizing error through gradient descent
  • principle output error of a layer is attributed
    to
  • 1 weights of connections in that layer
  • adapt these weights
  • 2 inputs of that layer (except if first layer)
  • backpropagate error to these inputs
  • now use same principle to adapt weights of
    previous layer
  • Iterative process, may be slow

119
Properties of neural networks
  • Useful for modelling complex, non-linear
    functions of numerical inputs outputs
  • symbolic inputs/outputs representable using some
    encoding, cf. true/false 1/-1
  • 2 or 3 layer networks can approximate a huge
    class of functions (if enough neurons in hidden
    layers)
  • Robust to noise
  • but risk of overfitting! (because of high
    expressiveness)
  • may happen when training for too long
  • usually handled using e.g. validation sets

120
  • All inputs have some effect
  • cf. decision trees selection of most important
    attributes
  • Explanatory power of ANNs is limited
  • model represented as weights in network
  • no simple explanation why networks makes a
    certain prediction
  • contrast with e.g. trees can give a rule that
    was used

121
  • Hence, ANNs are good when
  • high-dimensional input and output (numeric or
    symbolic)
  • interpretability of model unimportant
  • Examples
  • typical image recognition, speech recognition,
  • e.g. images one input per pixel
  • see http//www.cs.cmu.edu/tom/faces.html for
    illustration
  • less typical symbolic problems
  • cases where e.g. trees would work too
  • performance of networks and trees then often
    comparable

122
To remember
  • Perceptrons, neural networks
  • inspiration
  • what they are
  • how they work
  • representation power
  • explanatory power
Write a Comment
User Comments (0)
About PowerShow.com