Sl1 - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Sl1

Description:

Handling Continuous-Valued Attributes. Handling Missing Attribute Values. Decision Trees ... attribute as the root, create a branch for each of the values the ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 65
Provided by: fda017
Category:
Tags: attribute | sl1

less

Transcript and Presenter's Notes

Title: Sl1


1
Decision-Tree Induction Decision-Rule Induction
Evgueni Smirnov
2
Overview
  • Instances, Classes, Languages, Hypothesis Spaces
  • Decision Trees
  • Decision Rules
  • Evaluation Techniques
  • Intro to Weka

3
Instances and Classes
A class is a set of objects in a world that are
unified by a reason. A reason may be a similar
appearance, structure or function.
friendly robots
Example. The set children, photos, cat,
diplomas can be viewed as a class Most
important things to take out of your apartment
when it catches fire.
4
Instances, Classes, Languages
head square body round smiling yes holding
flag color yellow
friendly robots
5
Instances, Classes, Hypothesis Spaces
smiling yes ? friendly robots
head square body round smiling yes holding
flag color yellow
friendly robots
6
The Classification Task
7
Decision Trees for Classification
  • Decision trees
  • Appropriate problems for decision trees
  • Entropy and Information Gain
  • The ID3 algorithm
  • Avoiding Overfitting via Pruning
  • Handling Continuous-Valued Attributes
  • Handling Missing Attribute Values

8
Decision Trees
  • Definition A decision tree is a tree s.t.
  • Each internal node tests an attribute
  • Each branch corresponds to attribute value
  • Each leaf node assigns a classification

9
Data Set for Playing Tennis
10
Decision Tree For Playing Tennis
Outlook
Sunny
Overcast
Rainy
Humidity
Windy
yes
High
Normal
False
True
no
yes
yes
no
11
When to Consider Decision Trees
  • Each instance consists of an attribute with
    discrete values (e.g. outlook/sunny, etc..)
  • The classification is over discrete values (e.g.
    yes/no )
  • It is okay to have disjunctive descriptions
    each path in the tree represents a disjunction of
    attribute combinations. Any Boolean function can
    be represented!
  • It is okay for the training data to contain
    errors decision trees are robust to
    classification errors in the training data.
  • It is okay for the training data to contain
    missing values decision trees can be used even
    if instances have missing attributes.

12
Rules in Decision Trees
If Outlook Sunny Humidity High then Play
no If Outlook Sunny Humidity Normal then
Play yes If Outlook Overcast then Play
yes If Outlook Rainy Windy False then Play
yes If Outlook Rainy Windy True then Play
no
13
Decision Tree Induction
  • Basic Algorithm
  • 1. A ? the best" decision attribute for a node
    N.
  • 2. Assign A as decision attribute for the node N.
  • 3. For each value of A, create new descendant of
    the node N.
  • 4. Sort training examples to leaf nodes.
  • 5. IF training examples perfectly classified,
    THEN STOP.
  • ELSE iterate over new leaf nodes

14
Decision Tree Induction
Outlook
Sunny
Rain
Overcast
_____________________________________ Outlook
Temp Hum Wind Play
-------------------------------------------------
-------- Rain Mild High
Weak yes Rain Cool Normal
Weak yes Rain Cool Normal
Strong no Rain Mild Normal
Weak yes Rain Mild High
Strong no
____________________________________ Outlook
Temp Hum Wind Play ------------------
------------------------------------- Sunny
Hot High Weak no Sunny
Hot High Strong no Sunny
Mild High Weak no Sunny
Cool Normal Weak yes Sunny
Mild Normal Strong yes
_____________________________________ Outlook
Temp Hum Wind Play
-------------------------------------------------
-------- Overcast Hot High
Weak yes Overcast Cool Normal
Strong yes
15
Entropy
  • Let S be a sample of training examples, and
  • p is the proportion of positive examples in S
    and
  • p- is the proportion of negative examples in S.
  • Then  entropy measures the impurity of S
  • E( S) - p log2 p p- log2 p-

16
Entropy Example from the Dataset
17
Information Gain
  • Information Gain is the expected reduction
    in entropy caused by partitioning the instances
    according to a given attribute.
  •  
  • Gain(S, A) E(S) -
  • where Sv s ? S A(s) v

S
Sv1 s ? S A(s) v1
Sv12 s ? S A(s) v2
18
Example
Outlook
Sunny
Rain
Overcast
_____________________________________ Outlook
Temp Hum Windy Play
-------------------------------------------------
-------- Rain Mild High
False Yes Rain Cool Normal
False Yes Rain Cool
Normal True No Rain Mild
Normal False Yes Rain Mild
High True No
____________________________________ Outlook
Temp Hum Wind Play ------------------
------------------------------------- Sunny
Hot High False No Sunny
Hot High True No Sunny
Mild High False No Sunny
Cool Normal False Yes Sunny
Mild Normal True Yes
_____________________________________ Outlook
Temp Hum Wind Play
-------------------------------------------------
-------- Overcast Hot High
Weak Yes Overcast Cool Normal
Strong Yes
 Which attribute should be tested here? Gain
(Ssunny , Humidity) .970 - (3/5) 0.0 - (2/5)
0.0 .970 Gain (Ssunny , Temperature) .970 -
(2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 .570 Gain
(Ssunny , Wind) .970 - (2/5) 1.0 - (3/5) .918
.019
19
ID3 Algorithm
  • Informally
  • Determine the attribute with the highest
    information gain on the training set.
  • Use this attribute as the root, create a branch
    for each of the values the attribute can have.
  • For each branch, repeat the process with subset
    of the training set that is classified by that
    branch.

20
Hypothesis Space Search in ID3
  • The hypothesis space is the set of all decision
    trees defined over the given set of attributes.
  • ID3s hypothesis space is a compete space i.e.,
    the target description is there!
  • ID3 performs a simple-to-complex, hill climbing
    search through this space.

21
Hypothesis Space Search in ID3
  • The evaluation function is the information gain.
  • ID3 maintains only a single current decision
    tree.
  • ID3 performs no backtracking in its search.
  • ID3 uses all training instances at each step of
    the search.

22
Posterior Class Probabilities
Outlook
Sunny
Overcast
Rainy
no 2 pos and 3 neg Ppos 0.4, Pneg 0.6
Windy
no 2 pos and 0 neg Ppos 1.0, Pneg 0.0
False
True
no 0 pos and 2 neg Ppos 0.0, Pneg 1.0
no 3 pos and 0 neg Ppos 1.0, Pneg 0.0
23
Overfitting
  • Definition Given a hypothesis space H, a
    hypothesis h ? H is said to overfit the training
    data if there exists some hypothesis h ? H, such
    that h has smaller error that h over the
    training instances, but h has a smaller error
    that h over the entire distribution of instances.

24
Reasons for Overfitting
Outlook
sunny
overcast
rainy
Humidity
Windy
yes
high
normal
false
true
no
yes
yes
no
  • Noisy training instances. Consider an noisy
    training example
  • Outlook Sunny Temp Hot Humidity Normal
    Wind True PlayTennis No
  • This instance affects the training instances
  • Outlook Sunny Temp Cool Humidity Normal
    Wind False PlayTennis Yes
  • Outlook Sunny Temp Mild Humidity Normal
    Wind True PlayTennis Yes

25
Reasons for Overfitting
Outlook
sunny
overcast
rainy
Humidity
Windy
yes
high
normal
false
true
no
yes
no
Windy
false
true
Outlook Sunny Temp Hot Humidity Normal
Wind True PlayTennis No Outlook Sunny
Temp Cool Humidity Normal Wind False
PlayTennis Yes Outlook Sunny Temp Mild
Humidity Normal Wind True PlayTennis Yes
yes
Temp
mild
high
cool
yes
no
?
26
Reasons for Overfitting
  • Small number of instances are associated with
    leaf nodes. In this case it is possible that for
    coincidental regularities to occur that are
    unrelated to the actual target concept.

-



-

-

-

-

-
-

-
-
-
-
-
-
-
-
-
-
-
-
27
Approaches to Avoiding Overfitting
  • Pre-pruning stop growing the tree earlier,
    before it reaches the point where it perfectly
    classifies the training data
  • Post-pruning Allow the tree to overfit the data,
    and then post-prune the tree.

28
Pre-pruning
  • It is difficult to decide when to stop growing
    the tree.
  • A possible scenario is to stop when the leaf
    nodes gets less than m training instances. Here
    is an example for m 5.

Outlook
Sunny
Overcast
Rainy
no
?
yes
2
3
2
2
3
29
Validation Set
  • Validation set is a set of instances used to
    evaluate the utility of nodes in decision trees.
    The validation set has to be chosen so that it is
    unlikely to suffer from same errors or
    fluctuations as the training set.
  • Usually before pruning the training data is split
    randomly into a growing set and a validation set.

30
Reduced-Error Pruning
  • Split data into growing and validation sets.
  • Pruning a decision node d consists of
  • removing the subtree rooted at d.
  • making d a leaf node.
  • assigning d the most common classification of the
    training instances associated with d.

Outlook
sunny
overcast
rainy
Humidity
Windy
yes
high
normal
false
true
no
yes
yes
no
3 instances
2 instances
Accuracy of the tree on the validation set is 90.
31
Reduced-Error Pruning
  • Split data into growing and validation sets.
  • Pruning a decision node d consists of
  • removing the subtree rooted at d.
  • making d a leaf node.
  • assigning d the most common classification of the
    training instances associated with d.

Outlook
sunny
overcast
rainy
Windy
no
yes
false
true
yes
no
Accuracy of the tree on the validation set is
92.4.
32
Reduced-Error Pruning
  • Split data into growing and validation sets.
  • Pruning a decision node d consists of
  • removing the subtree rooted at d.
  • making d a leaf node.
  • assigning d the most common classification of the
    training instances associated with d.
  • Do until further pruning is harmful
  • Evaluate impact on validation set of pruning each
    possible node (plus those below it).
  • Greedily remove the one that most improves
    validation set accuracy.

Outlook
sunny
overcast
rainy
Windy
no
yes
false
true
yes
no
Accuracy of the tree on the validation set is
92.4.
33
Reduced Error Pruning Example
34
Rule Post-Pruning
  • Convert tree to equivalent set of rules.
  • Prune each rule independently of others.
  • Sort final rules by their estimated accuracy, and
    consider them in this sequence when classifying
    subsequent instances.


Outlook
IF (Outlook Sunny) (Humidity High) THEN
PlayTennis No IF (Outlook Sunny) (Humidity
Normal) THEN PlayTennis Yes .
sunny
overcast
rainy
Humidity
Windy
yes
normal
false
true
no
yes
yes
no
35
Continuous Valued Attributes
  • Create a set of discrete attributes to test
    continuous.
  • Apply Information Gain in order to choose the
    best attribute.
  • Temperature 40 48 60 72 80 90
  • PlayTennis No No Yes Yes Yes No

Tempgt54 Temgt85
36
Missing Attribute Values
  • Strategies
  • Assign most common value of A among other
    instances belonging to the same concept.
  • If node n tests the attribute A, assign most
    common value of A among other instances sorted to
    node n.
  • If node n tests the attribute A, assign a
    probability to each of possible values of A.
    These probabilities are estimated based on the
    observed frequencies of the values of A. These
    probabilities are used in the information gain
    measure (via info gain) (
    ).

37
Summary Points
  • Decision tree learning provides a practical
    method for concept learning.
  • ID3-like algorithms search complete hypothesis
    space.
  • The inductive bias of decision trees is
    preference (search) bias.
  • Overfitting the training data is an important
    issue in decision tree learning.
  • A large number of extensions of the ID3 algorithm
    have been proposed for overfitting avoidance,
    handling missing attributes, handling numerical
    attributes, etc.

38
Learning Decision Rules
  • Decision Rules
  • Basic Sequential Covering Algorithm
  • Learn-One-Rule Procedure
  • Pruning

39
Definition of Decision Rules
Definition Decision rules are rules with the
following form if ltconditionsgt then class
C.
Example If you run the Prism algorithm from Weka
on the weather data you will get the following
set of decision rules if outlook overcast
then PlayTennis yes if humidity normal and
windy FALSE then PlayTennis yes if
temperature mild and humidity normal then
PlayTennis yes if outlook rainy and windy
FALSE then PlayTennis yes if outlook sunny
and humidity high then PlayTennis no if
outlook rainy and windy TRUE then PlayTennis
no
40
Why Decision Rules?
  • Decision rules are more compact.
  • Decision rules are more understandable.

Example Let X ?0,1, Y ?0,1, Z ?0,1, W
?0,1. The rules are if X1 and Y1 then 1 if
Z1 and W1 then 1 Otherwise 0
41
Why Decision Rules?
42
How to Learn Decision Rules?
  • We can convert trees to rules
  • We can use specific rule-learning methods

43
Sequential Covering Algorithms
function LearnRuleSet(Target, Attrs, Examples,
Threshold) LearnedRules ? Rule
LearnOneRule(Target, Attrs, Examples) while
performance(Rule,Examples) gt Threshold, do
LearnedRules LearnedRules ? Rule
Examples Examples \ examples covered by
Rule Rule LearnOneRule(Target, Attrs,
Examples) sort LearnedRules according to
performance return LearnedRules
44
Illustration
-
-

-
-

-









-
-
-

-
-
-
-
-
45
Illustration
-
-

-
-

-









-
-
-

-
-
-
-
-
46
Illustration
-
-

-
-

-









-
-
-

-
-
-
-
-
47
Illustration
-
-

-
-

-









-
-
-

-
-
-
-
-
IF A B THEN pos
48
Illustration
-
-

-
-

-









-
-
-

-
-
-
-
-
IF A B THEN pos
49
Illustration
-
-

-
-

-









-
-
-

-
-
-
-
-
IF A B THEN pos
IF true THEN pos
IF C THEN pos
IF C D THEN pos
50
Learning One Rule
  • To learn one rule we use one of the strategies
    below
  • Top-down
  • Start with maximally general rule
  • Add literals one by one
  • Bottom-up
  • Start with maximally specific rule
  • Remove literals one by one
  • Combination of top-down and bottom-up
  • Candidate-elimination algorithm.

51
Bottom-up vs. Top-down
Bottom-up typically more specific rules
-
-

-
-

-









-
-
-

-
-
-
-
-
Top-down typically more general rules
52
Learning One Rule
  • Bottom-up
  • Example-driven (AQ family).
  • Top-down
  • Generate-then-Test (CN-2).

53
Example of Learning One Rule
54
Heuristics for Learning One Rule
  • When is a rule good?
  • High accuracy
  • Less important high coverage.
  • Possible evaluation functions
  • Relative frequency nc/n, where nc is the number
    of correctly classified instances, and n is the
    number of instances covered by the rule
  • m-estimate of accuracy (nc mp)/(nm), where nc
    is the number of correctly classified instances,
    n is the number of instances covered by the rule,
    p is the prior probablity of the class predicted
    by the rule, and m is the weight of p.
  • Entropy.

55
How to Arrange the Rules
  • The rules are ordered according to the order they
    have been learned. This order is used for
    instance classification.
  • The rules are ordered according to their
    accuracy. This order is used for instance
    classification.
  • The rules are not ordered but there exists a
    strategy how to apply the rules (e.g., an
    instance covered by conflicting rules gets the
    classification of the rule that classifies
    correctly more training instances if an instance
    is not covered by any rule, then it gets the
    classification of the majority class represented
    in the training data).

56
Approaches to Avoiding Overfitting
  • Pre-pruning stop learning the decision rules
    before they reach the point where they perfectly
    classify the training data
  • Post-pruning allow the decision rules to overfit
    the training data, and then post-prune the rules.

57
Post-Pruning
  • Split instances into Growing Set and Pruning Set
  • Learn set SR of rules using Growing Set
  • Find the best simplification BSR of SR.
  • while (Accuracy(BSR, Pruning Set) gt
  • Accuracy(SR, Pruning Set) )
    do
  • 4.1 SR BSR
  • 4.2 Find the best simplification BSR
    of SR.
  • 5. return BSR

58
Incremental Reduced Error Pruning
Post-pruning
D1
D3
D1
D21
D2
D22
D3
59
Incremental Reduced Error Pruning
  • Split Training Set into Growing Set and
    Validation Set
  • Learn rule R using Growing Set
  • Prune the rule R using Validation Set
  • if performance(R, Training Set) gt Threshold
  • 4.1 Add R to Set of Learned Rules
  • 4.2 Remove in Training Set the instances
    covered by R
  • 4.2 go to 1
  • 5. else return Set of Learned Rules

60
Summary Points
  • Decision rules are easier for human comprehension
    than decision trees.
  • Decision rules have simpler decision boundaries
    than decision trees.
  • Decision rules are learned by sequential covering
    of the training instances.

61
Model Evaluation Techniques
  • Evaluation on the training set too optimistic

Classifier
Training set
Training set
62
Model Evaluation Techniques
  • Hold-out Method depends on the make-up of the
    test set.

Classifier
Training set
Test set
Data
  • To improve the precision of the hold-out
    method it is repeated many times.

63
Model Evaluation Techniques
  • k-fold Cross Validation

Classifier
Data
64
Intro to Weka
_at_relation weather.symbolic _at_attribute outlook
sunny, overcast, rainy _at_attribute temperature
hot, mild, cool _at_attribute humidity high,
normal _at_attribute windy TRUE, FALSE _at_attribute
play TRUE, FALSE _at_data sunny,hot,high,FALSE,FAL
SE sunny,hot,high,TRUE,FALSE overcast,hot,high,FAL
SE,TRUE rainy,mild,high,FALSE,TRUE rainy,cool,norm
al,FALSE,TRUE rainy,cool,normal,TRUE,FALSE overcas
t,cool,normal,TRUE,TRUE .
Write a Comment
User Comments (0)
About PowerShow.com