Title: Rule Induction
1Rule Induction
ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY
 Nada Lavrac
 Department of Knowledge Technologies
 Joef Stefan Institute
 Ljubljana, Slovenia
2Talk outline
 Predictive vs. Descriptive DM
 Predictive rule induction
 Classification vs. estimation
 Classification rule induction
 Heuristics and rule quality evaluation
 Descriptive rule induction
 Predictive vs. Descriptive DM summary
3Types of DM tasks
 Predictive DM
 Classification (learning of rulesets, decision
trees, ...)  Prediction and estimation (regression)
 Predictive relational DM (RDM, ILP)
 Descriptive DM
 description and summarization
 dependency analysis (association rule learning)
 discovery of properties and constraints
 segmentation (clustering)
 subgroup discovery
 Text, Web and image analysis
H



x
x
x
x
H
x
x
x
4Predictive vs. descriptive induction
 Predictive induction Inducing classifiers for
solving classification and prediction tasks,  Classification rule learning, Decision tree
learning, ...  Bayesian classifier, ANN, SVM, ...
 Data analysis through hypothesis generation and
testing  Descriptive induction Discovering interesting
regularities in the data, uncovering patterns,
... for solving KDD tasks  Symbolic clustering, Association rule learning,
Subgroup discovery, ...  Exploratory data analysis
5Predictive vs. descriptive induction A rule
learning perspective
 Predictive induction Induces rulesets acting as
classifiers for solving classification and
prediction tasks  Descriptive induction Discovers individual rules
describing interesting regularities in the data  Therefore Different goals, different heuristics,
different evaluation criteria
6Supervised vs. unsupervised learning A rule
learning perspective
 Supervised learning Rules are induced from
labeled instances (training examples with class
assignment)  usually used in predictive
induction  Unsupervised learning Rules are induced from
unlabeled instances (training examples with no
class assignment)  usually used in descriptive
induction  Exception Subgroup discovery
 Discovers individual rules describing
interesting regularities in the data induced from
labeled examples
7Subgroups vs. classifiers
 Classifiers
 Classification rules aim at pure subgroups
 A set of rules forms a domain model
 Subgroups
 Rules describing subgroups aim at significantly
higher proportion of positives  Each rule is an independent chunk of knowledge
 Link SD can be viewed as a form of
costsensitive classification
8Talk outline
 Predictive vs. Descriptive DM
 Predictive rule induction
 Classification vs. estimation
 Classification rule induction
 Heuristics and rule quality evaluation
 Descriptive rule induction
 Predictive vs. Descriptive DM summary
9Predictive DM  Classification
 data are objects, characterized with attributes 
objects belong to different classes (discrete
labels)  given the objects described by attribute values,
induce a model to predict different classes  decision trees, ifthen rules, ...
10Illustrative example Contact lenses data
11Decision tree forcontact lenses recommendation
12Illustrative example Customer data
13Induced decision trees
Income
? 102000
? 102000
Age
yes
? 58
? 58
Gender
no
yes
female
male
Age
no
? 49
? 49
no
yes
14Predictive DM  Estimation
 often referred to as regression
 data are objects, characterized with attributes
(discrete or continuous), classes of objects are
continuous (numeric)  given objects described with attribute values,
induce a model to predict the numeric class value  regression trees, linear and logistic regression,
ANN, kNN, ...
15Illustrative example Customer data
16Customer data regression tree
Income
? 108000
? 108000
Age
12000
? 42.5
? 42.5
16500
26700
17Predicting algal biomass regression tree
Month
Jan.June
July  Dec.
Ptot
Si
? 9.34
gt 9.34
? 10.1
gt10.1
2.34?1.65
Ptot
Ptot
4.32?2.07
? 9.1
gt 9.1
? 5.9
gt 5.9
Si
1.28?1.08
2.08 ?0.71
2.97?1.09
gt 2.13
? 2.13
0.70?0.34
1.15?0.21
18Talk outline
 Predictive vs. Descriptive DM
 Predictive rule induction
 Classification vs. estimation
 Classification rule induction
 Heuristics and rule quality evaluation
 Descriptive rule induction
 Predictive vs. Descriptive DM summary
19Ruleset representation
 Rule base is a disjunctive set of conjunctive
rules  Standard form of rules IF Condition THEN Class
 Class IF Conditions
 Class ? Conditions
 Examples
 IF OutlookSunny ? HumidityNormal THEN
PlayTennisYesIF OutlookOvercast THEN
PlayTennisYesIF OutlookRain ? WindWeak THEN
PlayTennisYes  Form of CN2 rules IF Conditions THEN
MajClass ClassDistr  Rule base R1, R2, R3, , DefaultRule
20Classification Rule Learning
 Rule set representation
 Two rule learning approaches
 Learn decision tree, convert to rules
 Learn set/list of rules
 Learning an unordered set of rules
 Learning an ordered list of rules
 Heuristics, overfitting, pruning
21Decision tree vs. rule learning Splitting vs.
covering
 Splitting (ID3, C4.5, J48, See5)
 Covering (AQ, CN2)










22PlayTennis Training examples
23PlayTennis Using a decision tree for
classification
Outlook
Sunny
Overcast
Rain
Humidity
Wind
Yes
Weak
Strong
Normal
High
No
Yes
No
Yes
Is Saturday morning OK for playing
tennis? OutlookSunny, TemperatureHot,
HumidityHigh, WindStrong PlayTennis No,
because OutlookSunny ? HumidityHigh
24PlayTennis Converting a tree to rules
 IF OutlookSunny ? HumidityNormal THEN
PlayTennisYes  IF OutlookOvercast THEN PlayTennisYes
 IF OutlookRain ? WindWeak THEN PlayTennisYes
 IF OutlookSunny ? HumidityHigh THEN
PlayTennisNo  IF OutlookRain ? WindStrong THEN PlayTennisNo
25Contact lense classification rules
 tear productionreduced gt lensesNONE
 S0,H0,N12
 tear productionnormal astigmatismno gt
lensesSOFT S5,H0,N1  tear productionnormal astigmatismyes spect.
pre.myope gt lensesHARD S0,H3,N2  tear productionnormal astigmatismyes spect.
pre.hypermetrope gt lensesNONE  S0,H1,N2
 DEFAULT lenses NONE
26Unordered rulesets
 rule Class IF Conditions is learned by first
determining Class and then Conditions  NB ordered sequence of classes C1, , Cn in
RuleSet  But unordered (independent) execution of rules
when classifying a new instance all rules are
tried and predictions of those covering the
example are collected voting is used to obtain
the final classification  if no rule fires, then DefaultClass (majority
class in E)
27Contact lense decision list
 Ordered (order dependent) rules
 IF tear productionreduced THEN lensesNONE
 ELSE /tear productionnormal/
 IF astigmatismno THEN lensesSOFT
 ELSE /astigmatismyes/
 IF spect. pre.myope THEN lensesHARD
 ELSE / spect.pre.hypermetrope/
 lensesNONE
28Ordered set of rules ifthenelse decision lists
 rule Class IF Conditions is learned by first
determining Conditions and then Class  Notice mixed sequence of classes C1, , Cn in
RuleBase  But ordered execution when classifying a new
instance rules are sequentially tried and the
first rule that fires (covers the example) is
used for classification  Decision list R1, R2, R3, , D rules Ri are
interpreted as ifthenelse rules  If no rule fires, then DefaultClass (majority
class in Ecur)
29Original covering algorithm(AQ, Michalski
1969,86)
 Basic covering algorithm
 for each class Ci do
 Ei Pi U Ni (Pi pos., Ni neg.)
 RuleBase(Ci) empty
 repeat learnsetofrules
 learnonerule R covering some positive examples
and no negatives  add R to RuleBase(Ci)
 delete from Pi all pos. ex. covered by R
 until Pi empty





30Learning unordered set of rules (CN2, Clark and
Niblett)
 RuleBase empty
 for each class Ci do
 Ei Pi U Ni, RuleSet(Ci) empty
 repeat learnsetofrules
 R Class Ci IF Conditions, Conditions
true  repeat learnonerule R Class Ci IF
Conditions AND Cond (generaltospecific beam
search of Best R)  until stopping criterion is satisfied (no
negatives covered  or Performance(R) lt ThresholdR)
 add R to RuleSet(Ci)
 delete from Pi all positive examples covered by
R  until stopping criterion is satisfied (all
positives covered or Performance(RuleSet(Ci)) lt
ThresholdRS)  RuleBase RuleBase U RuleSet(Ci)
31Learnonerule Greedy vs. beam search
 learnonerule by greedy generaltospecific
search, at each step selecting the best
descendant, no backtracking  beam search maintain a list of k best candidates
at each step descendants (specializations) of
each of these k candidates are generated, and the
resulting set is again reduced to k best
candidates
32Illustrative example Contact lenses data
33Learnonerule as heuristic search
Lenses hard IF true
?S???H???N???
...
Lenses hard IF Astigmatism no
Lenses hard IF Tearprod. reduced
S5, H0, N7
S0, H0, N12
Lenses hard IF Astigmatism yes
Lenses hard IF Tearprod. normal
S0, H4, N8
S5, H4, N3
Lenses hard IF Tearprod. normalAND
Spect.Pre. myope
Lenses hard IF Tearprod. normalAND
Astigmatism yes
S2, H3, N1
Lenses hard IF Tearprod. normalAND
Astigmatism no
Lenses hard IF Tearprod. normalAND
Spect.Pre. hyperm.
S0, H4, N2
S5, H0, N1
S3, H1, N2
34Rule learning summary
 Hypothesis construction find a set of n rules
 usually simplified by n separate rule
constructions  Rule construction find a pair (Class, Cond)
 select rule head (class) and construct rule body,
or  construct rule body and assign rule head (in
ordered algos)  Body construction find a set of m features
 usually simplified by adding to rule body one
feature at a time
35Talk outline
 Predictive vs. Descriptive DM
 Predictive rule induction
 Classification vs. estimation
 Classification rule induction
 Heuristics and rule quality evaluation
 Descriptive rule induction
 Predictive vs. Descriptive DM summary
36Evaluating rules and rulesets
 Predictive evaluation measures maximizing
accuracy, minimizing Error 1  Accuracy,
avoiding overfitting  Estimating accuracy percentage of correct
classifications  on the training set
 on unseen / testing instances
 cross validation, leaveoneout, ...
 Other measures comprehensibility (size),
information contents (information score),
significance, ...  Other measures of rule interestingness for
descriptive induction
37nfold cross validation
 A methods for accuracy estimation of classifiers
 Partition set D into n disjoint, almost
equallysized folds Ti where Ui Ti D  for i 1, ..., n do
 form a training set out of n1 folds Di D\Ti
 induce classifier Hi from examples in Di
 use fold Ti for testing the accuracy of Hi
 Estimate the accuracy of the classifier by
averaging accuracies over 10 folds Ti 
38D
T1
T2
T3
39D
T1
T2
T3
D\T1D1
D\T2D2
D\T3D3
40D
T1
T2
T3
D\T1D1
D\T2D2
D\T3D3
41D
T1
T2
T3
D\T1D1
D\T2D2
D\T3D3
T1
T2
T3
42Overfitting and accuracy
 Typical relation between hypothesis size and
accuracy  Question how to prune optimally?
43Overfitting
 Consider error of hypothesis h over
 training data T ErrorT(h)
 entire distribution D of data ErrorD(h)
 Hypothesis h ? H overfits training data T if
there is an alternative hypothesis h ? H such
that  ErrorT(h) lt ErrorT(h), and
 ErrorD(h) gt ErrorD(h)
 Prune a hypothesis (decision tree, ruleset) to
avoid overfitting T
44Avoiding overfitting
 Decision trees
 Prepruning (forward pruning) stop growing the
tree e.g., when data split not statistically
significant or too few examples are in a split  Postpruning grow full tree, then postprune
 Rulesets
 Prepruning (forward pruning) stop growing the
rule e.g., when too few examples are covered by a
rule  Postpruning construct a full ruleset, then
prune
Prepruning
Postpruning
45Rule postpruning (Quinlan 1993)
 Very frequently used method, e.g., in C4.5
 Procedure
 grow a full tree (allowing overfitting)
 convert the tree to an equivalent set of rules
 prune each rule independently of others
 sort final rules into a desired sequence for use
46Performance metrics
 Rule evaluation measures  aimed at avoiding
overfitting  Heuristics for guiding the search
 Heuristics for stopping the search
 Confusion matrix, contingency table for the
evaluation of individual rules and ruleset
evaluation  Area under ROC evaluation (employing the
confusion matrix information)
47Learnonerule PlayTennis training examples
48Learnonerule as search PlayTennis example
Play tennis yes IF
...
Play tennis yes IF Windweak
Play tennis yes IF Humidityhigh
Play tennis yes IF Humiditynormal
Play tennis yes IF Windstrong
Play tennis yes IF Humiditynormal,
Windweak
Play tennis yes IF Humiditynormal,
Outlookrain
Play tennis yes IF Humiditynormal,
Windstrong
Play tennis yes IF Humiditynormal,
Outlooksunny
49Learnonerule as heuristic search PlayTennis
example
Play tennis yes IF
9,5 (14)
...
Play tennis yes IF Windweak
Play tennis yes IF Humidityhigh
6,2 (8)
Play tennis yes IF Humiditynormal
Play tennis yes IF Windstrong
3,4 (7)
6,1 (7)
3,3 (6)
Play tennis yes IF Humiditynormal,
Windweak
Play tennis yes IF Humiditynormal,
Outlookrain
Play tennis yes IF Humiditynormal,
Windstrong
Play tennis yes IF Humiditynormal,
Outlooksunny
2,0 (2)
50Heuristics for learnonerule PlayTennis
example
 PlayTennis yes 9,5 (14)
 PlayTennis yes ? Windweak 6,2 (8) ?
Windstrong 3,3 (6) ? Humiditynormal
6,1 (7) ?  PlayTennis yes ? Humiditynormal Outlooksu
nny 2,0 (2) ?  Estimating accuracy with probability
 A(Ci ? Cond) p(Ci Cond)
 Estimating probability with relative frequency
 covered pos. ex. / all covered ex.
 6,1 (7) 6/7, 2,0 (2) 2/2 1
51Probability estimates
 Relative frequency of covered positive examples
 problems with small samples
 Laplace estimate
 assumes uniform prior distribution of k classes
 mestimate
 special case p()1/k, mk
 takes into account prior probabilities pa(C)
instead of uniform distribution  independent of the number of classes k
 m is domain dependent (more noise, larger m)
52Learnonerule search heuristics
 Assume two classes (,), learn rules for
class (Cl). Search for specializations of one
rule R Cl ? Cond from RuleBase.  Expected classification accuracy A(R)
p(ClCond)  Informativity (info needed to specify that
example covered by Cond belongs to Cl) I(R)
 log2p(ClCond)  Accuracy gain (increase in expected accuracy)
 AG(R,R) p(ClCond)  p(ClCond)
 Information gain (decrease in the information
needed)  IG(R,R) log2p(ClCond) 
log2p(ClCond)  Weighted measures favoring more general rules
WAG, WIG  WAG(R,R)
 p(Cond)/p(Cond) . (p(ClCond)  p(ClCond))
 Weighted relative accuracy trades off coverage
and relative accuracy WRAcc(R) p(Cond) .
(p(ClCond)  pa(Cl))
53What is high accuracy?
 Rule accuracy should be traded off against the
default accuracy of the rule Cl gt true  68 accuracy is OK if there are 20 examples of
that class in the training set, but bad if there
are 80  Relative accuracy
 RAcc(Cl gt Cond) p(Cl Cond) p(Cl)
54Weighted relative accuracy
 If a rule covers a single example, its accuracy
is either 0 or 100  maximizing relative accuracy tends to produce
many overly specific rules  Weighted relative accuracy
 WRAcc(Cl gt Cond)
 p(Cond).p(Cl Cond) p(Cl)
55Weighted relative accuracy
 WRAcc is a fundamental rule evaluation measure
 WRAcc can be used if you want to assess both
accuracy and significance  WRAcc can be used if you want to compare rules
with different heads and bodies  appropriate
measure for use in descriptive induction, e.g.,
association rule learning
56Talk outline
 Predictive vs. Descriptive DM
 Predictive rule induction
 Classification vs. estimation
 Classification rule induction
 Heuristics and rule quality evaluation
 Descriptive rule induction
 Subgroup discovery
 Association rule learning
 Predictive vs. Descriptive DM summary
57Descriptive DM
 Often used for preliminary data analysis
 User gets feel for the data and its structure
 Aims at deriving descriptions of characteristics
of the data  Visualization and descriptive statistical
techniques can be used
58Descriptive DM
 Description
 Data description and summarization describe
elementary and aggregated data characteristics
(statistics, )  Dependency analysis
 describe associations, dependencies,
 discovery of properties and constraints
 Segmentation
 Clustering separate objects into subsets
according to distance and/or similarity
(clustering, SOM, visualization, ...)  Subgroup discovery find unusual subgroups that
are significantly different from the majority
(deviation detection w.r.t. overall class
distribution)
59Subgroup Discovery
 Given a population of individuals and a property
of individuals we are interested in  Find population subgroups that are statistically
most interesting, e.g., are as large as
possible and have most unusual statistical
(distributional) characteristics w.r.t. the
property of interest
60Subgroup interestingness
 Interestingness criteria
 As large as possible
 Class distribution as different as possible from
the distribution in the entire data set  Significant
 Surprising to the user
 Nonredundant
 Simple
 Useful  actionable
61Classification Rule Learning for Subgroup
Discovery Deficiencies
 Only first few rules induced by the covering
algorithm have sufficient support (coverage)  Subsequent rules are induced from smaller and
strongly biased example subsets (pos. examples
not covered by previously induced rules), which
hinders their ability to detect population
subgroups  Ordered rules are induced and interpreted
sequentially as a ifthenelse decision list
62CN2SD Adapting CN2 Rule Learning to Subgroup
Discovery
 Weighted covering algorithm
 Weighted relative accuracy (WRAcc) search
heuristics, with added example weights  Probabilistic classification
 Evaluation with different interestingness measures
63CN2SD CN2 Adaptations
 Generaltospecific search (beam search) for
best rules  Rule quality measure
 CN2 Laplace Acc(Class ? Cond)
 p(ClassCond) (nc1)/(nrulek)
 CN2SD Weighted Relative Accuracy
 WRAcc(Class ? Cond)
 p(Cond) (p(ClassCond)  p(Class))
 Weighted covering approach (example weights)
 Significance testing (likelihood ratio
statistics)  Output Unordered rule sets (probabilistic
classification)
64CN2SD Weighted Covering
 Standard covering approach
 covered examples are deleted from current
training set  Weighted covering approach
 weights assigned to examples
 covered pos. examples are
 reweighted in all covering loop
 iterations, store count i how
 many times (with how many
 rules induced so far) a pos. example has
 been covered w(e,i), w(e,0)1





65CN2SD Weighted Covering
 Additive weights w(e,i) 1/(i1)
 w(e,i) pos. example e being covered i times
 Multiplicative weights w(e,i) gammai,
0ltgammalt1  note gamma 1 ? find the same (first) rule
again and again
gamma 0 ? behaves as standard CN2





66CN2SD Weighted WRAcc Search Heuristic
 Weighted relative accuracy (WRAcc) search
heuristics, with added example weights  WRAcc(Cl ? Cond) p(Cond) (p(ClCond)  p(Cl))
 increased coverage, decreased of rules, approx.
equal accuracy (PKDD2000)
67CN2SD Weighted WRAcc Search Heuristic
 In WRAcc computation, probabilities are estimated
with relative frequencies, adapt  WRAcc(Cl ? Cond) p(Cond) (p(ClCond)  p(Cl))
 n(Cond)/N ( n(Cl.Cond)/n(Cond) 
n(Cl)/N)  N sum of weights of examples
 n(Cond) sum of weights of all covered examples
 n(Cl.Cond) sum of weights of all correctly
covered examples
68Probabilistic classification
 Unlike the ordered case of standard CN2 where
rules are interpreted in an IFTHENELSE fashion,
in the unordered case and in CN2SD all rules are
tried and all rules which fire are collected  If a clash occurs, a probabilistic method is used
to resolve the clash
69Probabilistic classification
 A simplified example
 classbird ? legs2 feathersyes 13,0
 classelephant ? sizelarge fliesno 2,10
 classbird ? beakyes20,0
 35,10
 Twolegged, feathered, large,
 nonflying animal with a beak?
 bird !
70Talk outline
 Predictive vs. Descriptive DM
 Predictive rule induction
 Classification vs. estimation
 Classification rule induction
 Heuristics and rule quality evaluation
 Descriptive rule induction
 Subgroup discovery
 Association rule learning
 Predictive vs. Descriptive DM summary
71Association Rule Learning
 Rules X gtY, if X then Y
 X, Y itemsets (records, conjunction of items),
where items/features are binaryvalued
attributes)  Transactions i1 i2
i50  itemsets (records) t1 1 1 1 .
0  t2 1 0
 Example
...  Market basket analysis
 peanuts chips gt beer coke (0.05, 0.65)
 Support Sup(X,Y) XY/D p(XY)
 Confidence Conf(X,Y) XY/X Sup(X,Y)/Sup(X)
p(XY)/p(X) p(YX)
72Association Rule Learning
 Given a set of transactions D
 Find all association rules that hold on the set
of transactions that have support gt MinSup and
confidence gt MinConf  Procedure
 find all large itemsets Z, Sup(Z) gt MinSup
 split every large itemset Z into XY, compute
Conf(X,Y) Sup(X,Y)/Sup(X), if Conf(X,Y) gt
MinConf then X gtY (Sup(X,Y) gt MinSup, as
XY is large)
73Induced association rules
 Age ? 52 BigSpender no gt
 Gender male

 Age ? 52 BigSpender no gt
 Gender male Income ? 73250
 Gender male Age ? 52 Income ? 73250 gt
BigSpender no  ....
74Association Rule Learning for Classification
APRIORIC
 Simplified APRIORIC
 Discretise numeric attributes, for each discrete
attribute with N values create N items  Run APRIORI
 Collect rules whose righthand side consists of a
single target item, representing a value of the
target attribute
75Association Rule Learning for Classification
APRIORIC
 Improvements
 Creating rules Class ? Conditions during search
 Pruning of irrelevant items and itemsets
 Preprocessing Feature subset selection
 Postprocessing Rule subset selection
76Association Rule Learning for Subgroup Discovery
Advantages
 May be used to create rules of the form
 Class ? Conditions
 Each rule is an independent chunk of knowledge,
with  high support and coverage (p(Class.Cond) gt
MinSup, p(Cond) gt MinSup)  high confidence p(ClassCond) gt MinConf
 all interesting rules found (complete search)
 Building small and easytounderstand classifiers
 Appropriate for unbalanced class distributions
77Association Rule Learning for Subgroup Discovery
APRIORISD
 Further improvements
 Create a set of rules Class ? Conditions with
APRIORIC  advantage exhaustive set of rules
above the MinConf and MinSupp threshold  Order a set of induced rules w.r.t. decreased
WRAcc  Postprocess Rule subset selection by a
weighted covering approach  Take the best rule w.r.t. WRAcc
 Decrease the weights of covered examples
 Reorder the remaining rules and repeat until
stopping criterion is satisfied  significance threshold
 WRAcc threshold
78Talk outline
 Predictive vs. Descriptive DM
 Predictive rule induction
 Classification vs. estimation
 Classification rule induction
 Heuristics and rule quality evaluation
 Descriptive rule induction
 Predictive vs. Descriptive DM summary
79Predictive vs. descriptive induction Summary
 Predictive induction Induces rulesets acting as
classifiers for solving classification and
prediction tasks  Rules are induced from labeled instances
 Descriptive induction Discovers individual rules
describing interesting regularities in the data  Rules are induced from unlabeled instances
 Exception Subgroup discovery
 Discovers individual rules describing
interesting regularities in the data induced from
labeled examples
80Rule induction Literature
 P. Flach and N. Lavrac Rule Induction
 chapter in the book Intelligent Data Analysis,
Springer, edited by M. Berthold and D. Hand  See references to other sources in this book
chapter