From Machine Learning to Inductive Logic Programming: ILP made easy

About This Presentation

Title:

From Machine Learning to Inductive Logic Programming: ILP made easy

Description:

Contents and s in co-operation with Luc De Raedt. of the University of ... identify substructure that causes it to 'dock' on certain other molecules ... – PowerPoint PPT presentation

Number of Views:375

Avg rating:3.0/5.0

Slides: 212

Provided by: hend78

Category:

more less

Transcript and Presenter's Notes

Title: From Machine Learning to Inductive Logic Programming: ILP made easy

1
From Machine Learning to Inductive Logic
ProgrammingILP made easy

Hendrik Blockeel
Katholieke Universiteit Leuven Belgium

2
Contents of this course

Introduction
What is Inductive Logic Programming?
Relationship with other fields
Foundations of ILP
Algorithms
Applications

Contents and slides in co-operation with Luc De
Raedt of the University of Freiburg, Germany
3
1. Introduction

What is inductive logic programming?

4
Introduction What is ILP?

Paradigm for inductive reasoning (reasoning from
specific to general)
Related to
machine learning and data mining
logic programming

5
Inductive reasoning

Reasoning from specific to general
from (specific) observations
to a (general) hypothesis
Studied in
philosophy of science
statistics
...

Distinguish
weak induction all observed tomatoes are red
strong induction all tomatoes are red

Weak induction conclusion is entailed by
(follows deductively from) observations
cannot be wrong
Strong induction conclusion does not follow
deductively from observations
could be wrong!
logic does not provide justification
probability theory may

8
A predicate logic approach

Different kinds of reasoning in first order
predicate logic
Standard example Socrates

Human(Socrates)
Mortal(x) ?Human(x)
9
(No Transcript)
10

Logic programming focuses on deduction
Other types of LP
abductive logic programming (ALP)
inductive logic programming (ILP)
2 questions to be solved
How to perform induction?
How to integrate it in logic programming?

11
Some examples

Learning a definition of member from examples

member(a, a,b,c). member(b,a,b,c). member(3,5
,4,3,2,1). - member(b, 1,2,3). - member(3,
a,b,c).
Examples
12
Some examples

Use of background knowledge
E.g., learning quicksort

qsort(b,c,a, a,b,c). qsort(, )
. qsort(5,3,3,5). - qsort(5,3,5,3). -
qsort(1,3 3). split(L, A, B) -
... append(A,B,C) - ...
13
Some examples

Not only predicate definitions can be learned
e.g. learning constraints

parent(jack,mary). parent(mary,bob). father(jack,m
ary). mother(mary,bob). male(jack). male(bob). fem
ale(mary).
14
Practical applications

Program synthesis
very hard
subtasks debugging, validation,
Machine learning
e.g., learning to play games
Data mining
mining in large amounts of structured data

15
Example Application Mutagenicity Prediction

Given a set of molecules
Some cause mutation in DNA (these are mutagenic),
others dont
Try to distinguish them on basis of molecular
structure
Srinivasan et al., 1994 found structural alert

16
(No Transcript)
17
Example Application Pharmacophore Discovery

Application by Muggleton et al., 1996
Find "pharmacophore" in molecules
identify substructure that causes it to "dock"
on certain other molecules
Molecules described by listing for each atom in
it element, 3-D coordinates, ...
Background defines euclidean distance, ...

Some example molecules (Muggleton et al. 1996)

19
Description of molecules
Background knowledge
... hacc(M,A)- atm(M,A,o,2,_,_,_). hacc(M,A)-
atm(M,A,o,3,_,_,_). hacc(M,A)-
atm(M,A,s,2,_,_,_). hacc(M,A)-
atm(M,A,n,ar,_,_,_). zincsite(M,A)-
atm(M,A,du,_,_,_,_). hdonor(M,A) -
atm(M,A,h,_,_,_,_), not(carbon_bond(M,A)),
!. ...
atm(m1,a1,o,2,3.430400,-3.116000,0.048900). atm(m1
,a2,c,2,6.033400,-1.776000,0.679500). atm(m1,a3,o,
2,7.026500,-2.042500,0.023200). ... bond(m1,a2,a3,
2). bond(m1,a5,a6,1). bond(m1,a2,a4,1). bond(m1,a6
,a7,du). ...
20
Learning to play strategic games
21
Advantages of ILP

Advantages of using first order predicate logic
for induction
powerful representation formalism for data and
hypotheses (high expressiveness)
ability to express background domain knowledge
ability to use powerful reasoning mechanisms
many kinds of reasoning have been studied in a
first order logic framework

22
Foundations of Inductive Logic Programming
23
Overview

Concept learning the Versionspaces approach
from machine learning
how to search for a concept definition consistent
with examples
based on notion of generality

Notions of generality in ILP
the theta-subsumption ordering
other generality orderings
basic techniques and algorithms
Representation of data
two paradigms learning from implications,
learning from interpretations

25
Concept learning

Given
an instance space
some unknown concept subset of instance space
Task learn concept definition from examples (
labelled instances)
Could be defined extensionally or intensionally
Usually interested in intensional definition
otherwise no generalisation possible

Hypothesis h concept definition
can be represented intensionally h
or extensionally (as set of examples) ext(h)
Hypothesis h covers example e iff e?ext(h)
Given a set of (positive and negative) examples E
ltE, E-gt, h is consistent with E if E?ext(h)
and ext(h)?E- ?

27
Versionspaces

Given a set of instances E and a hypothesis space
H, the versionspace is the set of all h?H
consistent with E
contains all hypotheses in H that might be the
correct target concept
Some inductive algorithms exist that, given H and
E, compute the versionspace VS(H,E)

28
Properties

If target concept c?H, and E contains no noise,
then c?VS(H,E)
If VS(H,E) is singleton one solution
Usually multiple solutions
If H 2I with I instance space
i.e., all possible concepts in H
then no generalisation possible
H is called inductive bias

Usually illustrated with conjunctive concept
definitions
Example from T. Mitchell, 1996 Machine
Learning

Sky AirTemp Humidity Wind Water
Forecast EnjoySport sunny warm normal
strong warm same yes

30
Lattice for Conjunctive Concepts
lt?,?,?,?,?,?gt
ltSunny,?,?,?,?,?gt
lt?,Warm,?,?,?,?gt
lt?,?,?,?,?,Samegt
...
...
...
...
...
...
...
...
...
...
...
...
ltSunny,Warm,Normal,Strong,Warm,Samegt
...
...
lt?, ?, ?, ?, ?, ?gt
31

Concept represented as if-then-rule
ltSunny,Warm,?,?,?,?gt
IF Skysunny AND AirTempwarm THEN
EnjoySportsyes

32
Generality

Central to versionspace algorithms is notion of
generality
h is more general than h ( h ? h ) iff
ext(h)?ext(h)
Properties of VS(H,E) w.r.t. generality
if s?VS(H,E), g?VS(H,E) and g ? h ? s, then
h?VS(H,E)
gt VS can be represented by its borders

33
Candidate Elimination Algorithm

Start with general border G all and specific
border S none
When encountering positive example e
generalise hypotheses in S that do not cover e
throw away hypotheses in G that do not cover e
When encountering negative example e
specialise hypotheses in G that cover e
throw away hypotheses in S that cover e

34
G
lt?,?,?gt
lt?,w,?gt
ltc,?,?gt
lts,?,?gt
ltr,?,?gt
lt?,c,?gt
lt?,?,ngt
lt?,?,dgt
sw?
s?n
sc?
s?d
cw?
c?n
cc?
c?d
rw?
r?n
rc?
r?d
?wn
?cn
?wd
?cd
swn
swd
scn
scd
cwn
cwd
ccn
ccd
rwn
rwd
rcn
rcd
lt?,?,?gt
S
35
ltc,w,ngt
G
lt?,?,?gt
lt?,w,?gt
ltc,?,?gt
lts,?,?gt
ltr,?,?gt
lt?,c,?gt
lt?,?,ngt
lt?,?,dgt
sw?
s?n
sc?
s?d
cw?
c?n
cc?
c?d
rw?
r?n
rc?
r?d
?wn
?cn
?wd
?cd
cwn
swn
swd
scn
scd
cwd
ccn
ccd
rwn
rwd
rcn
rcd
S
lt?,?,?gt
36
ltc,w,ngt ltc,c,dgt -
lt?,?,?gt
G
G
lt?,w,?gt
lt?,?,ngt
ltc,?,?gt
lts,?,?gt
ltr,?,?gt
lt?,c,?gt
lt?,?,dgt
sw?
s?n
sc?
s?d
cw?
c?n
cc?
c?d
rw?
r?n
rc?
r?d
?wn
?cn
?wd
?cd
cwn
swn
swd
scn
scd
cwd
ccn
ccd
rwn
rwd
rcn
rcd
S
lt?,?,?gt
37

Keeping G and S may not be feasible
exponential size
In practice, most inductive concept learners do
not identify VS but just try to find one
hypothesis in VS

38
Importance of generality for induction

Even when not VS itself, but only one element of
it is computed, generality can be used for search
properties allow to prune search space
if h covers negatives, then any g ? h also covers
negatives
if h does not cover some positives, then any s ?
h does not cover those positives either

For concept learning in ILP, we will need a
generality ordering between hypotheses
ILP is not only useful for learning concepts, but
in general for learning theories (e.g.,
constraints)
then we need generality ordering for theories

40
Concept Learning in First Order Logic

Need a notion of generality (cf. versionspaces)
?-subsumption, entailment,
How to specialise / generalise concept
definitions?
operators for specialisation / generalisation
inverse resolution, least general generalisation
under ?-subsumption,

41
Generality of theories

A theory G is more general than a theory S if and
only if G S
G S in every interpretation (set of facts)
for which G is true, S is also true
"G logically implies S"
e.g., "all fruit tastes good" "all apples
taste good" (assuming apples are fruit)

Note talking about theories, not just concepts
(lt-gt versionspaces)
generality of concepts is special case of this
This will allow us to also learn e.g.
constraints, instead of only predicate
definitions ( concept definitions)

43
Deduction, induction and generality

Deduction reasoning from general to specific
is "always correct", truth-preserving
Induction reasoning from specific to general
inverse of deduction
not truth-preserving (falsity-preserving)
there may be statistical evidence

Deductive operators "-" exist that implement (or
approximate)
E.g., resolution (from logic programming)
Inverting these operators yields inductive
operators
basic technique in many inductive logic
programming systems

45
Various frameworks for generality

Depending on form of G and S
1 clause / set of clauses / any first order
theory
Depending on choice of - to invert
theta-subsumption
resolution
implication
Some frameworks much easier than others

46
1) ?-subsumption (Plotkin)

Most often used in ILP
S and G are single clauses
c1 ?-subsumes c2 (denoted c1?? c2 ) if and only
if there exists a variable substitution ? such
that c1? ? c2
to check this, first write clauses as
disjunctions
a,b,c ? d,e,f ? a ? b ? c ? ?d ? ?e ? ?f
then try to replace variables with constants or
other variables

Example
c1 father(X,Y) - parent(X,Y)
c2 father(X,Y) - parent(X,Y), male(X)
for ? c1? ? c2 gt c1 ?-subsumes c2
c3 father(luc,Y) - parent(luc,Y)
for ? X/luc c1? c3 gt c1 ?-subsumes c3
c2 and c3 do not ?-subsume one another

Given facts for parent, male, female,
so-called background knowledge B
Clause produces a set of father facts
answer substitutions for X,Y when body considered
as query
or facts occurring in minimal model of B?clause
set extensional definition of concept father

Property
If
c1 and c2 are definite Horn clauses
c1 ?? c2
Then
facts produced by c2 ? facts produced by c1
(Easy to see from definition ?-subsumption)

Similarity with propositional refinement
IF Sky sunny THEN EnjoySportsyes
To specialise add 1 condition
IF Skysunny AND Humiditylow THEN
EnjoySportsyes
...

In first order logic
c1 father(X,Y) - parent(X,Y)
To specialize find clauses ?-subsumed by c1
father(X,Y) - parent(X,Y), male(X)
father(luc,X) - parent(luc,X)
add literals or instantiate variables

Another (slightly more complicated) example
c1 p(X,Y) - q(X,Y)
c2 p(X,Y) - q(X,Y), q(Y,X)
c3 p(Z,Z) - q(Z,Z)
c4 p(a,a) - q(a,a)
Which clauses ?-subsumed by which?

Properties of ?-subsumption
Sound
if c1 ?-subsumes c2 then c1 c2
Incomplete possibly c1 c2 without c1
?-subsuming c2 (but only for recursive clauses)
c1 p(f(X)) - p(X)
c2 p(f(f(X))) - p(X)
Hence ?-subsumption approximates entailment but
is not the same

Checking whether c1 ?-subsumes c2 is decidable
but NP-complete
Transitive and reflexive, not anti-symmetric
"semi-order" relation
e.g.
f(X,Y) - g(X,Y), g(X,Z)
f(X,Y) - g(X,Y)
both ?-subsume one another

Semi-order generates equivalence classes
partial order on those equivalence classes
equivalence class c1 c2 iff c1 ?? c2 and c2 ??
c1
c1 and c2 are then called syntactic variants
c1 is reduced clause of c2 iff c1 contains
minimal subset of literals of c2 that is still
equivalent with c2
each equivalence class represented by its reduced
clause

If c1 and c2 in different equivalence classes,
either c1 ?? c2 or c2 ?? c1 or neither gt
anti-symmetry gt partial order
Thus, reduced clauses are partially ordered
they form a lattice
properties of this lattice?

57
p(X,Y) - m(X,Y) p(X,Y) - m(X,Y), m(X,Z) p(X,Y)
- m(X,Y), m(X,Z), m(X,U) ...
lgg
p(X,Y) - m(X,Y),r(X) p(X,Y) - m(X,Y),
m(X,Z),r(X) ...
p(X,Y) - m(X,Y),s(X) p(X,Y) - m(X,Y),
m(X,Z),s(X) ...
reduced
p(X,Y) - m(X,Y),s(X),r(X) p(X,Y) - m(X,Y),
m(X,Z),s(X),r(X) ...
glb
58

Least upper bound / greatest lower bound of two
clauses always exists and is unique
Infinite chains c1 ?? c2 ?? c3 ?? ... ?? c exist
h(X) - p(X,Y)
h(X) - p(X,X2), p(X2,Y)
h(X) - p(X,X2), p(X2,X3), p(X3,Y)
...
h(X) - p(X,X)

Looking for good hypothesis traversing this
lattice
can be done top-down, using specialization
operator
or bottom-up, using generalization operator

60
top
Heuristics-based searches (greedy, beam,
exhaustive)
VS
bottom
61
Specialisation operators

Shapiro general-to-specific traversal using
refinement operator ?
?(c) yields set of refinements of c
theory ?(c) c' c' is a maximally general
specialisation of c
practice ?(c) ? c ? l l is a literal ?
c? ? is a substitution

62
daughter(X,Y)
daughter(X,X)
daughter(X,Y) - parent(X,Z)
......
daughter(X,Y) - parent(Y,X)
daughter(X,Y) - female(X)
...
daughter(X,Y)-female(X),female(Y)
daughter(X,Y)-female(X),parent(Y,X)
63

How to traverse hypothesis space so that
no hypotheses are generated more than once?
no hypotheses are skipped?
-gt Many properties of refinement operators
studied in detail

Some properties
globally complete each point in lattice is
reachable from top
locally complete each point directly below c is
in ?(c) (useful for greedy systems)
optimal no point in lattice is reached twice
(useful for exhaustive systems)
minimal, proper,

65
A generalisation operator

For bottom-up search
We discuss one generalisation operator Plotkins
lgg
Starts from 2 clauses and compute least general
generalisation (lgg)
i.e., given 2 clauses, return most specific
single clause that is more general than both of
them

Definition of lgg of terms
(let si, tj denote any term, V a variable)
lgg(f(s1,...,sn), f(t1,...,tn))
f(lgg(s1,t1),...,lgg(sn,tn))
lgg(f(s1,...,sn),g(t1,...,tn)) V
e.g. lgg(a,b) X lgg(f(X),g(Y)) Z
lgg(f(a,b,a),f(c,c,c))f(X,Y,X)

lgg of literals
lgg(p(s1,...,sn),p(t1,...,tn))
p(lgg(s1,t1),...,lgg(sn,tn))
lgg(?p(...), ? p(...)) ? lgg(p(...),p(...))
lgg(p(s1,...,sn),q(t1,...,tn)) is undefined
lgg(p(...), ?p(...)) and lgg(?p(...),p(...)) are
undefined

lgg of clauses
lgg(c1,c2) lgg(l1, l2) l1?c1, l2?c2 and
lgg(l1,l2) defined
Example
f(t,a) - p(t,a), m(t), f(a)
f(j,p) - p(j,p), m(j), m(p)
lgg f(X,Y) - p(X,Y), m(X), m(Z)

Relative lgg (rlgg) (Plotkin 1971)
relative to "background theory" B (assume B is a
set of facts)
rlgg(e1,e2) lgg(e1 - B, e2 - B)
method to compute
change facts into clauses with body B
compute lgg of clauses
remove B, reduce

70
Example Bongard problems

Bongard Russian scientist studying pattern
recognition
Given some pictures, find patterns in them
Simplified version of Bongard problems used as
benchmarks in ILP

71
Examples labelled neg
Examples labelled pos
72

Example 2 simple Bongard problems, find least
general clause that would predict both to be
positive

pos(1). pos(2). contains(1,
o1). contains(2,o3). contains(1,o2). tri
angle(o1). triangle(o3). points(o1,d
own). points(o3,down). circle(o2).
1
2
73

Method 1 represent example by clause compute
lgg of examples

pos(1) - contains(1,o1),
contains(1,o2), triangle(o1),
points(o1,down), circle(o2). pos(2) -
contains(2,o3), triangle(o3),
points(o3,down).
lgg( (pos(1) - contains(1,o1), contains(1,o2),
triangle(o1),
points(o1,down), circle(o2)) , (pos(2) -
contains(2,o3), triangle(o3), points(o3, down)
) pos(X) - contains(X,Y), triangle(Y),
points(Y,down)
74

Method 2 represent class of example by fact,
other properties in background compute rlgg

Examples
Background
pos(1). pos(2).
contains(1,o1). contains(2,o3). contains(1
,o2). triangle(o1).
triangle(o3). points(o1,down).
points(o3,down). circle(o2).
rlgg(pos(1), pos(2)) ? (exercise)
75

?-subsumption ordering used by many ILP systems
top down using refinement operators (many
systems)
bottom up using rlgg (e.g., Golem system,
Muggleton Feng)

Note inverting implication
Given the incompleteness of ?-subsumption, could
we invert implication?
Some problems
lgg under implication not unique e.g., lgg of
p(f(f(f(X))))-p(X) and p(f(f(X)))-p(X) can be
p(f(X))-p(X) or p(f(f(X)))-p(Y)
computationally expensive

77
2) Inverting resolution

Resolution rule for deduction

Propositional
First order
p??q q?r ----------------- p ? r
p(X) ? ?q(X) q(X) ? ?r(X,Y) ----------------
------------------------- p(X) ?
? r(X,Y)
p(a) ? ?q(b) q(X) ? ?r(X,Y) ----------------
------------------------ p(a) ?
?r(b,Y)
p ? q q ? s ----------------- p ? s
X/b
78
Inverting resolution

General resolution rule

2 opposite literals (up to a substitution) li?1
?kj?2
l1 ? ... ? li ? ... ? ln k1 ? ... ?
kj ? ... ? km ------------------------------------
------------------------------------------- (l1
? l2 ? ... ? li-1 ? li1 ? ... ? ln ? k1 ? kj-1 ?
kj1 ... ? km) ?1?2
e.g., p(X) - q(X) and q(X) - r(X,Y) yield
p(X) - r(X,Y) p(X) - q(X) and q(a)
yield p(a).
79

Resolution implements - for sets of clauses
cf. ?-subsumption for single clauses
Inverting it allows to generalize a clausal
theory
Inverse resolution is much more difficult than
resolution itself
different operators defined
no unique results

80
Inverse resolution operators

Some operators related to inverse resolution
(A and B are conjunctions of literals)
absorption
from q-A and p - A,B
infer p - q,B
identification
from p - q,B and p - A,B
infer q - A

q - A
p - q,B
p - A,B
q - A
p - q,B
p - A,B
81

Intra-construction
from p - A,B and p - A,C
infer q - B and p - A,q and q - C
Inter-construction
from p - A,B and q - A,C
infer p - r,B and r - A and q - r,C

q-C
p-A,q
q-B
q-r,C
p-r,B
r - A
inter
intra
p-A,B p-A,C
p-A,B q-A,C
82

With intra- and inter-construction, new
predicates are invented
E.g., apply intra-construction on
grandparent(X,Y) - father(X,Z), father(Z,Y)
grandparent(X,Y) - father(X,Z), mother(Z,Y)
What predicate is invented?

83
Example inverse resolution
m(j)
f(X,Y) - p(X,Y),m(X)
f(j,Y) - p(j,Y)
p(j,m)
f(j,m)
84
grandparent(X,Y) - father(X,Z), parent(Z,Y)
father(X,Y) - male(X), parent(X,Y)
grandparent(X,Y) - male(X), parent(X,Z),
parent(Z,Y)
male(jef)
grandparent(jef,Y) - parent(jef,Z),parent(Z,Y)
parent(jef,an)
grandparent(jef,Y) - parent(an,Y)
parent(an,paul)
grandparent(jef,paul)
85

Properties of inverse resolution
in principle very powerful
- gives rise to huge search space
- result of inverse resolution not unique
e.g., father(j,p)-male(j) and parent(j,p) yields
father(j,p)-male(j),parent(j,p) or
father(X,Y)-male(X),parent(X,Y) or
CIGOL approach (Muggleton Buntine)

We now have some basic operators
?-subsumption-based at single clause level
specialization operator ?
generalization operator lgg of 2 clauses
inverse resolution generalize a set of clauses
These can be used to build ILP systems
top-down using specialization operators
bottom-up using generalization operators

87
Representations

2 main paradigms for learning in ILP
learning from interpretations
learning from entailment
Related to representation of examples
Cf. Bongard examples we saw before

88
Learning from entailment

1 example a fact e (or clause e-B)
Goal
Given examples ltE,E-gt,
Find theory H such that
?e?E B?H - e
?e-?E- B?H - e-

89
pos(1). pos(2). - pos(3).
Examples
contains(1,o1). contains(1,o2). contains(2,o3). tr
iangle(o1). triangle(o3). points(o1,d
own). points(o3,down). circle(o2). contains(3
,o4). circle(o4).
Background
pos(X) - contains(X,Y), triangle(Y),
points(Y,down).
90
Learning from interpretations

Example interpretation (set of facts) e
contains a full description of the example
all information that intuitively belongs to the
example, is represented in the example, not in
background knowledge
Background domain knowledge
general information concerning the domain, not
concerning specific examples

91
Examples
pos(1) - contains(1,o1), contains(1,o2),
triangle(o1), points(o1,down),
circle(o2). pos(2) - contains(2,o3),
triangle(o3), points(o3,down). - pos(3),
contains(3,o4), circle(o4).
Background
polygon(X) - triangle(X). polygon(X) -
square(X).
pos(X) - contains(X,Y), triangle(Y),
points(Y,down).
92
Closed World Assumption made inside
interpretations
Examples
pos contains(o1), contains(o2), triangle(o1),
points(o1,down), circle(o2) pos
contains(o3), triangle(o3), points(o3,down) neg
contains(o4), circle(o4)
Background
polygon(X) - triangle(X). polygon(X) -
square(X).
constraint on pos
?Ycontains(Y),triangle(Y),points(Y,down).
93

Note when learning from interpretations
can dispose of example identifier
but can also use standard format
CWA made for example description
i.e., example description is assumed to be
complete
class of example related to information inside
example background information, NOT to
information in other examples

Because of 3rd property, more limited than
learning from entailment
cannot learn relations between different
examples, nor recursive clauses
but also more efficient
because of 2nd and 3rd property
positive PAC-learnability results (De Raedt and
Deroski, 1994, AIJ), vs. negative results for
learning from entailment

95
Algorithms
96
Rule induction

Most inductive logic programming systems induce
concept definition in form of set of definite
Horn clauses (Prolog program)
Many algorithms similar to propositional
algorithms for learning rule sets
FOIL -gt CN2
Progol -gt AQ

97
FOIL (Quinlan)

Learns single concept, e.g., p(X,Y) - ...
To learn one clause (hill-climbing search)
start with general clause p(X,Y) - true
repeat
add best literal to clause (i.e., literal that
most improves quality of clause)
new literal can also be unification Xc or XY
applying refinement operator under
?-subsumption
until no further improvement

98
Example
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea) male(homer). male(bart). male(bill). fem
ale(chelsea). female(marge).
99
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea). male(homer). male(bart). male(bill). fe
male(chelsea). female(marge).
father(X,Y) - parent(X,Y). father(X,Y) -
parent(Y,X). father(X,Y) - male(X). father(X,Y)
- male(Y). father(X,Y) - female(X). father(X,Y)
- female(Y).
2,2-
100
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea). male(homer). male(bart). male(bill). fe
male(chelsea). female(marge).
father(X,Y) - parent(X,Y). father(X,Y) -
parent(Y,X). father(X,Y) - male(X). father(X,Y)
- male(Y). father(X,Y) - female(X). father(X,Y)
- female(Y).
2,1-
101
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
, chelsea). male(homer). male(bart). male(bill). f
emale(chelsea). female(marge).
father(X,Y) - male(X). father(X,Y) -
male(X), parent(X,Y). father(X,Y) - male(X),
parent(Y,X). father(X,Y) - male(X),
male(Y). father(X,Y) - male(X),
female(X). father(X,Y) - male(X), female(Y).
2,0-
102
Learning multiple clauses the Covering approach

To learn multiple clauses
repeat
learn a single clause c (see previous algorithm)
add c to h
mark positive examples covered by c as covered
until
all positive examples marked covered
or no more good clauses found

103
likes(garfield, lasagne). likes(garfield,
birds). likes(garfield, meat). likes(garfield,
jon). likes(garfield, odie).
likes(garfield, X) - edible(X).
3,0-
104
likes(garfield, lasagne). likes(garfield,
birds). likes(garfield, meat). likes(garfield,
jon). likes(garfield, odie).
(italics previously covered)
likes(garfield, X) - edible(X). likes(garfield,
X) - subject_to_cruelty(X).
2,0-
105
Some pitfalls

Avoiding infinite recursion
when recursive clauses allowed, e.g.,
ancestor(X,Y) - parent(X,Z), ancestor(Z,Y)
avoid learning parent(X,Y) - parent(X,Y)
won't be useful, even though it's 100 correct
Bonus for introduction of new variables
literal may not yield any direct gain, but may
introduce variables that may be useful later

p(X) - q(X) p positives, n negatives
covered refine by adding age p(X) - q(X),
age(X,Y) p positives, n negatives covered -gt no
gain
106
Golem (Muggleton Feng)

Based on rlgg-operator
To build one clause
Look at 2 positive examples, find rlgg,
generalize using yet another example, until no
improvement in quality of clause
bottom-up search
Result very dependent on choice of examples
e.g. what if true theory is p(X) - q(X) , p(X)
- r(X) ?

107

Try this for different couples, pick best clause
found
this reduces dependency on choice of couple (if 1
of them noisy no good clause found)
Remove covered positive examples, restart process
Repeat until no more good clauses found

108

1 limitation of Golem extensional coverage tests
only extensional background knowledge
may go wrong when learning recursive clauses

induces
p(0). p(1). p(2). - p(4).
s(0,1). s(1,2). s(2,3). s(3,4).
p(Y) - s(X,Y), p(X).
H-B checked by running query ?(B? ? H)
extensional coverage test
examples
background
109
Progol (Muggleton)

Top-down approach, but with seed
To find one clause
Start with 1 positive example e
Generate hypothesis space He that contains only
hypotheses that cover at least this one example
first generate most specific clause c that covers
e
He contains every clause more general than c
Perform exhaustive top-down search in He, looking
for clause that maximizes compaction

110

Compaction size(covered examples) -
size(clause)
Repeat process of finding one clause until no
more good ( causing compaction) clauses found
Compaction heuristic in principle allows no
coverage of negatives
can be relaxed (accommodating noise)

111
Generation of bottom clause

Language bias set of all acceptable clauses
(chosen by user)
specification of H (on level of single clauses)
Bottom clause ? for example e most specific
clause in language bias covering e
Constructed using inverse entailment

112

Construction of ?
if B?H e, then B ? ?e ?H
if H is clause, ?H is conjunction of ground
(skolemized) literals
compute ?? all ground literals entailed by B ?
?e
?H must be subset of these
so B ? ?e ?? ?H
hence H ?

113

Some examples (cf. Muggleton, NGC 1995)

?
B
e
anim(X) - pet(X). pet(X) - dog(X).
nice(X) - dog(X).
nice(X) - dog(X), pet(X), anim(X).
hasbeak(X) - bird(X). bird(X) - vulture(X).
hasbeak(tweety).
hasbeak(tweety) bird(tweety) vulture(tweety).
114

Example of (part of) Progol run
learn to classify animals as mammals, reptiles,
...

- generalise(class/2)? Generalising
class(dog,mammal). Most specific clause
is class(A,mammal) - has_milk(A),
has_covering(A,hair), has_legs(A, 4),
homeothermic(A), habitat(A,land). C-28,4,10,0
class(A,mammal). C8,4,0,0 class(A,mammal) -
has_milk(A). C5,3,0,0 class(A,mammal) -
has_covering(A,hair). C-4,4,3,0
class(A,mammal) - homeothermic(A). 4 explored
search nodes f8,p4,n0,h0 Result of search
is class(A,mammal) - has_milk(A).
115

Exhaustive search important to constrain size
of hypothesis space
Strong language bias
specify which predicates to be used in head or
body of clause
specify types and modes of predicates
e.g., allow age(X,Y), Ylt18
but not habitat(X,Y), Ylt18

116

E.g., for "animals" example

put this in head
variable of type "animal"
- modeh(1,class(animal,class))? -
modeb(1,has_milk(animal))? - modeb(1,has_gills(
animal))? - modeb(1,has_covering(animal,coverin
g))? - modeb(1,has_legs(animal,nat))? -
modeb(1,homeothermic(animal))? -
modeb(1,has_eggs(animal))? - modeb(,habitat(an
imal,habitat))?
constant of type "covering"
put this in body
there can be any number of habitats
only one literal of this kind needed
117
Other approaches

Algorithms we have seen up till now are rule
based algorithms
induce theory in the form of a set of rules
(definite Horn clauses)
induce rules one by one
Quite normal, given that logic programs are
essentially sets of rules

118

Still induction of rule sets is only one type of
machine learning
Difference between ILP and propositional
approaches is mainly in representation
Possible to define other learning techniques and
tasks in ILP induction of constraints, induction
of decision trees, Bayesian learning, ...

119
Claudien (De Raedt Bruynooghe)

"Clausal Discovery Engine"
Discovers patterns that hold in set of data
any patterns represented as clauses (not
necessarily Horn clauses)
I.e., finds patterns of a more general kind than
predictive rules
also called descriptive induction

120

Given a hypothesis space
performs an exhaustive top-down search through
the space
returns all clauses that
hold in the data set
are not implied by other clauses found
Strong language bias precise syntactical
description of acceptable clauses

121

Example language bias

parent(X,Y), father(X,Y), mother(X,Y) -
parent(X,Y), father(X,Y), mother(X,Y),
male(X), male(Y), female(X), female(Y)

May result in following clauses being discovered

parent(X,Y) - father(X,Y). parent(X,Y) -
mother(X,Y). - father(X,Y), mother(X,Y). -
male(X), female(X). mother(X,Y) - parent(X,Y),
female(X). ...
122
Claudien algorithm

S ?
Q ?
while Q not empty
pick first clause c from Q
for all (h?b) in ?(c)
if query (b??h) fails (i.e., clause is true in
data)
then
if (h?b) not entailed by clauses in S then add
(h?b) to S
else add (h?b) to Q

123
ICL (De Raedt and Van Laer)

Inductive Constraint Logic
First system to learn from interpretations
Search for constraints on interpretations
distinguishing examples of different classes
Roughly run Claudien on set of examples E
each constraint found will be true for all e,
but probably false for some e-
all constraints together hopefully rule out all e-

124

Search for one constraint
c ?
repeat until c true for all positives
find d in ?(c) so that d holds for as many
positives and as few negatives as possible
c d
add c to h
can also use beam search

125

Search for set of constraints on a class
h
while there are negatives left to be eliminated
find a constraint c
add c to h
Uses same language bias (DLAB) as recent
versions of Claudien
DLAB is advanced form of original Claudien bias

126

Example of DLAB bias specification
min-max ... means at least min and at most max
literals from the list are to be put here
can be nested
allows some nice tricks, e.g.
1-1male(X),female(X)

0-2parent(X,Y), father(X,Y), mother(X,Y) lt--
0-lenparent(X,Y), father(X,Y), mother(X,Y),
male(X), male(Y), female(X), female(Y)
127
Warmr (Dehaspe)

Induces first order association rules
Algorithm similar to APRIORI
Finds frequent patterns
cf. "frequent item sets" in APRIORI context
Pattern conjunction of literals
Uses ?-subsumption lattice over hypothesis space
Constructs association rules from patterns
IF this pattern occurs, THEN that pattern occurs
too

128
The APRIORI algorithm

APRIORI (Agrawal et al.) efficient discovery of
frequent itemsets and association rules
Typical example market basket analysis
which things are often bought together?
Association rule
IF a1, , an THEN an1, anm

129

Association rules should have at least some
minimal
support t(a1anm) / ttrue
how many people buy all these things together?
confidence ta1anm/ta1an
how many people of those buying IF-things also
buy THEN-things?
Minimal support and confidence may be low

130

APRIORI tailored towards using large data sets
efficiency very important
minimize data access
Works in 2 steps
find frequent itemsets
compute association rules from them

131

Observation
if a1an infrequent (below min. support)
then a1an1 also infrequent
adding a condition can only strengthen the
conjunction
Hence
a1,,an can only be frequent if each subset of
it is frequent

132

Leads to levelwise algorithm
first compute frequent singletons
then frequent pairs, triples,
a lot of pruning possible due to previous
observation
itemset of cardinality n is candidate if each
subset of it of cardinality n-1 was frequent in
previous level
need to count only candidates

133
Example
bread
butter
wine
ham
cheese
jam
Bread butter
Bread cheese
Bread jam
Butter cheese
Butter jam
Cheese jam
Bread butter cheese
Bread butter jam
Not a candidate
134
Apriori algorithm
Min_freq min_supportfreq(?) d 0 Q0 ?
/ candidates for level 0 / F ? / frequent
sets / while Qd ? ? do for all S in Qd do find
freq(S) Fd S in Qd freq(S) ?
min_freq F F ? Fd compute Qd1 d
d1 return F
135
Computing candidates
Compute Qd1 from Fd Qd1 ? for
each S in Fd do for each item x not in S do
S S ? x if ?i in S S\i ? Fd
then add S to Qd1
136

Step 2 deriving association rules from frequent
sets
if S ? a ? F and (S?a)/S gt min_confidence
then S -gt S ? a is a valid association rule
has sufficient support and confidence

137
Warmr

Warmr is first-order version of Apriori
Patterns (itemsets) are now conjunctive queries
Frequent patterns what to count?
examples, of course...
Was easy in propositional case
1 example 1 tuple -gt count tuples

138

In first-order case
also easy when learning from interpretations
not so clear when learning from implications
which implications are examples?
indicate this by specifying a key
key unique identification of example
each pattern contains a set of variables that
forms the key

139

Example
assume 100 people in database
person(X) X is the key
count answer substitutions of X, not Y or Z!
person(X), mother(X,Y) 40 examples
mother(X,Y), has_pet(Y,Z) 30 examples
mother(X,Y) ---gt has_pet(Y,Z) support 0.3,
confidence 0.75

140

Remark association rule is NOT a clause
mother(X,Y) ---gt has_pet(Y,Z)
?X (?Ymother(X,Y)) -gt (?YZmother(X,Y),has_pet
(Y,Z))
? mother(X,Y) -gt has_pet(Y,Z)
main difference is occurrence of existentially
quantified variables in conclusion

141

Illustrated on Bongard drawings
1 example 1 drawing
contains(D,Obj) D is the key
Pattern e.g.,
contains(D,X), circle(X), in(X,Y), circle(Y)
Association rule e.g.,
contains(D,X), circle(X),in(X,Y),circle(Y) --gt
contains(D,Z), square(Z)
"drawings that contain a circle inside another
circle usually also contain a square"

142

Warmr also useful for feature construction
Generally applicable method for improving
representation of examples
Given description of example
derive new (propositional) features that describe
the example
add those features to a propositional description
of the example
run a propositional learner

143

For Bongard example
construct features "contains a circle", "contains
a circle inside a triangle", ...
given the correct features, a propositional
representation of examples is possible
Feature construction with ILP general method
for applying propositional machine learning
techniques to structural examples

144
Decision tree induction in ILP

S-CART (Kramer 1996) upgrade of CART
Tilde (Blockeel De Raedt 98) upgrades C4.5
Both induce "first order" or "structural"
decision trees (FOLDTs)
test in node first order literal
may result in true or false -gt binary trees
different nodes may share variables
"real" test in a node conjunction of all
literal in path from root to node

145
Top-down Induction of Decision Trees Algorithm

function TDIDT(E set of examples)
T set of possible tests
t BEST_SPLIT(T, E)
E partition induced on E by t
if STOP_CRIT(E, E) then return leaf(INFO(E))
else
for all Ei in E ti TDIDT(Ei)
return inode(t, (i, ti))

146

Set of possible tests
generated using refinement operator
c conjunction on path from root to node
?(c ) - c literal(s) to be put in node
Other auxiliary functions lt prop. TDIDT
best split using e.g. information gain
stop_crit e.g. significance test
info e.g. most frequent class

147

Known from propositional learning
induction of decision trees is fast
usually yields good results
These properties are inherited by Tilde / S-CART
New results (not inherited from prop. learning)
on expressiveness

148
Example FOLDT
worn(X)
yes
no
irreplaceable(X)
ok
yes
no
sendback
fix
("x Ø worn(x))
gt ok (x worn(x) Ù
irreplaceable(x)) gt
sendback (x"y worn(x) Ù Ø(worn(y) Ù
irreplaceable(y))) gt fix
149
Expressiveness
FOL formula equivalent with tree
("x Øworn(x))
gt ok (x worn(x) Ù
irreplaceable(x)) gt
sendback (x"y worn(x) Ù Ø(worn(y) Ù
irreplaceable(y))) gt fix
Logic program equivalent with tree
a worn(X) b worn(X), irreplaceable(X) ok Ø
a sendback b fix a Ù Ø b
150

Prolog program equivalent with tree, using cuts
(first order decision list)

sendback - worn(X), irreplaceable(X), ! fix -
worn(X), !. ok.
151

FOLDT can be converted to
layered logic program
containing invented predicates
flat Prolog program (using cuts)
Can not be converted to flat logic program

152
Expressiveness
TL
F
F Flat logic programs T decision Trees L
decision Lists

Difference is specific for first-order case
Possible remedies for ILP systems
invent auxiliary predicates
use both " and
induce decision lists

153
Representation with keys
class(e1,fix). worn(e1,gear). worn(e1,chain). clas
s(e2,sendback). worn(e2,engine). worn(e2,chain). c
lass(e3,sendback). worn(e3,control_unit). class(e4
,fix). worn(e4,chain). class(e5,keep).
worn(E,X)?
class(E,keep)
not_replaceable(X)?
class(E,fix)
class(E,sendback)
conversion to Prolog
replaceable(gear). replaceable(chain). not_replace
able(engine). not_replaceable(control_unit).
class(E,sendback) - worn(E,X),
not_replaceable(X), !. class(E,fix) - worn(E,X),
!. class(E, keep).
154
speed(x,s), s gt 120, not job(x, politician), not
(?y knows(x,y), job(y,politician)) gt fine(x,Y)
speed(X,S), Sgt120
yes
no
job(X, politician)
N
yes
no
knows(X, Y)
N
yes
no
job(Y, politician)
Y
yes
no
Y
N
155
Other advantages of FOLDTs

Both classification and regression possible
classification predict class ( learn concept)
regression predict numbers
important not given much attention in ILP
Also clustering to some extent
clustering group similar examples together

156
Many other approaches and applications of ILP
possible...

Combination of ILP and Q-learning
RRL ("relational reinforcement learning")
reinforcement learning in structural domains
First-order equivalent of Bayesian networks
First-order clustering
needs first order distance measures
...

157
Conclusions

Many different approaches exist in Machine
Learning
ILP is in a sense diverging
from concept learning
to other approaches and tasks
Still many new approaches to be tried!

158
Applications of ILP
159
Applications Overview

User modelling
Games
Ecology
Drug design
Natural language
Inductive Database Design

160
User Modelling

Behavioural cloning
build model of users behaviour
simulate users behaviour by means of model
e.g.
learning to fly / drive /
learning to play music
learning to play games (adventure, strategic, )

161

Automatic adaptation of system to user
detect patterns in users actions
use patterns to try to predict users next action
based on predictions, make life easier for user
e.g.
mail system (auto-priority, )
adaptive web pages
intelligent search engines

162
Example Applications

Some applications the Leuven group has looked at
behavioural cloning
learning to play mus ic
learning to play games
automatic adaptation of system to user
adaptive webpages
a learning command shell
intelligent e-mail interface

163
Learning to Play Music

Van Baelen De Raedt, ILP-96
Playing music is difficult
not just playing the notes
but play with feeling
adapt volume, speed,
Midi files provided to learning system
System detects patterns w.r.t. pitch, volume,
speed,
and tries to play music itself

164

Why an ILP approach?
mainly because of time sequences
Results?
Compare computer generated MIDI file with human
generated MIDI file
Computer makes similar mistakes as beginning
player
See ILP-96 proc. for details (LNAI 1314)

165
Adaptive Webpages

Adaplix project (Jacobs et al., 1997-)
Webpage observes actions of user
e.g., which links are followed frequently, time
that is spent on one page,
and adapts itself
within limitations given by page author
change layout of page
move links to different places
add or remove links

166

example site http//adaplix.linux.student.kuleuve
n.ac.be
identify yourself
name, gender, occupation (personnel/student)
based on this info provides customized web page
student project (in Dutch)

167
Intelligent Mailer

Visual Elm (Jacobs, 1996)
Intelligent mail interface
tries to detect which kind of mails are
immediately deleted
immediately read
not deleted, read later
forwarded
based on this, assigns priorities to new mails

168

Predictions
priority assigned to new mails
expected actions delete, forward,
Explanation facility
Several options offered to user
e.g. set priority threshold, only show mails
above threshold
sort mails according to priority

169
(No Transcript)
170
Learning Shell

Jacobs, Dehaspe et al. (1999)
Context Unix command shell, e.g., csh
Each user has profile file
defines configuration for user that makes it
easier to use the shell
usually default profile, unless user changes it
manually

171

Possible to learn profile file?
Observe user
which commands are often used?
which parameters are used with the commands?
Automatically construct better profile from
observations

172

Example of input to ILP system

/ background / command(Id, Command)
- isa(OrigCommand, Command), command(Id,
OrigCommand). isa(emacs, editor). isa(vi,
editor). / observations / command(1,
cd). attribute(1, 1, tex). command(2,
emacs). switch(2, 1, -nw). switch(2, 2,
-q). attribute(2, 1, aaai.tex).
173