Title: From Machine Learning to Inductive Logic Programming: ILP made easy
1From Machine Learning to Inductive Logic
ProgrammingILP made easy
- Hendrik Blockeel
- Katholieke Universiteit Leuven Belgium
2Contents of this course
- Introduction
- What is Inductive Logic Programming?
- Relationship with other fields
- Foundations of ILP
- Algorithms
- Applications
Contents and slides in co-operation with Luc De
Raedt of the University of Freiburg, Germany
31. Introduction
- What is inductive logic programming?
4Introduction What is ILP?
- Paradigm for inductive reasoning (reasoning from
specific to general) - Related to
- machine learning and data mining
- logic programming
5Inductive reasoning
- Reasoning from specific to general
- from (specific) observations
- to a (general) hypothesis
- Studied in
- philosophy of science
- statistics
- ...
6- Distinguish
- weak induction all observed tomatoes are red
- strong induction all tomatoes are red
7- Weak induction conclusion is entailed by
(follows deductively from) observations - cannot be wrong
- Strong induction conclusion does not follow
deductively from observations - could be wrong!
- logic does not provide justification
- probability theory may
8A predicate logic approach
- Different kinds of reasoning in first order
predicate logic - Standard example Socrates
Human(Socrates)
Mortal(x) ?Human(x)
9(No Transcript)
10- Logic programming focuses on deduction
- Other types of LP
- abductive logic programming (ALP)
- inductive logic programming (ILP)
- 2 questions to be solved
- How to perform induction?
- How to integrate it in logic programming?
11Some examples
- Learning a definition of member from examples
member(a, a,b,c). member(b,a,b,c). member(3,5
,4,3,2,1). - member(b, 1,2,3). - member(3,
a,b,c).
Examples
12Some examples
- Use of background knowledge
- E.g., learning quicksort
qsort(b,c,a, a,b,c). qsort(, )
. qsort(5,3,3,5). - qsort(5,3,5,3). -
qsort(1,3 3). split(L, A, B) -
... append(A,B,C) - ...
13Some examples
- Not only predicate definitions can be learned
e.g. learning constraints
parent(jack,mary). parent(mary,bob). father(jack,m
ary). mother(mary,bob). male(jack). male(bob). fem
ale(mary).
14Practical applications
- Program synthesis
- very hard
- subtasks debugging, validation,
- Machine learning
- e.g., learning to play games
- Data mining
- mining in large amounts of structured data
15Example Application Mutagenicity Prediction
- Given a set of molecules
- Some cause mutation in DNA (these are mutagenic),
others dont - Try to distinguish them on basis of molecular
structure - Srinivasan et al., 1994 found structural alert
16(No Transcript)
17Example Application Pharmacophore Discovery
- Application by Muggleton et al., 1996
- Find "pharmacophore" in molecules
- identify substructure that causes it to "dock"
on certain other molecules - Molecules described by listing for each atom in
it element, 3-D coordinates, ... - Background defines euclidean distance, ...
18- Some example molecules (Muggleton et al. 1996)
19Description of molecules
Background knowledge
... hacc(M,A)- atm(M,A,o,2,_,_,_). hacc(M,A)-
atm(M,A,o,3,_,_,_). hacc(M,A)-
atm(M,A,s,2,_,_,_). hacc(M,A)-
atm(M,A,n,ar,_,_,_). zincsite(M,A)-
atm(M,A,du,_,_,_,_). hdonor(M,A) -
atm(M,A,h,_,_,_,_), not(carbon_bond(M,A)),
!. ...
atm(m1,a1,o,2,3.430400,-3.116000,0.048900). atm(m1
,a2,c,2,6.033400,-1.776000,0.679500). atm(m1,a3,o,
2,7.026500,-2.042500,0.023200). ... bond(m1,a2,a3,
2). bond(m1,a5,a6,1). bond(m1,a2,a4,1). bond(m1,a6
,a7,du). ...
20Learning to play strategic games
21Advantages of ILP
- Advantages of using first order predicate logic
for induction - powerful representation formalism for data and
hypotheses (high expressiveness) - ability to express background domain knowledge
- ability to use powerful reasoning mechanisms
- many kinds of reasoning have been studied in a
first order logic framework
22Foundations of Inductive Logic Programming
23Overview
- Concept learning the Versionspaces approach
- from machine learning
- how to search for a concept definition consistent
with examples - based on notion of generality
24- Notions of generality in ILP
- the theta-subsumption ordering
- other generality orderings
- basic techniques and algorithms
- Representation of data
- two paradigms learning from implications,
learning from interpretations
25Concept learning
- Given
- an instance space
- some unknown concept subset of instance space
- Task learn concept definition from examples (
labelled instances) - Could be defined extensionally or intensionally
- Usually interested in intensional definition
- otherwise no generalisation possible
26- Hypothesis h concept definition
- can be represented intensionally h
- or extensionally (as set of examples) ext(h)
- Hypothesis h covers example e iff e?ext(h)
- Given a set of (positive and negative) examples E
ltE, E-gt, h is consistent with E if E?ext(h)
and ext(h)?E- ?
27Versionspaces
- Given a set of instances E and a hypothesis space
H, the versionspace is the set of all h?H
consistent with E - contains all hypotheses in H that might be the
correct target concept - Some inductive algorithms exist that, given H and
E, compute the versionspace VS(H,E)
28Properties
- If target concept c?H, and E contains no noise,
then c?VS(H,E) - If VS(H,E) is singleton one solution
- Usually multiple solutions
- If H 2I with I instance space
- i.e., all possible concepts in H
- then no generalisation possible
- H is called inductive bias
29- Usually illustrated with conjunctive concept
definitions - Example from T. Mitchell, 1996 Machine
Learning
Sky AirTemp Humidity Wind Water
Forecast EnjoySport sunny warm normal
strong warm same yes
30Lattice for Conjunctive Concepts
lt?,?,?,?,?,?gt
ltSunny,?,?,?,?,?gt
lt?,Warm,?,?,?,?gt
lt?,?,?,?,?,Samegt
...
...
...
...
...
...
...
...
...
...
...
...
ltSunny,Warm,Normal,Strong,Warm,Samegt
...
...
lt?, ?, ?, ?, ?, ?gt
31- Concept represented as if-then-rule
- ltSunny,Warm,?,?,?,?gt
- IF Skysunny AND AirTempwarm THEN
EnjoySportsyes
32Generality
- Central to versionspace algorithms is notion of
generality - h is more general than h ( h ? h ) iff
ext(h)?ext(h) - Properties of VS(H,E) w.r.t. generality
- if s?VS(H,E), g?VS(H,E) and g ? h ? s, then
h?VS(H,E) - gt VS can be represented by its borders
33Candidate Elimination Algorithm
- Start with general border G all and specific
border S none - When encountering positive example e
- generalise hypotheses in S that do not cover e
- throw away hypotheses in G that do not cover e
- When encountering negative example e
- specialise hypotheses in G that cover e
- throw away hypotheses in S that cover e
34G
lt?,?,?gt
lt?,w,?gt
ltc,?,?gt
lts,?,?gt
ltr,?,?gt
lt?,c,?gt
lt?,?,ngt
lt?,?,dgt
sw?
s?n
sc?
s?d
cw?
c?n
cc?
c?d
rw?
r?n
rc?
r?d
?wn
?cn
?wd
?cd
swn
swd
scn
scd
cwn
cwd
ccn
ccd
rwn
rwd
rcn
rcd
lt?,?,?gt
S
35ltc,w,ngt
G
lt?,?,?gt
lt?,w,?gt
ltc,?,?gt
lts,?,?gt
ltr,?,?gt
lt?,c,?gt
lt?,?,ngt
lt?,?,dgt
sw?
s?n
sc?
s?d
cw?
c?n
cc?
c?d
rw?
r?n
rc?
r?d
?wn
?cn
?wd
?cd
cwn
swn
swd
scn
scd
cwd
ccn
ccd
rwn
rwd
rcn
rcd
S
lt?,?,?gt
36ltc,w,ngt ltc,c,dgt -
lt?,?,?gt
G
G
lt?,w,?gt
lt?,?,ngt
ltc,?,?gt
lts,?,?gt
ltr,?,?gt
lt?,c,?gt
lt?,?,dgt
sw?
s?n
sc?
s?d
cw?
c?n
cc?
c?d
rw?
r?n
rc?
r?d
?wn
?cn
?wd
?cd
cwn
swn
swd
scn
scd
cwd
ccn
ccd
rwn
rwd
rcn
rcd
S
lt?,?,?gt
37- Keeping G and S may not be feasible
- exponential size
- In practice, most inductive concept learners do
not identify VS but just try to find one
hypothesis in VS
38Importance of generality for induction
- Even when not VS itself, but only one element of
it is computed, generality can be used for search - properties allow to prune search space
- if h covers negatives, then any g ? h also covers
negatives - if h does not cover some positives, then any s ?
h does not cover those positives either
39- For concept learning in ILP, we will need a
generality ordering between hypotheses - ILP is not only useful for learning concepts, but
in general for learning theories (e.g.,
constraints) - then we need generality ordering for theories
40Concept Learning in First Order Logic
- Need a notion of generality (cf. versionspaces)
- ?-subsumption, entailment,
- How to specialise / generalise concept
definitions? - operators for specialisation / generalisation
- inverse resolution, least general generalisation
under ?-subsumption,
41Generality of theories
- A theory G is more general than a theory S if and
only if G S - G S in every interpretation (set of facts)
for which G is true, S is also true - "G logically implies S"
- e.g., "all fruit tastes good" "all apples
taste good" (assuming apples are fruit)
42- Note talking about theories, not just concepts
(lt-gt versionspaces) - generality of concepts is special case of this
- This will allow us to also learn e.g.
constraints, instead of only predicate
definitions ( concept definitions)
43Deduction, induction and generality
- Deduction reasoning from general to specific
- is "always correct", truth-preserving
- Induction reasoning from specific to general
inverse of deduction - not truth-preserving (falsity-preserving)
- there may be statistical evidence
44- Deductive operators "-" exist that implement (or
approximate) - E.g., resolution (from logic programming)
- Inverting these operators yields inductive
operators - basic technique in many inductive logic
programming systems
45Various frameworks for generality
- Depending on form of G and S
- 1 clause / set of clauses / any first order
theory - Depending on choice of - to invert
- theta-subsumption
- resolution
- implication
- Some frameworks much easier than others
461) ?-subsumption (Plotkin)
- Most often used in ILP
- S and G are single clauses
- c1 ?-subsumes c2 (denoted c1?? c2 ) if and only
if there exists a variable substitution ? such
that c1? ? c2 - to check this, first write clauses as
disjunctions - a,b,c ? d,e,f ? a ? b ? c ? ?d ? ?e ? ?f
- then try to replace variables with constants or
other variables
47- Example
- c1 father(X,Y) - parent(X,Y)
- c2 father(X,Y) - parent(X,Y), male(X)
- for ? c1? ? c2 gt c1 ?-subsumes c2
- c3 father(luc,Y) - parent(luc,Y)
- for ? X/luc c1? c3 gt c1 ?-subsumes c3
- c2 and c3 do not ?-subsume one another
48- Given facts for parent, male, female,
- so-called background knowledge B
- Clause produces a set of father facts
- answer substitutions for X,Y when body considered
as query - or facts occurring in minimal model of B?clause
- set extensional definition of concept father
49- Property
- If
- c1 and c2 are definite Horn clauses
- c1 ?? c2
- Then
- facts produced by c2 ? facts produced by c1
- (Easy to see from definition ?-subsumption)
50- Similarity with propositional refinement
- IF Sky sunny THEN EnjoySportsyes
- To specialise add 1 condition
- IF Skysunny AND Humiditylow THEN
EnjoySportsyes - ...
51- In first order logic
- c1 father(X,Y) - parent(X,Y)
- To specialize find clauses ?-subsumed by c1
- father(X,Y) - parent(X,Y), male(X)
- father(luc,X) - parent(luc,X)
-
- add literals or instantiate variables
52- Another (slightly more complicated) example
- c1 p(X,Y) - q(X,Y)
- c2 p(X,Y) - q(X,Y), q(Y,X)
- c3 p(Z,Z) - q(Z,Z)
- c4 p(a,a) - q(a,a)
- Which clauses ?-subsumed by which?
53- Properties of ?-subsumption
- Sound
- if c1 ?-subsumes c2 then c1 c2
- Incomplete possibly c1 c2 without c1
?-subsuming c2 (but only for recursive clauses) - c1 p(f(X)) - p(X)
- c2 p(f(f(X))) - p(X)
- Hence ?-subsumption approximates entailment but
is not the same
54- Checking whether c1 ?-subsumes c2 is decidable
but NP-complete - Transitive and reflexive, not anti-symmetric
- "semi-order" relation
- e.g.
- f(X,Y) - g(X,Y), g(X,Z)
- f(X,Y) - g(X,Y)
- both ?-subsume one another
55- Semi-order generates equivalence classes
partial order on those equivalence classes - equivalence class c1 c2 iff c1 ?? c2 and c2 ??
c1 - c1 and c2 are then called syntactic variants
- c1 is reduced clause of c2 iff c1 contains
minimal subset of literals of c2 that is still
equivalent with c2 - each equivalence class represented by its reduced
clause
56- If c1 and c2 in different equivalence classes,
either c1 ?? c2 or c2 ?? c1 or neither gt
anti-symmetry gt partial order - Thus, reduced clauses are partially ordered
- they form a lattice
- properties of this lattice?
57p(X,Y) - m(X,Y) p(X,Y) - m(X,Y), m(X,Z) p(X,Y)
- m(X,Y), m(X,Z), m(X,U) ...
lgg
p(X,Y) - m(X,Y),r(X) p(X,Y) - m(X,Y),
m(X,Z),r(X) ...
p(X,Y) - m(X,Y),s(X) p(X,Y) - m(X,Y),
m(X,Z),s(X) ...
reduced
p(X,Y) - m(X,Y),s(X),r(X) p(X,Y) - m(X,Y),
m(X,Z),s(X),r(X) ...
glb
58- Least upper bound / greatest lower bound of two
clauses always exists and is unique - Infinite chains c1 ?? c2 ?? c3 ?? ... ?? c exist
- h(X) - p(X,Y)
- h(X) - p(X,X2), p(X2,Y)
- h(X) - p(X,X2), p(X2,X3), p(X3,Y)
- ...
- h(X) - p(X,X)
59- Looking for good hypothesis traversing this
lattice - can be done top-down, using specialization
operator - or bottom-up, using generalization operator
60top
Heuristics-based searches (greedy, beam,
exhaustive)
VS
bottom
61Specialisation operators
- Shapiro general-to-specific traversal using
refinement operator ? - ?(c) yields set of refinements of c
- theory ?(c) c' c' is a maximally general
specialisation of c - practice ?(c) ? c ? l l is a literal ?
c? ? is a substitution
62daughter(X,Y)
daughter(X,X)
daughter(X,Y) - parent(X,Z)
......
daughter(X,Y) - parent(Y,X)
daughter(X,Y) - female(X)
...
daughter(X,Y)-female(X),female(Y)
daughter(X,Y)-female(X),parent(Y,X)
63- How to traverse hypothesis space so that
- no hypotheses are generated more than once?
- no hypotheses are skipped?
- -gt Many properties of refinement operators
studied in detail
64- Some properties
- globally complete each point in lattice is
reachable from top - locally complete each point directly below c is
in ?(c) (useful for greedy systems) - optimal no point in lattice is reached twice
(useful for exhaustive systems) - minimal, proper,
65A generalisation operator
- For bottom-up search
- We discuss one generalisation operator Plotkins
lgg - Starts from 2 clauses and compute least general
generalisation (lgg) - i.e., given 2 clauses, return most specific
single clause that is more general than both of
them
66- Definition of lgg of terms
- (let si, tj denote any term, V a variable)
- lgg(f(s1,...,sn), f(t1,...,tn))
f(lgg(s1,t1),...,lgg(sn,tn)) - lgg(f(s1,...,sn),g(t1,...,tn)) V
- e.g. lgg(a,b) X lgg(f(X),g(Y)) Z
lgg(f(a,b,a),f(c,c,c))f(X,Y,X)
67- lgg of literals
- lgg(p(s1,...,sn),p(t1,...,tn))
p(lgg(s1,t1),...,lgg(sn,tn)) - lgg(?p(...), ? p(...)) ? lgg(p(...),p(...))
- lgg(p(s1,...,sn),q(t1,...,tn)) is undefined
- lgg(p(...), ?p(...)) and lgg(?p(...),p(...)) are
undefined
68- lgg of clauses
- lgg(c1,c2) lgg(l1, l2) l1?c1, l2?c2 and
lgg(l1,l2) defined - Example
- f(t,a) - p(t,a), m(t), f(a)
- f(j,p) - p(j,p), m(j), m(p)
- lgg f(X,Y) - p(X,Y), m(X), m(Z)
69- Relative lgg (rlgg) (Plotkin 1971)
- relative to "background theory" B (assume B is a
set of facts) - rlgg(e1,e2) lgg(e1 - B, e2 - B)
- method to compute
- change facts into clauses with body B
- compute lgg of clauses
- remove B, reduce
70Example Bongard problems
- Bongard Russian scientist studying pattern
recognition - Given some pictures, find patterns in them
- Simplified version of Bongard problems used as
benchmarks in ILP
71Examples labelled neg
Examples labelled pos
72- Example 2 simple Bongard problems, find least
general clause that would predict both to be
positive
pos(1). pos(2). contains(1,
o1). contains(2,o3). contains(1,o2). tri
angle(o1). triangle(o3). points(o1,d
own). points(o3,down). circle(o2).
1
2
73- Method 1 represent example by clause compute
lgg of examples
pos(1) - contains(1,o1),
contains(1,o2), triangle(o1),
points(o1,down), circle(o2). pos(2) -
contains(2,o3), triangle(o3),
points(o3,down).
lgg( (pos(1) - contains(1,o1), contains(1,o2),
triangle(o1),
points(o1,down), circle(o2)) , (pos(2) -
contains(2,o3), triangle(o3), points(o3, down)
) pos(X) - contains(X,Y), triangle(Y),
points(Y,down)
74- Method 2 represent class of example by fact,
other properties in background compute rlgg
Examples
Background
pos(1). pos(2).
contains(1,o1). contains(2,o3). contains(1
,o2). triangle(o1).
triangle(o3). points(o1,down).
points(o3,down). circle(o2).
rlgg(pos(1), pos(2)) ? (exercise)
75- ?-subsumption ordering used by many ILP systems
- top down using refinement operators (many
systems) - bottom up using rlgg (e.g., Golem system,
Muggleton Feng)
76- Note inverting implication
- Given the incompleteness of ?-subsumption, could
we invert implication? - Some problems
- lgg under implication not unique e.g., lgg of
p(f(f(f(X))))-p(X) and p(f(f(X)))-p(X) can be
p(f(X))-p(X) or p(f(f(X)))-p(Y) - computationally expensive
772) Inverting resolution
- Resolution rule for deduction
Propositional
First order
p??q q?r ----------------- p ? r
p(X) ? ?q(X) q(X) ? ?r(X,Y) ----------------
------------------------- p(X) ?
? r(X,Y)
p(a) ? ?q(b) q(X) ? ?r(X,Y) ----------------
------------------------ p(a) ?
?r(b,Y)
p ? q q ? s ----------------- p ? s
X/b
78Inverting resolution
2 opposite literals (up to a substitution) li?1
?kj?2
l1 ? ... ? li ? ... ? ln k1 ? ... ?
kj ? ... ? km ------------------------------------
------------------------------------------- (l1
? l2 ? ... ? li-1 ? li1 ? ... ? ln ? k1 ? kj-1 ?
kj1 ... ? km) ?1?2
e.g., p(X) - q(X) and q(X) - r(X,Y) yield
p(X) - r(X,Y) p(X) - q(X) and q(a)
yield p(a).
79- Resolution implements - for sets of clauses
- cf. ?-subsumption for single clauses
- Inverting it allows to generalize a clausal
theory - Inverse resolution is much more difficult than
resolution itself - different operators defined
- no unique results
80Inverse resolution operators
- Some operators related to inverse resolution
- (A and B are conjunctions of literals)
- absorption
- from q-A and p - A,B
- infer p - q,B
- identification
- from p - q,B and p - A,B
- infer q - A
q - A
p - q,B
p - A,B
q - A
p - q,B
p - A,B
81- Intra-construction
- from p - A,B and p - A,C
- infer q - B and p - A,q and q - C
- Inter-construction
- from p - A,B and q - A,C
- infer p - r,B and r - A and q - r,C
q-C
p-A,q
q-B
q-r,C
p-r,B
r - A
inter
intra
p-A,B p-A,C
p-A,B q-A,C
82- With intra- and inter-construction, new
predicates are invented - E.g., apply intra-construction on
- grandparent(X,Y) - father(X,Z), father(Z,Y)
- grandparent(X,Y) - father(X,Z), mother(Z,Y)
- What predicate is invented?
83Example inverse resolution
m(j)
f(X,Y) - p(X,Y),m(X)
f(j,Y) - p(j,Y)
p(j,m)
f(j,m)
84grandparent(X,Y) - father(X,Z), parent(Z,Y)
father(X,Y) - male(X), parent(X,Y)
grandparent(X,Y) - male(X), parent(X,Z),
parent(Z,Y)
male(jef)
grandparent(jef,Y) - parent(jef,Z),parent(Z,Y)
parent(jef,an)
grandparent(jef,Y) - parent(an,Y)
parent(an,paul)
grandparent(jef,paul)
85- Properties of inverse resolution
- in principle very powerful
- - gives rise to huge search space
- - result of inverse resolution not unique
- e.g., father(j,p)-male(j) and parent(j,p) yields
father(j,p)-male(j),parent(j,p) or
father(X,Y)-male(X),parent(X,Y) or - CIGOL approach (Muggleton Buntine)
86- We now have some basic operators
- ?-subsumption-based at single clause level
- specialization operator ?
- generalization operator lgg of 2 clauses
- inverse resolution generalize a set of clauses
- These can be used to build ILP systems
- top-down using specialization operators
- bottom-up using generalization operators
87Representations
- 2 main paradigms for learning in ILP
- learning from interpretations
- learning from entailment
- Related to representation of examples
- Cf. Bongard examples we saw before
88Learning from entailment
- 1 example a fact e (or clause e-B)
- Goal
- Given examples ltE,E-gt,
- Find theory H such that
- ?e?E B?H - e
- ?e-?E- B?H - e-
89pos(1). pos(2). - pos(3).
Examples
contains(1,o1). contains(1,o2). contains(2,o3). tr
iangle(o1). triangle(o3). points(o1,d
own). points(o3,down). circle(o2). contains(3
,o4). circle(o4).
Background
pos(X) - contains(X,Y), triangle(Y),
points(Y,down).
90Learning from interpretations
- Example interpretation (set of facts) e
- contains a full description of the example
- all information that intuitively belongs to the
example, is represented in the example, not in
background knowledge - Background domain knowledge
- general information concerning the domain, not
concerning specific examples
91Examples
pos(1) - contains(1,o1), contains(1,o2),
triangle(o1), points(o1,down),
circle(o2). pos(2) - contains(2,o3),
triangle(o3), points(o3,down). - pos(3),
contains(3,o4), circle(o4).
Background
polygon(X) - triangle(X). polygon(X) -
square(X).
pos(X) - contains(X,Y), triangle(Y),
points(Y,down).
92Closed World Assumption made inside
interpretations
Examples
pos contains(o1), contains(o2), triangle(o1),
points(o1,down), circle(o2) pos
contains(o3), triangle(o3), points(o3,down) neg
contains(o4), circle(o4)
Background
polygon(X) - triangle(X). polygon(X) -
square(X).
constraint on pos
?Ycontains(Y),triangle(Y),points(Y,down).
93- Note when learning from interpretations
- can dispose of example identifier
- but can also use standard format
- CWA made for example description
- i.e., example description is assumed to be
complete - class of example related to information inside
example background information, NOT to
information in other examples
94- Because of 3rd property, more limited than
learning from entailment - cannot learn relations between different
examples, nor recursive clauses - but also more efficient
- because of 2nd and 3rd property
- positive PAC-learnability results (De Raedt and
Deroski, 1994, AIJ), vs. negative results for
learning from entailment
95Algorithms
96Rule induction
- Most inductive logic programming systems induce
concept definition in form of set of definite
Horn clauses (Prolog program) - Many algorithms similar to propositional
algorithms for learning rule sets - FOIL -gt CN2
- Progol -gt AQ
97FOIL (Quinlan)
- Learns single concept, e.g., p(X,Y) - ...
- To learn one clause (hill-climbing search)
- start with general clause p(X,Y) - true
- repeat
- add best literal to clause (i.e., literal that
most improves quality of clause) - new literal can also be unification Xc or XY
- applying refinement operator under
?-subsumption - until no further improvement
98Example
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea) male(homer). male(bart). male(bill). fem
ale(chelsea). female(marge).
99father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea). male(homer). male(bart). male(bill). fe
male(chelsea). female(marge).
father(X,Y) - parent(X,Y). father(X,Y) -
parent(Y,X). father(X,Y) - male(X). father(X,Y)
- male(Y). father(X,Y) - female(X). father(X,Y)
- female(Y).
2,2-
100father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea). male(homer). male(bart). male(bill). fe
male(chelsea). female(marge).
father(X,Y) - parent(X,Y). father(X,Y) -
parent(Y,X). father(X,Y) - male(X). father(X,Y)
- male(Y). father(X,Y) - female(X). father(X,Y)
- female(Y).
2,1-
101father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
, chelsea). male(homer). male(bart). male(bill). f
emale(chelsea). female(marge).
father(X,Y) - male(X). father(X,Y) -
male(X), parent(X,Y). father(X,Y) - male(X),
parent(Y,X). father(X,Y) - male(X),
male(Y). father(X,Y) - male(X),
female(X). father(X,Y) - male(X), female(Y).
2,0-
102Learning multiple clauses the Covering approach
- To learn multiple clauses
- repeat
- learn a single clause c (see previous algorithm)
- add c to h
- mark positive examples covered by c as covered
- until
- all positive examples marked covered
- or no more good clauses found
103likes(garfield, lasagne). likes(garfield,
birds). likes(garfield, meat). likes(garfield,
jon). likes(garfield, odie).
likes(garfield, X) - edible(X).
3,0-
104likes(garfield, lasagne). likes(garfield,
birds). likes(garfield, meat). likes(garfield,
jon). likes(garfield, odie).
(italics previously covered)
likes(garfield, X) - edible(X). likes(garfield,
X) - subject_to_cruelty(X).
2,0-
105Some pitfalls
- Avoiding infinite recursion
- when recursive clauses allowed, e.g.,
ancestor(X,Y) - parent(X,Z), ancestor(Z,Y) - avoid learning parent(X,Y) - parent(X,Y)
- won't be useful, even though it's 100 correct
- Bonus for introduction of new variables
- literal may not yield any direct gain, but may
introduce variables that may be useful later
p(X) - q(X) p positives, n negatives
covered refine by adding age p(X) - q(X),
age(X,Y) p positives, n negatives covered -gt no
gain
106Golem (Muggleton Feng)
- Based on rlgg-operator
- To build one clause
- Look at 2 positive examples, find rlgg,
generalize using yet another example, until no
improvement in quality of clause - bottom-up search
- Result very dependent on choice of examples
- e.g. what if true theory is p(X) - q(X) , p(X)
- r(X) ?
107- Try this for different couples, pick best clause
found - this reduces dependency on choice of couple (if 1
of them noisy no good clause found) - Remove covered positive examples, restart process
- Repeat until no more good clauses found
108- 1 limitation of Golem extensional coverage tests
- only extensional background knowledge
- may go wrong when learning recursive clauses
induces
p(0). p(1). p(2). - p(4).
s(0,1). s(1,2). s(2,3). s(3,4).
p(Y) - s(X,Y), p(X).
H-B checked by running query ?(B? ? H)
extensional coverage test
examples
background
109Progol (Muggleton)
- Top-down approach, but with seed
- To find one clause
- Start with 1 positive example e
- Generate hypothesis space He that contains only
hypotheses that cover at least this one example - first generate most specific clause c that covers
e - He contains every clause more general than c
- Perform exhaustive top-down search in He, looking
for clause that maximizes compaction
110- Compaction size(covered examples) -
size(clause) - Repeat process of finding one clause until no
more good ( causing compaction) clauses found - Compaction heuristic in principle allows no
coverage of negatives - can be relaxed (accommodating noise)
111Generation of bottom clause
- Language bias set of all acceptable clauses
(chosen by user) - specification of H (on level of single clauses)
- Bottom clause ? for example e most specific
clause in language bias covering e - Constructed using inverse entailment
112- Construction of ?
- if B?H e, then B ? ?e ?H
- if H is clause, ?H is conjunction of ground
(skolemized) literals - compute ?? all ground literals entailed by B ?
?e - ?H must be subset of these
- so B ? ?e ?? ?H
- hence H ?
113- Some examples (cf. Muggleton, NGC 1995)
?
B
e
anim(X) - pet(X). pet(X) - dog(X).
nice(X) - dog(X).
nice(X) - dog(X), pet(X), anim(X).
hasbeak(X) - bird(X). bird(X) - vulture(X).
hasbeak(tweety).
hasbeak(tweety) bird(tweety) vulture(tweety).
114- Example of (part of) Progol run
- learn to classify animals as mammals, reptiles,
...
- generalise(class/2)? Generalising
class(dog,mammal). Most specific clause
is class(A,mammal) - has_milk(A),
has_covering(A,hair), has_legs(A, 4),
homeothermic(A), habitat(A,land). C-28,4,10,0
class(A,mammal). C8,4,0,0 class(A,mammal) -
has_milk(A). C5,3,0,0 class(A,mammal) -
has_covering(A,hair). C-4,4,3,0
class(A,mammal) - homeothermic(A). 4 explored
search nodes f8,p4,n0,h0 Result of search
is class(A,mammal) - has_milk(A).
115- Exhaustive search important to constrain size
of hypothesis space - Strong language bias
- specify which predicates to be used in head or
body of clause - specify types and modes of predicates
- e.g., allow age(X,Y), Ylt18
- but not habitat(X,Y), Ylt18
116- E.g., for "animals" example
put this in head
variable of type "animal"
- modeh(1,class(animal,class))? -
modeb(1,has_milk(animal))? - modeb(1,has_gills(
animal))? - modeb(1,has_covering(animal,coverin
g))? - modeb(1,has_legs(animal,nat))? -
modeb(1,homeothermic(animal))? -
modeb(1,has_eggs(animal))? - modeb(,habitat(an
imal,habitat))?
constant of type "covering"
put this in body
there can be any number of habitats
only one literal of this kind needed
117Other approaches
- Algorithms we have seen up till now are rule
based algorithms - induce theory in the form of a set of rules
(definite Horn clauses) - induce rules one by one
- Quite normal, given that logic programs are
essentially sets of rules
118- Still induction of rule sets is only one type of
machine learning - Difference between ILP and propositional
approaches is mainly in representation - Possible to define other learning techniques and
tasks in ILP induction of constraints, induction
of decision trees, Bayesian learning, ...
119Claudien (De Raedt Bruynooghe)
- "Clausal Discovery Engine"
- Discovers patterns that hold in set of data
- any patterns represented as clauses (not
necessarily Horn clauses) - I.e., finds patterns of a more general kind than
predictive rules - also called descriptive induction
120- Given a hypothesis space
- performs an exhaustive top-down search through
the space - returns all clauses that
- hold in the data set
- are not implied by other clauses found
- Strong language bias precise syntactical
description of acceptable clauses
121parent(X,Y), father(X,Y), mother(X,Y) -
parent(X,Y), father(X,Y), mother(X,Y),
male(X), male(Y), female(X), female(Y)
- May result in following clauses being discovered
parent(X,Y) - father(X,Y). parent(X,Y) -
mother(X,Y). - father(X,Y), mother(X,Y). -
male(X), female(X). mother(X,Y) - parent(X,Y),
female(X). ...
122Claudien algorithm
- S ?
- Q ?
- while Q not empty
- pick first clause c from Q
- for all (h?b) in ?(c)
- if query (b??h) fails (i.e., clause is true in
data) - then
- if (h?b) not entailed by clauses in S then add
(h?b) to S - else add (h?b) to Q
123ICL (De Raedt and Van Laer)
- Inductive Constraint Logic
- First system to learn from interpretations
- Search for constraints on interpretations
distinguishing examples of different classes - Roughly run Claudien on set of examples E
- each constraint found will be true for all e,
but probably false for some e- - all constraints together hopefully rule out all e-
124- Search for one constraint
- c ?
- repeat until c true for all positives
- find d in ?(c) so that d holds for as many
positives and as few negatives as possible - c d
- add c to h
- can also use beam search
125- Search for set of constraints on a class
- h
- while there are negatives left to be eliminated
- find a constraint c
- add c to h
- Uses same language bias (DLAB) as recent
versions of Claudien - DLAB is advanced form of original Claudien bias
126- Example of DLAB bias specification
- min-max ... means at least min and at most max
literals from the list are to be put here - can be nested
- allows some nice tricks, e.g.
- 1-1male(X),female(X)
0-2parent(X,Y), father(X,Y), mother(X,Y) lt--
0-lenparent(X,Y), father(X,Y), mother(X,Y),
male(X), male(Y), female(X), female(Y)
127Warmr (Dehaspe)
- Induces first order association rules
- Algorithm similar to APRIORI
- Finds frequent patterns
- cf. "frequent item sets" in APRIORI context
- Pattern conjunction of literals
- Uses ?-subsumption lattice over hypothesis space
- Constructs association rules from patterns
- IF this pattern occurs, THEN that pattern occurs
too
128The APRIORI algorithm
- APRIORI (Agrawal et al.) efficient discovery of
frequent itemsets and association rules - Typical example market basket analysis
- which things are often bought together?
- Association rule
- IF a1, , an THEN an1, anm
129- Association rules should have at least some
minimal - support t(a1anm) / ttrue
- how many people buy all these things together?
- confidence ta1anm/ta1an
- how many people of those buying IF-things also
buy THEN-things? - Minimal support and confidence may be low
130- APRIORI tailored towards using large data sets
- efficiency very important
- minimize data access
- Works in 2 steps
- find frequent itemsets
- compute association rules from them
131- Observation
- if a1an infrequent (below min. support)
- then a1an1 also infrequent
- adding a condition can only strengthen the
conjunction - Hence
- a1,,an can only be frequent if each subset of
it is frequent
132- Leads to levelwise algorithm
- first compute frequent singletons
- then frequent pairs, triples,
- a lot of pruning possible due to previous
observation - itemset of cardinality n is candidate if each
subset of it of cardinality n-1 was frequent in
previous level - need to count only candidates
133Example
bread
butter
wine
ham
cheese
jam
Bread butter
Bread cheese
Bread jam
Butter cheese
Butter jam
Cheese jam
Bread butter cheese
Bread butter jam
Not a candidate
134Apriori algorithm
Min_freq min_supportfreq(?) d 0 Q0 ?
/ candidates for level 0 / F ? / frequent
sets / while Qd ? ? do for all S in Qd do find
freq(S) Fd S in Qd freq(S) ?
min_freq F F ? Fd compute Qd1 d
d1 return F
135Computing candidates
Compute Qd1 from Fd Qd1 ? for
each S in Fd do for each item x not in S do
S S ? x if ?i in S S\i ? Fd
then add S to Qd1
136- Step 2 deriving association rules from frequent
sets - if S ? a ? F and (S?a)/S gt min_confidence
- then S -gt S ? a is a valid association rule
- has sufficient support and confidence
137Warmr
- Warmr is first-order version of Apriori
- Patterns (itemsets) are now conjunctive queries
- Frequent patterns what to count?
- examples, of course...
- Was easy in propositional case
- 1 example 1 tuple -gt count tuples
138- In first-order case
- also easy when learning from interpretations
- not so clear when learning from implications
- which implications are examples?
- indicate this by specifying a key
- key unique identification of example
- each pattern contains a set of variables that
forms the key
139- Example
- assume 100 people in database
- person(X) X is the key
- count answer substitutions of X, not Y or Z!
- person(X), mother(X,Y) 40 examples
- mother(X,Y), has_pet(Y,Z) 30 examples
- mother(X,Y) ---gt has_pet(Y,Z) support 0.3,
confidence 0.75
140- Remark association rule is NOT a clause
- mother(X,Y) ---gt has_pet(Y,Z)
- ?X (?Ymother(X,Y)) -gt (?YZmother(X,Y),has_pet
(Y,Z)) - ? mother(X,Y) -gt has_pet(Y,Z)
- main difference is occurrence of existentially
quantified variables in conclusion
141- Illustrated on Bongard drawings
- 1 example 1 drawing
- contains(D,Obj) D is the key
- Pattern e.g.,
- contains(D,X), circle(X), in(X,Y), circle(Y)
- Association rule e.g.,
- contains(D,X), circle(X),in(X,Y),circle(Y) --gt
contains(D,Z), square(Z) - "drawings that contain a circle inside another
circle usually also contain a square"
142- Warmr also useful for feature construction
- Generally applicable method for improving
representation of examples - Given description of example
- derive new (propositional) features that describe
the example - add those features to a propositional description
of the example - run a propositional learner
143- For Bongard example
- construct features "contains a circle", "contains
a circle inside a triangle", ... - given the correct features, a propositional
representation of examples is possible - Feature construction with ILP general method
for applying propositional machine learning
techniques to structural examples
144Decision tree induction in ILP
- S-CART (Kramer 1996) upgrade of CART
- Tilde (Blockeel De Raedt 98) upgrades C4.5
- Both induce "first order" or "structural"
decision trees (FOLDTs) - test in node first order literal
- may result in true or false -gt binary trees
- different nodes may share variables
- "real" test in a node conjunction of all
literal in path from root to node
145Top-down Induction of Decision Trees Algorithm
- function TDIDT(E set of examples)
- T set of possible tests
- t BEST_SPLIT(T, E)
- E partition induced on E by t
- if STOP_CRIT(E, E) then return leaf(INFO(E))
- else
- for all Ei in E ti TDIDT(Ei)
- return inode(t, (i, ti))
146- Set of possible tests
- generated using refinement operator
- c conjunction on path from root to node
- ?(c ) - c literal(s) to be put in node
- Other auxiliary functions lt prop. TDIDT
- best split using e.g. information gain
- stop_crit e.g. significance test
- info e.g. most frequent class
147- Known from propositional learning
- induction of decision trees is fast
- usually yields good results
- These properties are inherited by Tilde / S-CART
- New results (not inherited from prop. learning)
on expressiveness
148Example FOLDT
worn(X)
yes
no
irreplaceable(X)
ok
yes
no
sendback
fix
("x Ø worn(x))
gt ok (x worn(x) Ù
irreplaceable(x)) gt
sendback (x"y worn(x) Ù Ø(worn(y) Ù
irreplaceable(y))) gt fix
149Expressiveness
FOL formula equivalent with tree
("x Øworn(x))
gt ok (x worn(x) Ù
irreplaceable(x)) gt
sendback (x"y worn(x) Ù Ø(worn(y) Ù
irreplaceable(y))) gt fix
Logic program equivalent with tree
a worn(X) b worn(X), irreplaceable(X) ok Ø
a sendback b fix a Ù Ø b
150- Prolog program equivalent with tree, using cuts
(first order decision list)
sendback - worn(X), irreplaceable(X), ! fix -
worn(X), !. ok.
151- FOLDT can be converted to
- layered logic program
- containing invented predicates
- flat Prolog program (using cuts)
- Can not be converted to flat logic program
152Expressiveness
TL
F
F Flat logic programs T decision Trees L
decision Lists
- Difference is specific for first-order case
- Possible remedies for ILP systems
- invent auxiliary predicates
- use both " and
- induce decision lists
153Representation with keys
class(e1,fix). worn(e1,gear). worn(e1,chain). clas
s(e2,sendback). worn(e2,engine). worn(e2,chain). c
lass(e3,sendback). worn(e3,control_unit). class(e4
,fix). worn(e4,chain). class(e5,keep).
worn(E,X)?
class(E,keep)
not_replaceable(X)?
class(E,fix)
class(E,sendback)
conversion to Prolog
replaceable(gear). replaceable(chain). not_replace
able(engine). not_replaceable(control_unit).
class(E,sendback) - worn(E,X),
not_replaceable(X), !. class(E,fix) - worn(E,X),
!. class(E, keep).
154speed(x,s), s gt 120, not job(x, politician), not
(?y knows(x,y), job(y,politician)) gt fine(x,Y)
speed(X,S), Sgt120
yes
no
job(X, politician)
N
yes
no
knows(X, Y)
N
yes
no
job(Y, politician)
Y
yes
no
Y
N
155Other advantages of FOLDTs
- Both classification and regression possible
- classification predict class ( learn concept)
- regression predict numbers
- important not given much attention in ILP
- Also clustering to some extent
- clustering group similar examples together
156Many other approaches and applications of ILP
possible...
- Combination of ILP and Q-learning
- RRL ("relational reinforcement learning")
reinforcement learning in structural domains - First-order equivalent of Bayesian networks
- First-order clustering
- needs first order distance measures
- ...
157Conclusions
- Many different approaches exist in Machine
Learning - ILP is in a sense diverging
- from concept learning
- to other approaches and tasks
- Still many new approaches to be tried!
158Applications of ILP
159Applications Overview
- User modelling
- Games
- Ecology
- Drug design
- Natural language
- Inductive Database Design
160User Modelling
- Behavioural cloning
- build model of users behaviour
- simulate users behaviour by means of model
- e.g.
- learning to fly / drive /
- learning to play music
- learning to play games (adventure, strategic, )
161- Automatic adaptation of system to user
- detect patterns in users actions
- use patterns to try to predict users next action
- based on predictions, make life easier for user
- e.g.
- mail system (auto-priority, )
- adaptive web pages
- intelligent search engines
162Example Applications
- Some applications the Leuven group has looked at
- behavioural cloning
- learning to play mus ic
- learning to play games
- automatic adaptation of system to user
- adaptive webpages
- a learning command shell
- intelligent e-mail interface
163Learning to Play Music
- Van Baelen De Raedt, ILP-96
- Playing music is difficult
- not just playing the notes
- but play with feeling
- adapt volume, speed,
- Midi files provided to learning system
- System detects patterns w.r.t. pitch, volume,
speed, - and tries to play music itself
164- Why an ILP approach?
- mainly because of time sequences
- Results?
- Compare computer generated MIDI file with human
generated MIDI file - Computer makes similar mistakes as beginning
player - See ILP-96 proc. for details (LNAI 1314)
165Adaptive Webpages
- Adaplix project (Jacobs et al., 1997-)
- Webpage observes actions of user
- e.g., which links are followed frequently, time
that is spent on one page, - and adapts itself
- within limitations given by page author
- change layout of page
- move links to different places
- add or remove links
166- example site http//adaplix.linux.student.kuleuve
n.ac.be - identify yourself
- name, gender, occupation (personnel/student)
- based on this info provides customized web page
- student project (in Dutch)
167Intelligent Mailer
- Visual Elm (Jacobs, 1996)
- Intelligent mail interface
- tries to detect which kind of mails are
- immediately deleted
- immediately read
- not deleted, read later
- forwarded
-
- based on this, assigns priorities to new mails
168- Predictions
- priority assigned to new mails
- expected actions delete, forward,
- Explanation facility
- Several options offered to user
- e.g. set priority threshold, only show mails
above threshold - sort mails according to priority
169(No Transcript)
170Learning Shell
- Jacobs, Dehaspe et al. (1999)
- Context Unix command shell, e.g., csh
- Each user has profile file
- defines configuration for user that makes it
easier to use the shell - usually default profile, unless user changes it
manually
171- Possible to learn profile file?
- Observe user
- which commands are often used?
- which parameters are used with the commands?
- Automatically construct better profile from
observations
172- Example of input to ILP system
/ background / command(Id, Command)
- isa(OrigCommand, Command), command(Id,
OrigCommand). isa(emacs, editor). isa(vi,
editor). / observations / command(1,
cd). attribute(1, 1, tex). command(2,
emacs). switch(2, 1, -nw). switch(2, 2,
-q). attribute(2, 1, aaai.tex).
173- Detect relationships (assocation rules) with
ILP system Warmr - Examples of rules output by Warmr
IF command(Id, ls) THEN switch(Id, -l). IF
recentcommand(Id, cd) AND command(ID,
ls) THEN nextcommand(Id, editor).
174- Some (preliminary) experimental results
- Evaluation criterion predict next action of user
- Actions logged for 10 users
- each log about 500 commands
- 2 experiments
- learning from all log files together
- learning from individual log files
175- Learning from mixed data
- predictive accuracy 35 ( fmax, relative
frequency of most popular command) - Learning from individual data
- predictive accuracy 50 (gt fmax)
- Conclusion
- proposed approach to user modelling in this
context shows promise
176Learning to Play Games
- Strategic games, adventure games,
- le