From Machine Learning to Inductive Logic Programming: ILP made easy - PowerPoint PPT Presentation

1 / 211
About This Presentation
Title:

From Machine Learning to Inductive Logic Programming: ILP made easy

Description:

Contents and s in co-operation with Luc De Raedt. of the University of ... identify substructure that causes it to 'dock' on certain other molecules ... – PowerPoint PPT presentation

Number of Views:375
Avg rating:3.0/5.0
Slides: 212
Provided by: hend78
Category:

less

Transcript and Presenter's Notes

Title: From Machine Learning to Inductive Logic Programming: ILP made easy


1
From Machine Learning to Inductive Logic
ProgrammingILP made easy
  • Hendrik Blockeel
  • Katholieke Universiteit Leuven Belgium

2
Contents of this course
  • Introduction
  • What is Inductive Logic Programming?
  • Relationship with other fields
  • Foundations of ILP
  • Algorithms
  • Applications

Contents and slides in co-operation with Luc De
Raedt of the University of Freiburg, Germany
3
1. Introduction
  • What is inductive logic programming?

4
Introduction What is ILP?
  • Paradigm for inductive reasoning (reasoning from
    specific to general)
  • Related to
  • machine learning and data mining
  • logic programming

5
Inductive reasoning
  • Reasoning from specific to general
  • from (specific) observations
  • to a (general) hypothesis
  • Studied in
  • philosophy of science
  • statistics
  • ...

6
  • Distinguish
  • weak induction all observed tomatoes are red
  • strong induction all tomatoes are red

7
  • Weak induction conclusion is entailed by
    (follows deductively from) observations
  • cannot be wrong
  • Strong induction conclusion does not follow
    deductively from observations
  • could be wrong!
  • logic does not provide justification
  • probability theory may

8
A predicate logic approach
  • Different kinds of reasoning in first order
    predicate logic
  • Standard example Socrates

Human(Socrates)
Mortal(x) ?Human(x)
9
(No Transcript)
10
  • Logic programming focuses on deduction
  • Other types of LP
  • abductive logic programming (ALP)
  • inductive logic programming (ILP)
  • 2 questions to be solved
  • How to perform induction?
  • How to integrate it in logic programming?

11
Some examples
  • Learning a definition of member from examples

member(a, a,b,c). member(b,a,b,c). member(3,5
,4,3,2,1). - member(b, 1,2,3). - member(3,
a,b,c).
Examples
12
Some examples
  • Use of background knowledge
  • E.g., learning quicksort

qsort(b,c,a, a,b,c). qsort(, )
. qsort(5,3,3,5). - qsort(5,3,5,3). -
qsort(1,3 3). split(L, A, B) -
... append(A,B,C) - ...
13
Some examples
  • Not only predicate definitions can be learned
    e.g. learning constraints

parent(jack,mary). parent(mary,bob). father(jack,m
ary). mother(mary,bob). male(jack). male(bob). fem
ale(mary).
14
Practical applications
  • Program synthesis
  • very hard
  • subtasks debugging, validation,
  • Machine learning
  • e.g., learning to play games
  • Data mining
  • mining in large amounts of structured data

15
Example Application Mutagenicity Prediction
  • Given a set of molecules
  • Some cause mutation in DNA (these are mutagenic),
    others dont
  • Try to distinguish them on basis of molecular
    structure
  • Srinivasan et al., 1994 found structural alert

16
(No Transcript)
17
Example Application Pharmacophore Discovery
  • Application by Muggleton et al., 1996
  • Find "pharmacophore" in molecules
  • identify substructure that causes it to "dock"
    on certain other molecules
  • Molecules described by listing for each atom in
    it element, 3-D coordinates, ...
  • Background defines euclidean distance, ...

18
  • Some example molecules (Muggleton et al. 1996)

19
Description of molecules
Background knowledge
... hacc(M,A)- atm(M,A,o,2,_,_,_). hacc(M,A)-
atm(M,A,o,3,_,_,_). hacc(M,A)-
atm(M,A,s,2,_,_,_). hacc(M,A)-
atm(M,A,n,ar,_,_,_). zincsite(M,A)-
atm(M,A,du,_,_,_,_). hdonor(M,A) -
atm(M,A,h,_,_,_,_), not(carbon_bond(M,A)),
!. ...
atm(m1,a1,o,2,3.430400,-3.116000,0.048900). atm(m1
,a2,c,2,6.033400,-1.776000,0.679500). atm(m1,a3,o,
2,7.026500,-2.042500,0.023200). ... bond(m1,a2,a3,
2). bond(m1,a5,a6,1). bond(m1,a2,a4,1). bond(m1,a6
,a7,du). ...
20
Learning to play strategic games
21
Advantages of ILP
  • Advantages of using first order predicate logic
    for induction
  • powerful representation formalism for data and
    hypotheses (high expressiveness)
  • ability to express background domain knowledge
  • ability to use powerful reasoning mechanisms
  • many kinds of reasoning have been studied in a
    first order logic framework

22
Foundations of Inductive Logic Programming
23
Overview
  • Concept learning the Versionspaces approach
  • from machine learning
  • how to search for a concept definition consistent
    with examples
  • based on notion of generality

24
  • Notions of generality in ILP
  • the theta-subsumption ordering
  • other generality orderings
  • basic techniques and algorithms
  • Representation of data
  • two paradigms learning from implications,
    learning from interpretations

25
Concept learning
  • Given
  • an instance space
  • some unknown concept subset of instance space
  • Task learn concept definition from examples (
    labelled instances)
  • Could be defined extensionally or intensionally
  • Usually interested in intensional definition
  • otherwise no generalisation possible

26
  • Hypothesis h concept definition
  • can be represented intensionally h
  • or extensionally (as set of examples) ext(h)
  • Hypothesis h covers example e iff e?ext(h)
  • Given a set of (positive and negative) examples E
    ltE, E-gt, h is consistent with E if E?ext(h)
    and ext(h)?E- ?

27
Versionspaces
  • Given a set of instances E and a hypothesis space
    H, the versionspace is the set of all h?H
    consistent with E
  • contains all hypotheses in H that might be the
    correct target concept
  • Some inductive algorithms exist that, given H and
    E, compute the versionspace VS(H,E)

28
Properties
  • If target concept c?H, and E contains no noise,
    then c?VS(H,E)
  • If VS(H,E) is singleton one solution
  • Usually multiple solutions
  • If H 2I with I instance space
  • i.e., all possible concepts in H
  • then no generalisation possible
  • H is called inductive bias

29
  • Usually illustrated with conjunctive concept
    definitions
  • Example from T. Mitchell, 1996 Machine
    Learning

Sky AirTemp Humidity Wind Water
Forecast EnjoySport sunny warm normal
strong warm same yes


30
Lattice for Conjunctive Concepts
lt?,?,?,?,?,?gt
ltSunny,?,?,?,?,?gt
lt?,Warm,?,?,?,?gt
lt?,?,?,?,?,Samegt
...
...
...
...
...
...
...
...
...
...
...
...
ltSunny,Warm,Normal,Strong,Warm,Samegt
...
...
lt?, ?, ?, ?, ?, ?gt
31
  • Concept represented as if-then-rule
  • ltSunny,Warm,?,?,?,?gt
  • IF Skysunny AND AirTempwarm THEN
    EnjoySportsyes

32
Generality
  • Central to versionspace algorithms is notion of
    generality
  • h is more general than h ( h ? h ) iff
    ext(h)?ext(h)
  • Properties of VS(H,E) w.r.t. generality
  • if s?VS(H,E), g?VS(H,E) and g ? h ? s, then
    h?VS(H,E)
  • gt VS can be represented by its borders

33
Candidate Elimination Algorithm
  • Start with general border G all and specific
    border S none
  • When encountering positive example e
  • generalise hypotheses in S that do not cover e
  • throw away hypotheses in G that do not cover e
  • When encountering negative example e
  • specialise hypotheses in G that cover e
  • throw away hypotheses in S that cover e

34
G
lt?,?,?gt
lt?,w,?gt
ltc,?,?gt
lts,?,?gt
ltr,?,?gt
lt?,c,?gt
lt?,?,ngt
lt?,?,dgt
sw?
s?n
sc?
s?d
cw?
c?n
cc?
c?d
rw?
r?n
rc?
r?d
?wn
?cn
?wd
?cd
swn
swd
scn
scd
cwn
cwd
ccn
ccd
rwn
rwd
rcn
rcd
lt?,?,?gt
S
35
ltc,w,ngt
G
lt?,?,?gt
lt?,w,?gt
ltc,?,?gt
lts,?,?gt
ltr,?,?gt
lt?,c,?gt
lt?,?,ngt
lt?,?,dgt
sw?
s?n
sc?
s?d
cw?
c?n
cc?
c?d
rw?
r?n
rc?
r?d
?wn
?cn
?wd
?cd
cwn
swn
swd
scn
scd
cwd
ccn
ccd
rwn
rwd
rcn
rcd
S
lt?,?,?gt
36
ltc,w,ngt ltc,c,dgt -
lt?,?,?gt
G
G
lt?,w,?gt
lt?,?,ngt
ltc,?,?gt
lts,?,?gt
ltr,?,?gt
lt?,c,?gt
lt?,?,dgt
sw?
s?n
sc?
s?d
cw?
c?n
cc?
c?d
rw?
r?n
rc?
r?d
?wn
?cn
?wd
?cd
cwn
swn
swd
scn
scd
cwd
ccn
ccd
rwn
rwd
rcn
rcd
S
lt?,?,?gt
37
  • Keeping G and S may not be feasible
  • exponential size
  • In practice, most inductive concept learners do
    not identify VS but just try to find one
    hypothesis in VS

38
Importance of generality for induction
  • Even when not VS itself, but only one element of
    it is computed, generality can be used for search
  • properties allow to prune search space
  • if h covers negatives, then any g ? h also covers
    negatives
  • if h does not cover some positives, then any s ?
    h does not cover those positives either

39
  • For concept learning in ILP, we will need a
    generality ordering between hypotheses
  • ILP is not only useful for learning concepts, but
    in general for learning theories (e.g.,
    constraints)
  • then we need generality ordering for theories

40
Concept Learning in First Order Logic
  • Need a notion of generality (cf. versionspaces)
  • ?-subsumption, entailment,
  • How to specialise / generalise concept
    definitions?
  • operators for specialisation / generalisation
  • inverse resolution, least general generalisation
    under ?-subsumption,

41
Generality of theories
  • A theory G is more general than a theory S if and
    only if G S
  • G S in every interpretation (set of facts)
    for which G is true, S is also true
  • "G logically implies S"
  • e.g., "all fruit tastes good" "all apples
    taste good" (assuming apples are fruit)

42
  • Note talking about theories, not just concepts
    (lt-gt versionspaces)
  • generality of concepts is special case of this
  • This will allow us to also learn e.g.
    constraints, instead of only predicate
    definitions ( concept definitions)

43
Deduction, induction and generality
  • Deduction reasoning from general to specific
  • is "always correct", truth-preserving
  • Induction reasoning from specific to general
    inverse of deduction
  • not truth-preserving (falsity-preserving)
  • there may be statistical evidence

44
  • Deductive operators "-" exist that implement (or
    approximate)
  • E.g., resolution (from logic programming)
  • Inverting these operators yields inductive
    operators
  • basic technique in many inductive logic
    programming systems

45
Various frameworks for generality
  • Depending on form of G and S
  • 1 clause / set of clauses / any first order
    theory
  • Depending on choice of - to invert
  • theta-subsumption
  • resolution
  • implication
  • Some frameworks much easier than others

46
1) ?-subsumption (Plotkin)
  • Most often used in ILP
  • S and G are single clauses
  • c1 ?-subsumes c2 (denoted c1?? c2 ) if and only
    if there exists a variable substitution ? such
    that c1? ? c2
  • to check this, first write clauses as
    disjunctions
  • a,b,c ? d,e,f ? a ? b ? c ? ?d ? ?e ? ?f
  • then try to replace variables with constants or
    other variables

47
  • Example
  • c1 father(X,Y) - parent(X,Y)
  • c2 father(X,Y) - parent(X,Y), male(X)
  • for ? c1? ? c2 gt c1 ?-subsumes c2
  • c3 father(luc,Y) - parent(luc,Y)
  • for ? X/luc c1? c3 gt c1 ?-subsumes c3
  • c2 and c3 do not ?-subsume one another

48
  • Given facts for parent, male, female,
  • so-called background knowledge B
  • Clause produces a set of father facts
  • answer substitutions for X,Y when body considered
    as query
  • or facts occurring in minimal model of B?clause
  • set extensional definition of concept father

49
  • Property
  • If
  • c1 and c2 are definite Horn clauses
  • c1 ?? c2
  • Then
  • facts produced by c2 ? facts produced by c1
  • (Easy to see from definition ?-subsumption)

50
  • Similarity with propositional refinement
  • IF Sky sunny THEN EnjoySportsyes
  • To specialise add 1 condition
  • IF Skysunny AND Humiditylow THEN
    EnjoySportsyes
  • ...

51
  • In first order logic
  • c1 father(X,Y) - parent(X,Y)
  • To specialize find clauses ?-subsumed by c1
  • father(X,Y) - parent(X,Y), male(X)
  • father(luc,X) - parent(luc,X)
  • add literals or instantiate variables

52
  • Another (slightly more complicated) example
  • c1 p(X,Y) - q(X,Y)
  • c2 p(X,Y) - q(X,Y), q(Y,X)
  • c3 p(Z,Z) - q(Z,Z)
  • c4 p(a,a) - q(a,a)
  • Which clauses ?-subsumed by which?

53
  • Properties of ?-subsumption
  • Sound
  • if c1 ?-subsumes c2 then c1 c2
  • Incomplete possibly c1 c2 without c1
    ?-subsuming c2 (but only for recursive clauses)
  • c1 p(f(X)) - p(X)
  • c2 p(f(f(X))) - p(X)
  • Hence ?-subsumption approximates entailment but
    is not the same

54
  • Checking whether c1 ?-subsumes c2 is decidable
    but NP-complete
  • Transitive and reflexive, not anti-symmetric
  • "semi-order" relation
  • e.g.
  • f(X,Y) - g(X,Y), g(X,Z)
  • f(X,Y) - g(X,Y)
  • both ?-subsume one another

55
  • Semi-order generates equivalence classes
    partial order on those equivalence classes
  • equivalence class c1 c2 iff c1 ?? c2 and c2 ??
    c1
  • c1 and c2 are then called syntactic variants
  • c1 is reduced clause of c2 iff c1 contains
    minimal subset of literals of c2 that is still
    equivalent with c2
  • each equivalence class represented by its reduced
    clause

56
  • If c1 and c2 in different equivalence classes,
    either c1 ?? c2 or c2 ?? c1 or neither gt
    anti-symmetry gt partial order
  • Thus, reduced clauses are partially ordered
  • they form a lattice
  • properties of this lattice?

57
p(X,Y) - m(X,Y) p(X,Y) - m(X,Y), m(X,Z) p(X,Y)
- m(X,Y), m(X,Z), m(X,U) ...
lgg
p(X,Y) - m(X,Y),r(X) p(X,Y) - m(X,Y),
m(X,Z),r(X) ...
p(X,Y) - m(X,Y),s(X) p(X,Y) - m(X,Y),
m(X,Z),s(X) ...
reduced
p(X,Y) - m(X,Y),s(X),r(X) p(X,Y) - m(X,Y),
m(X,Z),s(X),r(X) ...
glb
58
  • Least upper bound / greatest lower bound of two
    clauses always exists and is unique
  • Infinite chains c1 ?? c2 ?? c3 ?? ... ?? c exist
  • h(X) - p(X,Y)
  • h(X) - p(X,X2), p(X2,Y)
  • h(X) - p(X,X2), p(X2,X3), p(X3,Y)
  • ...
  • h(X) - p(X,X)

59
  • Looking for good hypothesis traversing this
    lattice
  • can be done top-down, using specialization
    operator
  • or bottom-up, using generalization operator

60
top
Heuristics-based searches (greedy, beam,
exhaustive)
VS
bottom
61
Specialisation operators
  • Shapiro general-to-specific traversal using
    refinement operator ?
  • ?(c) yields set of refinements of c
  • theory ?(c) c' c' is a maximally general
    specialisation of c
  • practice ?(c) ? c ? l l is a literal ?
    c? ? is a substitution

62
daughter(X,Y)
daughter(X,X)
daughter(X,Y) - parent(X,Z)
......
daughter(X,Y) - parent(Y,X)
daughter(X,Y) - female(X)
...
daughter(X,Y)-female(X),female(Y)
daughter(X,Y)-female(X),parent(Y,X)
63
  • How to traverse hypothesis space so that
  • no hypotheses are generated more than once?
  • no hypotheses are skipped?
  • -gt Many properties of refinement operators
    studied in detail

64
  • Some properties
  • globally complete each point in lattice is
    reachable from top
  • locally complete each point directly below c is
    in ?(c) (useful for greedy systems)
  • optimal no point in lattice is reached twice
    (useful for exhaustive systems)
  • minimal, proper,

65
A generalisation operator
  • For bottom-up search
  • We discuss one generalisation operator Plotkins
    lgg
  • Starts from 2 clauses and compute least general
    generalisation (lgg)
  • i.e., given 2 clauses, return most specific
    single clause that is more general than both of
    them

66
  • Definition of lgg of terms
  • (let si, tj denote any term, V a variable)
  • lgg(f(s1,...,sn), f(t1,...,tn))
    f(lgg(s1,t1),...,lgg(sn,tn))
  • lgg(f(s1,...,sn),g(t1,...,tn)) V
  • e.g. lgg(a,b) X lgg(f(X),g(Y)) Z
    lgg(f(a,b,a),f(c,c,c))f(X,Y,X)

67
  • lgg of literals
  • lgg(p(s1,...,sn),p(t1,...,tn))
    p(lgg(s1,t1),...,lgg(sn,tn))
  • lgg(?p(...), ? p(...)) ? lgg(p(...),p(...))
  • lgg(p(s1,...,sn),q(t1,...,tn)) is undefined
  • lgg(p(...), ?p(...)) and lgg(?p(...),p(...)) are
    undefined

68
  • lgg of clauses
  • lgg(c1,c2) lgg(l1, l2) l1?c1, l2?c2 and
    lgg(l1,l2) defined
  • Example
  • f(t,a) - p(t,a), m(t), f(a)
  • f(j,p) - p(j,p), m(j), m(p)
  • lgg f(X,Y) - p(X,Y), m(X), m(Z)

69
  • Relative lgg (rlgg) (Plotkin 1971)
  • relative to "background theory" B (assume B is a
    set of facts)
  • rlgg(e1,e2) lgg(e1 - B, e2 - B)
  • method to compute
  • change facts into clauses with body B
  • compute lgg of clauses
  • remove B, reduce

70
Example Bongard problems
  • Bongard Russian scientist studying pattern
    recognition
  • Given some pictures, find patterns in them
  • Simplified version of Bongard problems used as
    benchmarks in ILP

71
Examples labelled neg
Examples labelled pos
72
  • Example 2 simple Bongard problems, find least
    general clause that would predict both to be
    positive

pos(1). pos(2). contains(1,
o1). contains(2,o3). contains(1,o2). tri
angle(o1). triangle(o3). points(o1,d
own). points(o3,down). circle(o2).
1
2
73
  • Method 1 represent example by clause compute
    lgg of examples

pos(1) - contains(1,o1),
contains(1,o2), triangle(o1),
points(o1,down), circle(o2). pos(2) -
contains(2,o3), triangle(o3),
points(o3,down).
lgg( (pos(1) - contains(1,o1), contains(1,o2),
triangle(o1),
points(o1,down), circle(o2)) , (pos(2) -
contains(2,o3), triangle(o3), points(o3, down)
) pos(X) - contains(X,Y), triangle(Y),
points(Y,down)
74
  • Method 2 represent class of example by fact,
    other properties in background compute rlgg

Examples
Background
pos(1). pos(2).
contains(1,o1). contains(2,o3). contains(1
,o2). triangle(o1).
triangle(o3). points(o1,down).
points(o3,down). circle(o2).
rlgg(pos(1), pos(2)) ? (exercise)
75
  • ?-subsumption ordering used by many ILP systems
  • top down using refinement operators (many
    systems)
  • bottom up using rlgg (e.g., Golem system,
    Muggleton Feng)

76
  • Note inverting implication
  • Given the incompleteness of ?-subsumption, could
    we invert implication?
  • Some problems
  • lgg under implication not unique e.g., lgg of
    p(f(f(f(X))))-p(X) and p(f(f(X)))-p(X) can be
    p(f(X))-p(X) or p(f(f(X)))-p(Y)
  • computationally expensive

77
2) Inverting resolution
  • Resolution rule for deduction

Propositional
First order
p??q q?r ----------------- p ? r
p(X) ? ?q(X) q(X) ? ?r(X,Y) ----------------
------------------------- p(X) ?
? r(X,Y)
p(a) ? ?q(b) q(X) ? ?r(X,Y) ----------------
------------------------ p(a) ?
?r(b,Y)
p ? q q ? s ----------------- p ? s
X/b
78
Inverting resolution
  • General resolution rule

2 opposite literals (up to a substitution) li?1
?kj?2
l1 ? ... ? li ? ... ? ln k1 ? ... ?
kj ? ... ? km ------------------------------------
------------------------------------------- (l1
? l2 ? ... ? li-1 ? li1 ? ... ? ln ? k1 ? kj-1 ?
kj1 ... ? km) ?1?2
e.g., p(X) - q(X) and q(X) - r(X,Y) yield
p(X) - r(X,Y) p(X) - q(X) and q(a)
yield p(a).
79
  • Resolution implements - for sets of clauses
  • cf. ?-subsumption for single clauses
  • Inverting it allows to generalize a clausal
    theory
  • Inverse resolution is much more difficult than
    resolution itself
  • different operators defined
  • no unique results

80
Inverse resolution operators
  • Some operators related to inverse resolution
  • (A and B are conjunctions of literals)
  • absorption
  • from q-A and p - A,B
  • infer p - q,B
  • identification
  • from p - q,B and p - A,B
  • infer q - A

q - A
p - q,B
p - A,B
q - A
p - q,B
p - A,B
81
  • Intra-construction
  • from p - A,B and p - A,C
  • infer q - B and p - A,q and q - C
  • Inter-construction
  • from p - A,B and q - A,C
  • infer p - r,B and r - A and q - r,C

q-C
p-A,q
q-B
q-r,C
p-r,B
r - A
inter
intra
p-A,B p-A,C
p-A,B q-A,C
82
  • With intra- and inter-construction, new
    predicates are invented
  • E.g., apply intra-construction on
  • grandparent(X,Y) - father(X,Z), father(Z,Y)
  • grandparent(X,Y) - father(X,Z), mother(Z,Y)
  • What predicate is invented?

83
Example inverse resolution
m(j)
f(X,Y) - p(X,Y),m(X)
f(j,Y) - p(j,Y)
p(j,m)
f(j,m)
84
grandparent(X,Y) - father(X,Z), parent(Z,Y)
father(X,Y) - male(X), parent(X,Y)
grandparent(X,Y) - male(X), parent(X,Z),
parent(Z,Y)
male(jef)
grandparent(jef,Y) - parent(jef,Z),parent(Z,Y)
parent(jef,an)
grandparent(jef,Y) - parent(an,Y)
parent(an,paul)
grandparent(jef,paul)
85
  • Properties of inverse resolution
  • in principle very powerful
  • - gives rise to huge search space
  • - result of inverse resolution not unique
  • e.g., father(j,p)-male(j) and parent(j,p) yields
    father(j,p)-male(j),parent(j,p) or
    father(X,Y)-male(X),parent(X,Y) or
  • CIGOL approach (Muggleton Buntine)

86
  • We now have some basic operators
  • ?-subsumption-based at single clause level
  • specialization operator ?
  • generalization operator lgg of 2 clauses
  • inverse resolution generalize a set of clauses
  • These can be used to build ILP systems
  • top-down using specialization operators
  • bottom-up using generalization operators

87
Representations
  • 2 main paradigms for learning in ILP
  • learning from interpretations
  • learning from entailment
  • Related to representation of examples
  • Cf. Bongard examples we saw before

88
Learning from entailment
  • 1 example a fact e (or clause e-B)
  • Goal
  • Given examples ltE,E-gt,
  • Find theory H such that
  • ?e?E B?H - e
  • ?e-?E- B?H - e-

89
pos(1). pos(2). - pos(3).
Examples
contains(1,o1). contains(1,o2). contains(2,o3). tr
iangle(o1). triangle(o3). points(o1,d
own). points(o3,down). circle(o2). contains(3
,o4). circle(o4).
Background
pos(X) - contains(X,Y), triangle(Y),
points(Y,down).
90
Learning from interpretations
  • Example interpretation (set of facts) e
  • contains a full description of the example
  • all information that intuitively belongs to the
    example, is represented in the example, not in
    background knowledge
  • Background domain knowledge
  • general information concerning the domain, not
    concerning specific examples

91
Examples
pos(1) - contains(1,o1), contains(1,o2),
triangle(o1), points(o1,down),
circle(o2). pos(2) - contains(2,o3),
triangle(o3), points(o3,down). - pos(3),
contains(3,o4), circle(o4).
Background
polygon(X) - triangle(X). polygon(X) -
square(X).
pos(X) - contains(X,Y), triangle(Y),
points(Y,down).
92
Closed World Assumption made inside
interpretations
Examples
pos contains(o1), contains(o2), triangle(o1),
points(o1,down), circle(o2) pos
contains(o3), triangle(o3), points(o3,down) neg
contains(o4), circle(o4)
Background
polygon(X) - triangle(X). polygon(X) -
square(X).
constraint on pos
?Ycontains(Y),triangle(Y),points(Y,down).
93
  • Note when learning from interpretations
  • can dispose of example identifier
  • but can also use standard format
  • CWA made for example description
  • i.e., example description is assumed to be
    complete
  • class of example related to information inside
    example background information, NOT to
    information in other examples

94
  • Because of 3rd property, more limited than
    learning from entailment
  • cannot learn relations between different
    examples, nor recursive clauses
  • but also more efficient
  • because of 2nd and 3rd property
  • positive PAC-learnability results (De Raedt and
    Deroski, 1994, AIJ), vs. negative results for
    learning from entailment

95
Algorithms
96
Rule induction
  • Most inductive logic programming systems induce
    concept definition in form of set of definite
    Horn clauses (Prolog program)
  • Many algorithms similar to propositional
    algorithms for learning rule sets
  • FOIL -gt CN2
  • Progol -gt AQ

97
FOIL (Quinlan)
  • Learns single concept, e.g., p(X,Y) - ...
  • To learn one clause (hill-climbing search)
  • start with general clause p(X,Y) - true
  • repeat
  • add best literal to clause (i.e., literal that
    most improves quality of clause)
  • new literal can also be unification Xc or XY
  • applying refinement operator under
    ?-subsumption
  • until no further improvement

98
Example
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea) male(homer). male(bart). male(bill). fem
ale(chelsea). female(marge).
99
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea). male(homer). male(bart). male(bill). fe
male(chelsea). female(marge).
father(X,Y) - parent(X,Y). father(X,Y) -
parent(Y,X). father(X,Y) - male(X). father(X,Y)
- male(Y). father(X,Y) - female(X). father(X,Y)
- female(Y).
2,2-
100
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
,chelsea). male(homer). male(bart). male(bill). fe
male(chelsea). female(marge).
father(X,Y) - parent(X,Y). father(X,Y) -
parent(Y,X). father(X,Y) - male(X). father(X,Y)
- male(Y). father(X,Y) - female(X). father(X,Y)
- female(Y).
2,1-
101
father(homer,bart). father(bill,chelsea). -
father(marge,bart). - father(hillary,chelsea). -
father(bart,chelsea). parent(homer,bart). parent
(marge,bart). parent(bill,chelsea). parent(hillary
, chelsea). male(homer). male(bart). male(bill). f
emale(chelsea). female(marge).
father(X,Y) - male(X). father(X,Y) -
male(X), parent(X,Y). father(X,Y) - male(X),
parent(Y,X). father(X,Y) - male(X),
male(Y). father(X,Y) - male(X),
female(X). father(X,Y) - male(X), female(Y).
2,0-
102
Learning multiple clauses the Covering approach
  • To learn multiple clauses
  • repeat
  • learn a single clause c (see previous algorithm)
  • add c to h
  • mark positive examples covered by c as covered
  • until
  • all positive examples marked covered
  • or no more good clauses found

103
likes(garfield, lasagne). likes(garfield,
birds). likes(garfield, meat). likes(garfield,
jon). likes(garfield, odie).
likes(garfield, X) - edible(X).
3,0-
104
likes(garfield, lasagne). likes(garfield,
birds). likes(garfield, meat). likes(garfield,
jon). likes(garfield, odie).
(italics previously covered)
likes(garfield, X) - edible(X). likes(garfield,
X) - subject_to_cruelty(X).
2,0-
105
Some pitfalls
  • Avoiding infinite recursion
  • when recursive clauses allowed, e.g.,
    ancestor(X,Y) - parent(X,Z), ancestor(Z,Y)
  • avoid learning parent(X,Y) - parent(X,Y)
  • won't be useful, even though it's 100 correct
  • Bonus for introduction of new variables
  • literal may not yield any direct gain, but may
    introduce variables that may be useful later

p(X) - q(X) p positives, n negatives
covered refine by adding age p(X) - q(X),
age(X,Y) p positives, n negatives covered -gt no
gain
106
Golem (Muggleton Feng)
  • Based on rlgg-operator
  • To build one clause
  • Look at 2 positive examples, find rlgg,
    generalize using yet another example, until no
    improvement in quality of clause
  • bottom-up search
  • Result very dependent on choice of examples
  • e.g. what if true theory is p(X) - q(X) , p(X)
    - r(X) ?

107
  • Try this for different couples, pick best clause
    found
  • this reduces dependency on choice of couple (if 1
    of them noisy no good clause found)
  • Remove covered positive examples, restart process
  • Repeat until no more good clauses found

108
  • 1 limitation of Golem extensional coverage tests
  • only extensional background knowledge
  • may go wrong when learning recursive clauses

induces
p(0). p(1). p(2). - p(4).
s(0,1). s(1,2). s(2,3). s(3,4).
p(Y) - s(X,Y), p(X).
H-B checked by running query ?(B? ? H)
extensional coverage test
examples
background
109
Progol (Muggleton)
  • Top-down approach, but with seed
  • To find one clause
  • Start with 1 positive example e
  • Generate hypothesis space He that contains only
    hypotheses that cover at least this one example
  • first generate most specific clause c that covers
    e
  • He contains every clause more general than c
  • Perform exhaustive top-down search in He, looking
    for clause that maximizes compaction

110
  • Compaction size(covered examples) -
    size(clause)
  • Repeat process of finding one clause until no
    more good ( causing compaction) clauses found
  • Compaction heuristic in principle allows no
    coverage of negatives
  • can be relaxed (accommodating noise)

111
Generation of bottom clause
  • Language bias set of all acceptable clauses
    (chosen by user)
  • specification of H (on level of single clauses)
  • Bottom clause ? for example e most specific
    clause in language bias covering e
  • Constructed using inverse entailment

112
  • Construction of ?
  • if B?H e, then B ? ?e ?H
  • if H is clause, ?H is conjunction of ground
    (skolemized) literals
  • compute ?? all ground literals entailed by B ?
    ?e
  • ?H must be subset of these
  • so B ? ?e ?? ?H
  • hence H ?

113
  • Some examples (cf. Muggleton, NGC 1995)

?
B
e
anim(X) - pet(X). pet(X) - dog(X).
nice(X) - dog(X).
nice(X) - dog(X), pet(X), anim(X).
hasbeak(X) - bird(X). bird(X) - vulture(X).
hasbeak(tweety).
hasbeak(tweety) bird(tweety) vulture(tweety).
114
  • Example of (part of) Progol run
  • learn to classify animals as mammals, reptiles,
    ...

- generalise(class/2)? Generalising
class(dog,mammal). Most specific clause
is class(A,mammal) - has_milk(A),
has_covering(A,hair), has_legs(A, 4),
homeothermic(A), habitat(A,land). C-28,4,10,0
class(A,mammal). C8,4,0,0 class(A,mammal) -
has_milk(A). C5,3,0,0 class(A,mammal) -
has_covering(A,hair). C-4,4,3,0
class(A,mammal) - homeothermic(A). 4 explored
search nodes f8,p4,n0,h0 Result of search
is class(A,mammal) - has_milk(A).
115
  • Exhaustive search important to constrain size
    of hypothesis space
  • Strong language bias
  • specify which predicates to be used in head or
    body of clause
  • specify types and modes of predicates
  • e.g., allow age(X,Y), Ylt18
  • but not habitat(X,Y), Ylt18

116
  • E.g., for "animals" example

put this in head
variable of type "animal"
- modeh(1,class(animal,class))? -
modeb(1,has_milk(animal))? - modeb(1,has_gills(
animal))? - modeb(1,has_covering(animal,coverin
g))? - modeb(1,has_legs(animal,nat))? -
modeb(1,homeothermic(animal))? -
modeb(1,has_eggs(animal))? - modeb(,habitat(an
imal,habitat))?
constant of type "covering"
put this in body
there can be any number of habitats
only one literal of this kind needed
117
Other approaches
  • Algorithms we have seen up till now are rule
    based algorithms
  • induce theory in the form of a set of rules
    (definite Horn clauses)
  • induce rules one by one
  • Quite normal, given that logic programs are
    essentially sets of rules

118
  • Still induction of rule sets is only one type of
    machine learning
  • Difference between ILP and propositional
    approaches is mainly in representation
  • Possible to define other learning techniques and
    tasks in ILP induction of constraints, induction
    of decision trees, Bayesian learning, ...

119
Claudien (De Raedt Bruynooghe)
  • "Clausal Discovery Engine"
  • Discovers patterns that hold in set of data
  • any patterns represented as clauses (not
    necessarily Horn clauses)
  • I.e., finds patterns of a more general kind than
    predictive rules
  • also called descriptive induction

120
  • Given a hypothesis space
  • performs an exhaustive top-down search through
    the space
  • returns all clauses that
  • hold in the data set
  • are not implied by other clauses found
  • Strong language bias precise syntactical
    description of acceptable clauses

121
  • Example language bias

parent(X,Y), father(X,Y), mother(X,Y) -
parent(X,Y), father(X,Y), mother(X,Y),
male(X), male(Y), female(X), female(Y)
  • May result in following clauses being discovered

parent(X,Y) - father(X,Y). parent(X,Y) -
mother(X,Y). - father(X,Y), mother(X,Y). -
male(X), female(X). mother(X,Y) - parent(X,Y),
female(X). ...
122
Claudien algorithm
  • S ?
  • Q ?
  • while Q not empty
  • pick first clause c from Q
  • for all (h?b) in ?(c)
  • if query (b??h) fails (i.e., clause is true in
    data)
  • then
  • if (h?b) not entailed by clauses in S then add
    (h?b) to S
  • else add (h?b) to Q

123
ICL (De Raedt and Van Laer)
  • Inductive Constraint Logic
  • First system to learn from interpretations
  • Search for constraints on interpretations
    distinguishing examples of different classes
  • Roughly run Claudien on set of examples E
  • each constraint found will be true for all e,
    but probably false for some e-
  • all constraints together hopefully rule out all e-

124
  • Search for one constraint
  • c ?
  • repeat until c true for all positives
  • find d in ?(c) so that d holds for as many
    positives and as few negatives as possible
  • c d
  • add c to h
  • can also use beam search

125
  • Search for set of constraints on a class
  • h
  • while there are negatives left to be eliminated
  • find a constraint c
  • add c to h
  • Uses same language bias (DLAB) as recent
    versions of Claudien
  • DLAB is advanced form of original Claudien bias

126
  • Example of DLAB bias specification
  • min-max ... means at least min and at most max
    literals from the list are to be put here
  • can be nested
  • allows some nice tricks, e.g.
  • 1-1male(X),female(X)

0-2parent(X,Y), father(X,Y), mother(X,Y) lt--
0-lenparent(X,Y), father(X,Y), mother(X,Y),
male(X), male(Y), female(X), female(Y)
127
Warmr (Dehaspe)
  • Induces first order association rules
  • Algorithm similar to APRIORI
  • Finds frequent patterns
  • cf. "frequent item sets" in APRIORI context
  • Pattern conjunction of literals
  • Uses ?-subsumption lattice over hypothesis space
  • Constructs association rules from patterns
  • IF this pattern occurs, THEN that pattern occurs
    too

128
The APRIORI algorithm
  • APRIORI (Agrawal et al.) efficient discovery of
    frequent itemsets and association rules
  • Typical example market basket analysis
  • which things are often bought together?
  • Association rule
  • IF a1, , an THEN an1, anm

129
  • Association rules should have at least some
    minimal
  • support t(a1anm) / ttrue
  • how many people buy all these things together?
  • confidence ta1anm/ta1an
  • how many people of those buying IF-things also
    buy THEN-things?
  • Minimal support and confidence may be low

130
  • APRIORI tailored towards using large data sets
  • efficiency very important
  • minimize data access
  • Works in 2 steps
  • find frequent itemsets
  • compute association rules from them

131
  • Observation
  • if a1an infrequent (below min. support)
  • then a1an1 also infrequent
  • adding a condition can only strengthen the
    conjunction
  • Hence
  • a1,,an can only be frequent if each subset of
    it is frequent

132
  • Leads to levelwise algorithm
  • first compute frequent singletons
  • then frequent pairs, triples,
  • a lot of pruning possible due to previous
    observation
  • itemset of cardinality n is candidate if each
    subset of it of cardinality n-1 was frequent in
    previous level
  • need to count only candidates

133
Example
bread
butter
wine
ham
cheese
jam
Bread butter
Bread cheese
Bread jam
Butter cheese
Butter jam
Cheese jam
Bread butter cheese
Bread butter jam
Not a candidate
134
Apriori algorithm
Min_freq min_supportfreq(?) d 0 Q0 ?
/ candidates for level 0 / F ? / frequent
sets / while Qd ? ? do for all S in Qd do find
freq(S) Fd S in Qd freq(S) ?
min_freq F F ? Fd compute Qd1 d
d1 return F
135
Computing candidates
Compute Qd1 from Fd Qd1 ? for
each S in Fd do for each item x not in S do
S S ? x if ?i in S S\i ? Fd
then add S to Qd1
136
  • Step 2 deriving association rules from frequent
    sets
  • if S ? a ? F and (S?a)/S gt min_confidence
  • then S -gt S ? a is a valid association rule
  • has sufficient support and confidence

137
Warmr
  • Warmr is first-order version of Apriori
  • Patterns (itemsets) are now conjunctive queries
  • Frequent patterns what to count?
  • examples, of course...
  • Was easy in propositional case
  • 1 example 1 tuple -gt count tuples

138
  • In first-order case
  • also easy when learning from interpretations
  • not so clear when learning from implications
  • which implications are examples?
  • indicate this by specifying a key
  • key unique identification of example
  • each pattern contains a set of variables that
    forms the key

139
  • Example
  • assume 100 people in database
  • person(X) X is the key
  • count answer substitutions of X, not Y or Z!
  • person(X), mother(X,Y) 40 examples
  • mother(X,Y), has_pet(Y,Z) 30 examples
  • mother(X,Y) ---gt has_pet(Y,Z) support 0.3,
    confidence 0.75

140
  • Remark association rule is NOT a clause
  • mother(X,Y) ---gt has_pet(Y,Z)
  • ?X (?Ymother(X,Y)) -gt (?YZmother(X,Y),has_pet
    (Y,Z))
  • ? mother(X,Y) -gt has_pet(Y,Z)
  • main difference is occurrence of existentially
    quantified variables in conclusion

141
  • Illustrated on Bongard drawings
  • 1 example 1 drawing
  • contains(D,Obj) D is the key
  • Pattern e.g.,
  • contains(D,X), circle(X), in(X,Y), circle(Y)
  • Association rule e.g.,
  • contains(D,X), circle(X),in(X,Y),circle(Y) --gt
    contains(D,Z), square(Z)
  • "drawings that contain a circle inside another
    circle usually also contain a square"

142
  • Warmr also useful for feature construction
  • Generally applicable method for improving
    representation of examples
  • Given description of example
  • derive new (propositional) features that describe
    the example
  • add those features to a propositional description
    of the example
  • run a propositional learner

143
  • For Bongard example
  • construct features "contains a circle", "contains
    a circle inside a triangle", ...
  • given the correct features, a propositional
    representation of examples is possible
  • Feature construction with ILP general method
    for applying propositional machine learning
    techniques to structural examples

144
Decision tree induction in ILP
  • S-CART (Kramer 1996) upgrade of CART
  • Tilde (Blockeel De Raedt 98) upgrades C4.5
  • Both induce "first order" or "structural"
    decision trees (FOLDTs)
  • test in node first order literal
  • may result in true or false -gt binary trees
  • different nodes may share variables
  • "real" test in a node conjunction of all
    literal in path from root to node

145
Top-down Induction of Decision Trees Algorithm
  • function TDIDT(E set of examples)
  • T set of possible tests
  • t BEST_SPLIT(T, E)
  • E partition induced on E by t
  • if STOP_CRIT(E, E) then return leaf(INFO(E))
  • else
  • for all Ei in E ti TDIDT(Ei)
  • return inode(t, (i, ti))

146
  • Set of possible tests
  • generated using refinement operator
  • c conjunction on path from root to node
  • ?(c ) - c literal(s) to be put in node
  • Other auxiliary functions lt prop. TDIDT
  • best split using e.g. information gain
  • stop_crit e.g. significance test
  • info e.g. most frequent class

147
  • Known from propositional learning
  • induction of decision trees is fast
  • usually yields good results
  • These properties are inherited by Tilde / S-CART
  • New results (not inherited from prop. learning)
    on expressiveness

148
Example FOLDT
worn(X)
yes
no
irreplaceable(X)
ok
yes
no
sendback
fix
("x Ø worn(x))
gt ok (x worn(x) Ù
irreplaceable(x)) gt
sendback (x"y worn(x) Ù Ø(worn(y) Ù
irreplaceable(y))) gt fix
149
Expressiveness
FOL formula equivalent with tree
("x Øworn(x))
gt ok (x worn(x) Ù
irreplaceable(x)) gt
sendback (x"y worn(x) Ù Ø(worn(y) Ù
irreplaceable(y))) gt fix
Logic program equivalent with tree
a worn(X) b worn(X), irreplaceable(X) ok Ø
a sendback b fix a Ù Ø b
150
  • Prolog program equivalent with tree, using cuts
    (first order decision list)

sendback - worn(X), irreplaceable(X), ! fix -
worn(X), !. ok.
151
  • FOLDT can be converted to
  • layered logic program
  • containing invented predicates
  • flat Prolog program (using cuts)
  • Can not be converted to flat logic program

152
Expressiveness
TL
F
F Flat logic programs T decision Trees L
decision Lists
  • Difference is specific for first-order case
  • Possible remedies for ILP systems
  • invent auxiliary predicates
  • use both " and
  • induce decision lists

153
Representation with keys
class(e1,fix). worn(e1,gear). worn(e1,chain). clas
s(e2,sendback). worn(e2,engine). worn(e2,chain). c
lass(e3,sendback). worn(e3,control_unit). class(e4
,fix). worn(e4,chain). class(e5,keep).
worn(E,X)?
class(E,keep)
not_replaceable(X)?
class(E,fix)
class(E,sendback)
conversion to Prolog
replaceable(gear). replaceable(chain). not_replace
able(engine). not_replaceable(control_unit).
class(E,sendback) - worn(E,X),
not_replaceable(X), !. class(E,fix) - worn(E,X),
!. class(E, keep).
154
speed(x,s), s gt 120, not job(x, politician), not
(?y knows(x,y), job(y,politician)) gt fine(x,Y)
speed(X,S), Sgt120
yes
no
job(X, politician)
N
yes
no
knows(X, Y)
N
yes
no
job(Y, politician)
Y
yes
no
Y
N
155
Other advantages of FOLDTs
  • Both classification and regression possible
  • classification predict class ( learn concept)
  • regression predict numbers
  • important not given much attention in ILP
  • Also clustering to some extent
  • clustering group similar examples together

156
Many other approaches and applications of ILP
possible...
  • Combination of ILP and Q-learning
  • RRL ("relational reinforcement learning")
    reinforcement learning in structural domains
  • First-order equivalent of Bayesian networks
  • First-order clustering
  • needs first order distance measures
  • ...

157
Conclusions
  • Many different approaches exist in Machine
    Learning
  • ILP is in a sense diverging
  • from concept learning
  • to other approaches and tasks
  • Still many new approaches to be tried!

158
Applications of ILP
159
Applications Overview
  • User modelling
  • Games
  • Ecology
  • Drug design
  • Natural language
  • Inductive Database Design

160
User Modelling
  • Behavioural cloning
  • build model of users behaviour
  • simulate users behaviour by means of model
  • e.g.
  • learning to fly / drive /
  • learning to play music
  • learning to play games (adventure, strategic, )

161
  • Automatic adaptation of system to user
  • detect patterns in users actions
  • use patterns to try to predict users next action
  • based on predictions, make life easier for user
  • e.g.
  • mail system (auto-priority, )
  • adaptive web pages
  • intelligent search engines

162
Example Applications
  • Some applications the Leuven group has looked at
  • behavioural cloning
  • learning to play mus ic
  • learning to play games
  • automatic adaptation of system to user
  • adaptive webpages
  • a learning command shell
  • intelligent e-mail interface

163
Learning to Play Music
  • Van Baelen De Raedt, ILP-96
  • Playing music is difficult
  • not just playing the notes
  • but play with feeling
  • adapt volume, speed,
  • Midi files provided to learning system
  • System detects patterns w.r.t. pitch, volume,
    speed,
  • and tries to play music itself

164
  • Why an ILP approach?
  • mainly because of time sequences
  • Results?
  • Compare computer generated MIDI file with human
    generated MIDI file
  • Computer makes similar mistakes as beginning
    player
  • See ILP-96 proc. for details (LNAI 1314)

165
Adaptive Webpages
  • Adaplix project (Jacobs et al., 1997-)
  • Webpage observes actions of user
  • e.g., which links are followed frequently, time
    that is spent on one page,
  • and adapts itself
  • within limitations given by page author
  • change layout of page
  • move links to different places
  • add or remove links

166
  • example site http//adaplix.linux.student.kuleuve
    n.ac.be
  • identify yourself
  • name, gender, occupation (personnel/student)
  • based on this info provides customized web page
  • student project (in Dutch)

167
Intelligent Mailer
  • Visual Elm (Jacobs, 1996)
  • Intelligent mail interface
  • tries to detect which kind of mails are
  • immediately deleted
  • immediately read
  • not deleted, read later
  • forwarded
  • based on this, assigns priorities to new mails

168
  • Predictions
  • priority assigned to new mails
  • expected actions delete, forward,
  • Explanation facility
  • Several options offered to user
  • e.g. set priority threshold, only show mails
    above threshold
  • sort mails according to priority

169
(No Transcript)
170
Learning Shell
  • Jacobs, Dehaspe et al. (1999)
  • Context Unix command shell, e.g., csh
  • Each user has profile file
  • defines configuration for user that makes it
    easier to use the shell
  • usually default profile, unless user changes it
    manually

171
  • Possible to learn profile file?
  • Observe user
  • which commands are often used?
  • which parameters are used with the commands?
  • Automatically construct better profile from
    observations

172
  • Example of input to ILP system

/ background / command(Id, Command)
- isa(OrigCommand, Command), command(Id,
OrigCommand). isa(emacs, editor). isa(vi,
editor). / observations / command(1,
cd). attribute(1, 1, tex). command(2,
emacs). switch(2, 1, -nw). switch(2, 2,
-q). attribute(2, 1, aaai.tex).
173
  • Detect relationships (assocation rules) with
    ILP system Warmr
  • Examples of rules output by Warmr

IF command(Id, ls) THEN switch(Id, -l). IF
recentcommand(Id, cd) AND command(ID,
ls) THEN nextcommand(Id, editor).
174
  • Some (preliminary) experimental results
  • Evaluation criterion predict next action of user
  • Actions logged for 10 users
  • each log about 500 commands
  • 2 experiments
  • learning from all log files together
  • learning from individual log files

175
  • Learning from mixed data
  • predictive accuracy 35 ( fmax, relative
    frequency of most popular command)
  • Learning from individual data
  • predictive accuracy 50 (gt fmax)
  • Conclusion
  • proposed approach to user modelling in this
    context shows promise

176
Learning to Play Games
  • Strategic games, adventure games,
  • le
Write a Comment
User Comments (0)
About PowerShow.com