Advanced Artificial Intelligence Lecture 3: Learning presentation

About This Presentation

Transcript and Presenter's Notes

Title: Advanced Artificial Intelligence Lecture 3: Learning

1
Advanced Artificial IntelligenceLecture 3
Learning

Bob McKay
School of Computer Science and Engineering
College of Engineering
Seoul National University

2
Outline

Defining Learning
Kinds of Learning
Generalisation and Specialisation
Some Simple Learning Algorithms

3
References

Mitchell, Tom M Machine Learning, McGraw-Hill,
1997, ISBN 0 07 115467 1

4
Defining a Learning System (Mitchell)

A program is said to learn from experience E
with respect to some class of tasks T and
performance measure P, if its performance at
tasks in T, as measured by P, improves with
experience E

5
Specifying a Learning System

Specifying the task T, the performance P and the
experience E defines the learning problem.
Specifying the learning system requires us to
define
Exactly what knowledge is to be learnt
How this knowledge is to be represented
How this knowledge is to be learnt

6
Specifying What is to be Learnt

Usually, the desired knowledge can be represented
as a target valuation function V I ? D
It takes in information about the problem and
gives back a desired decision
Often, it is unrealistic to expect to learn the
ideal function V
All that is required is a good enough
approximation, V I ? D

7
Specifying How Knowledge is to be Represented

The function V must be represented symbolically,
in some language L
The language may be a well-known language
Boolean expressions
Arithmetic functions
.
Or for some systems, the language may be defined
by a grammar

8
Specifying How the Knowledge is to be Learnt

If the learning system is to be implemented, we
must specify an algorithm A, which defines the
way in which the system is to search the language
L for an acceptable V
That is, we must specify a search algorithm

9
Structure of a Learning System

Four modules
The Performance System
The Critic
The Generaliser (or sometimes Specialiser)
The Experiment Generator

10
Performance Module

This is the system which actually uses the
function V as we learn it
Learning Task
Learning to play checkers
Performance module
System for playing checkers
(I.e. makes the checkers moves)

11
Critic Module

The critic module evaluates the performance of
the current V
It produces a set of data from which the system
can learn further

12
Generaliser/Specialiser Module

Takes a set of data and produces a new V for the
system to run again

13
Experiment Generator

Takes the new V
Maybe also uses the previous history of the
system
Produces a new experiment for the performance
system to undertake

14
The Importance of Bias

Important theoretical results from learning
theory (PAC learning) tell us that learning
without some presuppositions is infeasible.
Practical experience, of both machine and human
learning, confirms this.
To learn effectively, we must limit the class of
Vs.
Two approaches are used in machine learning
Language bias
Search Bias
Combined Bias
Language and search bias are not mutually
exclusive most learning systems feature both

15
Language Bias

The language L is restricted so that it cannot
represent all possible target functions V
This is usually on the basis of some knowledge we
have about the likely form of V
It introduces risk
Our system will fail if L does not contain an
acceptable V

16
Search Bias

The order in which the system searches L is
controlled, so that promising areas for V are
searched first

17
The DownsideNo Free Lunches

Wolpert and MacReadys No Free Lunch Theorem
states, in effect, that averaged over all
problems, all biases are equally good (or bad).
Conventional view
The choice of a learning system cannot be
universal
It must be matched to the problem being solved
In most systems, the bias is not explicit
The ability to identify the language and search
biases of a particular system is an important
aspect of machine learning
Some more recent systems permit the explicit and
flexible specification of both language and
search biases

18
No Free LunchDoes it Matter?

Alternative view
We arent interested in all problems
We are only interested in prolems which have
solutions of less than some bounded complexity
(so that we can understand the solutions)
The No Free Lunch Theorem may not apply in this
case

19
Some Dimensions of Learning

Induction vs Discovery
Guided learning vs learning from raw data
Learning How vs Learning That (vs Learning a
Better That)
Stochastic vs Deterministic Symbolic vs
Subsymbolic
Clean vs Noisy Data
Discrete vs continuous variables
Attribute vs Relational Learning
The Importance of Background Knowledge

20
Induction vs Discovery

Has the target concept been previously
identified?
Pearson cloud classifications from satellite
data
vs
Autoclass and H - R diagrams
AM and prime numbers
BACON and Boyle's Law

21
Guided Learning vs Learning from Raw Data

Does the learning system require carefully
selected examples and counterexamples, as in a
teacher student situation?
(allows fast learning)
CIGOL learning sort/merge
vs
Garvan institute's thyroid data

22
Learning How vs Learning That vs Learning a
Better That

Classifying handwritten symbols
Distinguishing vowel sounds (Sejnowski
Rosenberg)
Learning to fly a (simulated!) plane
vs
Michalski learning diagnosis of soy diseases
vs
Mitchell learning about chess forks

23
Stochastic vs DeterministicSymbolic vs
Subsymbolic

Classifying handwritten symbols (stochastic,
subsymbolic)
vs
Predicting plant distributions (stochastic,
symbolic)
vs
Cloud classification (deterministic, symbolic)
vs
? (deterministic, subsymbolic)

24
Clean vs Noisy Data

Learning to diagnose errors in programs
vs
Greater gliders in the Coolangubra

25
Discrete vs Continuous Variables

Quinlan's chess end games
vs
Pearson's clouds (eg cloud heights)

26
Attibute vs Relational Learning

Predicting plant distributions
vs
Predicting animal distributions
(because plants cant move, they dont care -
much - about spatial relationships)

27
The importance of Background Knowledge

Learning about faults in a satellite power supply
general electric circuit theory
knowledge about the particular circuit

28
Generalisation and Learning

What do we mean when we say of two propositions,
S and G, that G is a generalisation of S?
Suppose skippy is a grey kangaroo.
We would regard Kangaroos are grey as a
generalisation of Skippy is grey.
In any world in which kangaroos are grey is
true, Skippy is grey will also be true.
In other words, if G is a generalisation of
specialisation S, then G is 'at least as true' as
S,
That is, S is true in all states of the world in
which G is, and perhaps in other states as well.

29
Generalisation and Inference

In logic, we assume that if S is true in all
worlds in which G is, then
G ? S
That is, G is a generalisation of S exactly when
G implies S
So we can think of learning from S as a search
for a suitable G for which G ? S
In propositional learning, this is often used as
a definition
G is more general than S if and only if G ? S

30
Issues

Equating generalisation and logical implication
is only useful if the validity of an implication
can be readily computed
In the propositional calculus, validity is an
exponential problem
in the predicate calculus, validity is an
undecidable problem
so the definition is not universally useful
(although for some parts of logic - eg learning
rules - it is perfectly adequate).

31
A Common Misunderstanding

Suppose we have two rules,
1) A ? ? ? G
2) A ? ? ? C ? G
Clearly, we would want 1 to be a generalisation
of 2
This is OK with our definition, because
((A B ? G) ? (A B C ? G))
is valid
But the confusing thing is that ((ABC) ? (A??))
is valid
Iif you only look at the hypotheses of the rule,
rather than the whole rule, the implication is
the wrong way around
Note that some textbooks are themselves confused
about this

32
Defining Generalisaion

We could try to define the properties that
generalisation must satisfy,
So let's write down some axioms. We need some
notation.
We will write 'S ltG G' as shorthand for 'S is
less general than G'.
Axioms
Transitivity If A ltG B and B ltG C then also A
ltG C
Antisymmetry If A ltG B then it's not true that B
ltG A
Top there is a unique element, ?, for which it
is always true that A ltG ?.
Bottom there is a unique element, T, for which
it is always true that T ltG A.

33
Picturing Generalisaion

We can draw a 'picture' of a generalisation
hierarchy satisfying these axioms

34
Specifying Generalisaion

In a particular domain, the generalisation
hierarchy may be defined in either of two ways
By giving a general definition of what
generalisation means in that domain
Example our earlier definition in terms of
implication
By directly specifying the specialisation and
generalisation operators that may be used to
climb up and down the links in the generalisation
hierarchy

35
Learning and Generalisaion

How does learning relate to generalisation?
We can view most learning as an attempt to find
an appropriate generalisation that generalises
the examples.
In noise free domains, we usually want the
generalisation to cover all the examples.
Once we introduce noise, we want the
generalisation to cover 'enough' examples, and
the interesting bit is in defining what 'enough'
is.
In our picture of a generalisation hierarchy,
most learning algorithms can be viewed as methods
for searching the hierarchy.
The examples can be pictured as locations low
down in the hierarchy, and the learning algorithm
attempts to find a location that is above all (or
'enough') of them in the hierarchy, but usually,
no higher 'than it needs to be'

36
Searching the Generalisaion Hierarchy

The commonest approaches are
generalising search
the search is upward from the original examples,
towards the more general hypotheses
specialising search
the search is downward from the most general
hypothesis, towards the more special examples
Some algorithms use different approaches.
Mitchell's version space approach, for example,
tries to 'home in' on the right generalisation
from both directions at once.

37
Completeness and Generalisaion

Many approaches to axiomatising generalisation
add an extra axiom
Completeness For any set S of members of the
generalisation hierarchy, there is a unique
'least general generalisation' L, which satisfies
two properties
1) for every S in S, S ltG L
2) if any other L' satisfies 1), then L ltG L'
If this definition is hard to understand, compare
it with the definition of 'Least Upper Bound' in
set theory, or of 'Least Common Multiple' in
arithmetic

38
Restricting Generalisation

Let's go back to our original definition of
generalisation
G generalises S iff G ? S
In the general predicate calculus case, this
relation is uncomputable, so it's not very useful
One approach to avoiding the problem is to limit
the implications allowed

39
Generalisation and Substitution

Very commonly, the generalisations we want to
make involve turning a constant into a variable.
So we see a particular black crow, fred, so we
notice
crow(fred) ? black(fred)
and we may wish to generalise this to
?X(crow(X) ? black(X))
Notice that the original proposition can be
recovered from the generalisation by substituting
'fred' for the variable 'X'
The original is a substitution instance of the
generalisation
So we could define a new, restricted
generalisation
G subsumes S if S is a substitution instance of G
An example of our earlier definition, because a
substitution instance is always implied by the
original proposition.

40
Learning Algorithms

For the rest of this lecture, we will work with a
specific learning dataset (due to Mitchell)
Item Sky AirT Hum Wnd Wtr Fcst Enjy
1 Sun Wrm Nml Str Wrm Sam Yes
2 Sun Wrm High Str Wrm Sam Yes
3 Rain Cold High Str Wrm Chng No
4 Sun Wrm High Str Cool Chng Yes
First, we look at a really simple algorithm,
Maximally Specific Learning

41
Maximally Specific Learning

The learning language consists of sets of tuples,
representing the values of these attributes
A ? represents that any value is acceptable for
this attribute
A particular value represents that only that
value is acceptable for this attribute
A f represents that no value is acceptable for
this attribute
Thus (?, Cold, High, ?, ?, ?) represents the
hypothesis that water sport is enjoyed only on
cold, moist days.
Note that our language is already heavily biased
only conjunctive hypotheses (hypotheses built
with ) are allowed.

42
Find-S

Find-S is a simple algorithm its initial
hypothesis is that water sport is never enjoyed
It expands the hypothesis as positive data items
are noted

43
Running Find-S

Initial Hypothesis
The most specific hypothesis (water sports are
never enjoyed)
h ? (f,f,f,f,f,f)
After First Data Item
Water sport is enjoyed only under the conditions
of the first item
h ? (Sun,Wrm,Nml,Str,Wrm,Sam)
After Second Data Item
Water sport is enjoyed only under the common
conditions of the first two items
h ? (Sun,Wrm,?,Str,Wrm,Sam)

44
Running Find-S

After Third Data Item
Since this item is negative, it has no effect on
the learning hypothesis
h ? (Sun,Wrm,?,Str,Wrm,Sam)
After Final Data Item
Further generalises the conditions encountered
h ? (Sun,Wrm,?,Str,?,?)

45
Discussion

We have found the most specific hypothesis
corresponding to the dataset and the restricted
(conjunctive) language
It is not clear it is the best hypothesis
If the best hypothesis is not conjunctive (eg if
we enjoy swimming if its warm or sunny), it will
not be found
Find-S will not handle noise and inconsistencies
well.
In other languages (not using pure conjunction)
there may be more than one maximally specific
hypothesis Find-S will not work well here

46
Version Spaces

One possible improvement on Find-S is to search
many possible solutions in parallel
Consistency
A hypothesis h is consistent with a dataset D of
training examples iff h gives the same answer on
every element of the dataset as the dataset does
Version Space
The version space with respect to the language L
and the dataset D is the set of hypotheses h in
the language L which are consistent with D

47
List-then-Eliminate

Obvious algorithm
The list-then-eliminate algorithm aims to find
the version space in L for the given dataset D
It can thus return all hypotheses which could
explain D
It works by beginning with L as its set of
hypotheses H
As each item d of the dataset D is examined in
turn, any hypotheses in H which are inconsistent
with d are eliminated
The language L is usually large, and often
infinite, so this algorithm is computationally
infeasible as it stands

48
Version Space Representation

One of the problems with the previous algorithm
is the representation of the search space
We need to represent version spaces efficiently
General Boundary
The general boundary G with respect to language L
and dataset D is the set of hypotheses h in L
which are consistent with D, and for which there
is no more general hypothesis in L which is
consistent with D
Specific Boundary
The specific boundary S with respect to language
L and dataset D is the set of hypotheses h in L
which are consistent with D, and for which there
is no more specific hypothesis in L which is
consistent with D

49
Version Space Representation 2

A version space may be represented by its general
and specific boundary
That is, given the general and specific
boundaries, the whole version space may be
recovered
The Candidate Elimination Algorithm traces the
general and specific boundaries of the version
space as more examples and counter-examples of
the concept are seen
Positive examples are used to generalise the
specific boundary
Negative examples permit the general boundary to
be specialised.

50
Candidate Elimination Algorithm

Set G to the set of most general hypotheses in L
Set S to the set of most specific hypotheses in L
For each example d in D

51
Candidate Elimination Algorithm

If d is a positive example
Remove from G any hypothesis inconsistent with
d
For each hypothesis s in S that is not
consistent with d
Remove s from S
Add to S all minimal generalisations h of s
such that h is consistent with d, and some
member of G is more general than h
Remove from S any hypothesis that is more
general than another hypothesis in S

52
Candidate Elimination Algorithm

If d is a negative example
Remove from S any hypothesis inconsistent with
d
For each hypothesis g in G that is not
consistent with d
Remove g from G
Add to G all minimal specialisations h of g
such that h is consistent with d, and some
member of S is more specific than h
Remove from G any hypothesis that is less
general than another hypothesis in G

53
Summary

Defining Learning
Kinds of Learning
Generalisation and Specialisation
Some Simple Learning Algorithms
Find-S
Version Spaces
List-then-Eliminate
Candidate Elimination

54
?????

Write a Comment

User Comments (0)

About PowerShow.com

Advanced Artificial Intelligence Lecture 3: Learning PowerPoint PPT Presentation