Advanced Artificial Intelligence Lecture 4: Learning Theory - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Advanced Artificial Intelligence Lecture 4: Learning Theory

Description:

We would like to know the largest size of S that H can shatter: ... Thus H cannot shatter any 3-element subset of R, from which it ... 4 points not shattered ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 45
Provided by: scSn
Category:

less

Transcript and Presenter's Notes

Title: Advanced Artificial Intelligence Lecture 4: Learning Theory


1
Advanced Artificial IntelligenceLecture 4
Learning Theory
  • Bob McKay
  • School of Computer Science and Engineering
  • College of Engineering
  • Seoul National University

2
Outline
  • Language Identification
  • PAC Learning
  • Vapnik-Chervonenkis Dimension
  • Mistake-Bounded Learning

3
What should a Definition of Learnability Look
Like?
  • First try
  • How easy is it to learn a function f?
  • Easy, build a definition of f into the learning
    algorithm
  • Second try
  • How easy is it to learn a given function f from a
    set of functions F?

4
Language Identification in the Limit
  • Gold (1967)
  • Algorithm identifies language L in the limit if
    there is some K such that after K steps, the
    algorithm always answers L, and L is in fact the
    correct answer.
  • Computability focus
  • can concept be learned at all
  • rather than computational feasibility
  • can concept be learned with reasonable resources
  • Many sub-definitions, the most important being
    whether
  • the algorithm gets positive examples only, or
    negative plus positive examples
  • the algorithm is given the examples in a
    predetermined order, or can ask about specific
    examples
  • A very strict definition, appropriate for noise -
    free environment and infinite time only

5
What should a Definition of Learnability Look
Like?
  • Third try
  • Add a requirement for polynomial time rather than
    just eventually
  • Whats wrong with this?

6
Defining Learnability - Noise Issues
  • You might get a row of misleading instances by
    chance
  • Don't require a guaranteed correct answer, just
    one correct with a given probability
  • You might only see noisy answers for some inputs
  • Don't require the learned function to always be
    correct
  • just correct 'almost everywhere'
  • The examples may not be equally likely to be seen
  • Take the example distribution into account
  • As greater accuracy is required, learning is
    likely to require more examples
  • Learning is required to be polynomial in both
    size of input and required accuracy

7
PAC Learning
  • A set F of Boolean functions is learnable iff
    there is
  • a polynomial p and algorithm A(F) such that
  • for every f in F and for any distributions D, D-
    of likelihood of samples, and every ?, ? gt 0,
  • A halts in p(S(f),1/?,1/?) and outputs a program
    g such that
  • with probability at least 1-?
  • ?g(x)0 D(x) lt ?
  • ? g(x)1 D-(x) lt ?
  • (S(f) is some measure of the actual size of f)
  • Valiant 'A Theory of the Learnable', 1984 a
    motivational and informal paper, both positive
    and negative results
  • Pitt and Valiant 'Computational Limits on
    Learning from examples' a formal and
    mathematical paper, further mainly negative
    results

8
PAC Learning
  • ? is a measure of how accurate the function g is
    ('approximately correct')
  • ? is a measure of how often g can be wrongly
    chosen ('probably correct')
  • The definition could be rewritten with ? for both
    these roles
  • this is equivalent to the original definition
    anyway
  • but the use of separate ? and ? simplifies the
    derivation of limits on learnability.
  • Variants of the definitions
  • A is allowed access either to positive examples
    only, or both positive and negative examples
  • g is required to produce no false positives
  • g is required to produce no false negatives

9
PAC Learning Results
  • k-CNF
  • Formulas in Conjunctive Normal Form, max k
    literals per conjunct
  • PAC learnable from positive examples only
  • k-DNF
  • Formulas in Disjunctive Normal Form, max k
    literals per conjunct
  • PAC learnable from negative examples only
  • k-term CNF
  • (CNF with at most k conjuncts)
  • Polynomially hard
  • k-term DNF
  • (DNF with at most k disjuncts)
  • Polynomially hard

10
PAC Learning Results
  • Virtually all negative results rely on the
    assumption that RP ltgt NP
  • ie that some problems solvable in
    non-deterministic polynomial time cannot be
    solved in random polynomial time
  • informally, that making the right guesses gives
    an advantage over just making random guesses
  • The above results may seem somewhat surprising,
    since k-CNF includes k-term DNF (and mutatis
    mutandis)
  • There have been a series of PAC-learnability
    results since Valiants original work more
    negative than positive
  • This leads to the current emphasis on bias to
    restrict hypothesis space, and background
    knowledge to guide learning

11
Extensions of PAC Learning
  • k-term DNF is not PAC learnable
  • but
  • we can extend the PAC definition to allow f and g
    to belong to different function classes
  • and
  • k-term DNF is PAC learnable by k-CNF!!!
  • The problem is more in expressing the right
    hypothesis
  • Than in converging on that hypothesis
  • Pitt and Warmuth 'Reductions among Prediction
    Problems On the Difficulty of Predicting
    Automata' 1988
  • Polynomial predicability
  • Essentially the PAC definition, but with the
    hypothesis allowed to belong to an arbitrary
    language

12
PAC Learning and Sample Size
  • PAC learning results are expressed in terms of
    the amount of computation time needed to learn a
    concept
  • Many algorithms require a constant or
    near-constant time to process a sample,
    independent of the number of samples
  • Most (but not all) results regarding polynomial
    time may be translated into results about
    polynomial sample size
  • k-term DNF
  • We mentioned above that k-term DNF is not
    PAC-learnable
  • Nevertheless (see Mitchell) k-term DNF is
    learnable in polynomial sample size
  • The samples just take longer to process, the
    longer the formula is

13
Vapnik-Chervonenkis Dimension Why?
  • Good estimates of the amount of data needed to
    learn
  • A neutral comparison measure between different
    methods
  • A measure to help avoid over-fitting
  • The underpinning for support vector machine
    learning

14
Reminder Version Spaces
  • The version space VSH,D is the subset of the
    hypothesis space H which is consistent with the
    learning data D
  • The region of the generalisation hierarchy
  • bounded above by the positive examples
  • bounded below by the negative examples
  • As further examples are added to D, the
    boundaries of the version space contract to
    remain consistent.

15
Reminder Candidate Elimination
  • Set G to most general hypotheses in L
  • Set S to most specific hypotheses in L
  • For each example d in D
  • If d is a positive example
  • Remove from G any hypothesis inconsistent with d
  • For each hypothesis s in S inconsistent with d
  • Remove s from S
  • Add to S all minimal generalisations h of s such
    that h is consistent with d, and some member of
    G is more general than h
  • Remove from S any hypothesis that is more general
    than another hypothesis in S
  • If d is a negative example
  • Remove from S any hypothesis inconsistent with d
  • For each hypothesis g in G that is not consistent
    with d
  • Remove g from G
  • Add to G all minimal specialisations h of g such
    that h is consistent with d, and some member of S
    is more specific than h
  • Remove from G any hypothesis that is less general
    than another hypothesis in G

16
True vs Sample Error
  • error is the true error rate of the hypothesis
  • r is the error rate on the examples seen so
    far
  • Note that r 0 for all hypotheses in VSH,D
  • aim reduce error of hypotheses in VSH,D lt ?

17
?-Exhaustion
  • Suppose we wish to learn a target concept c from
    a hypothesis space H, using a set of training
    examples D drawn from c with distribution D
  • VSH,D is ?-exhausted if every hypothesis h in
    VSH,D has error less than ?
  • (?h ? VSH,D) error D (h) lt ?

18
Probability Bounds
  • Suppose we are given a particular sample size m
    (drawn randomly and independently)
  • What is the probability that the version space
    VSH,D has not been ?-exhausted?
  • There is a relatively simple bound - the
    probability is at most He-?m

19
Sample Size, Finite H
  • We would like the probability that we have not
    ?-exhausted VSH,D to be less than ?
  • He-?m lt ?
  • Then we need m samples, where
  • m gt (ln H ln (1 / ?)) / ?

20
Sample Size, Infinite H
  • For finite hypothesis spaces
  • The formula is very generous
  • For infinite hypothesis spaces
  • Gives no guidance at all
  • We would like a measure of difficulty of a
    hypothesis space giving a bound for infinite
    spaces and a tighter bound for finite spaces.
  • This is what the VC dimension gives us
  • Note that the previous analysis completely
    ignores the structure of the individual
    hypotheses in H, relying on the corresponding
    version space
  • The VC dimension takes into account the
    fine-grained structure of H, and its interaction
    with the individual data items.

21
Shattering
  • Definition A hypothesis space H shatters a set
    of instances S iff for every dichotomy in S,
    there is a hypothesis h in H consistent with that
    dichotomy
  • Figure 1 8 hypotheses shattering 3 instances

22
VC Dimension
  • We seek bounds on the number of instances
    required to learn a concept with a given
    fidelity.
  • We would like to know the largest size of S that
    H can shatter
  • the larger S is, the more expressive H is
  • Definition 2 VC(H) is the size of the largest
    subset of the instance space X which H shatters
  • If there is no limit on this size, then VC(H) ?

23
Example 1 Real Intervals
  • X R and H (the set of closed intervals on R)
  • What is VC(H)?
  • Consider the set S -1, 1
  • S is shattered by H
  • so VC(H) is at least 2.
  • Consider S x1, x2, x3 with x1 lt x2 lt x3
  • H doesnt shatter S
  • no hypothesis from H can represent x1, x3.
  • Why not?
  • Suppose Y y1, y2 covers x1, x3
  • Then clearly, y1 ? x1 and y2 ? x3
  • So y1 lt x2 lt y2, so that Y covers x2 as well.
  • Thus H cannot shatter any 3-element subset of R,
    from which it follows that VC(H) 2

24
Example 2 Linear Decisions
  • Let X R2, H the set of linear decision surfaces
  • (two-input perceptron)
  • H shatters any 2-element subset
  • VC(H) ? 2.
  • For three element sets
  • if the elements of S are co-linear, then H cannot
    shatter them (as above).
  • H can shatter any set of three points which are
    not co-linear.
  • Thus VC(H) ? 3

25
Example 2 Linear Decisions (cont)
  • 4 points not shattered
  • No single decision plane can partition these
    points into (-1,-1),(1,1) and (-1,1),(1,-1)
  • But of course, this isnt enough
  • to be sure that VC(H) 3, we need to know that
    no set of four points can be shattered

26
Example 2 Linear Decisions (cont)
  • If there were such a set of four points
  • No three of them are collinear (see previous)
  • Hence there is an affine transformation of them
    onto (-1,-1),(-1,1),(1,-1),(1,1)
  • That transformation would also transform the
    decision surfaces into new linear decision
    surfaces which shatter (-1,-1),(-1,1),(1,-1),(1,1
    )
  • contradiction!
  • For linear decision surfaces in Rn, the VC
    dimension is n1
  • (ie the VC dimension of an n-input perceptron is
    n1).

27
Example 3 Conjunctions of Literals
  • Consider conjunctions of n3 literals
  • Represent each instance as a bitstring
  • Consider the set S 100,010,001.
  • Naming the boolean variables A, B, C, we see
  • the hypothesis set ?, A, B, C, A, B, C,
    ABC shatters S
  • (? is the empty conjunction, which is always
    true, hence covers the full set S)
  • Thus VC(H) is at least 3.

28
Example 3 Conjunctions of Literals (cont)
  • Can a set of four instances be shattered?
  • The answer is no, though the proof is non-trivial
  • So VC(H) 3.
  • The proof is more general
  • if H is the set of boolean conjunctions of up to
    n variables, and X is the set of boolean
    instances, then VC(H) n.

29
VC Dimension andHypothesis Space Size
  • From examples 1 and 2
  • The VC dimension can be quite small even when the
    hypothesis space is infinite
  • If we can get learning bounds in terms of the VC
    dimension, these will apply even to infinite
    hypothesis spaces

30
VC Dimension andMinimum Sample Size
  • Recall for finite spaces, a bound on the number
    m of samples necessary to ?-exhaust H with
    probability ? is
  • m ? (ln H ln (1 / ?)) / ?
  • Using the VC dimension, we get
  • m ? (1 / ?) 4 log2(2 / ?) 8 VC(H) log2(13 /
    ?)
  • (Blumers Theorem)
  • The minimum number of examples is proportional to
    the VC dimension

31
VC Dimension and Sample Size
  • The above is a guaranteed bound. But how lucky
    could we get?
  • Assuming VC(H) ? 2, ? lt 1/8 and ? lt 1/100
  • For any learner L, there is a situation in which,
    with probability at least ?, L outputs a
    hypothesis having error rate at least ?, if L
    observes fewer training examples than
  • Max(1 / ?) log (1 / ?), (VC(H) 1) / 32?
  • (Ehrenfeuchts Theorem)

32
VC Dimension and Neural Nets
  • The VC dimension of a neural network is
    determined by the number of free parameters in
    the network
  • A free parameter is a one (usually a weight)
    which can change independently of any other
    parameters of the network.

33
VC Dimension Threshold Activation
  • For networks with a threshold activation
    function
  • ?(v) 1 for v ? 0
  • ?(v) 0 for v lt 0
  • the VC dimension is proportional to W(log W),
    where W is the total number of free parameters in
    the network

34
VC Dimension Sigmoid Activation
  • For networks with a sigmoid activation function
  • ?(v) 1 / (1 e-v)
  • the VC dimension is proportional to W2, where W
    is the total number of free parameters in the
    network

35
Structural Risk Minimisation
  • We would like to find the neural network N with
    the minimum generalisation error vgen(w) for the
    trained weight vector w.

36
Decision Tree Error Curve
37
Generalisation Error Curve
38
Structural Risk Minimisation
  • There is an upper bound for vgen(w) given by
  • Vgteed(w) vtrain(w) ?1(N,VC(N),?, vtrain(w))
  • N is the number of training examples, ? is a
    measure of the certainty we want
  • The exact form of ?1 is complex most
    importantly, ?1 increases with VC(N), so the
    guaranteed risk and generalisation error have the
    general form shown above

39
Structural Risk Algorithm
  • General method for finding the best-generalising
    neural network
  • Define a sequence N1, N2,. of classifiers
    with monotonically increasing VC dimension.
  • Minimise the training error of each
  • Identify the classifier N with the smallest
    guaranteed risk
  • This classifier is the one with the best
    generalising ability for unseen data.

40
Varying VC Dimension
  • For fully connected multilayer feedforward
    networks, one simple way to vary VC(N) is to
    monotonically increase the number of neurons in
    one of the hidden layers.

41
Mistake-Bounded Learning
  • In some situations (where we must use the result
    of the learning right from the start) we may be
    more concerned about the number of mistakes we
    make in learning a concept, than about the total
    number of instances required
  • In mistake bound learning, the learner is
    required, after receiving each instance x, to
    give a prediction of c(x)
  • before it is given the real answer
  • Each erroneous value counts as a mistake
  • we are interested in the total number of mistakes
    made before the algorithm converges to c
  • In some ways, an extension of Golds definition

42
Mistake-Bounded Learning - Example
  • For some algorithms and hypothesis spaces, it is
    possible to derive bounds on the number of
    mistakes which will be made in learning
  • if H is the set of conjunctions formed from any
    subset of n literals and their negations
  • find-S algorithm will make at most n1 mistakes
    in learning a given concept
  • With the same H
  • candidate-elimination algorithm will make at most
    log2n mistakes

43
Optimal Mistake Bounds
  • Optimal mistake bounds give an estimate of the
    overall complexity of a hypothesis space
  • The optimal mistake bound opt(H) is the minimum
    over all algorithms of the mistake bound for H
  • Littlestones Theorem
  • VC(H) ? opt(H) ? log2H

44
?????
Write a Comment
User Comments (0)
About PowerShow.com