CPSC 533 Reinforcement Learning - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

CPSC 533 Reinforcement Learning

Description:

Glossary. Introduction ... Resources And Glossary. Information Source. Russel, S. and P. Norvig (1995) ... Addition Information and Glossary of Keywords Available at ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 65
Provided by: paulmel
Category:

less

Transcript and Presenter's Notes

Title: CPSC 533 Reinforcement Learning


1
CPSC 533 Reinforcement Learning
Paul Melenchuk Eva Wong Winson Yuen Kenneth Wong
2
Outline
  • Introduction
  • Passive Learning in an Known Environment
  • Passive Learning in an Unknown Environment
  • Active Learning in an Unknown Environment
  • Exploration
  • Learning an Action Value Function
  • Generalization in Reinforcement Learning
  • Genetic Algorithms and Evolutionary Programming
  • Conclusion
  • Glossary

3
Introduction
In which we examine how an agent can learn from
success and failure, reward and punishment.
4
Introduction
  • Learning to ride a bicycle
  • The goal given to the Reinforcement Learning
    system is simply to ride the bicycle without
    falling over
  • Begins riding the bicycle and performs a series
    of actions that result in the bicycle being
    tilted 45 degrees to the right

Photohttp//www.roanoke.com/outdoors/bikepages/bi
kerattler.html
5
Introduction
  • Learning to ride a bicycle
  • RL system turns the handle bars to the LEFT
  • Result CRASH!!!
  • Receives negative reinforcement
  • RL system turns the handle bars to the RIGHT
  • Result CRASH!!!
  • Receives negative reinforcement

6
Introduction
  • Learning to ride a bicycle
  • RL system has learned that the state of being
    titled 45 degrees to the right is bad
  • Repeat trial using 40 degree to the right
  • By performing enough of these trial-and-error
    interactions with the environment, the RL system
    will ultimately learn how to prevent the bicycle
    from ever falling over

7
Passive Learning in a Known Environment
Passive Learner A passive learner simply
watches the world going by, and tries to learn
the utility of being in various states. Another
way to think of a passive learner is as an agent
with a fixed policy trying to determine its
benefits.
8
Passive Learning in a Known Environment
In passive learning, the environment generates
state transitions and the agent perceives them.
Consider an agent trying to learn the utilities
of the states shown below
9
Passive Learning in a Known Environment
  • Agent can move North, East, South, West
  • Terminate on reading 4,2 or 4,3

10
Passive Learning in a Known Environment
Agent is provided Mi j a model given the
probability of reaching from state i to state j
11
Passive Learning in a Known Environment
  • the object is to use this information about
    rewards to learn the expected utility U(i)
    associated with each nonterminal state i
  • Utilities can be learned using 3 approaches
  • 1) LMS (least mean squares)
  • 2) ADP (adaptive dynamic programming)
  • 3) TD (temporal difference learning)

12
Passive Learning in a Known Environment
LMS (Least Mean Squares)
Agent makes random runs (sequences of random
moves) through environment 1,1-gt1,2-gt1,3-gt
2,3-gt3,3-gt4,3 1 1,1-gt2,1-gt3,1-gt
3,2-gt4,2 -1
13
Passive Learning in a Known Environment
  • LMS
  • Collect statistics on final payoff for each
    state
  • (eg. when on 2,3, how often reached 1 vs -1
    ?)
  • Learner computes average for each state
  • Provably converges to
  • true expected value (utilities)
  • (Algorithm on page 602, Figure 20.3)

14
Passive Learning in a Known Environment
LMS Main Drawback - slow convergence - it takes
the agent well over a 1000 training sequences to
get close to the correct value
15
Passive Learning in a Known Environment
ADP (Adaptive Dynamic Programming) Uses the
value or policy iteration algorithm to calculate
exact utilities of states given an estimated model
16
Passive Learning in a Known Environment
ADP In general - R(i) is reward of being in
state i (often non zero for only a few end
states) - Mij is the probability of transition
from state i to j
17
Passive Learning in a Known Environment
ADP
  • Consider U(3,3)
  • U(3,3) 0.33 x U(4,3) 0.33 x U(2,3) 0.33 x
    U(3,2)
  • 0.33 x 1.0 0.33 x 0.0886 0.33 x
    -0.4430
  • 0.2152

18
Passive Learning in a Known Environment
  • ADP
  • makes optimal use of the local constraints on
    utilities of states imposed by the neighborhood
    structure of the environment
  • somewhat intractable for large state spaces

19
Passive Learning in a Known Environment
TD (Temporal Difference Learning) The key is to
use the observed transitions to adjust the values
of the observed states so that they agree with
the constraint equations
20
Passive Learning in a Known Environment
  • TD Learning
  • Suppose we observe a transition from state i to
    state j U(i) -0.5 and U(j) 0.5
  • Suggests that we should increase U(i) to make it
    agree better with it successor
  • Can be achieved using the following updating rule

21
Passive Learning in a Known Environment
  • TD Learning
  • Performance
  • Runs noisier than LMS but smaller error
  • Deal with observed states during sample runs
    (Not all instances, unlike ADP)

22
Passive Learning in an Unknown Environment
Least Mean Square(LMS) approach and
Temporal-Difference(TD) approach operate
unchanged in an initially unknown
environment. Adaptive Dynamic Programming(ADP)
approach adds a step that updates an estimated
model of the environment.
23
Passive Learning in an Unknown Environment
ADP Approach
  • The environment model is learned by direct
    observation of transitions
  • The environment model M can be updated by keeping
    track of the percentage of times each state
    transitions to each of its neighbors

24
Passive Learning in an Unknown Environment
ADP TD Approaches
  • The ADP approach and the TD approach are closely
    related
  • Both try to make local adjustments to the utility
    estimates in order to make each state agree
    with its successors

25
Passive Learning in an Unknown Environment
  • Minor differences
  • TD adjusts a state to agree with its observed
    successor
  • ADP adjusts the state to agree with all of the
    successors
  • Important differences
  • TD makes a single adjustment per observed
    transition
  • ADP makes as many adjustments as it needs to
    restore consistency between the utility estimates
    U and the environment model M

26
Passive Learning in an Unknown Environment
  • To make ADP more efficient
  • directly approximate the algorithm for value
    iteration or policy iteration
  • prioritized-sweeping heuristic makes adjustments
    to states whose likely successors have just
    undergone a large adjustment in their own utility
    estimates
  • Advantage of the approximate ADP
  • efficient in terms of computation
  • eliminate long value iterations occur in early
    stage

27
Active Learning in an Unknown Environment
  • An active agent must consider
  • what actions to take
  • what their outcomes may be
  • how they will affect the rewards received

28
Active Learning in an Unknown Environment
  • Minor changes to passive learning agent
  • environment model now incorporates the
    probabilities of transitions to other states
    given a particular action
  • maximize its expected utility
  • agent needs a performance element to choose an
    action at each step

29
Active Learning in an Unknown Environment
Active ADP Approach
  • need to learn the probability Maij of a
    transition instead of Mij
  • the input to the function will include the action
    taken

30
Active Learning in an Unknown Environment
Active TD Approach
  • the model acquisition problem for the TD agent is
    identical to that for the ADP agent
  • the update rule remains unchanged
  • the TD algorithm will converge to the same values
    as ADP as the number of training sequences tends
    to infinity

31
Exploration
Learning also involves the exploration of unknown
areas
Photohttp//www.duke.edu/icheese/cgeorge.html
32
Exploration
  • An agent can benefit from actions in 2 ways
  • immediate rewards
  • received percepts

33
Exploration
Wacky Approach Vs. Greedy Approach
-0.038
0.089
0.215
-0.443
-0.165
-0.418
-0.544
-0.772
34
Exploration
The Bandit Problem
Photos www.freetravel.net
35
Exploration
The Exploration Function a simple example
u expected utility (greed) n number of times
actions have been tried(wacky) R best reward
possible
36
Learning An Action Value-Function
What Are Q-Values?
37
Learning An Action Value-Function
The Q-Values Formula
38
Learning An Action Value-Function
The Q-Values Formula Application
-just an adaptation of the active learning
equation
39
Learning An Action Value-Function
The TD Q-Learning Update Equation
- requires no model - calculated after each
transition from state .i to j
40
Learning An Action Value-Function
The TD Q-Learning Update Equation in
Practice The TD-Gammon System(Tesauro) ProgramNe
urogammon - attempted to learn from self-play and
implicit representation
41
Generalization In Reinforcement Learning
Explicit Representation
  • we have assumed that all the functions learned by
    the agents(U,M,R,Q) are represented in tabular
    form
  • explicit representation involves one output value
    for each input tuple.

42
Generalization In Reinforcement Learning
Explicit Representation
  • good for small state spaces, but the time to
    convergence and the time per iteration increase
    rapidly as the space gets larger
  • it may be possible to handle 10,000 states or
    more
  • this suffices for 2-dimensional, maze-like
    environments

43
Generalization In Reinforcement Learning
Explicit Representation
  • Problem more realistic worlds are out of
    question
  • eg. Chess backgammon are tiny subsets of the
    real world, yet their state spaces contain on the
    order of 10 to 10 states. So it would be
    absurd to suppose that one must visit all these
    states in order to learn how to play the game.

120
50
44
Generalization In Reinforcement Learning
Implicit Representation
  • Overcome the explicit problem
  • a form that allows one to calculate the output
    for any input, but that is much more compact than
    the tabular form.

45
Generalization In Reinforcement Learning
Implicit Representation
  • For example ,
  • an estimated utility function for game playing
    can be represented as a weighted linear function
    of a set of board features f1fn
  • U(i) w1f1(i)w2f2(i).wnfn(i)

46
Generalization In Reinforcement Learning
Implicit Representation
  • The utility function is characterized by n
    weights.
  • A typical chess evaluation function might only
    have 10 weights, so this is enormous compression

47
Generalization In Reinforcement Learning
Implicit Representation
  • enormous compression achieved by an implicit
    representation allows the learning agents to
    generalize from states it has visited to states
    it has not visited
  • the most important aspect it allows for
    inductive generalization over input states.
  • Therefore, such method are said to perform input
    generalization

48
  • Mendel is a four-legged spider-like creature
  • he has goals and desires, rather than
    instructions
  • through trial and error, he programs himself to
    satisfy those desires
  • he is born not even knowing how to walk, and he
    has to learn to identify all of the deadly things
    in his environment
  • he has two basic drives move and avoid pain
    (negative reinforcement)

Game-playing Galapagos
49
  • player has no direct control over Mendel
  • player turns various objects on and off and
    activates devices in order to guide him
  • player has to let Mendel die a few times,
    otherwise hell never learn
  • each death proves to be a valuable lesson as the
    more experienced Mendel begins to avoid the
    things that cause him pain
  • Developer Anark Software.

Game-playing Galapagos
50
Generalization In Reinforcement Learning
Input Generalisation
  • The cart pole problem
  • set up the problem of balancing a long pole
    upright on the top of a moving cart.

51
Generalization In Reinforcement Learning
Input Generalisation
  • The cart can be jerked left or right by a
    controller that observes x, x, q, and q
  • the earliest work on learning for this problem
    was carried out by Michie and Chambers(1968)
  • their BOXES algorithm was able to balance the
    pole for over an hour after only about 30 trials.

52
Generalization In Reinforcement Learning
Input Generalisation
  • The algorithm first discretized the 4-dimensional
    state into boxes, hence the name
  • it then ran trials until the pole fell over or
    the cart hit the end of the track.
  • Negative reinforcement was associated with the
    final action in the final box and then propagated
    back through the sequence

53
Generalization In Reinforcement Learning
Input Generalisation
  • The discretization causes some problems when the
    apparatus was initialized in a different position
  • improvement using the algorithm that adaptively
    partitions that state space according to the
    observed variation in the reward

54
Genetic Algorithms And Evolutionary Programming
  • Genetic algorithm starts with a set of one or
    more individuals that are successful, as measured
    by a fitness function
  • several choices for the individuals exist, such
    as
  • -Entire Agent functions
  • the fitness function is a performance measure
    or reward function - the analogy to natural
    selection is greatest

55
Genetic Algorithms And Evolutionary Programming
  • Genetic algorithm simply searches directly in the
    space of individuals, with the goal of finding
    one that maximizes the fitness function in a
    performance measure or reward function
  • search is parallel because each individual in the
    population can be seen as a separate search

56
Genetic Algorithms And Evolutionary Programming
  • component function of an agent
  • the fitness function is the critic or they can be
    anything at all that can be framed as an
    optimization problem
  • Evolutionary process learn an agent function
    based on occasional rewards as supplied by the
    selection function, it can be seen as a form of
    reinforcement learning

57
Genetic Algorithms And Evolutionary Programming
  • Before we can apply Genetic algorithm to a
    problem, we need to answer 4 questions
  • 1. What is the fitness function?
  • 2. How is an individual represented?
  • 3. How are individuals selected?
  • 4. How do individuals reproduce?

58
Genetic Algorithms And Evolutionary Programming
What is fitness function?
  • Depends on the problem, but it is a function that
    takes an individual as input and returns a real
    number as output

59
Genetic Algorithms And Evolutionary Programming
How is an individual represented?
  • In the classic genetic algorithm, an individual
    is represented as a string over a finite alphabet
  • each element of the string is called a gene
  • in genetic algorithm, we usually use the binary
    alphabet(1,0) to represent DNA

60
Genetic Algorithms And Evolutionary Programming
How are individuals selected ?
  • The selection strategy is usually randomized,
    with the probability of selection proportional to
    fitness
  • for example, if an individual X scores twice as
    high as Y on the fitness function, then X is
    twice as likely to be selected for reproduction
    than is Y.
  • selection is done with replacement

61
Genetic Algorithms And Evolutionary Programming
How do individuals reproduce?
  • By cross-over and mutation
  • all the individuals that have been selected for
    reproduction are randomly paired
  • for each pair, a cross-over point is randomly
    chosen
  • cross-over point is a number in the range 1 to N

62
Genetic Algorithms And Evolutionary Programming
How do individuals reproduce?
  • One offspring will get genes 1 through 10 from
    the first parent, and the rest from the second
    parent
  • the second offspring will get genes 1 through 10
    from the second parent, and the rest from the
    first
  • however, each gene can be altered by random
    mutation to a different value

63
Conclusion
  • Passive Learning in a Known Environment
  • Passive Learning in an Unknown Environment
  • Active Learning in an Unknown Environment
  • Exploration
  • Learning an Action Value Function
  • Generalization in Reinforcement Learning
  • Genetic Algorithms and Evolutionary Programming

64
Resources And Glossary
Information Source Russel, S. and P. Norvig
(1995). Artificial Intelligence - A Modern
Approach. Upper Saddle River, NJ, Prentice
Hall Addition Information and Glossary of
Keywords Available at http//www.cpsc.ucalgary.ca/
paulme/533
Write a Comment
User Comments (0)
About PowerShow.com