Reinforcement Learning - PowerPoint PPT Presentation

Loading...

PPT – Reinforcement Learning PowerPoint presentation | free to download - id: 14e127-M2M4Y



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Reinforcement Learning

Description:

Study the field of Reinforcement Learning (RL) ... the weighted sum (short term reinforcements are taken more strongly into account ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 52
Provided by: kat115
Learn more at: http://visl.technion.ac.il
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Reinforcement Learning


1
Reinforcement Learning
  • Rafy Michaeli
  • Assaf Naor
  • Supervisor
  • Yaakov Engel

FOR MORE INFO...
Visit projects home page at http//www.technion
.ac.il/smily/rl/index.html
2
Project Goals
  • Study the field of Reinforcement Learning (RL)
  • Have a practical experience with implementing RL
    algorithms
  • Examine the influence of various parameters on RL
    algorithms performence

3
Overview
  • Reinforcement Learning
  • In RL problems, an agent (a decision-maker),
    attempts to control a dynamic system by choosing
    actions every time interval

4
Overview, cont.
  • Reinforcement Learning
  • The agent receives a feedback with every action
    it executes

5
Overview, cont.
  • Reinforcement Learning
  • The ultimate goal of the agent is to learn a
    strategy for selecting actions such that the
    overall performance is optimized according to a
    given criteria

6
Overview, cont.
  • The Value function
  • Given a fixed policy , which determines
    the action to be performed at a given state, this
    function assigns a value to every state in the
    state space (all possible states the system can
    have)

7
Overview, cont.
  • The Value function
  • The value of a state is defined as the weighted
    sum (short term reinforcements are taken more
    strongly into account than long term ones) of the
    reinforcements received when starting at that
    state and following the given policy to a final
    state

8
Overview, cont.
  • The Value function
  • Or mathematically

9
Overview, cont.
  • The Action Value Function or Q-Function
  • Given a fixed policy , this function
    assigns a value to every pair of (state, action)
    in the (state, action) space

10
Overview, cont.
  • The Action Value Function or Q-Function
  • The value of a pair (state s, action a) is
    defined as the weighted sum of reinforcements due
    to executing action a at state s, and then
    following the given policy for selecting actions
    in subsequent states

11
Overview, cont.
  • The Action Value Function or Q-Function
  • Or mathematically

12
Overview, cont.
  • The learning algorithm
  • Uses experiences to progressively learn the
    optimal value function, which is the function
    that predicts the best long term outcome an agent
    could receive from a given state

13
Overview, cont.
  • The learning algorithm
  • The agent studies the optimal value function by
    continually exercising the current, non-optimal
    estimate of the value function and improving this
    estimate after every experience

14
Overview, cont.
  • The learning algorithm
  • Given the optimal value functuion, the agent
    can then evaluate the optimal policy by
    performing

15
Description
  • Overviewed the field of Learning in general and
    focused on RL algorithms
  • Implemented various RL algorithms on a chosen
    task, aiming to teach the agent the best way to
    perform the task

16
Description, cont.
  • The task of the agent
  • Given a cars initial location and velocity,
    bring it to a desired location with a zero
    velocity, as quickly as possible !

17
Description, cont.
  • The task of the agent
  • System description
  • The car can move either forward or backwards
  • The agent can control the cars acceleration at
    any time interval

18
Description, cont.
  • The task of the agent
  • System description
  • Walls are placed on both sides of the track
  • When the car hits a wall, it bounces back in the
    same speed it had prior to the collision

19
Description, cont.
  • A sketch of the system

20
Description, cont.
  • The code was written using MATLAB
  • Performed experiments to determine the influence
    of different parameters on the Learning algorithm
    (mainly on convergence and how fast the system
    studies the optimal policy)
  • Tested the performance of CMAC as a function
    approximator (tested for both 1D and 2D functions)

21
Implementation issues
  • Function approximators - Representing the Value/Q
    Function
  • Lookup Tables
  • A finite ordered set of elements (A possible
    implementation would be an array). Each element
    is uniquely associated with an index. Accessing
    the element would be through its index.
  • Each region in a continuous state space is mapped
    into an element of the lookup table. Thus, all
    states within a region are aggregated
    (accumulated) into one table element, therefore
    assigned the same value.

22
Implementation issues, cont.
  • Function approximators - Representing the Value/Q
    Function
  • Lookup Tables
  • This mapping from the state space to the Lookup
    Table can be uniform or non-uniform.
  • An example of a uniform mapping of the state
    space to cells

23
Implementation issues, cont.
  • Function approximators - Representing the Value/Q
    Function
  • Cerebellar Model Articulation Controller CMAC
  • each state activates a specific set of memory
    locations (features). The arithmetic sum of their
    values is the value of the stored function.

24
A CMAC structure realization
25
Implementation issues, cont.
  • Learning the optimal Value Function
  • We wish to study the optimal Value Function
    from which we can deduce the optimal action policy

26
Implementation issues, cont.
  • Learning the optimal Value Function
  • Our learning algorithm was based on methods of
    Temporal Difference or shortly, TD

27
Implementation issues, cont.
  • Learning the optimal Value Function
  • We define the temporal difference as

28
Implementation issues, cont.
  • Learning the optimal Value Function
  • At each time step we update the estimated Value
    Function by calculating

29
Implementation issues, cont.
  • Learning the optimal Value Function
  • By definition, the optimal policy satisfies

30
Implementation issues, cont.
  • Learning the optimal Value Function
  • TD( ) and Eligibility Traces
  • The TD rule as presented above is really an
    instance of a more general class of algorithms
    called TD( ) with .

31
Implementation issues, cont.
  • Learning the optimal Value Function
  • TD( ) and Eligibility Traces
  • The general TD( ) rule is similar to TD rule
    given above
  • is taken to be .

32
Implementation issues, cont.
  • Look-Up Table Implementation
  • We used a Look-Up Table to represent the Value
    Function, and acquired the optimal policy by
    applying the TD( ) algorithm.
  • We used a non uniform mapping of the state space
    to cells in the Look-Up Table which enabled us to
    keep a rather small number of cells but still
    have a fine quantization around the origin.

33
Implementation issues, cont.
  • CMAC Implementation - 1
  • CMAC is used to represent the Value Function and
    TD( ) is the learning algorithm.
  • CMAC Implementation - 2
  • CMAC is used to represent the Q-Function and TD(
    ) is the learning algorithm.

34
Implementation issues, cont.
  • System simulation description
  • We simulated each of the three implementations
    for different values of .
  • For each value of we tested the system for
    different values of
  • . was taken to be 1 throughout the
    simulations.

35
Simulation Results
  • We define
  • success rate
  • The percentage of all tries in which the agent
    has successfuly been able to bring the car to
    its destination with zero velocity

36
Simulation Results
  • We define
  • Average way
  • The average of the time intervals it took the
    agent to bring the car to its destination

37
Simulation Results
  • Look-Up Table results
  • A common result for all parameter variants is the
    improvement of the success rate and the
    shortening of the average way to goal as learning
    progresses.
  • For a given , its hard to observe any
    differences between the results for different
    values of .
  • As increases, the learning process is better
    i.e. For a given try number, the results for the
    success rate and average way are better.

38
Simulation Results
  • Look-Up Table results
  • Its noted that eventually, in all cases, the
    success rate reaches 100 i.e. the agent
    successfully brought the car to its goal.

39
Look-Up Table performance summary
40
Simulation Results
  • CMAC Q-Function results
  • A common result for all parameter variants is the
    improvement of the success rate and the
    shortening of the average way to goal as learning
    progresses.
  • For a given , better results were obtained for
    a bigger .
  • As increases, the learning process is
    generally better.

41
Simulation Results
  • CMAC Q-Function results
  • In most cases, 100 success rate is not reached,
    though it reaches 100 in some cases.
  • We can see in some cases that the success rate
    decreases along the tries and increases again.

42
CMAC Q-Learning performance summary
43
Simulation Results
  • CMAC Value Iteration results
  • The figure ahead shows the results obtained by
    CMAC Value Iteration implementation compared to
    the results already obtained for the CMAC
    Q-Learning implementation. The results are for
    the best pair of ( ) as obtained from
    previous results.

44
Simulation Results
  • CMAC Value Iteration results

A comparison between CMAC Q-Learning and CMAC
Value Iteration performance
45
Simulation Results
  • Learning process examples
  • In figure 1 we show the process of learning for a
    specific starting state and learning parameters.
    The figure shows the movement of the car after
    every few tries, for 150 time consecutive time
    intervals.
  • In figure 2 we demonstrate the systems ability
    (at the end of learning) to direct the car to
    its goal starting from different states.

46
Figure 1 The progress of learning for
47
Figure 2 The systems performance from different
starting states after try 20 with the parameters
48
Conclusions
  • In this project we implemented a family of R.L.
    Algorithms, , with two different
    function approximators, CMAC and Look-Up Table.

49
Conclusions
  • We examined the affect of the learning
    parameters, , and on the overall
    performance of the system.
  • In the Look-Up Table implementation does not
    have a significant impact on the results as
    increases the success rate increases more rapidly.

50
Conclusions
  • We examined the affect of the learning
    parameters, , and on the overall
    performance of the system.
  • In the CMAC implementation as or
    increases, the success rate increases and the
    average way decreases more rapidly.

51
The End
About PowerShow.com