Loading...

PPT – Reinforcement Learning PowerPoint presentation | free to download - id: 14e127-M2M4Y

The Adobe Flash plugin is needed to view this content

Reinforcement Learning

- Rafy Michaeli
- Assaf Naor
- Supervisor
- Yaakov Engel

FOR MORE INFO...

Visit projects home page at http//www.technion

.ac.il/smily/rl/index.html

Project Goals

- Study the field of Reinforcement Learning (RL)
- Have a practical experience with implementing RL

algorithms - Examine the influence of various parameters on RL

algorithms performence

Overview

- Reinforcement Learning
- In RL problems, an agent (a decision-maker),

attempts to control a dynamic system by choosing

actions every time interval

Overview, cont.

- Reinforcement Learning
- The agent receives a feedback with every action

it executes

Overview, cont.

- Reinforcement Learning
- The ultimate goal of the agent is to learn a

strategy for selecting actions such that the

overall performance is optimized according to a

given criteria

Overview, cont.

- The Value function
- Given a fixed policy , which determines

the action to be performed at a given state, this

function assigns a value to every state in the

state space (all possible states the system can

have)

Overview, cont.

- The Value function
- The value of a state is defined as the weighted

sum (short term reinforcements are taken more

strongly into account than long term ones) of the

reinforcements received when starting at that

state and following the given policy to a final

state

Overview, cont.

- The Value function
- Or mathematically

Overview, cont.

- The Action Value Function or Q-Function
- Given a fixed policy , this function

assigns a value to every pair of (state, action)

in the (state, action) space

Overview, cont.

- The Action Value Function or Q-Function
- The value of a pair (state s, action a) is

defined as the weighted sum of reinforcements due

to executing action a at state s, and then

following the given policy for selecting actions

in subsequent states

Overview, cont.

- The Action Value Function or Q-Function
- Or mathematically

Overview, cont.

- The learning algorithm
- Uses experiences to progressively learn the

optimal value function, which is the function

that predicts the best long term outcome an agent

could receive from a given state

Overview, cont.

- The learning algorithm
- The agent studies the optimal value function by

continually exercising the current, non-optimal

estimate of the value function and improving this

estimate after every experience

Overview, cont.

- The learning algorithm
- Given the optimal value functuion, the agent

can then evaluate the optimal policy by

performing

Description

- Overviewed the field of Learning in general and

focused on RL algorithms - Implemented various RL algorithms on a chosen

task, aiming to teach the agent the best way to

perform the task

Description, cont.

- The task of the agent
- Given a cars initial location and velocity,

bring it to a desired location with a zero

velocity, as quickly as possible !

Description, cont.

- The task of the agent
- System description
- The car can move either forward or backwards
- The agent can control the cars acceleration at

any time interval

Description, cont.

- The task of the agent
- System description
- Walls are placed on both sides of the track
- When the car hits a wall, it bounces back in the

same speed it had prior to the collision

Description, cont.

- A sketch of the system

Description, cont.

- The code was written using MATLAB
- Performed experiments to determine the influence

of different parameters on the Learning algorithm

(mainly on convergence and how fast the system

studies the optimal policy) - Tested the performance of CMAC as a function

approximator (tested for both 1D and 2D functions)

Implementation issues

- Function approximators - Representing the Value/Q

Function - Lookup Tables
- A finite ordered set of elements (A possible

implementation would be an array). Each element

is uniquely associated with an index. Accessing

the element would be through its index. - Each region in a continuous state space is mapped

into an element of the lookup table. Thus, all

states within a region are aggregated

(accumulated) into one table element, therefore

assigned the same value.

Implementation issues, cont.

- Function approximators - Representing the Value/Q

Function - Lookup Tables
- This mapping from the state space to the Lookup

Table can be uniform or non-uniform. - An example of a uniform mapping of the state

space to cells

Implementation issues, cont.

- Function approximators - Representing the Value/Q

Function - Cerebellar Model Articulation Controller CMAC
- each state activates a specific set of memory

locations (features). The arithmetic sum of their

values is the value of the stored function.

A CMAC structure realization

Implementation issues, cont.

- Learning the optimal Value Function
- We wish to study the optimal Value Function

from which we can deduce the optimal action policy

Implementation issues, cont.

- Learning the optimal Value Function
- Our learning algorithm was based on methods of

Temporal Difference or shortly, TD

Implementation issues, cont.

- Learning the optimal Value Function
- We define the temporal difference as

Implementation issues, cont.

- Learning the optimal Value Function
- At each time step we update the estimated Value

Function by calculating

Implementation issues, cont.

- Learning the optimal Value Function
- By definition, the optimal policy satisfies

Implementation issues, cont.

- Learning the optimal Value Function
- TD( ) and Eligibility Traces
- The TD rule as presented above is really an

instance of a more general class of algorithms

called TD( ) with .

Implementation issues, cont.

- Learning the optimal Value Function
- TD( ) and Eligibility Traces
- The general TD( ) rule is similar to TD rule

given above - is taken to be .

Implementation issues, cont.

- Look-Up Table Implementation
- We used a Look-Up Table to represent the Value

Function, and acquired the optimal policy by

applying the TD( ) algorithm. - We used a non uniform mapping of the state space

to cells in the Look-Up Table which enabled us to

keep a rather small number of cells but still

have a fine quantization around the origin.

Implementation issues, cont.

- CMAC Implementation - 1
- CMAC is used to represent the Value Function and

TD( ) is the learning algorithm. - CMAC Implementation - 2
- CMAC is used to represent the Q-Function and TD(

) is the learning algorithm.

Implementation issues, cont.

- System simulation description
- We simulated each of the three implementations

for different values of . - For each value of we tested the system for

different values of - . was taken to be 1 throughout the

simulations.

Simulation Results

- We define
- success rate
- The percentage of all tries in which the agent

has successfuly been able to bring the car to

its destination with zero velocity

Simulation Results

- We define
- Average way
- The average of the time intervals it took the

agent to bring the car to its destination

Simulation Results

- Look-Up Table results
- A common result for all parameter variants is the

improvement of the success rate and the

shortening of the average way to goal as learning

progresses. - For a given , its hard to observe any

differences between the results for different

values of . - As increases, the learning process is better

i.e. For a given try number, the results for the

success rate and average way are better.

Simulation Results

- Look-Up Table results
- Its noted that eventually, in all cases, the

success rate reaches 100 i.e. the agent

successfully brought the car to its goal.

Look-Up Table performance summary

Simulation Results

- CMAC Q-Function results
- A common result for all parameter variants is the

improvement of the success rate and the

shortening of the average way to goal as learning

progresses. - For a given , better results were obtained for

a bigger . - As increases, the learning process is

generally better.

Simulation Results

- CMAC Q-Function results
- In most cases, 100 success rate is not reached,

though it reaches 100 in some cases. - We can see in some cases that the success rate

decreases along the tries and increases again.

CMAC Q-Learning performance summary

Simulation Results

- CMAC Value Iteration results
- The figure ahead shows the results obtained by

CMAC Value Iteration implementation compared to

the results already obtained for the CMAC

Q-Learning implementation. The results are for

the best pair of ( ) as obtained from

previous results.

Simulation Results

- CMAC Value Iteration results

A comparison between CMAC Q-Learning and CMAC

Value Iteration performance

Simulation Results

- Learning process examples
- In figure 1 we show the process of learning for a

specific starting state and learning parameters.

The figure shows the movement of the car after

every few tries, for 150 time consecutive time

intervals. - In figure 2 we demonstrate the systems ability

(at the end of learning) to direct the car to

its goal starting from different states.

Figure 1 The progress of learning for

Figure 2 The systems performance from different

starting states after try 20 with the parameters

Conclusions

- In this project we implemented a family of R.L.

Algorithms, , with two different

function approximators, CMAC and Look-Up Table.

Conclusions

- We examined the affect of the learning

parameters, , and on the overall

performance of the system. - In the Look-Up Table implementation does not

have a significant impact on the results as

increases the success rate increases more rapidly.

Conclusions

- We examined the affect of the learning

parameters, , and on the overall

performance of the system. - In the CMAC implementation as or

increases, the success rate increases and the

average way decreases more rapidly.

The End