Reinforcement Learning: Learning algorithms presentation

About This Presentation

Transcript and Presenter's Notes

Title: Reinforcement Learning: Learning algorithms

1
Reinforcement LearningLearning algorithms

Yishay Mansour
Tel-Aviv University

2
Outline

Last week
Goal of Reinforcement Learning
Mathematical Model (MDP)
Planning
Value iteration
Policy iteration
This week Learning Algorithms
Model based
Model Free

3
Planning - Basic Problems.
Given a complete MDP model.
Policy evaluation - Given a policy p, estimate
its return.
Optimal control -
Find an optimal policy p (maximizes the return
from any start state).
4
Planning - Value Functions
Vp(s) The expected value starting at state s
following p.
Qp(s,a) The expected value starting at state s
with action a and then following p.
V(s) and Q(s,a) are define using an optimal
policy p.
V(s) maxp Vp(s)
5
Algorithms - optimal control
CLAIM A policy p is optimal if and only if at
each state s
Vp(s) MAXa Qp(s,a) (Bellman Eq.)
The greedy policy with respect to Qp(s,a) is
p(s) argmaxaQp(s,a)
6
MDP - computing optimal policy
1. Linear Programming 2. Value Iteration method.
3. Policy Iteration method.
7
Planning versus Learning
Tightly coupled in Reinforcement Learning
Goal maximize return while learning.
8
Example - Elevator Control
Learning (alone) Model the arrival model
well.
Planning (alone) Given arrival model build
schedule
Real objective Construct a schedule while
updating model
9
Learning Algorithms
Given access only to actions perform 1.
policy evaluation. 2. control - find optimal
policy.
Two approaches 1. Model based (Dynamic
Programming). 2. Model free (Q-Learning).
10
Learning - Model Based
Estimate the model from the observation. (Both
transition probability and rewards.)
Use the estimated model as the true model, and
find optimal policy.
If we have a good estimated model, we
should have a good estimation.
11
Learning - Model Based off policy

Let the policy run for a long time.
what is long ?!
Build an observed model
Transition probabilities
Rewards
Use the observed model to estimate value of the
policy.

12
Learning - Model Basedsample size
Sample size (optimal policy)
Naive O(S2 A log (S A) ) samples.
(approximates each transition d(s,a,s) well.)
Better O(S A log (S A) ) samples.
(Sufficient to approximate optimal policy.)
KS, NIPS98
13
Learning - Model Based on policy

The learner has control over the action.
The immediate goal is to lean a model
As before
Build an observed model
Transition probabilities and Rewards
Use the observed model to estimate value of the
policy.
Accelerating the learning
How to reach new places ?!

14
Learning - Model Based on policy
Relatively unknown nodes
Well sampled nodes
15
Learning Policy improvement

Assume that we can perform
Given a policy p,
Compute V and Q functions of p
Can run policy improvement
p Greedy (Q)
Process converges if estimations are accurate.

16
Learning Monte Carlo Methods

Assume we can run in episodes
Terminating MDP
Discounted return
Simplest sample the return of state s
Wait to reach state s,
Compute the return from s,
Average all the returns.

17
Learning Monte Carlo Methods

First visit
For each state in the episode,
Compute the return from first occurrence
Average the returns
Every visit
Might be biased!
Computing optimal policy
Run policy iteration.

18
Learning - Model FreePolicy evaluation TD(0)
An online view At state st we performed action
at, received reward rt and moved to state
st1. Our estimation error is At
rtgV(st1)-V(st), The update
Vt 1(st) Vt(st ) a At
Note that for the correct value function we have
ErgV(s)-V(s) 0
19
Learning - Model FreeOptimal Control off-policy
Learn online the Q function.
Qt1 (st ,at ) Qt (st ,at ) a rtg Vt (st1)
- Qt (st ,at )
OFF POLICY Q-Learning Any underlying policy
selects actions. Assumes every state action
performed infinitely often Learning rate
dependency.
Convergence in the limit GUARANTEED
DW,JJS,S,TS
20
Learning - Model FreeOptimal Control on-policy
Learn online the Q function.
Qt1 (st ,at ) Qt (st ,at ) a rtg Qt
(st1,at1) - Qt (st ,at )
ON-Policy SARSA at1 the e-greedy policy for
Qt. The policy selects the action! Need to
balance exploration and exploitation.
Convergence in the limit GUARANTEED
DW,JJS,S,TS
21
Learning - Model FreePolicy evaluation TD(?)
Again At state st we performed action at,
received reward rt and moved to state st1. Our
estimation error ArtgV(st1)-V(st), Update
every state s
Vt 1(s) Vt(s ) a A e(s)
Update of e(s) When visiting s incremented by
1 e(s) e(s)1 For all s decremented by g ?
every step e(s) g ? e(s)
22
Summary
Markov Decision Process Mathematical Model.
Planning Algorithms.
Learning Algorithms Model Based Monte
Carlo TD(0) Q-Learning SARSA TD(?)

Write a Comment

User Comments (0)

Reinforcement Learning: Learning algorithms PowerPoint PPT Presentation