Reinforcement Learning: Learning algorithms

1 / 22
About This Presentation
Title:

Reinforcement Learning: Learning algorithms

Description:

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University Outline Last week Goal of Reinforcement Learning Mathematical Model (MDP) Planning ... – PowerPoint PPT presentation

Number of Views:1
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Reinforcement Learning: Learning algorithms


1
Reinforcement LearningLearning algorithms
  • Yishay Mansour
  • Tel-Aviv University

2
Outline
  • Last week
  • Goal of Reinforcement Learning
  • Mathematical Model (MDP)
  • Planning
  • Value iteration
  • Policy iteration
  • This week Learning Algorithms
  • Model based
  • Model Free

3
Planning - Basic Problems.
Given a complete MDP model.
Policy evaluation - Given a policy p, estimate
its return.
Optimal control -
Find an optimal policy p (maximizes the return
from any start state).
4
Planning - Value Functions
Vp(s) The expected value starting at state s
following p.
Qp(s,a) The expected value starting at state s
with action a and then following p.
V(s) and Q(s,a) are define using an optimal
policy p.
V(s) maxp Vp(s)
5
Algorithms - optimal control
CLAIM A policy p is optimal if and only if at
each state s
Vp(s) MAXa Qp(s,a) (Bellman Eq.)
The greedy policy with respect to Qp(s,a) is
p(s) argmaxaQp(s,a)
6
MDP - computing optimal policy
1. Linear Programming 2. Value Iteration method.
3. Policy Iteration method.
7
Planning versus Learning
Tightly coupled in Reinforcement Learning
Goal maximize return while learning.
8
Example - Elevator Control
Learning (alone) Model the arrival model
well.
Planning (alone) Given arrival model build
schedule
Real objective Construct a schedule while
updating model
9
Learning Algorithms
Given access only to actions perform 1.
policy evaluation. 2. control - find optimal
policy.
Two approaches 1. Model based (Dynamic
Programming). 2. Model free (Q-Learning).
10
Learning - Model Based
Estimate the model from the observation. (Both
transition probability and rewards.)
Use the estimated model as the true model, and
find optimal policy.
If we have a good estimated model, we
should have a good estimation.
11
Learning - Model Based off policy
  • Let the policy run for a long time.
  • what is long ?!
  • Build an observed model
  • Transition probabilities
  • Rewards
  • Use the observed model to estimate value of the
    policy.

12
Learning - Model Basedsample size
Sample size (optimal policy)
Naive O(S2 A log (S A) ) samples.
(approximates each transition d(s,a,s) well.)
Better O(S A log (S A) ) samples.
(Sufficient to approximate optimal policy.)
KS, NIPS98
13
Learning - Model Based on policy
  • The learner has control over the action.
  • The immediate goal is to lean a model
  • As before
  • Build an observed model
  • Transition probabilities and Rewards
  • Use the observed model to estimate value of the
    policy.
  • Accelerating the learning
  • How to reach new places ?!

14
Learning - Model Based on policy
Relatively unknown nodes
Well sampled nodes
15
Learning Policy improvement
  • Assume that we can perform
  • Given a policy p,
  • Compute V and Q functions of p
  • Can run policy improvement
  • p Greedy (Q)
  • Process converges if estimations are accurate.

16
Learning Monte Carlo Methods
  • Assume we can run in episodes
  • Terminating MDP
  • Discounted return
  • Simplest sample the return of state s
  • Wait to reach state s,
  • Compute the return from s,
  • Average all the returns.

17
Learning Monte Carlo Methods
  • First visit
  • For each state in the episode,
  • Compute the return from first occurrence
  • Average the returns
  • Every visit
  • Might be biased!
  • Computing optimal policy
  • Run policy iteration.

18
Learning - Model FreePolicy evaluation TD(0)
An online view At state st we performed action
at, received reward rt and moved to state
st1. Our estimation error is At
rtgV(st1)-V(st), The update
Vt 1(st) Vt(st ) a At
Note that for the correct value function we have
ErgV(s)-V(s) 0
19
Learning - Model FreeOptimal Control off-policy
Learn online the Q function.
Qt1 (st ,at ) Qt (st ,at ) a rtg Vt (st1)
- Qt (st ,at )
OFF POLICY Q-Learning Any underlying policy
selects actions. Assumes every state action
performed infinitely often Learning rate
dependency.
Convergence in the limit GUARANTEED
DW,JJS,S,TS
20
Learning - Model FreeOptimal Control on-policy
Learn online the Q function.
Qt1 (st ,at ) Qt (st ,at ) a rtg Qt
(st1,at1) - Qt (st ,at )
ON-Policy SARSA at1 the e-greedy policy for
Qt. The policy selects the action! Need to
balance exploration and exploitation.
Convergence in the limit GUARANTEED
DW,JJS,S,TS
21
Learning - Model FreePolicy evaluation TD(?)
Again At state st we performed action at,
received reward rt and moved to state st1. Our
estimation error ArtgV(st1)-V(st), Update
every state s
Vt 1(s) Vt(s ) a A e(s)
Update of e(s) When visiting s incremented by
1 e(s) e(s)1 For all s decremented by g ?
every step e(s) g ? e(s)
22
Summary
Markov Decision Process Mathematical Model.
Planning Algorithms.
Learning Algorithms Model Based Monte
Carlo TD(0) Q-Learning SARSA TD(?)
Write a Comment
User Comments (0)