Title: Reinforcement Learning: Learning algorithms
1Reinforcement LearningLearning algorithms
- Yishay Mansour
- Tel-Aviv University
2Outline
- Last week
- Goal of Reinforcement Learning
- Mathematical Model (MDP)
- Planning
- Value iteration
- Policy iteration
- This week Learning Algorithms
- Model based
- Model Free
3Planning - Basic Problems.
Given a complete MDP model.
Policy evaluation - Given a policy p, estimate
its return.
Optimal control -
Find an optimal policy p (maximizes the return
from any start state).
4Planning - Value Functions
Vp(s) The expected value starting at state s
following p.
Qp(s,a) The expected value starting at state s
with action a and then following p.
V(s) and Q(s,a) are define using an optimal
policy p.
V(s) maxp Vp(s)
5Algorithms - optimal control
CLAIM A policy p is optimal if and only if at
each state s
Vp(s) MAXa Qp(s,a) (Bellman Eq.)
The greedy policy with respect to Qp(s,a) is
p(s) argmaxaQp(s,a)
6MDP - computing optimal policy
1. Linear Programming 2. Value Iteration method.
3. Policy Iteration method.
7Planning versus Learning
Tightly coupled in Reinforcement Learning
Goal maximize return while learning.
8Example - Elevator Control
Learning (alone) Model the arrival model
well.
Planning (alone) Given arrival model build
schedule
Real objective Construct a schedule while
updating model
9Learning Algorithms
Given access only to actions perform 1.
policy evaluation. 2. control - find optimal
policy.
Two approaches 1. Model based (Dynamic
Programming). 2. Model free (Q-Learning).
10Learning - Model Based
Estimate the model from the observation. (Both
transition probability and rewards.)
Use the estimated model as the true model, and
find optimal policy.
If we have a good estimated model, we
should have a good estimation.
11Learning - Model Based off policy
- Let the policy run for a long time.
- what is long ?!
- Build an observed model
- Transition probabilities
- Rewards
- Use the observed model to estimate value of the
policy.
12Learning - Model Basedsample size
Sample size (optimal policy)
Naive O(S2 A log (S A) ) samples.
(approximates each transition d(s,a,s) well.)
Better O(S A log (S A) ) samples.
(Sufficient to approximate optimal policy.)
KS, NIPS98
13Learning - Model Based on policy
- The learner has control over the action.
- The immediate goal is to lean a model
- As before
- Build an observed model
- Transition probabilities and Rewards
- Use the observed model to estimate value of the
policy. - Accelerating the learning
- How to reach new places ?!
14Learning - Model Based on policy
Relatively unknown nodes
Well sampled nodes
15Learning Policy improvement
- Assume that we can perform
- Given a policy p,
- Compute V and Q functions of p
- Can run policy improvement
- p Greedy (Q)
- Process converges if estimations are accurate.
16Learning Monte Carlo Methods
- Assume we can run in episodes
- Terminating MDP
- Discounted return
- Simplest sample the return of state s
- Wait to reach state s,
- Compute the return from s,
- Average all the returns.
17Learning Monte Carlo Methods
- First visit
- For each state in the episode,
- Compute the return from first occurrence
- Average the returns
- Every visit
- Might be biased!
- Computing optimal policy
- Run policy iteration.
18Learning - Model FreePolicy evaluation TD(0)
An online view At state st we performed action
at, received reward rt and moved to state
st1. Our estimation error is At
rtgV(st1)-V(st), The update
Vt 1(st) Vt(st ) a At
Note that for the correct value function we have
ErgV(s)-V(s) 0
19Learning - Model FreeOptimal Control off-policy
Learn online the Q function.
Qt1 (st ,at ) Qt (st ,at ) a rtg Vt (st1)
- Qt (st ,at )
OFF POLICY Q-Learning Any underlying policy
selects actions. Assumes every state action
performed infinitely often Learning rate
dependency.
Convergence in the limit GUARANTEED
DW,JJS,S,TS
20Learning - Model FreeOptimal Control on-policy
Learn online the Q function.
Qt1 (st ,at ) Qt (st ,at ) a rtg Qt
(st1,at1) - Qt (st ,at )
ON-Policy SARSA at1 the e-greedy policy for
Qt. The policy selects the action! Need to
balance exploration and exploitation.
Convergence in the limit GUARANTEED
DW,JJS,S,TS
21Learning - Model FreePolicy evaluation TD(?)
Again At state st we performed action at,
received reward rt and moved to state st1. Our
estimation error ArtgV(st1)-V(st), Update
every state s
Vt 1(s) Vt(s ) a A e(s)
Update of e(s) When visiting s incremented by
1 e(s) e(s)1 For all s decremented by g ?
every step e(s) g ? e(s)
22Summary
Markov Decision Process Mathematical Model.
Planning Algorithms.
Learning Algorithms Model Based Monte
Carlo TD(0) Q-Learning SARSA TD(?)