Learning from Scarce Experience

About This Presentation

Title:

Learning from Scarce Experience

Description:

st-1. rt-1. ot 1. rt. at. ot-1. Reinforcement Learning by Policy Search. 11. Cumulative reward ... st. st 1. at. rt 1. Markov decision process. assumes complete ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 44

Provided by: pes9

more less

Transcript and Presenter's Notes

Title: Learning from Scarce Experience

1
Learning from Scarce Experience
Leonid Peshkin Harvard University
Christian Shelton Stanford University
2
Learning agent

A system that has an ongoing interaction with an
external environment
household robot
factory controller
web agent
Mars explorer
pizza delivery robot

3
Reinforcement learning

given a connection
to the environment

Reinforce
find a behavior that maximizes long-run
reinforcement
Action
Observation
4
Outline

model of agent learning from environment
learning as stochastic optimization problem
re-using the experience in policy evaluation
theoretical highlight
sample complexity bounds for likelihood ratio
estimation.
empirical highlight
load-unload.

5
Interaction loop
Design a learning algorithm from agents
perspective
6
Interaction loop
action
at
state
new state
st
st1
Markov decision process (MDP)
7
Model with partial observability
action
observation
ot
at
state
new state
st
st1
POMDP

set of states
set of actions
set of observations

reward
rt
8
Model with partial observability
action
observation
ot-1
at
state
new state
st-1
st
POMDP

observation function
world state transition function
reward function

reward
rt
9
Objective
ot-1
at
st
st-1
rt
10
Objective
ot-1
ot1
at1
ot
at
st
st1
st-1
rt
rt1
rt-1
11
Cumulative reward
ot-1
ot1
at1
ot
at
st
st1
st-1
rt
rt1
rt-1
S
Return(h)
12
Cumulative discounted reward
ot-1
ot1
at1
ot
at
st
st1
st-1
rt
rt1
rt-1
t1
t
g
g
t-1
g
S
Return(h)
13
Objective
Experience
Sh Pr(h) Return(h)
Maximize expected return!
14
Objective
Experience
at
ot
Sh Pr(h) Return(h)
st1
st
rt
Policy
15
Policy
action
at
m
state
new state
st
st1
reward
rt1
Markov decision process assumes complete
observability
16
Partial Observability
action
observation
ot
at
state
new state
st
st1
reward
rt1
PO Markov decision process assumes partial
observability
17
Policy with memory
ot-1
ot1
at1
ot
at
at-1
st
st1
st-1
rt
rt1
rt-1
18
Reactive policy
ot-1
at
ot1
ot
at-1
st
st1
st-1
rt
rt1
rt-1
Finding optimal reactive policy is NP-hard
Papadimitriou,89
19
Finite-State Controller
Meuleau, Peshkin, Kim, Kaelbling UAI-99
at
ot1
Environment POMDP
st
st1
rt
20
Finite-State Controller
Agent FSC
n t1
n t
at
ot1
Environment POMDP
st
st1
rt
Experience
21
Finite-State Controller
Agent FSC
n t1
n t
at
ot1

set of controller states
internal state transition function
action function

Policy m, optimal parameters q
22
Learning as optimization

Choose point in policy space
evaluate q
improve q

23
Learning algorithm
Choose policy vector qltq1,q2,...,qkgt evaluate
q (by following policy q several times improve q
(by changing qi according to some credit)
24
Policy evaluation
Experience
sampling
Estimator
policy m is parameterized by q
25
Gradient descent

Idea

Incremental

i 1..n

Stochastic

i
26
Policy improvement
Williams,92
Optimize

Stochastic gradient descent
sampling
Finds a locally optimal policy
27
Algorithm for reactive policy
Look-up table one parameter qoa per (o,a) pair
Action selection
Contribution
28
Algorithm for reactive policy
Peshkin, Meuleau, Kaelbling ICML-99

Initialize controller weights qoa
Initialize counters No , Noa return R
At each time step t
a. Draw action at from Pr(atot,q)
b. Increment No , Noa R R gtrt
Update for all (o,a)
qoa qoa a R (Noa - Pr(ao,q)No)
Loop

surprise
29
Issues

Learning takes lots of experience.
We are wasting experience! Could we re-use ?
Crucial dependence on the complexity of
controller.
How to choose the right one ( of states in FSC)
?
How to initialize controller ?
Combine with supervised learning ?

30
Wasting experience
q
Policy space Q
31
Wasting experience
q
Policy space Q
q
32
Evaluation by re-using data
q2
q1
qk
given experiences under other policies
q1,q2...qk evaluate arbitrary policy q
Policy space Q
33
Likelihood ratio enables re-use
Experience
Markov property warrants
We can calculate
34
Likelihood ratio sampling
Naïve sampling
Weighted sampling
Likelihood ratio
35
Likelihood ratio estimator
Accumulate experiences following p
Naïve sampling
Likelihood ratio
Weighted sampling
36
Learning revisited

Two problems
Approximation
Optimization

37
Approximation
Valiant,84
deviation
confidence
average return
expected return

How many samples N we need related to e, d and
maximal return Vmax
likelihood ratio bound h
complexity K(Q) of the policy space Q

38
Complexity of the policy space
q2
q1
qk
evaluate arbitrary policy q, given experiences
under other policies q1...qk
Policy space Q
39
Complexity of the policy space
K(Q) k means that any new policy q is close
to one of the cover set policies q1...qk
Policy space Q
40
Sample complexity result

How many samples N we need related to e, d and
maximal return Vmax
likelihood ratio bound h
complexity K(Q) of the policy space Q

Given sample size, calculate confidence
Given sample size and confidence, choose policy
class

41
Comparison of bounds
Kearns,Ng,MansourNIPS-00
KNM
PM
KNM reusable trajectories Partial reuse
estimate is built on experience consistent with
evaluated policy Fixed sampling policy all
choices are made uniformly at random Linear in
VC(Q) dimension which is greater than covering
number K(Q) Exponential dependency on experience
size T in general case
42
Sample complexity result

Proof is done by using concentration
inequalities bounds on deviations from function
expectations McDiarmidBernstein
How far is weighted average return from expected
return?

43
Theses and Conclusion