Title: Learning from Scarce Experience
1Learning from Scarce Experience
Leonid Peshkin Harvard University
Christian Shelton Stanford University
2Learning agent
- A system that has an ongoing interaction with an
external environment - household robot
- factory controller
- web agent
- Mars explorer
- pizza delivery robot
3Reinforcement learning
- given a connection
- to the environment
Reinforce
find a behavior that maximizes long-run
reinforcement
Action
Observation
4Outline
- model of agent learning from environment
- learning as stochastic optimization problem
- re-using the experience in policy evaluation
- theoretical highlight
- sample complexity bounds for likelihood ratio
estimation. - empirical highlight
- load-unload.
5Interaction loop
Design a learning algorithm from agents
perspective
6Interaction loop
action
at
state
new state
st
st1
Markov decision process (MDP)
7Model with partial observability
action
observation
ot
at
state
new state
st
st1
POMDP
- set of states
- set of actions
- set of observations
reward
rt
8Model with partial observability
action
observation
ot-1
at
state
new state
st-1
st
POMDP
- observation function
- world state transition function
- reward function
reward
rt
9Objective
ot-1
at
st
st-1
rt
10Objective
ot-1
ot1
at1
ot
at
st
st1
st-1
rt
rt1
rt-1
11Cumulative reward
ot-1
ot1
at1
ot
at
st
st1
st-1
rt
rt1
rt-1
S
Return(h)
12Cumulative discounted reward
ot-1
ot1
at1
ot
at
st
st1
st-1
rt
rt1
rt-1
t1
t
g
g
t-1
g
S
Return(h)
13Objective
Experience
Sh Pr(h) Return(h)
Maximize expected return!
14Objective
Experience
at
ot
Sh Pr(h) Return(h)
st1
st
rt
Policy
15Policy
action
at
m
state
new state
st
st1
reward
rt1
Markov decision process assumes complete
observability
16Partial Observability
action
observation
ot
at
state
new state
st
st1
reward
rt1
PO Markov decision process assumes partial
observability
17Policy with memory
ot-1
ot1
at1
ot
at
at-1
st
st1
st-1
rt
rt1
rt-1
18Reactive policy
ot-1
at
ot1
ot
at-1
st
st1
st-1
rt
rt1
rt-1
Finding optimal reactive policy is NP-hard
Papadimitriou,89
19Finite-State Controller
Meuleau, Peshkin, Kim, Kaelbling UAI-99
at
ot1
Environment POMDP
st
st1
rt
20Finite-State Controller
Agent FSC
n t1
n t
at
ot1
Environment POMDP
st
st1
rt
Experience
21Finite-State Controller
Agent FSC
n t1
n t
at
ot1
- set of controller states
- internal state transition function
- action function
Policy m, optimal parameters q
22Learning as optimization
- Choose point in policy space
- evaluate q
- improve q
23Learning algorithm
Choose policy vector qltq1,q2,...,qkgt evaluate
q (by following policy q several times improve q
(by changing qi according to some credit)
24Policy evaluation
Experience
sampling
Estimator
policy m is parameterized by q
25Gradient descent
i 1..n
i
26Policy improvement
Williams,92
Optimize
Stochastic gradient descent
sampling
Finds a locally optimal policy
27Algorithm for reactive policy
Look-up table one parameter qoa per (o,a) pair
Action selection
Contribution
28Algorithm for reactive policy
Peshkin, Meuleau, Kaelbling ICML-99
- Initialize controller weights qoa
- Initialize counters No , Noa return R
- At each time step t
- a. Draw action at from Pr(atot,q)
- b. Increment No , Noa R R gtrt
- Update for all (o,a)
- qoa qoa a R (Noa - Pr(ao,q)No)
- Loop
surprise
29Issues
- Learning takes lots of experience.
- We are wasting experience! Could we re-use ?
- Crucial dependence on the complexity of
controller. - How to choose the right one ( of states in FSC)
? - How to initialize controller ?
- Combine with supervised learning ?
30Wasting experience
q
Policy space Q
31Wasting experience
q
Policy space Q
q
32Evaluation by re-using data
q2
q1
qk
given experiences under other policies
q1,q2...qk evaluate arbitrary policy q
Policy space Q
33Likelihood ratio enables re-use
Experience
Markov property warrants
We can calculate
34Likelihood ratio sampling
Naïve sampling
Weighted sampling
Likelihood ratio
35Likelihood ratio estimator
Accumulate experiences following p
Naïve sampling
Likelihood ratio
Weighted sampling
36Learning revisited
- Two problems
- Approximation
- Optimization
37Approximation
Valiant,84
deviation
confidence
average return
expected return
- How many samples N we need related to e, d and
- maximal return Vmax
- likelihood ratio bound h
- complexity K(Q) of the policy space Q
38Complexity of the policy space
q2
q1
qk
evaluate arbitrary policy q, given experiences
under other policies q1...qk
Policy space Q
39Complexity of the policy space
K(Q) k means that any new policy q is close
to one of the cover set policies q1...qk
Policy space Q
40Sample complexity result
- How many samples N we need related to e, d and
- maximal return Vmax
- likelihood ratio bound h
- complexity K(Q) of the policy space Q
- Given sample size, calculate confidence
- Given sample size and confidence, choose policy
class
41Comparison of bounds
Kearns,Ng,MansourNIPS-00
KNM
PM
KNM reusable trajectories Partial reuse
estimate is built on experience consistent with
evaluated policy Fixed sampling policy all
choices are made uniformly at random Linear in
VC(Q) dimension which is greater than covering
number K(Q) Exponential dependency on experience
size T in general case
42Sample complexity result
- Proof is done by using concentration
inequalities bounds on deviations from function
expectations McDiarmidBernstein - How far is weighted average return from expected
return?
43Theses and Conclusion
- Policy search is often the only option available
- It performs reasonably well on sample domains
- Performance depends on policy encoding
- Building controllers is art, must become science
- Complexity driven choice of controllers
- Global search by constructive covering of space