Learning from Scarce Experience

1 / 43
About This Presentation
Title:

Learning from Scarce Experience

Description:

st-1. rt-1. ot 1. rt. at. ot-1. Reinforcement Learning by Policy Search. 11. Cumulative reward ... st. st 1. at. rt 1. Markov decision process. assumes complete ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 44
Provided by: pes9

less

Transcript and Presenter's Notes

Title: Learning from Scarce Experience


1
Learning from Scarce Experience
Leonid Peshkin Harvard University
Christian Shelton Stanford University
2
Learning agent
  • A system that has an ongoing interaction with an
    external environment
  • household robot
  • factory controller
  • web agent
  • Mars explorer
  • pizza delivery robot

3
Reinforcement learning
  • given a connection
  • to the environment

Reinforce
find a behavior that maximizes long-run
reinforcement
Action
Observation
4
Outline
  • model of agent learning from environment
  • learning as stochastic optimization problem
  • re-using the experience in policy evaluation
  • theoretical highlight
  • sample complexity bounds for likelihood ratio
    estimation.
  • empirical highlight
  • load-unload.

5
Interaction loop
Design a learning algorithm from agents
perspective
6
Interaction loop
action
at
state
new state
st
st1
Markov decision process (MDP)
7
Model with partial observability
action
observation
ot
at
state
new state
st
st1
POMDP
  • set of states
  • set of actions
  • set of observations

reward
rt
8
Model with partial observability
action
observation
ot-1
at
state
new state
st-1
st
POMDP
  • observation function
  • world state transition function
  • reward function

reward
rt
9
Objective
ot-1
at
st
st-1
rt
10
Objective
ot-1
ot1
at1
ot
at
st
st1
st-1
rt
rt1
rt-1
11
Cumulative reward
ot-1
ot1
at1
ot
at
st
st1
st-1
rt
rt1
rt-1
S
Return(h)
12
Cumulative discounted reward
ot-1
ot1
at1
ot
at
st
st1
st-1
rt
rt1
rt-1
t1
t
g
g
t-1
g
S
Return(h)
13
Objective
Experience
Sh Pr(h) Return(h)
Maximize expected return!
14
Objective
Experience
at
ot
Sh Pr(h) Return(h)
st1
st
rt
Policy
15
Policy
action
at
m
state
new state
st
st1
reward
rt1
Markov decision process assumes complete
observability
16
Partial Observability
action
observation
ot
at
state
new state
st
st1
reward
rt1
PO Markov decision process assumes partial
observability
17
Policy with memory
ot-1
ot1
at1
ot
at
at-1
st
st1
st-1
rt
rt1
rt-1
18
Reactive policy
ot-1
at
ot1
ot
at-1
st
st1
st-1
rt
rt1
rt-1
Finding optimal reactive policy is NP-hard
Papadimitriou,89
19
Finite-State Controller
Meuleau, Peshkin, Kim, Kaelbling UAI-99
at
ot1
Environment POMDP
st
st1
rt
20
Finite-State Controller
Agent FSC
n t1
n t
at
ot1
Environment POMDP
st
st1
rt
Experience
21
Finite-State Controller
Agent FSC
n t1
n t
at
ot1
  • set of controller states
  • internal state transition function
  • action function

Policy m, optimal parameters q
22
Learning as optimization
  • Choose point in policy space
  • evaluate q
  • improve q

23
Learning algorithm
Choose policy vector qltq1,q2,...,qkgt evaluate
q (by following policy q several times improve q
(by changing qi according to some credit)
24
Policy evaluation
Experience
sampling
Estimator
policy m is parameterized by q
25
Gradient descent
  • Idea
  • Incremental

i 1..n
  • Stochastic

i
26
Policy improvement
Williams,92
Optimize

Stochastic gradient descent
sampling
Finds a locally optimal policy
27
Algorithm for reactive policy
Look-up table one parameter qoa per (o,a) pair
Action selection
Contribution
28
Algorithm for reactive policy
Peshkin, Meuleau, Kaelbling ICML-99
  • Initialize controller weights qoa
  • Initialize counters No , Noa return R
  • At each time step t
  • a. Draw action at from Pr(atot,q)
  • b. Increment No , Noa R R gtrt
  • Update for all (o,a)
  • qoa qoa a R (Noa - Pr(ao,q)No)
  • Loop

surprise
29
Issues
  • Learning takes lots of experience.
  • We are wasting experience! Could we re-use ?
  • Crucial dependence on the complexity of
    controller.
  • How to choose the right one ( of states in FSC)
    ?
  • How to initialize controller ?
  • Combine with supervised learning ?

30
Wasting experience
q
Policy space Q
31
Wasting experience
q
Policy space Q
q
32
Evaluation by re-using data
q2
q1
qk
given experiences under other policies
q1,q2...qk evaluate arbitrary policy q
Policy space Q
33
Likelihood ratio enables re-use
Experience
Markov property warrants
We can calculate
34
Likelihood ratio sampling
Naïve sampling
Weighted sampling
Likelihood ratio
35
Likelihood ratio estimator
Accumulate experiences following p
Naïve sampling
Likelihood ratio
Weighted sampling
36
Learning revisited
  • Two problems
  • Approximation
  • Optimization

37
Approximation
Valiant,84
deviation
confidence
average return
expected return
  • How many samples N we need related to e, d and
  • maximal return Vmax
  • likelihood ratio bound h
  • complexity K(Q) of the policy space Q

38
Complexity of the policy space
q2
q1
qk
evaluate arbitrary policy q, given experiences
under other policies q1...qk
Policy space Q
39
Complexity of the policy space
K(Q) k means that any new policy q is close
to one of the cover set policies q1...qk
Policy space Q
40
Sample complexity result
  • How many samples N we need related to e, d and
  • maximal return Vmax
  • likelihood ratio bound h
  • complexity K(Q) of the policy space Q
  • Given sample size, calculate confidence
  • Given sample size and confidence, choose policy
    class

41
Comparison of bounds
Kearns,Ng,MansourNIPS-00
KNM
PM
KNM reusable trajectories Partial reuse
estimate is built on experience consistent with
evaluated policy Fixed sampling policy all
choices are made uniformly at random Linear in
VC(Q) dimension which is greater than covering
number K(Q) Exponential dependency on experience
size T in general case
42
Sample complexity result
  • Proof is done by using concentration
    inequalities bounds on deviations from function
    expectations McDiarmidBernstein
  • How far is weighted average return from expected
    return?

43
Theses and Conclusion
  • Policy search is often the only option available
  • It performs reasonably well on sample domains
  • Performance depends on policy encoding
  • Building controllers is art, must become science
  • Complexity driven choice of controllers
  • Global search by constructive covering of space
Write a Comment
User Comments (0)