Reinforcement Learning by Policy Search

1 / 65
About This Presentation
Title:

Reinforcement Learning by Policy Search

Description:

A system that has an ongoing interaction with an external ... for damaging furniture - - - for terrorizing cat. Reinforcement Learning by Policy Search ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Reinforcement Learning by Policy Search


1
Reinforcement Learning by Policy Search
Dr. Leonid Peshkin Harvard University
2
Learning agent
  • A system that has an ongoing interaction with an
    external environment
  • household robot
  • factory controller
  • web agent
  • Mars explorer
  • pizza delivery robot

3
Reinforcement learning
  • given a connection
  • to the environment

Reinforce
find a behavior that maximizes long-run
reinforcement
Action
Observation
4
Why Reinforcement Learning?
  • Supervision signal is rarely available
  • Reward is easier than behavior for humans to
    specify
  • for removing dirt
  • - for consuming energy
  • - - for damaging furniture
  • - - - for terrorizing cat

5
Major Successes
  • Backgammon Tesauro_at_IBM
  • Elevator scheduling Crites Barto _at_UMASS
  • Cellular phone channel allocation Singh
    Bertsekas
  • Space-shuttle scheduling Zhang Dietterich
  • Real robots crawling Kimura Kobayashi

6
Outline
  • model of agent learning from environment
  • learning as stochastic optimization problem
  • re-using the experience in policy evaluation
  • theoretical highlight
  • sample complexity bounds for likelihood ratio
    estimation.
  • empirical highlight
  • adaptive network routing.

7
Interaction loop
Design a learning algorithm from agents
perspective
8
Interaction loop
action
at
state
new state
st
st1
Markov decision process (MDP)
9
Model with partial observability
action
observation
ot
at
state
new state
st
st1
POMDP
  • set of states
  • set of actions
  • set of observations

reward
rt
10
Model with partial observability
action
observation
ot-1
at
state
new state
st-1
st
POMDP
  • observation function
  • world state transition function
  • reward function

reward
rt
11
Objective
ot-1
at
st
st-1
rt
12
Objective
ot-1
ot1
at1
ot
at
st
st1
st-1
rt
rt1
rt-1
13
Cumulative reward
ot-1
ot1
at1
ot
at
st
st1
st-1
rt
rt1
rt-1
S
Return(h)
14
Cumulative discounted reward
ot-1
ot1
at1
ot
at
st
st1
st-1
rt
rt1
rt-1
t1
t
g
g
t-1
g
S
Return(h)
15
Objective
Experience
Sh Pr(h) Return(h)
Maximize expected return!
16
Objective
Experience
at
ot
Sh Pr(h) Return(h)
st1
st
rt
Policy
17
Policy
action
at
m
state
new state
st
st1
reward
rt1
Markov decision process assumes complete
observability
18
Markov Decision Processes
Bellman,64
  • Good news Many techniques to learn in MDP
  • value indication of a potential payoff
  • guaranteed to converge to the best policy
  • Bad news Guaranteed to converge if
  • environment is Markov, observable
  • value could be represented exactly
  • every action is tried in every state infinitely
    often

19
Partial Observability
action
observation
ot
at
state
new state
st
st1
reward
rt1
PO Markov decision process assumes partial
observability
20
Policy with memory
ot-1
ot1
at1
ot
at
at-1
st
st1
st-1
rt
rt1
rt-1
21
Reactive policy
ot-1
at
ot1
ot
at-1
st
st1
st-1
rt
rt1
rt-1
Finding optimal reactive policy is NP-hard
Papadimitriou,89
22
Finite-State Controller
Meuleau, Peshkin, Kim, Kaelbling UAI-99
at
ot1
Environment POMDP
st
st1
rt
23
Finite-State Controller
Agent FSC
n t1
n t
at
ot1
Environment POMDP
st
st1
rt
Experience
24
Finite-State Controller
Agent FSC
n t1
n t
at
ot1
  • set of controller states
  • internal state transition function
  • action function

Policy m, optimal parameters q
25
Learning as optimization
  • Choose point in policy space
  • evaluate q
  • improve q

26
Learning algorithm
Choose policy vector qltq1,q2,...,qkgt evaluate
q (by following policy q several times improve q
(by changing qi according to some credit)
27
Policy evaluation
Experience
sampling
Estimator
policy m is parameterized by q
28
Gradient descent
  • Idea
  • Incremental

i 1..n
  • Stochastic

i
29
Policy improvement
Williams,92
Optimize

Stochastic gradient descent
sampling
Finds a locally optimal policy
30
Algorithm for reactive policy
Look-up table one parameter qoa per (o,a) pair
Action selection
Contribution
31
Algorithm for reactive policy
Peshkin, Meuleau, Kaelbling ICML-99
  • Initialize controller weights qoa
  • Initialize counters No , Noa return R
  • At each time step t
  • a. Draw action at from Pr(atot,q)
  • b. Increment No , Noa R R gtrt
  • Update for all (o,a)
  • qoa qoa a R (Noa - Pr(ao,q)No)
  • Loop

surprise
32
Slow learning!
Peshkin, Meuleau, Kim, Kaelbling 00
Meuleau, Peshkin, Kaelbling 99
Learning distributed control simulated soccer
  • Learning with FSCs
  • pole and cart

large number
33
Issues
  • Learning takes lots of experience.
  • We are wasting experience! Could we re-use ?
  • Crucial dependence on the complexity of
    controller.
  • How to choose the right one ( of states in FSC)
    ?
  • How to initialize controller ?
  • Combine with supervised learning ?

34
Wasting experience
q
Policy space Q
35
Wasting experience
q
Policy space Q
q
36
Evaluation by re-using data
q2
q1
qk
given experiences under other policies
q1,q2...qk evaluate arbitrary policy q
Policy space Q
37
Likelihood ratio enables re-use
Experience
Markov property warrants
We can calculate
38
Likelihood ratio sampling
Naïve sampling
Weighted sampling
Likelihood ratio
39
Likelihood ratio estimator
Accumulate experiences following p
Naïve sampling
Likelihood ratio
Weighted sampling
40
Outline
  • model of agent learning from environment
  • learning as stochastic optimization problem
  • re-using the samples in policy evaluation
  • theoretical highlight
  • sample complexity bounds for likelihood ratio
    estimation
  • empirical highlight
  • adaptive network routing

41
Learning revisited
  • Two problems
  • Approximation
  • Optimization

42
Approximation
Valiant,84
deviation
confidence
average return
expected return
  • How many samples N we need related to e, d and
  • maximal return Vmax
  • likelihood ratio bound h
  • complexity K(Q) of the policy space Q

43
Complexity of the policy space
q2
q1
qk
evaluate arbitrary policy q, given experiences
under other policies q1...qk
Policy space Q
44
Complexity of the policy space
K(Q) k means that any new policy q is close
to one of the cover set policies q1...qk
Policy space Q
45
Sample complexity result
Peshkin, Mukherjee COLT-01
  • How many samples N we need related to e, d and
  • maximal return Vmax
  • likelihood ratio bound h
  • complexity K(Q) of the policy space Q
  • Given sample size, calculate confidence
  • Given sample size and confidence, choose policy
    class

46
Comparison of bounds
Kearns,Ng,MansourNIPS-00
KNM
PM
KNM reusable trajectories Partial reuse
estimate is built on experience consistent with
evaluated policy Fixed sampling policy all
choices are made uniformly at random Linear in
VC(Q) dimension which is greater than covering
number K(Q) Exponential dependency on experience
size T in general case
47
Sample complexity result
  • Proof is done by using concentration
    inequalities bounds on deviations from function
    expectations McDiarmidBernstein
  • How far is weighted average return from expected
    return?
  • We obtained better result by bounding likelihood
    ratio through log regret in sequence guessing
    game Cesa-Bianchi, Lugosi99

48
Outline
  • model of agent learning from environment
  • learning as stochastic optimization problem
  • re-using the experience in policy evaluation
  • theoretical highlight
  • sample complexity bounds for likelihood ratio
    estimation.
  • empirical highlight
  • adaptive network routing.

49
Adaptive routing
50
Adaptive routing a problem
T
S
51
Adaptive routing a problem
Observation destination node Action the next
node in route Remove packet Delivered or
defunct Process takes a unit of time per packet
Forward takes a unit of time per hop Reward
inverse average routing time Shaping penalize
for loops in route Inject Source and target
uniformly, Poisson process,
parameter 0, 5
52
Adaptive routing an algorithm
Peshkin, Meuleau, Kaelbling UAI-00
Algorithm distributed gradient descent,
bio-plausible local, reactive policy Policy
Soft-max action choosing rule, lookup table
for each destination Learning rate, temperature
constant Coordination identical reinforcement
distributed by acknowledgement packets
53
Performance comparison
Shortest path, considering loads industrys
favorite optimal (deterministic) policy relies
on global info
Q-route assigns value of estimated routing time
to every (destination,action) pair sends along
estimated best route (deterministically)
Boyan,Littman
PolSearch (my personal favorite)
54
Performance comparison
0.5 1.0 1.5 2.0
2.5 3.0
55
Regular 6x6 network
56
Modified 6x6 network
57
Performance comparison
0.5 1.0 1.5 2.0
2.5 3.0
58
Performance comparison
0.5 1.0 1.5 2.0
2.5 3.0
59
Theses and Conclusion
  • Policy search is often the only option available
  • It performs reasonably well on sample domains
  • Performance depends on policy encoding
  • Building controllers is art, must become science
  • Complexity driven choice of controllers
  • Global search by constructive covering of space

60
Why do you need RL?
Information Extraction, Web Spider Index
McCallum_at_CMU Baum_at_NECI Kushmerick_at_DublinU
Selective Visual Attention Mahadevan_at_UMASS Fov
eation for Target Detection Schmidhuber_at_IDSIA
  • Vision NLP Sequential processing via routines

61
Contributions
  • Learning with external memory.
    Peshkin, Meuleau, Kaelbling
    ICML-99
  • Learning with Finite State Controllers.
    Meuleau, Peshkin, Kim,
    Kaelbling UAI-99
  • Equivalence of centralized and distributed
    gradient ascent Relating optima to Nash
    equilibria.
    Peshkin, Meuleau, Kaelbling UAI-00
  • Likelihood ratio estimation for experience
    re-use. Peshkin, Shelton ICML-02
  • Sample complexity bounds for likelihood ratio
    estimation. Peshkin, Mukherjee COLT-01
  • Empirical results for several domains
  • adaptive routing Peshkin, Savova IJCNN-02
  • pole balancing
  • load-unload
  • simulated soccer.

62
Acknowledgments
  • I have benefited from technical interactions
    with many people, including
  • Tom Dean, John Tsitsiklis, Nicolas Meuleau,
    Christian Shelton, Kee-Eung Kim, Luis Ortiz,
    Theos Evgeniou, Mike Schwarz
  • and
  • Leslie Kaelbling

63
Multi-Agent Learning
Environment Partially Observable Markov
Game
ot1
1
st
st1
ot1
2
64
Multi-Agent Learning
Agent 1 FSC
n t1
n t
1
1
ot1
1
Environment Partially Observable Markov
Game
st
st1
ot1
2
Agent 2 FSC
n t
2
65
Multi-Agent Learning
Peshkin, Meuleau, Kaelbling UAI-00
joint gradient descent
h lto0, o0,n0, n0,a0,a0,r0,,ot,ot,nt,nt,at,at,rt,
gt
distributed gradient descent
Write a Comment
User Comments (0)