Model-based Bayesian Reinforcement Learning - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Model-based Bayesian Reinforcement Learning

Description:

model-free: avoid to explicitly model the environment ... This paper: Bayesian model-based approach ... graph: Dynamics are included in the graph, denoted ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 16
Provided by: people3
Category:

less

Transcript and Presenter's Notes

Title: Model-based Bayesian Reinforcement Learning


1
Model-based Bayesian Reinforcement Learning in
Partially Observable Domains
by
Pascal Poupart and Nikos Vlassis
(2008 International Symposium on Artificial
Intelligence and Math)
Presented by Lihan He ECE, Duke University Oct 3,
2008
2
Outline
  • Introduction
  • POMDP represented as dynamic decision network
    (DDN)
  • Partially observable reinforcement learning
  • Belief update
  • Value function and optimal action
  • Partially observable BEETLE
  • Offline policy optimization
  • Online policy execution
  • Conclusion

1/14
3
Introduction
POMDP partially observable Markov decision
process
  • represented by
  • sequential decision-making problem

Reinforcement learning for POMDP solve the
decision-making problem given feedback from
environment, when the dynamics of the environment
(T and O) are unknown.
  • given action-observation sequence as history
  • model-based explicitly model the environment
  • model-free avoid to explicitly model the
    environment
  • online learning policy learning and execution at
    the same time
  • offline learning learn policy first given
    training data, and then execute policy without
    modifying the policy

Final objective learn optimal actions (policy)
to achieve best reward
2/14
4
Introduction
This paper
  • Bayesian model-based approach
  • Set the prior for belief as mixture of products
    of Dirichlets
  • The posterior belief is a mixture of products of
    Dirichlets
  • The value function is also a mixture of products
    of Dirichlets
  • The number of the mixture components increases
    exponentially with respect to the time step
  • PO-BEETLE algorithm

3/14
5
POMDP and DDN
Redefine POMDP as dynamic decision network (DDN)
X, X two consecutive time steps
  • Observation and reward are subsets of state
    variable
  • The conditional probability distributions of
    state Pr(spas) jointly encode the transition,
    observation and reward models T, O and R

4/14
6
POMDP and DDN
Given X, S, R, O, A, edge E and the dynamics
Pr(spas)
Belief update
Objective finding a policy
that maximizes the expected total
reward
The optimal value function satisfies Bellmans
equation
Value iteration algorithms optimize the value
function by iteratively computing the right hand
side of the Bellmans equation.
5/14
7
POMDP and DDN
For reinforcement learning, assume X, S, R, O, A
are known, and edges E are known, but the
dynamics Pr(spas) are unknown.
We augment graph Dynamics are included in the
graph, denoted by parameter T.
6/14
8
PORL belief update
Prior setting for belief a mixture of products
of Dirichlets
Posterior belief (after taking action a and
receiving observation o) is again a mixture of
products of Dirichlets
Problem number of mixture components increases
by a factor of S (exponential growth with time)
7/14
9
PORL value function and optimal action
The augmented POMDP is hybrid, with discrete
state variables S and continuous model variables T
Discrete state POMDP
with
Continuous state POMDP 1
Hybrid state POMDP
The a-function a(s,?) can also be represented as
a mixture of products of Dirichlets
1 Porta, J. M. Vlassis, N. A. Spaaan, M. T.
J. and Poupart, P. 2006. Point-based value
iteration for continuous POMDPs. Journal of
Machine Learning Research 723292367.
8/14
10
PORL value function and optimal action
decomposed in 3 steps
1)
find optimal action for belief b
2)
find the corresponding a-function
3)
with
Problem number of mixture components increases
by a factor of S (exponential growth with time)
9/14
11
PO-BEETLE offline policy optimization
Policy learning is performed offline, given
sufficient training data (action-observation
sequence)
10/14
12
PO-BEETLE offline policy optimization
Keep the number of mixture components for
a-functions bounded
Approach 1 approximation using basis functions
Approach 2 approximation by important components
11/14
13
PO-BEETLE online policy execution
Given policy, the agent executes the policy and
updates belief online.
Keep the number of mixture components for belief
b bounded
Approach 1 approximation using importance
sampling
12/14
14
PO-BEETLE online policy execution
Approach 2 particle filtering simultaneously
update belief and reduce the number of mixture
components
Sample one updated component (after taking a and
receiving o)
The updated belief is represented by k particles
13/14
15
Conclusion
  • Bayesian model-based reinforcement learning
  • Prior belief is a mixture of products of
    Dirichlets
  • Posterior belief is also a mixture of products of
    Dirichlets, with the number of mixture components
    growing exponentially with time
  • a-functions (associated with value functions) are
    also represented as mixtures of products of
    Dirichlets that grow exponentially with time
  • Partially observable BEETLE algorithm.

14/14
Write a Comment
User Comments (0)
About PowerShow.com