Model-based Bayesian Reinforcement Learning presentation

About This Presentation

Transcript and Presenter's Notes

Title: Model-based Bayesian Reinforcement Learning

1
Model-based Bayesian Reinforcement Learning in
Partially Observable Domains
by
Pascal Poupart and Nikos Vlassis
(2008 International Symposium on Artificial
Intelligence and Math)
Presented by Lihan He ECE, Duke University Oct 3,
2008
2
Outline

Introduction
POMDP represented as dynamic decision network
(DDN)
Partially observable reinforcement learning
Belief update
Value function and optimal action
Partially observable BEETLE
Offline policy optimization
Online policy execution
Conclusion

1/14
3
Introduction
POMDP partially observable Markov decision
process

represented by
sequential decision-making problem

Reinforcement learning for POMDP solve the
decision-making problem given feedback from
environment, when the dynamics of the environment
(T and O) are unknown.

given action-observation sequence as history
model-based explicitly model the environment
model-free avoid to explicitly model the
environment
online learning policy learning and execution at
the same time
offline learning learn policy first given
training data, and then execute policy without
modifying the policy

Final objective learn optimal actions (policy)
to achieve best reward
2/14
4
Introduction
This paper

Bayesian model-based approach
Set the prior for belief as mixture of products
of Dirichlets
The posterior belief is a mixture of products of
Dirichlets
The value function is also a mixture of products
of Dirichlets
The number of the mixture components increases
exponentially with respect to the time step
PO-BEETLE algorithm

3/14
5
POMDP and DDN
Redefine POMDP as dynamic decision network (DDN)
X, X two consecutive time steps

Observation and reward are subsets of state
variable
The conditional probability distributions of
state Pr(spas) jointly encode the transition,
observation and reward models T, O and R

4/14
6
POMDP and DDN
Given X, S, R, O, A, edge E and the dynamics
Pr(spas)
Belief update
Objective finding a policy
that maximizes the expected total
reward
The optimal value function satisfies Bellmans
equation
Value iteration algorithms optimize the value
function by iteratively computing the right hand
side of the Bellmans equation.
5/14
7
POMDP and DDN
For reinforcement learning, assume X, S, R, O, A
are known, and edges E are known, but the
dynamics Pr(spas) are unknown.
We augment graph Dynamics are included in the
graph, denoted by parameter T.
6/14
8
PORL belief update
Prior setting for belief a mixture of products
of Dirichlets
Posterior belief (after taking action a and
receiving observation o) is again a mixture of
products of Dirichlets
Problem number of mixture components increases
by a factor of S (exponential growth with time)
7/14
9
PORL value function and optimal action
The augmented POMDP is hybrid, with discrete
state variables S and continuous model variables T
Discrete state POMDP
with
Continuous state POMDP 1
Hybrid state POMDP
The a-function a(s,?) can also be represented as
a mixture of products of Dirichlets
1 Porta, J. M. Vlassis, N. A. Spaaan, M. T.
J. and Poupart, P. 2006. Point-based value
iteration for continuous POMDPs. Journal of
Machine Learning Research 723292367.
8/14
10
PORL value function and optimal action
decomposed in 3 steps
1)
find optimal action for belief b
2)
find the corresponding a-function
3)
with
Problem number of mixture components increases
by a factor of S (exponential growth with time)
9/14
11
PO-BEETLE offline policy optimization
Policy learning is performed offline, given
sufficient training data (action-observation
sequence)
10/14
12
PO-BEETLE offline policy optimization
Keep the number of mixture components for
a-functions bounded
Approach 1 approximation using basis functions
Approach 2 approximation by important components
11/14
13
PO-BEETLE online policy execution
Given policy, the agent executes the policy and
updates belief online.
Keep the number of mixture components for belief
b bounded
Approach 1 approximation using importance
sampling
12/14
14
PO-BEETLE online policy execution
Approach 2 particle filtering simultaneously
update belief and reduce the number of mixture
components
Sample one updated component (after taking a and
receiving o)
The updated belief is represented by k particles
13/14
15
Conclusion

Bayesian model-based reinforcement learning
Prior belief is a mixture of products of
Dirichlets
Posterior belief is also a mixture of products of
Dirichlets, with the number of mixture components
growing exponentially with time
a-functions (associated with value functions) are
also represented as mixtures of products of
Dirichlets that grow exponentially with time
Partially observable BEETLE algorithm.

14/14

Write a Comment

User Comments (0)

About PowerShow.com

Model-based Bayesian Reinforcement Learning PowerPoint PPT Presentation