Title: Model-based Bayesian Reinforcement Learning
1Model-based Bayesian Reinforcement Learning in
Partially Observable Domains
by
Pascal Poupart and Nikos Vlassis
(2008 International Symposium on Artificial
Intelligence and Math)
Presented by Lihan He ECE, Duke University Oct 3,
2008
2Outline
- Introduction
- POMDP represented as dynamic decision network
(DDN) - Partially observable reinforcement learning
- Belief update
- Value function and optimal action
- Partially observable BEETLE
- Offline policy optimization
- Online policy execution
- Conclusion
1/14
3Introduction
POMDP partially observable Markov decision
process
- represented by
- sequential decision-making problem
Reinforcement learning for POMDP solve the
decision-making problem given feedback from
environment, when the dynamics of the environment
(T and O) are unknown.
- given action-observation sequence as history
- model-based explicitly model the environment
- model-free avoid to explicitly model the
environment - online learning policy learning and execution at
the same time - offline learning learn policy first given
training data, and then execute policy without
modifying the policy
Final objective learn optimal actions (policy)
to achieve best reward
2/14
4Introduction
This paper
- Bayesian model-based approach
- Set the prior for belief as mixture of products
of Dirichlets - The posterior belief is a mixture of products of
Dirichlets - The value function is also a mixture of products
of Dirichlets - The number of the mixture components increases
exponentially with respect to the time step - PO-BEETLE algorithm
3/14
5POMDP and DDN
Redefine POMDP as dynamic decision network (DDN)
X, X two consecutive time steps
- Observation and reward are subsets of state
variable - The conditional probability distributions of
state Pr(spas) jointly encode the transition,
observation and reward models T, O and R
4/14
6POMDP and DDN
Given X, S, R, O, A, edge E and the dynamics
Pr(spas)
Belief update
Objective finding a policy
that maximizes the expected total
reward
The optimal value function satisfies Bellmans
equation
Value iteration algorithms optimize the value
function by iteratively computing the right hand
side of the Bellmans equation.
5/14
7POMDP and DDN
For reinforcement learning, assume X, S, R, O, A
are known, and edges E are known, but the
dynamics Pr(spas) are unknown.
We augment graph Dynamics are included in the
graph, denoted by parameter T.
6/14
8PORL belief update
Prior setting for belief a mixture of products
of Dirichlets
Posterior belief (after taking action a and
receiving observation o) is again a mixture of
products of Dirichlets
Problem number of mixture components increases
by a factor of S (exponential growth with time)
7/14
9PORL value function and optimal action
The augmented POMDP is hybrid, with discrete
state variables S and continuous model variables T
Discrete state POMDP
with
Continuous state POMDP 1
Hybrid state POMDP
The a-function a(s,?) can also be represented as
a mixture of products of Dirichlets
1 Porta, J. M. Vlassis, N. A. Spaaan, M. T.
J. and Poupart, P. 2006. Point-based value
iteration for continuous POMDPs. Journal of
Machine Learning Research 723292367.
8/14
10PORL value function and optimal action
decomposed in 3 steps
1)
find optimal action for belief b
2)
find the corresponding a-function
3)
with
Problem number of mixture components increases
by a factor of S (exponential growth with time)
9/14
11PO-BEETLE offline policy optimization
Policy learning is performed offline, given
sufficient training data (action-observation
sequence)
10/14
12PO-BEETLE offline policy optimization
Keep the number of mixture components for
a-functions bounded
Approach 1 approximation using basis functions
Approach 2 approximation by important components
11/14
13PO-BEETLE online policy execution
Given policy, the agent executes the policy and
updates belief online.
Keep the number of mixture components for belief
b bounded
Approach 1 approximation using importance
sampling
12/14
14PO-BEETLE online policy execution
Approach 2 particle filtering simultaneously
update belief and reduce the number of mixture
components
Sample one updated component (after taking a and
receiving o)
The updated belief is represented by k particles
13/14
15Conclusion
- Bayesian model-based reinforcement learning
- Prior belief is a mixture of products of
Dirichlets - Posterior belief is also a mixture of products of
Dirichlets, with the number of mixture components
growing exponentially with time - a-functions (associated with value functions) are
also represented as mixtures of products of
Dirichlets that grow exponentially with time - Partially observable BEETLE algorithm.
14/14