Partial Observability

1 / 27

About This Presentation

Title:

Partial Observability

Description:

The POMDP model is a popular one for these kinds of problems ... Rodriguez, Parr, and Koller (1999) developed an algorithm based on the monitoring scheme ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 28

Provided by: dont223

more less

Transcript and Presenter's Notes

Title: Partial Observability

1
Partial Observability

Dan Bernstein
Jerod Weinman
Patrick Deegan
April 17, 2002

2
Roadmap for This Lecture

POMDP basics
Examples and formal model
Belief states
Exact algorithms for flat POMDPs with full
model
Towards RL for large POMDPs
Belief-state monitoring
EM for model learning
The need to go beyond hidden state

3
Partial Observability

Some problems are difficult to model as MDPs
because of limited observations
Robot navigation
Diagnosing a disease
Inspection and maintenance
Tutoring and designing questionaires
Active classification
Automated driving

4
Partially Observable MDPs

The POMDP model is a popular one for these kinds
of problems
The agent sees observations that are based on the
state
Just an HMM with actions and rewards

5
Factored POMDPs

Of course, we can think of state and observation
variables, leading to a factored model
We will return to this later

6
Formal Details

A POMDP is a tuple ?S, A, P, R, ?, O?, where
S is a finite state set, with initial
distribution ?
A is a finite action set,
P(s s,a) is a transition function
R(s,a) is a reward function
? is a finite observation set
O(o s, a, s) is an observation function

7
Formal Details (cont.)

A policy ? ? ? A is a mapping from observation
histories to actions
The aim is to find a policy that maximizes
expected total reward
As with MDPs, we can have finite horizon,
infinite horizon discounted, infinite horizon
average reward, etc.

8
Facts About POMDPs

Finite-horizon POMDPs are PSPACE-hard
Infinite-horizon POMDPs are undecidable
For some types of infinite-horizon POMDPs,
optimal policies dont even exist!
But people are still interested in solving these
things exactly or almost exactly

9
Solving POMDPs Exactly

Its trickier than solving an MDP
A popular way to go about it
POMDP ? belief-state MDP
Perform DP on the belief-state MDP

10
Belief States

A belief state b is a distribution over states of
the underlying MDP
Each time the agent gets an observation and takes
an action, it enters a new belief state
A belief state is a sufficient statistic (i.e.,
it satisfies the Markov property)
Observation sequence ? belief state

11
Belief-state MDP

So from a POMDP, we can construct an equivalent
belief-state MDP
The Bellman equation for this MDP is
But if the POMDP has n states, the belief-state
MDP has an n-dimensional continuous state space!
Continuous MDPs are hard to solve in general

12
Saved by Sondik!

Luckily, these special continuous MDPs have nice
structure we can exploit
In particular, value functions are piecewise
linear and convex (and thus finitely
representable)
For example, in a two-state POMDP, the value
function looks like

s0
s1
13
Piecewise Linearity and Convexity
a0
a1
o0
o0
o1
o1
a1
a0
a1
a1
o1
o1
o0
o0
o0
o1
o0
o1
a1
a1
a0
a1
a1
a0
a0
a1

Each vector in a value function corresponds to a
policy tree
For any belief state, we of course want to pick
the best policy tree
This is how we get piecewise linear and convex

14
First Step of Value Iteration
a0
a1

One-step policy trees are just actions
To do a DP backup, we evaluate every possible
two-step tree

a0
a0
a0
a0
o0
o1
o0
o0
o0
o1
o1
o1
a0
a0
a0
a1
a1
a0
a1
a1
a1
a1
a1
a1
o0
o1
o0
o0
o0
o1
o1
o1
a0
a0
a0
a1
a1
a0
a1
a1
15
Pruning

Some policy trees are dominated and are never
useful
They are pruned and not considered on the next
step
The key is to prune before evaluating (Witness
and Incremental Pruning do this)

prune this one
16
Value Iteration

Keep doing backups until the value function
doesnt change much anymore
In the worst case, you end up considering every
possible policy tree
But hopefully you will converge before this
happens

17
More on Flat POMDP Algorithms

Policy iteration (Hansen, 1998) for POMDPs uses
the same machinery, but it has an explicit policy
improvement step
People have also tried heuristic search,
discretization, etc.
See (Murphy, 2000) for a good survey

18
Back to Factored POMDPs

With factored POMDPs, things get even hairier
In fact, even executing a policy can be hard
because it requires belief-state monitoring
In a factored POMDP, a belief state is
exponentially large
So we cant represent it and update it

19
Belief-state Updating

Note that the monitoring problem is a special
case of general Bayes net inference
But its not an easy special case causes
problems for exact algorithms like junction tree

20
Approximate Belief Updating

Boyen and Koller (1998) have devised a neat
approximate algorithm
State variables x1, ..., xn, Belief b(x1, ...,
xn)
Thats too hard to handle, but maybe we can
handle b(x1)b(x2) ??? b(xn)
At each step, the algorithm updates the belief
state, and projects it down to be factored

21
Approximate Belief Updating
Observe
True belief state
Updated
Updated
Observe
Project
Factored Belief state
Factored again

Can we say anything about the error due to this
approximation?

22
Error Bound

Yes, it turns out that the error remains bounded
Theorem
?t is the true belief state
is the factored approximate belief state
? is the error due to projection
? is the error decrease due to mixing

23
Computational Complexity

Thats great, but we still need to be able to do
the projection efficiently
Can construct a junction tree from the DBN and
use inference to get the marginals
The junction tree just needs to be set up once at
the beginning

24
An RL Algorithm

Rodriguez, Parr, and Koller (1999) developed an
algorithm based on the monitoring scheme
They use function approximators to represent
policies from belief states to actions
It acts like reinforcement learning, but it isnt
true RL because you need a model to do the
monitoring

25
Learning a Model Using EM

We can use EM to learn a model (Boyen and Koller,
1999)
Previous result deals with forward message
passing
A similar result holds for backwards message
passing
EM can be done online because only a finite
window of time around the current observation
matters

26
Why Is This Still Unsatisfying?

It learns parameters, but you need to know the
structure a priori (independence assumptions,
number of values for the hidden variables)
Where does an agent get this information?
Does an agent really even want this type of
information?
Would be better if the model was grounded in
experience (more direct!?)

27
Getting Rid of Hidden State

Pretend observation is state and do Q-learning
(Loch and Singh, 1998 Perkins, 2002)
Gradient descent in policy space (Meuleau et al.,
1999)
U-Tree algorithm (McCallum, 1995)
Predictive state representations (Littman,
Sutton, and Singh, 2001)

Write a Comment

User Comments (0)