Title: Partial Observability
1Partial Observability
- Dan Bernstein
- Jerod Weinman
- Patrick Deegan
- April 17, 2002
2Roadmap for This Lecture
- POMDP basics
- Examples and formal model
- Belief states
- Exact algorithms for flat POMDPs with full
model - Towards RL for large POMDPs
- Belief-state monitoring
- EM for model learning
- The need to go beyond hidden state
3Partial Observability
- Some problems are difficult to model as MDPs
because of limited observations - Robot navigation
- Diagnosing a disease
- Inspection and maintenance
- Tutoring and designing questionaires
- Active classification
- Automated driving
4Partially Observable MDPs
- The POMDP model is a popular one for these kinds
of problems - The agent sees observations that are based on the
state - Just an HMM with actions and rewards
5Factored POMDPs
- Of course, we can think of state and observation
variables, leading to a factored model - We will return to this later
6Formal Details
- A POMDP is a tuple ?S, A, P, R, ?, O?, where
- S is a finite state set, with initial
distribution ? - A is a finite action set,
- P(s s,a) is a transition function
- R(s,a) is a reward function
- ? is a finite observation set
- O(o s, a, s) is an observation function
7Formal Details (cont.)
- A policy ? ? ? A is a mapping from observation
histories to actions - The aim is to find a policy that maximizes
expected total reward - As with MDPs, we can have finite horizon,
infinite horizon discounted, infinite horizon
average reward, etc.
8Facts About POMDPs
- Finite-horizon POMDPs are PSPACE-hard
- Infinite-horizon POMDPs are undecidable
- For some types of infinite-horizon POMDPs,
optimal policies dont even exist! - But people are still interested in solving these
things exactly or almost exactly
9Solving POMDPs Exactly
- Its trickier than solving an MDP
- A popular way to go about it
- POMDP ? belief-state MDP
- Perform DP on the belief-state MDP
10Belief States
- A belief state b is a distribution over states of
the underlying MDP - Each time the agent gets an observation and takes
an action, it enters a new belief state - A belief state is a sufficient statistic (i.e.,
it satisfies the Markov property) - Observation sequence ? belief state
11Belief-state MDP
- So from a POMDP, we can construct an equivalent
belief-state MDP - The Bellman equation for this MDP is
- But if the POMDP has n states, the belief-state
MDP has an n-dimensional continuous state space! - Continuous MDPs are hard to solve in general
12Saved by Sondik!
- Luckily, these special continuous MDPs have nice
structure we can exploit - In particular, value functions are piecewise
linear and convex (and thus finitely
representable) - For example, in a two-state POMDP, the value
function looks like
s0
s1
13Piecewise Linearity and Convexity
a0
a1
o0
o0
o1
o1
a1
a0
a1
a1
o1
o1
o0
o0
o0
o1
o0
o1
a1
a1
a0
a1
a1
a0
a0
a1
- Each vector in a value function corresponds to a
policy tree - For any belief state, we of course want to pick
the best policy tree - This is how we get piecewise linear and convex
14First Step of Value Iteration
a0
a1
- One-step policy trees are just actions
- To do a DP backup, we evaluate every possible
two-step tree
a0
a0
a0
a0
o0
o1
o0
o0
o0
o1
o1
o1
a0
a0
a0
a1
a1
a0
a1
a1
a1
a1
a1
a1
o0
o1
o0
o0
o0
o1
o1
o1
a0
a0
a0
a1
a1
a0
a1
a1
15Pruning
- Some policy trees are dominated and are never
useful - They are pruned and not considered on the next
step - The key is to prune before evaluating (Witness
and Incremental Pruning do this)
prune this one
16Value Iteration
- Keep doing backups until the value function
doesnt change much anymore - In the worst case, you end up considering every
possible policy tree - But hopefully you will converge before this
happens
17More on Flat POMDP Algorithms
- Policy iteration (Hansen, 1998) for POMDPs uses
the same machinery, but it has an explicit policy
improvement step - People have also tried heuristic search,
discretization, etc. - See (Murphy, 2000) for a good survey
18Back to Factored POMDPs
- With factored POMDPs, things get even hairier
- In fact, even executing a policy can be hard
because it requires belief-state monitoring - In a factored POMDP, a belief state is
exponentially large - So we cant represent it and update it
19Belief-state Updating
- Note that the monitoring problem is a special
case of general Bayes net inference - But its not an easy special case causes
problems for exact algorithms like junction tree
20Approximate Belief Updating
- Boyen and Koller (1998) have devised a neat
approximate algorithm - State variables x1, ..., xn, Belief b(x1, ...,
xn) - Thats too hard to handle, but maybe we can
handle b(x1)b(x2) ??? b(xn) - At each step, the algorithm updates the belief
state, and projects it down to be factored
21Approximate Belief Updating
Observe
True belief state
Updated
Updated
Observe
Project
Factored Belief state
Factored again
- Can we say anything about the error due to this
approximation?
22Error Bound
- Yes, it turns out that the error remains bounded
- Theorem
- ?t is the true belief state
- is the factored approximate belief state
- ? is the error due to projection
- ? is the error decrease due to mixing
23Computational Complexity
- Thats great, but we still need to be able to do
the projection efficiently - Can construct a junction tree from the DBN and
use inference to get the marginals - The junction tree just needs to be set up once at
the beginning
24An RL Algorithm
- Rodriguez, Parr, and Koller (1999) developed an
algorithm based on the monitoring scheme - They use function approximators to represent
policies from belief states to actions - It acts like reinforcement learning, but it isnt
true RL because you need a model to do the
monitoring
25Learning a Model Using EM
- We can use EM to learn a model (Boyen and Koller,
1999) - Previous result deals with forward message
passing - A similar result holds for backwards message
passing - EM can be done online because only a finite
window of time around the current observation
matters
26Why Is This Still Unsatisfying?
- It learns parameters, but you need to know the
structure a priori (independence assumptions,
number of values for the hidden variables) - Where does an agent get this information?
- Does an agent really even want this type of
information? - Would be better if the model was grounded in
experience (more direct!?)
27Getting Rid of Hidden State
- Pretend observation is state and do Q-learning
(Loch and Singh, 1998 Perkins, 2002) - Gradient descent in policy space (Meuleau et al.,
1999) - U-Tree algorithm (McCallum, 1995)
- Predictive state representations (Littman,
Sutton, and Singh, 2001)