Partial Observability

1 / 27
About This Presentation
Title:

Partial Observability

Description:

The POMDP model is a popular one for these kinds of problems ... Rodriguez, Parr, and Koller (1999) developed an algorithm based on the monitoring scheme ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 28
Provided by: dont223

less

Transcript and Presenter's Notes

Title: Partial Observability


1
Partial Observability
  • Dan Bernstein
  • Jerod Weinman
  • Patrick Deegan
  • April 17, 2002

2
Roadmap for This Lecture
  • POMDP basics
  • Examples and formal model
  • Belief states
  • Exact algorithms for flat POMDPs with full
    model
  • Towards RL for large POMDPs
  • Belief-state monitoring
  • EM for model learning
  • The need to go beyond hidden state

3
Partial Observability
  • Some problems are difficult to model as MDPs
    because of limited observations
  • Robot navigation
  • Diagnosing a disease
  • Inspection and maintenance
  • Tutoring and designing questionaires
  • Active classification
  • Automated driving

4
Partially Observable MDPs
  • The POMDP model is a popular one for these kinds
    of problems
  • The agent sees observations that are based on the
    state
  • Just an HMM with actions and rewards

5
Factored POMDPs
  • Of course, we can think of state and observation
    variables, leading to a factored model
  • We will return to this later

6
Formal Details
  • A POMDP is a tuple ?S, A, P, R, ?, O?, where
  • S is a finite state set, with initial
    distribution ?
  • A is a finite action set,
  • P(s s,a) is a transition function
  • R(s,a) is a reward function
  • ? is a finite observation set
  • O(o s, a, s) is an observation function

7
Formal Details (cont.)
  • A policy ? ? ? A is a mapping from observation
    histories to actions
  • The aim is to find a policy that maximizes
    expected total reward
  • As with MDPs, we can have finite horizon,
    infinite horizon discounted, infinite horizon
    average reward, etc.

8
Facts About POMDPs
  • Finite-horizon POMDPs are PSPACE-hard
  • Infinite-horizon POMDPs are undecidable
  • For some types of infinite-horizon POMDPs,
    optimal policies dont even exist!
  • But people are still interested in solving these
    things exactly or almost exactly

9
Solving POMDPs Exactly
  • Its trickier than solving an MDP
  • A popular way to go about it
  • POMDP ? belief-state MDP
  • Perform DP on the belief-state MDP

10
Belief States
  • A belief state b is a distribution over states of
    the underlying MDP
  • Each time the agent gets an observation and takes
    an action, it enters a new belief state
  • A belief state is a sufficient statistic (i.e.,
    it satisfies the Markov property)
  • Observation sequence ? belief state

11
Belief-state MDP
  • So from a POMDP, we can construct an equivalent
    belief-state MDP
  • The Bellman equation for this MDP is
  • But if the POMDP has n states, the belief-state
    MDP has an n-dimensional continuous state space!
  • Continuous MDPs are hard to solve in general

12
Saved by Sondik!
  • Luckily, these special continuous MDPs have nice
    structure we can exploit
  • In particular, value functions are piecewise
    linear and convex (and thus finitely
    representable)
  • For example, in a two-state POMDP, the value
    function looks like

s0
s1
13
Piecewise Linearity and Convexity
a0
a1
o0
o0
o1
o1
a1
a0
a1
a1
o1
o1
o0
o0
o0
o1
o0
o1
a1
a1
a0
a1
a1
a0
a0
a1

  • Each vector in a value function corresponds to a
    policy tree
  • For any belief state, we of course want to pick
    the best policy tree
  • This is how we get piecewise linear and convex

14
First Step of Value Iteration
a0
a1
  • One-step policy trees are just actions
  • To do a DP backup, we evaluate every possible
    two-step tree

a0
a0
a0
a0
o0
o1
o0
o0
o0
o1
o1
o1
a0
a0
a0
a1
a1
a0
a1
a1
a1
a1
a1
a1
o0
o1
o0
o0
o0
o1
o1
o1
a0
a0
a0
a1
a1
a0
a1
a1
15
Pruning
  • Some policy trees are dominated and are never
    useful
  • They are pruned and not considered on the next
    step
  • The key is to prune before evaluating (Witness
    and Incremental Pruning do this)

prune this one
16
Value Iteration
  • Keep doing backups until the value function
    doesnt change much anymore
  • In the worst case, you end up considering every
    possible policy tree
  • But hopefully you will converge before this
    happens

17
More on Flat POMDP Algorithms
  • Policy iteration (Hansen, 1998) for POMDPs uses
    the same machinery, but it has an explicit policy
    improvement step
  • People have also tried heuristic search,
    discretization, etc.
  • See (Murphy, 2000) for a good survey

18
Back to Factored POMDPs
  • With factored POMDPs, things get even hairier
  • In fact, even executing a policy can be hard
    because it requires belief-state monitoring
  • In a factored POMDP, a belief state is
    exponentially large
  • So we cant represent it and update it

19
Belief-state Updating
  • Note that the monitoring problem is a special
    case of general Bayes net inference
  • But its not an easy special case causes
    problems for exact algorithms like junction tree


20
Approximate Belief Updating
  • Boyen and Koller (1998) have devised a neat
    approximate algorithm
  • State variables x1, ..., xn, Belief b(x1, ...,
    xn)
  • Thats too hard to handle, but maybe we can
    handle b(x1)b(x2) ??? b(xn)
  • At each step, the algorithm updates the belief
    state, and projects it down to be factored

21
Approximate Belief Updating
Observe
True belief state
Updated
Updated
Observe
Project
Factored Belief state
Factored again
  • Can we say anything about the error due to this
    approximation?

22
Error Bound
  • Yes, it turns out that the error remains bounded
  • Theorem
  • ?t is the true belief state
  • is the factored approximate belief state
  • ? is the error due to projection
  • ? is the error decrease due to mixing

23
Computational Complexity
  • Thats great, but we still need to be able to do
    the projection efficiently
  • Can construct a junction tree from the DBN and
    use inference to get the marginals
  • The junction tree just needs to be set up once at
    the beginning

24
An RL Algorithm
  • Rodriguez, Parr, and Koller (1999) developed an
    algorithm based on the monitoring scheme
  • They use function approximators to represent
    policies from belief states to actions
  • It acts like reinforcement learning, but it isnt
    true RL because you need a model to do the
    monitoring

25
Learning a Model Using EM
  • We can use EM to learn a model (Boyen and Koller,
    1999)
  • Previous result deals with forward message
    passing
  • A similar result holds for backwards message
    passing
  • EM can be done online because only a finite
    window of time around the current observation
    matters

26
Why Is This Still Unsatisfying?
  • It learns parameters, but you need to know the
    structure a priori (independence assumptions,
    number of values for the hidden variables)
  • Where does an agent get this information?
  • Does an agent really even want this type of
    information?
  • Would be better if the model was grounded in
    experience (more direct!?)

27
Getting Rid of Hidden State
  • Pretend observation is state and do Q-learning
    (Loch and Singh, 1998 Perkins, 2002)
  • Gradient descent in policy space (Meuleau et al.,
    1999)
  • U-Tree algorithm (McCallum, 1995)
  • Predictive state representations (Littman,
    Sutton, and Singh, 2001)
Write a Comment
User Comments (0)