Title: An Overview of: Partially Observable Markov Decision Process
1An Overview of Partially Observable Markov
Decision Process
- Dr Karl Altenburg
- 18 OCT 02
2Introduction
- The operation of a UAV (or any automated device)
is dependent of decisions and actions - The decision of which action to take is typically
based on the state of the environment - if sense_a_target then strike
- if sense_a_threat then avoid
- Understandably, given limited sensors, a UAVs
view of the state of the environment is only
partially complete - It cant see through mountains, for example
3Introduction
- A challenge in the design of intelligent agents,
such as a UAV, is to find a very effective
mapping of actions for various environmental
states, even if the agent isnt completely sure
what environmental its in
4Markov Chains
- Developed by A.A. Markov in 1906
- Markov chains applied to chains of events
- Repeated trials
- Outcomes depend only on the outcome of the
previous trial - Each experiment has a finite fixed number of
outcome called states - The outcomes may be stochastic, that is, entry
into the next state is based on a probability
5Markov Chains
- A Markov process may be depicted as a state
transition diagram where the nodes are states and
the transitions are probabilities leading to the
succeeding state
0.5
heads
tails
0.5
0.5
0.5
6STDs and Agent Specification
- State Transitions Diagrams are also used to
depict finite state automata, which are often
used to describe, and even specify, autonomous
agent behavior
7FSM and Agents
8Partially Observable Markov Decision Process
- For a good introduction see
- Atrash A, Koenig S. 2001. Probabilistic Planning
for Behavior-Based Robotics. GA Tech/AAAI. - Describes a police robot that has a noisy sensor
and the task of - searching rooms for victims
- avoiding rooms with terrorists
- Avoiding rooms with victims is bad
- Searching rooms with terrorists is worse
9POMDP
- A POMDP may be described as follows
- S finite set of states
- O finite set of observations
- ? initial state distribution
- s current state
- A(s) set of actions for that state
- a an action
- p(s s, a) transition function
- q(o s, a) observation function
- r(s, a) reward function
10(No Transcript)
11POMDP and Policy Graphs
- Given some observed state of the world, a
decision maker must take some action, which
results in a new observation and a reward - The mapping of actions and observations may be
done with a policy graph - Nodes are actions
- Arcs are observations
12(No Transcript)
13Policy Graph
- The objective of the planner is to derive a
policy (graph) that maximizes the average total
reward over an infinite planning horizon (not
sure when things will end) - ? a discount factor
- Typically set to slightly less than 1.0 (i.e.,
0.9) to assure that total reward is always finite
14Policy Graph
- Mapping policy graphs to finite state automata is
strait forward - Note
- Optimal policy graphs can potentially be large
but often are not - Finding optimal policy graph is PSPACE-complete
(see, Papadimitriou Tsitsiklis), and only
feasible for small planning tasks - There may be unreachable nodes, which can be
removed
15(No Transcript)
16POMDP vs. FSM
- POMDP assume discrete actions and observations
made after each action - FSM assume continuous behavior and triggers could
be observed at any time - POMDP planners assume every observation can be
made in every state - Leads to a combinatorial explosion
- FSM ignore non-trigger observations
- Atomic packaging and abstracting of actions are
solutions to these differences
17Ideas
- Enter or continue is very much like our UAV
search or strike question - POMDP appear to be a good fit of a formal model
for planning to emergent swarm intelligence - Work needs to be done to clearly define the set
of states, actions, observation, and rewards