Hierarchical Methods for Planning under Uncertainty - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Hierarchical Methods for Planning under Uncertainty

Description:

R(a=open-right, s=tiger-left) = 10. R(a=open-left, s=tiger-left) = -100 ... The tiger problem: An action hierarchy. Pinvestigate={S0, Ainvestigate, O0, Minvestigate} ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 62
Provided by: Joelle4
Category:

less

Transcript and Presenter's Notes

Title: Hierarchical Methods for Planning under Uncertainty


1
Hierarchical Methods forPlanning under
Uncertainty
  • Thesis Proposal
  • Joelle Pineau
  • Thesis Committee
  • Sebastian Thrun, Chair
  • Matthew Mason
  • Andrew Moore
  • Craig Boutilier, U. of Toronto

2
Integrating robots in living environments
The robots role - Social interaction - Mobile
manipulation - Intelligent reminding -
Remote-operation - Data collection / monitoring
3
A broad perspective
Belief state
OBSERVATIONS
STATE
USER WORLD ROBOT
ACTIONS
GOAL Selecting appropriate actions
4
Why is this a difficult problem?
UNCERTAINTY
Cause 1 Non-deterministic effects of
actions Cause 2 Partial and noisy sensor
information Cause 3 Inaccurate model of the
world and the user
5
Why is this a difficult problem?
UNCERTAINTY
Cause 1 Non-deterministic effects of
actions Cause 2 Partial and noisy sensor
information Cause 3 Inaccurate model of the
world and the user
A solution Partially Observable Markov Decision
Processes (POMDPs)
6
The truth about POMDPs
  • Bad news
  • Finding an optimal POMDP action selection policy
    is computationally intractable for complex
    problems.

7
The truth about POMDPs
  • Bad news
  • Finding an optimal POMDP action selection policy
    is computationally intractable for complex
    problems.
  • Good news
  • Many real-world decision-making problems exhibit
    structure inherent to the problem domain.
  • By leveraging structure in the problem domain, I
    propose an algorithm that makes POMDPs tractable,
    even for large domains.

8
How is it done?
  • Use a Divide-and-conquer approach
  • We decompose a large monolithic problem into a
    collection of loosely-related smaller problems.

9
Thesis statement
Decision-making under uncertainty can be made
tractable for complex problems by exploiting
hierarchical structure in the problem domain.
10
Outline
  • Problem motivation
  • Partially observable Markov decision processes
  • The hierarchical POMDP algorithm
  • Proposed research

11
POMDPs within the family of Markov models
12
What are POMDPs?
Components Set of states s?S Set of actions
a?A Set of observations o?O
S2
0.5
Pr(o1)0.9 Pr(o2)0.1
0.5
S1
a1
Pr(o1)0.5 Pr(o2)0.5
S3
1
a2
Pr(o1)0.2 Pr(o2)0.8
POMDP parameters Initial belief
b0(s)Pr(sos) Observation probabilities
O(s,a,o)Pr(os,a) Transition probabilities
T(s,a,s)Pr(ss,a) Rewards R(s,a)
HMM


MDP
13
A POMDP example The tiger problem
Reward Function R(alisten)
-1 R(aopen-right, stiger-left)
10 R(aopen-left, stiger-left) -100
14
What can we do with POMDPs?
  • 1) State tracking
  • After an action, what is the state of the world,
    st ?
  • 2) Computing a policy
  • Which action, aj, should the controller apply
    next?

Not so hard.
Very hard!
St-1
st
...
World
at-1
Control layer
ot
??
bt-1
??
...
Robot
15
The tiger problem State tracking
b0
Belief vector
S1 tiger-left
S2 tiger-right
Belief
16
The tiger problem State tracking
b0
Belief vector
S1 tiger-left
S2 tiger-right
Belief
obsgrowl-left
actionlisten
17
The tiger problem State tracking
b1
b0
Belief vector
S1 tiger-left
S2 tiger-right
Belief
obsgrowl-left
actionlisten
18
Policy Optimization
  • Which action, aj, should the controller apply
    next?
  • In MDPs
  • Policy is a mapping from state to action, ? si ?
    aj
  • In POMDPs
  • Policy is a mapping from belief to action, ? b ?
    aj
  • Recursively calculate expected long-term reward
    for each state/belief
  • Find the action that maximizes the expected
    reward

19
The tiger problem Optimal policy
open-left
open-right
listen
Optimal policy
Belief vector
S1 tiger-left
S2 tiger-right
20
Complexity of policy optimization
  • Finite-horizon POMDPs are in worse-case doubly
    exponential
  • Infinite-horizon undiscounted stochastic POMDPs
    are EXPTIME-hard, and may not be decidable
    (?n????).

21
The essence of the problem
  • How can we find good policies for complex POMDPs?
  • Is there a principled way to provide near-optimal
    policies in reasonable time?

22
Outline
  • Problem motivation
  • Partially observable Markov decision processes
  • The hierarchical POMDP algorithm
  • Proposed research

23
A hierarchical approach to POMDP planning
  • Key Idea Exploit hierarchical structure in the
    problem domain to break a problem into many
    related POMDPs.
  • What type of structure?
  • Action set partitioning

subtask
abstract action
24
Assumptions
  • Each POMDP controller has a subset of Ao.
  • Each POMDP controller has full state set S0,
    observation set O0.
  • Each controller includes discriminative reward
    information.
  • We are given the action set partitioning graph.
  • We are given a full POMDP model of the problem
    So,Ao,Oo,Mo.

25
The tiger problem An action hierarchy
act
open-left
investigate
open-right
listen
PinvestigateS0, Ainvestigate, O0,
Minvestigate Ainvestigatelisten, open-right
26
Optimizing the investigate controller
open-right
listen
Locally optimal policy
Belief vector
S1 tiger-left
S2 tiger-right
27
The tiger problem An action hierarchy
PactS0, Aact, O0, Mact Aactopen-left,
investigate
act
But... R(s, ainvestigate) is not defined!
open-left
investigate
open-right
listen
28
Modeling abstract actions
Insight Use the local policy of corresponding
low-level controller. General form R( si, ak)
R ( si, Policy(controllerk,si) ) Example
R(stiger-left,ak investigate)
Policy (investigate,stiger-left) open-right
open-right listen open-left tiger-left
10 -1 -100 tiger-right -100
-1 10
29
Optimizing the act controller
investigate
open-left
Locally optimal policy
Belief vector
S1 tiger-left
S2 tiger-right
30
The complete hierarchical policy
open-left
open-right
listen
Hierarchical policy
Belief vector
S1 tiger-left
S2 tiger-right
31
The complete hierarchical policy
Optimal policy
open-left
open-right
listen
Hierarchical policy
Belief vector
S1 tiger-left
S2 tiger-right
32
Results for larger simulation domains
33
Related work on hierarchical methods
  • Hierarchical HMMs
  • Fine et al., 1998
  • Hierarchical MDPs
  • DayanHinton, 1993 Dietterich, 1998 McGovern et
    al., 1998 ParrRussell, 1998 Singh, 1992.
  • Loosely-coupled MDPs
  • Boutilier et al., 1997 DeanLin, 1995 Meuleau
    et al. 1998 SinghCohn, 1998 WangMahadevan,
    1999.
  • Factored state POMDPs
  • Boutilier et al., 1999 BoutilierPoole, 1996
    HansenFeng, 2000.
  • Hierarchical POMDPs
  • Castanon, 1997 Hernandez-GardiolMahadevan,
    2001 Theocharous et al., 2001
    WieringSchmidhuber, 1997.

34
Outline
  • Problem motivation
  • Partially observable Markov decision processes
  • The hierarchical POMDP algorithm
  • Proposed research

35
Proposed research
  • 1) Algorithmic design
  • 2) Algorithmic analysis
  • 3) Model learning
  • 4) System development and application

36
Research block 1 Algorithmic design
  • Goal 1.1 Developing/implementing hierarchical
    POMDP algorithm.
  • Goal 1.2 Extending H-POMDP for factorized state
    representation.
  • Goal 1.3 Using state/observation abstraction.
  • Goal 1.4 Planning for controllers with no local
    reward information.

37
Goal 1.3 State/observation abstraction
  • Assumption 2
  • Each POMDP controller has full state set S0, and
    observation set O0.
  • Can we reduce the number of states/observations,
    S and O?

38
Goal 1.3 State/observation abstraction
  • Assumption 2
  • Each POMDP controller has full state set S0, and
    observation set O0.
  • Can we reduce the number of states/observations,
    S and O?
  • Yes! Each controller only needs subset of
    state/observation features.
  • What is the computational speed-up?

39
Goal 1.4 Local controller reward information
  • Assumption 3
  • Each controller includes some amount of
    discriminative reward information.
  • Can we relax this assumption?

40
Goal 1.4 Local controller reward information
  • Assumption 3
  • Each controller includes some amount of
    discriminative reward information.
  • Can we relax this assumption?
  • Possibly. Use reward shaping to select
    policy-invariant reward function.
  • What is the benefit?
  • H-POMDP could solve problems with sparse reward
    functions.

41
Research block 2 Algorithmic analysis
  • Goal 2.1 Evaluating performance of the H-POMDP
    algorithm.
  • Goal 2.2 Quantifying the loss due to the
    hierarchy.
  • Goal 2.3 Comparing different possible
    decompositions of a problem.

42
Goal 2.1 Performance evaluation
  • How does the hierarchical POMDP algorithm compare
    to
  • Exact value function methods
  • Sondik, 1971 Monahan, 1982 Littman, 1996
    Cassandra et al, 1997.
  • Policy search methods
  • Hansen, 1998 Kearns et al., 1999 NgJordan,
    2000 BaxterBartlett, 2000.
  • Value approximation methods
  • ParrRussell, 1995 Thrun, 2000.
  • Belief approximation methods
  • Nourbakhsh, 1995 KoenigSimmons, 1996
    Hauskrecht, 2000 RoyThrun, 2000.
  • Memory-based methods
  • McCallum, 1996.
  • Consider problems from POMDP literature and
    dialogue management domain.

43
Goal 2.2 Quantifying the loss
  • The hierarchical POMDP planning algorithm
    provides an approximately-optimal policy.
  • How near-optimal is the policy?
  • Subject to some (very restrictive) conditions
  • The value function of top-level controller
  • is an upper-bound on the value
  • of the approximation.
  • Can we loosen the restrictions? Tighten the
    bound?
  • Find a lower-bound?

Vtop(b)?Vactual(b)
44
Goal 2.3 Comparing different decomposition
  • Assumption 4
  • We are given an action set partitioning graph.
  • What makes a good hierarchical action
    decomposition?
  • Comparing decompositions is the first step
    towards automatic decomposition.

Replace
Manufacture
Examine
Inspect
Manufacture
Replace
Examine
Inspect
45
Research block 3 Model learning
  • Goal 3.1 Automatically generating good action
    hierarchies.
  • Assumption 4 We are given an action set
    partitioning graph.
  • Can we automatically generate a good hierarchical
    decomposition?
  • Maybe. It is being done for hierarchical MDPs.
  • Goal 3.2 Including parameter learning.
  • Assumption 5 We are given a full POMDP model
    of the problem.
  • Can we introduce parameter learning?
  • Yes! Maximum-likelihood parameter optimization
    (Baum-Welch) can be used for POMDPs.

46
Research block 4 System development and
application
  • Goal 4.1 Building an extensive dialogue manager

Remote-control command
Facemail operations
Teleoperation module
Reminder message
Status information
Reminding module
Touchscreen input Speech utterance
Touchscreen message Speech utterance
User
Robot sensor readings
Motion command
Robot module
Dialogue Manager
47
An implemented scenario
Problem size S288, A14, O15 State
Features RobotLocation, UserLocation,
UserStatus, ReminderGoal,
UserMotionGoal, UserSpeechGoal
Patient room
Robot home
Physiotherapy
Test subjects 3 elderly residents in assisted
living facility
48
Contributions
  • Algorithmic contribution A novel POMDP
    algorithm based on hierarchical structure.
  • Enables use of POMDPs for much larger problems.
  • Application contribution Application of POMDPs
    to dialogue management is novel.
  • Allows design of robust robot behavioural
    managers.

49
Research schedule
fall 01 spring/summer 02 spring/summer/fall
02 ongoing fall 02 / spring 03
  • 1) Algorithmic design/implementation
  • 2) Algorithmic analysis
  • 3) Model learning
  • 4) System development and application
  • 5) Thesis writing

50
Questions?
51
A simulated robot navigation example
()
()
Domain size S11, A6, O6
52
A dialogue management example
- SayTime
Act
CheckHealth
- AskHealth - OfferHelp
CheckWeather
Move
Greet
DoMeds
Phone
- GreetGeneral - GreetMorning - GreetNight -
RespondThanks
- AskGoWhere - GoToRoom - GoToKitchen -
GoToFollow - VerifyRoom - VerifyKitchen -
VerifyFollow
- AskWeatherTime - SayCurrent - SayToday -
SayTomorrow
- StartMeds - NextMeds - ForceMeds - QuitMeds
- AskCallWho - Call911 - CallNurse -
CallRelative - Verify911 - VerifyNurse -
VerifyRelative
Domain size S20, A30, O27
53
Action hierarchy for implemented scenario
Act
Remind
Assist
Rest
Move
Contact
Inform
54
Sondiks parts manufacturing problem
Decomposition2
Decomposition1
Replace
Manufacture
Examine
Inspect
5 more decompositions
55
Manufacturing task results
56
Using state/observation abstraction
Action Set
State Set
CheckHealth
- AskHealth - OfferHelp
ReminderGoalnone, medsX CommunicationGoalnone
, personX UserHealthgood, poor, emergency
DoMeds
Phone
- AskCallWho - CallHelp - CallNurse -
CallRelative - VerifyHelp - VerifyNurse -
VerifyRelative
Phone
CommunicationGoalnone, nurse, 911, relative
57
Related work on robot planning and control
  • Manually-scripted dialogue strategies
  • DeneckeWaibel, 1997 Walker et al., 1997.
  • Markov decision processes (MDPs) for dialogue
    management
  • Levin et al., 1997 Fromer, 1998 Walker et al.,
    1998 GoddeauPineau, 2000 Singh et al., 2000
    Walker, 2000.
  • Robot interface
  • Torrance, 1996 Asoh et al., 1999.
  • Classical planning
  • FikesNilsson, 1971 Simmons, 1987
    McAllesterRosenblitt, 1991 PenberthyWeld,
    1992 Kushmerick, 1995 Velosoal., 1995
    SmithWeld, 1998.
  • Execution architectures
  • Firby, 1987 Musliner, 1993 Simmons, 1994
    BonassoKortenkamp, 1996

58
Decision-theoretic planning models
59
The tiger problem Value function solution
open-right
open-left
listen
V
belief
Stiger-left
Stiger-right
60
Optimizing the investigate controller
listen
open-right
V
belief
Stiger-left
Stiger-right
61
Optimizing the act controller
investigate
open-left
V
belief
Stiger-left
Stiger-right
Write a Comment
User Comments (0)
About PowerShow.com