ExecutionTime Communication Decisions for Coordination of MultiAgent Teams - PowerPoint PPT Presentation

About This Presentation
Title:

ExecutionTime Communication Decisions for Coordination of MultiAgent Teams

Description:

Guarantee agents will Avoid Coordination Errors (ACE) during decentralized execution ... Coordination Errors by executing Individual Factored Policies (ACE-IFP) ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 72
Provided by: maaya4
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: ExecutionTime Communication Decisions for Coordination of MultiAgent Teams


1
Execution-Time Communication Decisions for
Coordination of Multi-Agent Teams
  • Maayan Roth
  • Thesis Defense
  • Carnegie Mellon University
  • September 4, 2007

2
Cooperative Multi-Agent Teams Operating Under
Uncertainty and Partial Observability
  • Cooperative teams
  • Agents work together to achieve team reward
  • No individual motivations
  • Uncertainty
  • Actions have stochastic outcomes
  • Partial observability
  • Agents dont always know world state

3
Coordinating When Communication is a Limited
Resource
  • Tight coordination
  • One agents best action choice depends on the
    action choices of its teammates
  • We wish to Avoid Coordination Errors
  • Limited communication
  • Communication costs
  • Limited bandwidth

4
Thesis Question
How can we effectively use communication to
enable the coordination of cooperative
multi-agent teams making sequential decisions
under uncertainty and partial observability?
5
Multi-Agent Sequential Decision Making
6
Thesis Statement
Reasoning about communication decisions at
execution-time provides a more tractable means
for coordinating teams of agents operating under
uncertainty and partial observability.
7
Thesis Contributions
  • Algorithms that
  • Guarantee agents will Avoid Coordination Errors
    (ACE) during decentralized execution
  • Answer the questions of when and what agents
    should communicate

8
Outline
  • Dec-POMDP model
  • Impact of communication on complexity
  • Avoiding Coordination Errors by reasoning over
    Possible Joint Beliefs (ACE-PJB)
  • ACE-PJB-Comm When should agents communicate?
  • Selective ACE-PJB-Comm What should agents
    communicate?
  • Avoiding Coordination Errors by executing
    Individual Factored Policies (ACE-IFP)
  • Future directions

9
Dec-POMDP Model
  • Decentralized Partially Observable Markov
    Decision Process
  • Multi-agent extension of single-agent POMDP model
  • Sequential decision-making in domains where
  • Uncertainty in outcome of actions
  • Partial observability - uncertainty about world
    state

10
Dec-POMDP Model
  • M lt?, S, Aii?m, T, ?ii?m, O, Rgt
  • ? is the number of agents
  • S is set of possible world states
  • Aii?m is set of joint actions, lta1, , amgt
    where ai ? Ai
  • T defines transition probabilities over joint
    actions
  • ?ii?m is set of joint observations, lt?1, ,
    ?mgt where ?i ? ?i
  • O defines observation probabilities over joint
    actions and joint observations
  • R is team reward function

11
Dec-POMDP Complexity
  • Goal - Compute policy which, for each agent, maps
    its local observation history to an action
  • For all ? ? 2, Dec-POMDP with ? agents is
    NEXP-complete
  • Agents must reason about the possible actions and
    observations of their teammates

12
Impact of Communication on Complexity Pynadath
and Tambe, 2002
  • If communication is free
  • Dec-POMDP reducible to single-agent POMDP
  • Optimal communication policy is to communicate at
    every time step
  • When communication has any cost, Dec-POMDP is
    still intractable (NEXP-complete)
  • Agents must reason about value of information

13
Classifying Communication Heuristics
  • AND- vs. OR-communication Emery-Montemerlo,
    2005
  • AND-communication does not replace domain-level
    actions
  • OR-communication does replace domain-level
    actions
  • Initiating communication Xuan et al., 2001
  • Tell - Agent decides to tell local information to
    teammates
  • Query - Agent asks a teammate for information
  • Sync - All agents broadcast all information
    simultaneously

14
Classifying Communication Heuristics
  • Does the algorithm consider communication cost?
  • Is the algorithm is applicable to
  • General Dec-POMDP domains
  • General Dec-MDP domains
  • Restricted domains
  • Are the agents guaranteed to Avoid Coordination
    Errors?

15
Related Work
Query
OR
AND
Cost
Sync
ACE
Unrestricted
Tell
16
Overall Approach
  • Recall, if communication is free, you can treat a
    Dec-POMDP like a single agent
  • 1) At plan-time, pretend communication is free
  • - Generate a centralized policy for the team
  • 2) At execution-time, use communication to enable
    decentralized execution of this policy while
    Avoiding Coordination Errors

17
Outline
  • Dec-POMDP, Dec-MDP models
  • Impact of communication on complexity
  • Avoiding Coordination Errors by reasoning over
    Possible Joint Beliefs (ACE-PJB)
  • ACE-PJB-Comm When should agents communicate?
  • Selective ACE-PJB-Comm What should agents
    communicate?
  • Avoiding Coordination Errors by executing
    Individual Factored Policies (ACE-IFP)
  • Future directions

18
Tiger Domain (States, Actions)
  • Two-agent tiger problem Nair et al., 2003

Individual Actions ai ? OpenL, OpenR,
Listen Robot can open left door, open right
door, or listen
S SL, SR Tiger is either behind left door or
behind right door
19
Tiger Domain (Observations)
Individual Observations ?I ? HL, HR Robot can
hear tiger behind left door or hear tiger behind
right door
Observations are noisy and independent.
20
Tiger Domain(Reward)
  • Coordination problem agents must act together
    for maximum reward

Listen has small cost (-1 per agent)
Both agents opening door with tiger leads to
medium negative reward (-50)
Maximum reward (20) when both agents open door
with treasure
Minimum reward (-100) when only one agent opens
door with tiger
21
Coordination Errors
Reward(ltOpenR, OpenLgt) -100 Reward(ltOpenL,
OpenLgt) -50
HL HL HL
Agents Avoid Coordination Errors when each
agents action is a best response to its
teammates actions.
a1 OpenR
a2 OpenL
22
Avoid Coordination Errors by Reasoning Over
Possible Joint Beliefs (ACE-PJB)
  • Centralized POMDP policy maps joint beliefs to
    joint actions
  • Joint belief (bt) distribution over world
    states
  • Individual agents cant compute the joint belief
  • Dont know what their teammates have observed or
    what action they selected
  • Simplifying assumption
  • What if agents knew the joint action at each
    timestep?
  • Agents would only have to reason about possible
    observations
  • How can this be assured?

23
Ensuring Action Synchronization
  • Agents only allowed to choose actions based on
    information known to all team members
  • At start of execution, agents know
  • b0 initial distribution over world states
  • A0 optimal joint action given b0, based on
    centralized policy
  • At each timestep, each agent computes Lt,
    distribution of possible joint beliefs
  • Lt ltbt, pt, ?tgt
  • ?t observation history that led to bt
  • pt - likelihood of observing ?t

24
Possible Joint Beliefs
b P(SL) 0.5 p p(b) 1.0
L0
a ltListen, Listengt
HL,HL
HR,HR
HL,HR
HR,HL
How should agents select actions over joint
beliefs?
b P(SL) 0.5 p p(b) 0.21
b P(SL) 0.5 p p(b) 0.21
b P(SL) 0.2 p p(b) 0.29
L1
25
Q-POMDP Heuristic
  • Select joint action that maximizes expected
    reward over possible joint beliefs
  • Q-MDP Littman et al., 1995
  • approximate solution to large POMDP using
    underlying MDP
  • Q-POMDP Roth et al., 2005
  • approximate solution to Dec-POMDP using
    underlying single-agent POMDP

26
Q-POMDP Heuristic
b P(SL) 0.5 p p(b) 1.0
Choose joint action by computing expected reward
over all leaves
HL,HL
HR,HR
HL,HR
HR,HL
b P(SL) 0.5 p p(b) 0.21
b P(SL) 0.5 p p(b) 0.21
b P(SL) 0.2 p p(b) 0.29
Agents will independently select same joint
action, guaranteeing they avoid coordination
errors
but action choice is very conservative (always
ltListen,Listengt)
ACE-PJB-Comm Communication adds local
observations to joint belief
27
ACE-PJB-Comm Example

ltHR,HLgt
ltHL,HLgt
ltHL,HRgt
ltHR,HRgt
L1
aNC Q-POMDP(L1) ltListen,Listengt
L circled nodes
Dont communicate
aC Q-POMDP(L) ltListen,Listengt
28
ACE-PJB-Comm Example

HL,HL
ltHL,HLgt
ltHL,HRgt
ltHR,HLgt
ltHR,HRgt
L1
a ltListen, Listengt

ltHL,HLgt ltHL,HLgt
ltHL,HLgt ltHL,HRgt
ltHL,HLgt ltHR,HLgt
ltHL,HLgt ltHR,HRgt
ltHL,HRgt ltHL,HLgt
ltHL,HRgt ltHL,HRgt
ltHL,HRgt ltHR,HLgt
ltHL,HRgt ltHR.HRgt
L2
aNC Q-POMDP(L2) ltListen, Listengt
L circled nodes
Agent 1 communicates
aC Q-POMDP(L) ltOpenR,OpenRgt
V(aC) - V(aNC) gt e
29
ACE-PJB-Comm Example

HL,HL
ltHL,HLgt
ltHL,HRgt
ltHR,HLgt
ltHR,HRgt
L1
a ltListen, Listengt

ltHL,HLgt ltHL,HLgt
ltHL,HLgt ltHL,HRgt
ltHL,HLgt ltHR,HLgt
ltHL,HLgt ltHR,HRgt
ltHL,HRgt ltHL,HLgt
ltHL,HRgt ltHL,HRgt
ltHL,HRgt ltHR,HLgt
ltHL,HRgt ltHR.HRgt
L2
Agent 1 communicates ltHL,HLgt
Agents open right door!
Q-POMDP(L2) ltOpenR, OpenRgt
30
ACE-PJB-Comm Results
  • 20,000 trials in 2-Agent Tiger Domain
  • 6 timesteps per trial
  • Agents communicate 49.7 fewer observations using
    ACE-PJB-Comm, 93.3 fewer messages
  • Difference in expected reward because
    ACE-PJB-Comm is slightly pessimistic about
    outcome of communication

31
Additional Challenges
  • Number of possible joint beliefs grows
    exponentially
  • Use particle filter to model distribution of
    possible joint beliefs
  • ACE-PJB-Comm answers the question of when agents
    should communicate
  • Doesnt deal with what to communicate
  • Agents communicate all observations that they
    havent previously communicated

32
Selective ACE-PJB-CommRoth et al., 2006
  • Answers what agents should communicate
  • Chooses most valuable subset of observations
  • Hill-climbing heuristic to choose observations
    that push teams towards aC
  • aC - joint action that would be chosen if agent
    communicated all observations
  • See details in thesis document

33
Selective ACE-PJB-Comm Results
  • 2-Agent Tiger domain
  • Communicates 28.7 fewer observations
  • Same expected reward
  • Slightly more messages

34
Outline
  • Dec-POMDP, Dec-MDP models
  • Impact of communication on complexity
  • Avoiding Coordination Errors by reasoning over
    Possible Joint Beliefs (ACE-PJB)
  • ACE-PJB-Comm When should agents communicate?
  • Selective ACE-PJB-Comm What should agents
    communicate?
  • Avoiding Coordination Errors by executing
    Individual Factored Policies (ACE-IFP)
  • Future directions

35
Dec-MDP
  • State is collectively observable
  • One agent cant identify full state on its own
  • Union of team observations uniquely identifies
    state
  • Underlying problem is an MDP, not a POMDP
  • Dec-MDP has same complexity as Dec-POMDP
  • NEXP-Complete

36
Acting Independently
  • ACE-PJB requires agents to know joint action at
    every timestep
  • Claim In many multi-agent domains, agents can
    act independently for long periods of time, only
    needing to coordinate infrequently

37
Meeting-Under-Uncertainty Domain
  • Agents must move to goal location and signal
    simultaneously
  • Reward
  • 20 - Both agents signal at goal
  • -50 - Both agents signal at another location
  • -100 - Only one agent signals
  • -1 - Agents move north, south, east, west, or
    stop

38
Factored Representations
  • Represent relationships among state variables
    instead of relationships among states

S ltX0, Y0, X1, Y1gt Each agent observes its own
position
39
Factored Representations
  • Dynamic Decision Network models state variables
    over time
  • at ltEast, gt

40
Tree-structured Policies
  • Decision tree that branches over state variables
  • A tree-structured joint policy has joint actions
    at the leaves

41
Approach Roth et al., 2007
  • Generate tree-structured joint policies for
    underlying centralized MDP
  • Use this joint policy to generate a
    tree-structured individual policy for each agent
  • Execute individual policies

See details in thesis document
42
Context-specific Independence
Claim In many multi-agent domains, one agents
individual policy will have large sections where
it is independent of variables that its teammates
observe.
43
Individual Policies
  • One agents individual policy may depend on state
    features it doesnt observe

44
Avoid Coordination Errors by Executing an
Individual Factored Policy (ACE-IFP)
  • Robot traverses policy tree according to its
    observations
  • If it reaches a leaf, its action is independent
    of its teammates observations
  • If it reaches a state variable that it does not
    observe directly, it must ask a teammate for the
    current value of that variable
  • The amount of communication needed to execute a
    particular policy corresponds to the amount of
    context-specific independence in that domain

45
Avoid Coordination Errors by Executing an
Individual Factored Policy (ACE-IFP)
  • Benefits
  • Agents can act independently without reasoning
    about the possible observations or actions of
    their teammates
  • Policy directs agents about when, what, and with
    whom to communicate
  • Drawback
  • In domains with little independence, agents may
    need to communicate a lot

46
Experimental Results
  • In 3x3 domain, executing factored policy required
    less than half as many messages as full
    communication, with same reward
  • Communication usage decreases relative to full
    communication as domain size increases

47
Factored Dec-POMDPs
  • Hansen and Feng, 2000 looked at factored POMDPs
  • ADD-representations of transition, observation,
    and reward functions
  • Policy is a finite-state controller
  • Nodes are actions
  • Transitions depend on conjunctions of state
    variable assignments
  • To extend to Dec-POMDP, make individual policy a
    finite-state controller among individual actions
  • Somehow combine nodes with the same action
  • Communicate to enable transitions between action
    nodes

48
Future Directions
  • Considering communication cost in ACE-IFP
  • All children of a particular variable may have
    similar values
  • Worst-case cost of mis-coordination?
  • Modeling teammate variables requires reasoning
    about possible teammate actions
  • Extending factoring to Dec-POMDPs

49
Future Directions
  • Knowledge persistence
  • Modeling teammates variables
  • Can we identify necessary conditions?
  • e.g. Tell me when you reach the goal.

Are you here yet?
Are you here yet?
50
Contributions
  • Decentralized execution of centralized policies
  • Guarantee that agents will Avoid Coordination
    Errors
  • Make effective use of limited communication
    resources
  • When should agents communicate?
  • What should agents communicate?
  • Demonstrate significant communication savings in
    experimental domains

51
Contributions
What?
When?
OR
AND
Cost
Sync
Query
ACE
Unrestricted
Tell
Who?
52
Thank You!
  • Advisors Reid Simmons, Manuela Veloso
  • Committee Carlos Guestrin, Jeff Schneider,
    Milind Tambe
  • RI Folks Suzanne, Alik, Damion, Doug, Drew,
    Frank, Harini, Jeremy, Jonathan, Kristen, Rachel
    (and many others!)
  • Aba, Ema, Nitzan, Yoel

53
References
  • Roth, M., Simmons, R., and Veloso, M. Reasoning
    About Joint Beliefs for Execution-Time
    Communication Decisions In AAMAS, 2005
  • Roth, M., Simmons, R., and Veloso, M. What to
    Communicate? Execution-Time Decisions in
    Multi-agent POMDPs In DARS, 2006
  • Roth, M., Simmons, R., and Veloso, M. Exploiting
    Factored Representations for Decentralized
    Execution in Multi-agent Teams In AAMAS, 2007
  • Bernstein, D., Zilberstein, S., and Immerman, N.
    The Complexity of Decentralized Control of
    Markov Decision Processes In UAI, 2000
  • Pynadath, D. and Tambe, M. The Communicative
    Multiagent Team Decision Problem Analyzing
    Teamwork Theories and Models In JAIR, 2002
  • Becker, R., Zilberstein, S., Lesser, V., and
    Goldman, C. Transition-independent Decentralized
    Markov Decision Processes In AAMAS, 2003
  • Nair, R., Roth, M., Yokoo, M., and Tambe, M.
    Communication for Improving Policy Computation
    in Distributed POMDPs In IJCAI, 2003

54
Tiger Domain Details
55
Particle filter representation
  • Each particle is a possible joint belief
  • Each agent maintains two particle filters
  • Ljoint possible joint team beliefs
  • Lown possible joint beliefs that are consistent
    with local observation history
  • Compare action selected by Q-POMDP over Ljoint to
    action selected over Lown and communicate as
    needed

56
Related Work Transition Independence Becker,
Zilberstein, Lesser, Goldman, 2003
  • DEC-MDP collective observability
  • Transition independence
  • Local state transitions
  • Each agent observes local state
  • Individual actions only affect local state
    transitions
  • Team connected through joint reward
  • Coverage set algorithm finds optimal policy
    quickly in experimental domains
  • No communication

57
Related Work COMM-JESP Nair, Roth, Yokoo,
Tambe, 2004
  • Add SYNC action to domain
  • If one agent chooses SYNC, all other agents SYNC
  • At SYNC, send entire observation history since
    last SYNC
  • SYNC brings agents to synchronized belief over
    world states
  • Policies indexed by root synchronized belief and
    observation history since last SYNC

t0 (SL () 0.5) (SR () 0.5)
a Listen, Listen
? HL
? HR
(SL (HR) 0.1275) (SL (HL) 0.7225) (SR (HR)
0.1275) (SR (HL) 0.0225)
(SL (HR) 0.0225) (SL (HL) 0.1275) (SR (HR)
0.7225) (SR (HL) 0.1275)
a SYNC
t 2 (SL () 0.5) (SR () 0.5)
t 2 (SL () 0.97) (SR () 0.03)
  • At-most K heuristic there must be a SYNC
    within at most K timesteps

58
Related Work No news is good news Xuan,
Lesser, Zilberstein, 2000
  • Applies to transition-independent DEC-MDPs
  • Agents form joint plan
  • plan exact path to be followed to accomplish
    goal
  • Communicate when deviation from plan occurs
  • agent sees it has slipped from optimal path
  • communicates need for re-planning

59
Related Work BaGA-Comm Emery-Montemerlo,
2005
  • Each agent has a type
  • Observation and action history
  • Agents model distribution of possible joint types
  • Choose actions by finding joint type closest to
    own local type
  • Allows coordination errors
  • Communicate if gain in expected reward is greater
    than cost of communication

60
Colorado/Wyoming Domain
  • Robots must meet in the capital, but do not know
    if they are in Colorado or Wyoming
  • Robots receive positive reward of 20 only if
    they SIGNAL simultaneously from correct goal
    location
  • To simplify problem, each robot knows both own
    and teammate position

Wyoming
Colorado
capital
61
Colorado/Wyoming Domain
  • Noisy observations mountain, plain, pikes peak,
    old faithful
  • Communication can help team reach goal more
    efficiently

Pikes Peak
Old Faithful
62
Build-Message What to Communicate
  • First, determine if communication is necessary
  • Calculate AC using Ace-PJB-Comm
  • If AC ANC, do not communicate
  • Greedily build message
  • Hill-climbing towards AC, away from ANC
  • Choose single observation that most increases
    difference between Q-POMDP values of AC and ANC

Mt
Pl
Mt
Pike
63
Build-Message What to Communicate
  • Is communication necessary?

Mt
ANC east, south
Pl
AC east, west
Mt
Pike
AC ? ANC so agent should communicate
64
Build-Message What to Communicate
Distribution if agent communicates entire
observation history
AC east, west - toward Denver
Mt
Mt
Pl
Pl
Mt
Pike
65
Build-Message What to Communicate
AC east, west toward Denver
Mt
  • PIKE is single best observation
  • In this case, PIKE sufficient to change joint
    action to AC, so agent communicates only one
    observation

Pl
m Pike
Mt
66
Context-specific Independence
  • A variable may be independent of a parent
    variable in some contexts but not others
  • e.g. X2 depends on X3 when X1 has value 1, but
    is independent otherwise
  • Claim - Many multi-agent domains exhibit a large
    amount of context-specific independence

67
Constructing Individual Factored Policies
  • Boutilier et al., 2000 defined Merge and
    Simplify operations for policy trees
  • We want to construct trees that maximize
    context-specific independence
  • Depends on variable ordering in policy
  • We define Intersect and Independent operations

68
Intersect
  • Find the intersection of the action sets of a
    nodes children

1. If all children are leaves, and action sets
have non-empty intersections, replace the node
with the intersection
2. If all but one child is a leaf, and all the
actions in the non-leaf childs subtree are
included in the leaf-childrens intersection,
replace with the non-leaf child
69
Independent
  • An individual action is Independent in a
    particular leaf of a policy tree if it is optimal
    when paired with any action its teammate could
    perform at that leaf

a is independent for agent 1
agent 1 has no independent actions
70
Generate Individual Policies
  • Generate a tree-structured joint policy
  • For each agent
  • Reorder variables in joint policy so that
    variables local to this agent are near the root
  • For each leaf in the policy, find the Independent
    actions
  • Break ties among remaining joint actions
  • Convert joint actions individual actions
  • Intersect and Simplify

71
Why Break Ties?
  • Ensure agents select the same optimal joint
    action to prevent mis-coordination
Write a Comment
User Comments (0)
About PowerShow.com