Title: ExecutionTime Communication Decisions for Coordination of MultiAgent Teams
1Execution-Time Communication Decisions for
Coordination of Multi-Agent Teams
- Maayan Roth
- Thesis Defense
- Carnegie Mellon University
- September 4, 2007
2Cooperative Multi-Agent Teams Operating Under
Uncertainty and Partial Observability
- Cooperative teams
- Agents work together to achieve team reward
- No individual motivations
- Uncertainty
- Actions have stochastic outcomes
- Partial observability
- Agents dont always know world state
3Coordinating When Communication is a Limited
Resource
- Tight coordination
- One agents best action choice depends on the
action choices of its teammates - We wish to Avoid Coordination Errors
- Limited communication
- Communication costs
- Limited bandwidth
4Thesis Question
How can we effectively use communication to
enable the coordination of cooperative
multi-agent teams making sequential decisions
under uncertainty and partial observability?
5Multi-Agent Sequential Decision Making
6Thesis Statement
Reasoning about communication decisions at
execution-time provides a more tractable means
for coordinating teams of agents operating under
uncertainty and partial observability.
7Thesis Contributions
- Algorithms that
- Guarantee agents will Avoid Coordination Errors
(ACE) during decentralized execution - Answer the questions of when and what agents
should communicate
8Outline
- Dec-POMDP model
- Impact of communication on complexity
- Avoiding Coordination Errors by reasoning over
Possible Joint Beliefs (ACE-PJB) - ACE-PJB-Comm When should agents communicate?
- Selective ACE-PJB-Comm What should agents
communicate? - Avoiding Coordination Errors by executing
Individual Factored Policies (ACE-IFP) - Future directions
9Dec-POMDP Model
- Decentralized Partially Observable Markov
Decision Process - Multi-agent extension of single-agent POMDP model
- Sequential decision-making in domains where
- Uncertainty in outcome of actions
- Partial observability - uncertainty about world
state
10Dec-POMDP Model
- M lt?, S, Aii?m, T, ?ii?m, O, Rgt
- ? is the number of agents
- S is set of possible world states
- Aii?m is set of joint actions, lta1, , amgt
where ai ? Ai - T defines transition probabilities over joint
actions - ?ii?m is set of joint observations, lt?1, ,
?mgt where ?i ? ?i - O defines observation probabilities over joint
actions and joint observations - R is team reward function
11Dec-POMDP Complexity
- Goal - Compute policy which, for each agent, maps
its local observation history to an action - For all ? ? 2, Dec-POMDP with ? agents is
NEXP-complete - Agents must reason about the possible actions and
observations of their teammates
12Impact of Communication on Complexity Pynadath
and Tambe, 2002
- If communication is free
- Dec-POMDP reducible to single-agent POMDP
- Optimal communication policy is to communicate at
every time step - When communication has any cost, Dec-POMDP is
still intractable (NEXP-complete) - Agents must reason about value of information
13Classifying Communication Heuristics
- AND- vs. OR-communication Emery-Montemerlo,
2005 - AND-communication does not replace domain-level
actions - OR-communication does replace domain-level
actions - Initiating communication Xuan et al., 2001
- Tell - Agent decides to tell local information to
teammates - Query - Agent asks a teammate for information
- Sync - All agents broadcast all information
simultaneously
14Classifying Communication Heuristics
- Does the algorithm consider communication cost?
- Is the algorithm is applicable to
- General Dec-POMDP domains
- General Dec-MDP domains
- Restricted domains
- Are the agents guaranteed to Avoid Coordination
Errors?
15Related Work
Query
OR
AND
Cost
Sync
ACE
Unrestricted
Tell
16Overall Approach
- Recall, if communication is free, you can treat a
Dec-POMDP like a single agent - 1) At plan-time, pretend communication is free
- - Generate a centralized policy for the team
- 2) At execution-time, use communication to enable
decentralized execution of this policy while
Avoiding Coordination Errors
17Outline
- Dec-POMDP, Dec-MDP models
- Impact of communication on complexity
- Avoiding Coordination Errors by reasoning over
Possible Joint Beliefs (ACE-PJB) - ACE-PJB-Comm When should agents communicate?
- Selective ACE-PJB-Comm What should agents
communicate? - Avoiding Coordination Errors by executing
Individual Factored Policies (ACE-IFP) - Future directions
18Tiger Domain (States, Actions)
- Two-agent tiger problem Nair et al., 2003
Individual Actions ai ? OpenL, OpenR,
Listen Robot can open left door, open right
door, or listen
S SL, SR Tiger is either behind left door or
behind right door
19Tiger Domain (Observations)
Individual Observations ?I ? HL, HR Robot can
hear tiger behind left door or hear tiger behind
right door
Observations are noisy and independent.
20Tiger Domain(Reward)
- Coordination problem agents must act together
for maximum reward
Listen has small cost (-1 per agent)
Both agents opening door with tiger leads to
medium negative reward (-50)
Maximum reward (20) when both agents open door
with treasure
Minimum reward (-100) when only one agent opens
door with tiger
21Coordination Errors
Reward(ltOpenR, OpenLgt) -100 Reward(ltOpenL,
OpenLgt) -50
HL HL HL
Agents Avoid Coordination Errors when each
agents action is a best response to its
teammates actions.
a1 OpenR
a2 OpenL
22Avoid Coordination Errors by Reasoning Over
Possible Joint Beliefs (ACE-PJB)
- Centralized POMDP policy maps joint beliefs to
joint actions - Joint belief (bt) distribution over world
states - Individual agents cant compute the joint belief
- Dont know what their teammates have observed or
what action they selected - Simplifying assumption
- What if agents knew the joint action at each
timestep? - Agents would only have to reason about possible
observations - How can this be assured?
23Ensuring Action Synchronization
- Agents only allowed to choose actions based on
information known to all team members - At start of execution, agents know
- b0 initial distribution over world states
- A0 optimal joint action given b0, based on
centralized policy - At each timestep, each agent computes Lt,
distribution of possible joint beliefs - Lt ltbt, pt, ?tgt
- ?t observation history that led to bt
- pt - likelihood of observing ?t
24Possible Joint Beliefs
b P(SL) 0.5 p p(b) 1.0
L0
a ltListen, Listengt
HL,HL
HR,HR
HL,HR
HR,HL
How should agents select actions over joint
beliefs?
b P(SL) 0.5 p p(b) 0.21
b P(SL) 0.5 p p(b) 0.21
b P(SL) 0.2 p p(b) 0.29
L1
25Q-POMDP Heuristic
- Select joint action that maximizes expected
reward over possible joint beliefs - Q-MDP Littman et al., 1995
- approximate solution to large POMDP using
underlying MDP - Q-POMDP Roth et al., 2005
- approximate solution to Dec-POMDP using
underlying single-agent POMDP
26Q-POMDP Heuristic
b P(SL) 0.5 p p(b) 1.0
Choose joint action by computing expected reward
over all leaves
HL,HL
HR,HR
HL,HR
HR,HL
b P(SL) 0.5 p p(b) 0.21
b P(SL) 0.5 p p(b) 0.21
b P(SL) 0.2 p p(b) 0.29
Agents will independently select same joint
action, guaranteeing they avoid coordination
errors
but action choice is very conservative (always
ltListen,Listengt)
ACE-PJB-Comm Communication adds local
observations to joint belief
27ACE-PJB-Comm Example
ltHR,HLgt
ltHL,HLgt
ltHL,HRgt
ltHR,HRgt
L1
aNC Q-POMDP(L1) ltListen,Listengt
L circled nodes
Dont communicate
aC Q-POMDP(L) ltListen,Listengt
28ACE-PJB-Comm Example
HL,HL
ltHL,HLgt
ltHL,HRgt
ltHR,HLgt
ltHR,HRgt
L1
a ltListen, Listengt
ltHL,HLgt ltHL,HLgt
ltHL,HLgt ltHL,HRgt
ltHL,HLgt ltHR,HLgt
ltHL,HLgt ltHR,HRgt
ltHL,HRgt ltHL,HLgt
ltHL,HRgt ltHL,HRgt
ltHL,HRgt ltHR,HLgt
ltHL,HRgt ltHR.HRgt
L2
aNC Q-POMDP(L2) ltListen, Listengt
L circled nodes
Agent 1 communicates
aC Q-POMDP(L) ltOpenR,OpenRgt
V(aC) - V(aNC) gt e
29ACE-PJB-Comm Example
HL,HL
ltHL,HLgt
ltHL,HRgt
ltHR,HLgt
ltHR,HRgt
L1
a ltListen, Listengt
ltHL,HLgt ltHL,HLgt
ltHL,HLgt ltHL,HRgt
ltHL,HLgt ltHR,HLgt
ltHL,HLgt ltHR,HRgt
ltHL,HRgt ltHL,HLgt
ltHL,HRgt ltHL,HRgt
ltHL,HRgt ltHR,HLgt
ltHL,HRgt ltHR.HRgt
L2
Agent 1 communicates ltHL,HLgt
Agents open right door!
Q-POMDP(L2) ltOpenR, OpenRgt
30ACE-PJB-Comm Results
- 20,000 trials in 2-Agent Tiger Domain
- 6 timesteps per trial
- Agents communicate 49.7 fewer observations using
ACE-PJB-Comm, 93.3 fewer messages - Difference in expected reward because
ACE-PJB-Comm is slightly pessimistic about
outcome of communication
31Additional Challenges
- Number of possible joint beliefs grows
exponentially - Use particle filter to model distribution of
possible joint beliefs - ACE-PJB-Comm answers the question of when agents
should communicate - Doesnt deal with what to communicate
- Agents communicate all observations that they
havent previously communicated
32Selective ACE-PJB-CommRoth et al., 2006
- Answers what agents should communicate
- Chooses most valuable subset of observations
- Hill-climbing heuristic to choose observations
that push teams towards aC - aC - joint action that would be chosen if agent
communicated all observations - See details in thesis document
33Selective ACE-PJB-Comm Results
- 2-Agent Tiger domain
- Communicates 28.7 fewer observations
- Same expected reward
- Slightly more messages
34Outline
- Dec-POMDP, Dec-MDP models
- Impact of communication on complexity
- Avoiding Coordination Errors by reasoning over
Possible Joint Beliefs (ACE-PJB) - ACE-PJB-Comm When should agents communicate?
- Selective ACE-PJB-Comm What should agents
communicate? - Avoiding Coordination Errors by executing
Individual Factored Policies (ACE-IFP) - Future directions
35Dec-MDP
- State is collectively observable
- One agent cant identify full state on its own
- Union of team observations uniquely identifies
state - Underlying problem is an MDP, not a POMDP
- Dec-MDP has same complexity as Dec-POMDP
- NEXP-Complete
36Acting Independently
- ACE-PJB requires agents to know joint action at
every timestep - Claim In many multi-agent domains, agents can
act independently for long periods of time, only
needing to coordinate infrequently
37Meeting-Under-Uncertainty Domain
- Agents must move to goal location and signal
simultaneously - Reward
- 20 - Both agents signal at goal
- -50 - Both agents signal at another location
- -100 - Only one agent signals
- -1 - Agents move north, south, east, west, or
stop
38Factored Representations
- Represent relationships among state variables
instead of relationships among states
S ltX0, Y0, X1, Y1gt Each agent observes its own
position
39Factored Representations
- Dynamic Decision Network models state variables
over time - at ltEast, gt
40Tree-structured Policies
- Decision tree that branches over state variables
- A tree-structured joint policy has joint actions
at the leaves
41Approach Roth et al., 2007
- Generate tree-structured joint policies for
underlying centralized MDP - Use this joint policy to generate a
tree-structured individual policy for each agent - Execute individual policies
See details in thesis document
42Context-specific Independence
Claim In many multi-agent domains, one agents
individual policy will have large sections where
it is independent of variables that its teammates
observe.
43Individual Policies
- One agents individual policy may depend on state
features it doesnt observe
44Avoid Coordination Errors by Executing an
Individual Factored Policy (ACE-IFP)
- Robot traverses policy tree according to its
observations - If it reaches a leaf, its action is independent
of its teammates observations - If it reaches a state variable that it does not
observe directly, it must ask a teammate for the
current value of that variable - The amount of communication needed to execute a
particular policy corresponds to the amount of
context-specific independence in that domain
45Avoid Coordination Errors by Executing an
Individual Factored Policy (ACE-IFP)
- Benefits
- Agents can act independently without reasoning
about the possible observations or actions of
their teammates - Policy directs agents about when, what, and with
whom to communicate - Drawback
- In domains with little independence, agents may
need to communicate a lot
46Experimental Results
- In 3x3 domain, executing factored policy required
less than half as many messages as full
communication, with same reward - Communication usage decreases relative to full
communication as domain size increases
47Factored Dec-POMDPs
- Hansen and Feng, 2000 looked at factored POMDPs
- ADD-representations of transition, observation,
and reward functions - Policy is a finite-state controller
- Nodes are actions
- Transitions depend on conjunctions of state
variable assignments - To extend to Dec-POMDP, make individual policy a
finite-state controller among individual actions - Somehow combine nodes with the same action
- Communicate to enable transitions between action
nodes
48Future Directions
- Considering communication cost in ACE-IFP
- All children of a particular variable may have
similar values - Worst-case cost of mis-coordination?
- Modeling teammate variables requires reasoning
about possible teammate actions - Extending factoring to Dec-POMDPs
49Future Directions
- Knowledge persistence
- Modeling teammates variables
- Can we identify necessary conditions?
- e.g. Tell me when you reach the goal.
Are you here yet?
Are you here yet?
50Contributions
- Decentralized execution of centralized policies
- Guarantee that agents will Avoid Coordination
Errors - Make effective use of limited communication
resources - When should agents communicate?
- What should agents communicate?
- Demonstrate significant communication savings in
experimental domains
51Contributions
What?
When?
OR
AND
Cost
Sync
Query
ACE
Unrestricted
Tell
Who?
52Thank You!
- Advisors Reid Simmons, Manuela Veloso
- Committee Carlos Guestrin, Jeff Schneider,
Milind Tambe - RI Folks Suzanne, Alik, Damion, Doug, Drew,
Frank, Harini, Jeremy, Jonathan, Kristen, Rachel
(and many others!) - Aba, Ema, Nitzan, Yoel
53References
- Roth, M., Simmons, R., and Veloso, M. Reasoning
About Joint Beliefs for Execution-Time
Communication Decisions In AAMAS, 2005 - Roth, M., Simmons, R., and Veloso, M. What to
Communicate? Execution-Time Decisions in
Multi-agent POMDPs In DARS, 2006 - Roth, M., Simmons, R., and Veloso, M. Exploiting
Factored Representations for Decentralized
Execution in Multi-agent Teams In AAMAS, 2007 - Bernstein, D., Zilberstein, S., and Immerman, N.
The Complexity of Decentralized Control of
Markov Decision Processes In UAI, 2000 - Pynadath, D. and Tambe, M. The Communicative
Multiagent Team Decision Problem Analyzing
Teamwork Theories and Models In JAIR, 2002 - Becker, R., Zilberstein, S., Lesser, V., and
Goldman, C. Transition-independent Decentralized
Markov Decision Processes In AAMAS, 2003 - Nair, R., Roth, M., Yokoo, M., and Tambe, M.
Communication for Improving Policy Computation
in Distributed POMDPs In IJCAI, 2003
54Tiger Domain Details
55Particle filter representation
- Each particle is a possible joint belief
- Each agent maintains two particle filters
- Ljoint possible joint team beliefs
- Lown possible joint beliefs that are consistent
with local observation history - Compare action selected by Q-POMDP over Ljoint to
action selected over Lown and communicate as
needed
56Related Work Transition Independence Becker,
Zilberstein, Lesser, Goldman, 2003
- DEC-MDP collective observability
- Transition independence
- Local state transitions
- Each agent observes local state
- Individual actions only affect local state
transitions - Team connected through joint reward
- Coverage set algorithm finds optimal policy
quickly in experimental domains - No communication
57Related Work COMM-JESP Nair, Roth, Yokoo,
Tambe, 2004
- Add SYNC action to domain
- If one agent chooses SYNC, all other agents SYNC
- At SYNC, send entire observation history since
last SYNC - SYNC brings agents to synchronized belief over
world states - Policies indexed by root synchronized belief and
observation history since last SYNC
t0 (SL () 0.5) (SR () 0.5)
a Listen, Listen
? HL
? HR
(SL (HR) 0.1275) (SL (HL) 0.7225) (SR (HR)
0.1275) (SR (HL) 0.0225)
(SL (HR) 0.0225) (SL (HL) 0.1275) (SR (HR)
0.7225) (SR (HL) 0.1275)
a SYNC
t 2 (SL () 0.5) (SR () 0.5)
t 2 (SL () 0.97) (SR () 0.03)
- At-most K heuristic there must be a SYNC
within at most K timesteps
58Related Work No news is good news Xuan,
Lesser, Zilberstein, 2000
- Applies to transition-independent DEC-MDPs
- Agents form joint plan
- plan exact path to be followed to accomplish
goal - Communicate when deviation from plan occurs
- agent sees it has slipped from optimal path
- communicates need for re-planning
59Related Work BaGA-Comm Emery-Montemerlo,
2005
- Each agent has a type
- Observation and action history
- Agents model distribution of possible joint types
- Choose actions by finding joint type closest to
own local type - Allows coordination errors
- Communicate if gain in expected reward is greater
than cost of communication
60Colorado/Wyoming Domain
- Robots must meet in the capital, but do not know
if they are in Colorado or Wyoming - Robots receive positive reward of 20 only if
they SIGNAL simultaneously from correct goal
location - To simplify problem, each robot knows both own
and teammate position
Wyoming
Colorado
capital
61Colorado/Wyoming Domain
- Noisy observations mountain, plain, pikes peak,
old faithful - Communication can help team reach goal more
efficiently
Pikes Peak
Old Faithful
62Build-Message What to Communicate
- First, determine if communication is necessary
- Calculate AC using Ace-PJB-Comm
- If AC ANC, do not communicate
- Greedily build message
- Hill-climbing towards AC, away from ANC
- Choose single observation that most increases
difference between Q-POMDP values of AC and ANC
Mt
Pl
Mt
Pike
63Build-Message What to Communicate
- Is communication necessary?
Mt
ANC east, south
Pl
AC east, west
Mt
Pike
AC ? ANC so agent should communicate
64Build-Message What to Communicate
Distribution if agent communicates entire
observation history
AC east, west - toward Denver
Mt
Mt
Pl
Pl
Mt
Pike
65Build-Message What to Communicate
AC east, west toward Denver
Mt
- PIKE is single best observation
- In this case, PIKE sufficient to change joint
action to AC, so agent communicates only one
observation
Pl
m Pike
Mt
66Context-specific Independence
- A variable may be independent of a parent
variable in some contexts but not others - e.g. X2 depends on X3 when X1 has value 1, but
is independent otherwise - Claim - Many multi-agent domains exhibit a large
amount of context-specific independence
67Constructing Individual Factored Policies
- Boutilier et al., 2000 defined Merge and
Simplify operations for policy trees - We want to construct trees that maximize
context-specific independence - Depends on variable ordering in policy
- We define Intersect and Independent operations
68Intersect
- Find the intersection of the action sets of a
nodes children
1. If all children are leaves, and action sets
have non-empty intersections, replace the node
with the intersection
2. If all but one child is a leaf, and all the
actions in the non-leaf childs subtree are
included in the leaf-childrens intersection,
replace with the non-leaf child
69Independent
- An individual action is Independent in a
particular leaf of a policy tree if it is optimal
when paired with any action its teammate could
perform at that leaf
a is independent for agent 1
agent 1 has no independent actions
70Generate Individual Policies
- Generate a tree-structured joint policy
- For each agent
- Reorder variables in joint policy so that
variables local to this agent are near the root - For each leaf in the policy, find the Independent
actions - Break ties among remaining joint actions
- Convert joint actions individual actions
- Intersect and Simplify
71Why Break Ties?
- Ensure agents select the same optimal joint
action to prevent mis-coordination