ExecutionTime Communication Decisions for Coordination of MultiAgent Teams

About This Presentation

Title:

ExecutionTime Communication Decisions for Coordination of MultiAgent Teams

Description:

Guarantee agents will Avoid Coordination Errors (ACE) during decentralized execution ... Coordination Errors by executing Individual Factored Policies (ACE-IFP) ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 72

Provided by: maaya4

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: ExecutionTime Communication Decisions for Coordination of MultiAgent Teams

1
Execution-Time Communication Decisions for
Coordination of Multi-Agent Teams

Maayan Roth
Thesis Defense
Carnegie Mellon University
September 4, 2007

2
Cooperative Multi-Agent Teams Operating Under
Uncertainty and Partial Observability

Cooperative teams
Agents work together to achieve team reward
No individual motivations
Uncertainty
Actions have stochastic outcomes
Partial observability
Agents dont always know world state

3
Coordinating When Communication is a Limited
Resource

Tight coordination
One agents best action choice depends on the
action choices of its teammates
We wish to Avoid Coordination Errors
Limited communication
Communication costs
Limited bandwidth

4
Thesis Question
How can we effectively use communication to
enable the coordination of cooperative
multi-agent teams making sequential decisions
under uncertainty and partial observability?
5
Multi-Agent Sequential Decision Making
6
Thesis Statement
Reasoning about communication decisions at
execution-time provides a more tractable means
for coordinating teams of agents operating under
uncertainty and partial observability.
7
Thesis Contributions

Algorithms that
Guarantee agents will Avoid Coordination Errors
(ACE) during decentralized execution
Answer the questions of when and what agents
should communicate

8
Outline

Dec-POMDP model
Impact of communication on complexity
Avoiding Coordination Errors by reasoning over
Possible Joint Beliefs (ACE-PJB)
ACE-PJB-Comm When should agents communicate?
Selective ACE-PJB-Comm What should agents
communicate?
Avoiding Coordination Errors by executing
Individual Factored Policies (ACE-IFP)
Future directions

9
Dec-POMDP Model

Decentralized Partially Observable Markov
Decision Process
Multi-agent extension of single-agent POMDP model
Sequential decision-making in domains where
Uncertainty in outcome of actions
Partial observability - uncertainty about world
state

10
Dec-POMDP Model

M lt?, S, Aii?m, T, ?ii?m, O, Rgt
? is the number of agents
S is set of possible world states
Aii?m is set of joint actions, lta1, , amgt
where ai ? Ai
T defines transition probabilities over joint
actions
?ii?m is set of joint observations, lt?1, ,
?mgt where ?i ? ?i
O defines observation probabilities over joint
actions and joint observations
R is team reward function

11
Dec-POMDP Complexity

Goal - Compute policy which, for each agent, maps
its local observation history to an action
For all ? ? 2, Dec-POMDP with ? agents is
NEXP-complete
Agents must reason about the possible actions and
observations of their teammates

12
Impact of Communication on Complexity Pynadath
and Tambe, 2002

If communication is free
Dec-POMDP reducible to single-agent POMDP
Optimal communication policy is to communicate at
every time step
When communication has any cost, Dec-POMDP is
still intractable (NEXP-complete)
Agents must reason about value of information

13
Classifying Communication Heuristics

AND- vs. OR-communication Emery-Montemerlo,
2005
AND-communication does not replace domain-level
actions
OR-communication does replace domain-level
actions
Initiating communication Xuan et al., 2001
Tell - Agent decides to tell local information to
teammates
Query - Agent asks a teammate for information
Sync - All agents broadcast all information
simultaneously

14
Classifying Communication Heuristics

Does the algorithm consider communication cost?
Is the algorithm is applicable to
General Dec-POMDP domains
General Dec-MDP domains
Restricted domains
Are the agents guaranteed to Avoid Coordination
Errors?

15
Related Work
Query
OR
AND
Cost
Sync
ACE
Unrestricted
Tell
16
Overall Approach

Recall, if communication is free, you can treat a
Dec-POMDP like a single agent
1) At plan-time, pretend communication is free
- Generate a centralized policy for the team
2) At execution-time, use communication to enable
decentralized execution of this policy while
Avoiding Coordination Errors

17
Outline

Dec-POMDP, Dec-MDP models
Impact of communication on complexity
Avoiding Coordination Errors by reasoning over
Possible Joint Beliefs (ACE-PJB)
ACE-PJB-Comm When should agents communicate?
Selective ACE-PJB-Comm What should agents
communicate?
Avoiding Coordination Errors by executing
Individual Factored Policies (ACE-IFP)
Future directions

18
Tiger Domain (States, Actions)

Two-agent tiger problem Nair et al., 2003

Individual Actions ai ? OpenL, OpenR,
Listen Robot can open left door, open right
door, or listen
S SL, SR Tiger is either behind left door or
behind right door
19
Tiger Domain (Observations)
Individual Observations ?I ? HL, HR Robot can
hear tiger behind left door or hear tiger behind
right door
Observations are noisy and independent.
20
Tiger Domain(Reward)

Coordination problem agents must act together
for maximum reward

Listen has small cost (-1 per agent)
Both agents opening door with tiger leads to
medium negative reward (-50)
Maximum reward (20) when both agents open door
with treasure
Minimum reward (-100) when only one agent opens
door with tiger
21
Coordination Errors
Reward(ltOpenR, OpenLgt) -100 Reward(ltOpenL,
OpenLgt) -50
HL HL HL
Agents Avoid Coordination Errors when each
agents action is a best response to its
teammates actions.
a1 OpenR
a2 OpenL
22
Avoid Coordination Errors by Reasoning Over
Possible Joint Beliefs (ACE-PJB)

Centralized POMDP policy maps joint beliefs to
joint actions
Joint belief (bt) distribution over world
states
Individual agents cant compute the joint belief
Dont know what their teammates have observed or
what action they selected
Simplifying assumption
What if agents knew the joint action at each
timestep?
Agents would only have to reason about possible
observations
How can this be assured?

23
Ensuring Action Synchronization

Agents only allowed to choose actions based on
information known to all team members
At start of execution, agents know
b0 initial distribution over world states
A0 optimal joint action given b0, based on
centralized policy
At each timestep, each agent computes Lt,
distribution of possible joint beliefs
Lt ltbt, pt, ?tgt
?t observation history that led to bt
pt - likelihood of observing ?t

24
Possible Joint Beliefs
b P(SL) 0.5 p p(b) 1.0
L0
a ltListen, Listengt
HL,HL
HR,HR
HL,HR
HR,HL
How should agents select actions over joint
beliefs?
b P(SL) 0.5 p p(b) 0.21
b P(SL) 0.5 p p(b) 0.21
b P(SL) 0.2 p p(b) 0.29
L1
25
Q-POMDP Heuristic

Select joint action that maximizes expected
reward over possible joint beliefs
Q-MDP Littman et al., 1995
approximate solution to large POMDP using
underlying MDP
Q-POMDP Roth et al., 2005
approximate solution to Dec-POMDP using
underlying single-agent POMDP

26
Q-POMDP Heuristic
b P(SL) 0.5 p p(b) 1.0
Choose joint action by computing expected reward
over all leaves
HL,HL
HR,HR
HL,HR
HR,HL
b P(SL) 0.5 p p(b) 0.21
b P(SL) 0.5 p p(b) 0.21
b P(SL) 0.2 p p(b) 0.29
Agents will independently select same joint
action, guaranteeing they avoid coordination
errors
but action choice is very conservative (always
ltListen,Listengt)
ACE-PJB-Comm Communication adds local
observations to joint belief
27
ACE-PJB-Comm Example

ltHR,HLgt
ltHL,HLgt
ltHL,HRgt
ltHR,HRgt
L1
aNC Q-POMDP(L1) ltListen,Listengt
L circled nodes
Dont communicate
aC Q-POMDP(L) ltListen,Listengt
28
ACE-PJB-Comm Example

HL,HL
ltHL,HLgt
ltHL,HRgt
ltHR,HLgt
ltHR,HRgt
L1
a ltListen, Listengt

ltHL,HLgt ltHL,HLgt
ltHL,HLgt ltHL,HRgt
ltHL,HLgt ltHR,HLgt
ltHL,HLgt ltHR,HRgt
ltHL,HRgt ltHL,HLgt
ltHL,HRgt ltHL,HRgt
ltHL,HRgt ltHR,HLgt
ltHL,HRgt ltHR.HRgt
L2
aNC Q-POMDP(L2) ltListen, Listengt
L circled nodes
Agent 1 communicates
aC Q-POMDP(L) ltOpenR,OpenRgt
V(aC) - V(aNC) gt e
29
ACE-PJB-Comm Example

HL,HL
ltHL,HLgt
ltHL,HRgt
ltHR,HLgt
ltHR,HRgt
L1
a ltListen, Listengt

ltHL,HLgt ltHL,HLgt
ltHL,HLgt ltHL,HRgt
ltHL,HLgt ltHR,HLgt
ltHL,HLgt ltHR,HRgt
ltHL,HRgt ltHL,HLgt
ltHL,HRgt ltHL,HRgt
ltHL,HRgt ltHR,HLgt
ltHL,HRgt ltHR.HRgt
L2
Agent 1 communicates ltHL,HLgt
Agents open right door!
Q-POMDP(L2) ltOpenR, OpenRgt
30
ACE-PJB-Comm Results

20,000 trials in 2-Agent Tiger Domain
6 timesteps per trial
Agents communicate 49.7 fewer observations using
ACE-PJB-Comm, 93.3 fewer messages
Difference in expected reward because
ACE-PJB-Comm is slightly pessimistic about
outcome of communication

31
Additional Challenges

Number of possible joint beliefs grows
exponentially
Use particle filter to model distribution of
possible joint beliefs
ACE-PJB-Comm answers the question of when agents
should communicate
Doesnt deal with what to communicate
Agents communicate all observations that they
havent previously communicated

32
Selective ACE-PJB-CommRoth et al., 2006

Answers what agents should communicate
Chooses most valuable subset of observations
Hill-climbing heuristic to choose observations
that push teams towards aC
aC - joint action that would be chosen if agent
communicated all observations
See details in thesis document

33
Selective ACE-PJB-Comm Results

2-Agent Tiger domain
Communicates 28.7 fewer observations
Same expected reward
Slightly more messages

34
Outline

Dec-POMDP, Dec-MDP models
Impact of communication on complexity
Avoiding Coordination Errors by reasoning over
Possible Joint Beliefs (ACE-PJB)
ACE-PJB-Comm When should agents communicate?
Selective ACE-PJB-Comm What should agents
communicate?
Avoiding Coordination Errors by executing
Individual Factored Policies (ACE-IFP)
Future directions

35
Dec-MDP

State is collectively observable
One agent cant identify full state on its own
Union of team observations uniquely identifies
state
Underlying problem is an MDP, not a POMDP
Dec-MDP has same complexity as Dec-POMDP
NEXP-Complete

36
Acting Independently

ACE-PJB requires agents to know joint action at
every timestep
Claim In many multi-agent domains, agents can
act independently for long periods of time, only
needing to coordinate infrequently

37
Meeting-Under-Uncertainty Domain

Agents must move to goal location and signal
simultaneously
Reward
20 - Both agents signal at goal
-50 - Both agents signal at another location
-100 - Only one agent signals
-1 - Agents move north, south, east, west, or
stop

38
Factored Representations

Represent relationships among state variables
instead of relationships among states

S ltX0, Y0, X1, Y1gt Each agent observes its own
position
39
Factored Representations

Dynamic Decision Network models state variables
over time
at ltEast, gt

40
Tree-structured Policies

Decision tree that branches over state variables
A tree-structured joint policy has joint actions
at the leaves

41
Approach Roth et al., 2007

Generate tree-structured joint policies for
underlying centralized MDP
Use this joint policy to generate a
tree-structured individual policy for each agent
Execute individual policies

See details in thesis document
42
Context-specific Independence
Claim In many multi-agent domains, one agents
individual policy will have large sections where
it is independent of variables that its teammates
observe.
43
Individual Policies

One agents individual policy may depend on state
features it doesnt observe

44
Avoid Coordination Errors by Executing an
Individual Factored Policy (ACE-IFP)

Robot traverses policy tree according to its
observations
If it reaches a leaf, its action is independent
of its teammates observations
If it reaches a state variable that it does not
observe directly, it must ask a teammate for the
current value of that variable
The amount of communication needed to execute a
particular policy corresponds to the amount of
context-specific independence in that domain

45
Avoid Coordination Errors by Executing an
Individual Factored Policy (ACE-IFP)

Benefits
Agents can act independently without reasoning
about the possible observations or actions of
their teammates
Policy directs agents about when, what, and with
whom to communicate
Drawback
In domains with little independence, agents may
need to communicate a lot

46
Experimental Results

In 3x3 domain, executing factored policy required
less than half as many messages as full
communication, with same reward
Communication usage decreases relative to full
communication as domain size increases

47
Factored Dec-POMDPs

Hansen and Feng, 2000 looked at factored POMDPs
ADD-representations of transition, observation,
and reward functions
Policy is a finite-state controller
Nodes are actions
Transitions depend on conjunctions of state
variable assignments
To extend to Dec-POMDP, make individual policy a
finite-state controller among individual actions
Somehow combine nodes with the same action
Communicate to enable transitions between action
nodes

48
Future Directions

Considering communication cost in ACE-IFP
All children of a particular variable may have
similar values
Worst-case cost of mis-coordination?
Modeling teammate variables requires reasoning
about possible teammate actions
Extending factoring to Dec-POMDPs

49
Future Directions

Knowledge persistence
Modeling teammates variables
Can we identify necessary conditions?
e.g. Tell me when you reach the goal.

Are you here yet?
Are you here yet?
50
Contributions

Decentralized execution of centralized policies
Guarantee that agents will Avoid Coordination
Errors
Make effective use of limited communication
resources
When should agents communicate?
What should agents communicate?
Demonstrate significant communication savings in
experimental domains

51
Contributions
What?
When?
OR
AND
Cost
Sync
Query
ACE
Unrestricted
Tell
Who?
52
Thank You!

Advisors Reid Simmons, Manuela Veloso
Committee Carlos Guestrin, Jeff Schneider,
Milind Tambe
RI Folks Suzanne, Alik, Damion, Doug, Drew,
Frank, Harini, Jeremy, Jonathan, Kristen, Rachel
(and many others!)
Aba, Ema, Nitzan, Yoel

53
References

Roth, M., Simmons, R., and Veloso, M. Reasoning
About Joint Beliefs for Execution-Time
Communication Decisions In AAMAS, 2005
Roth, M., Simmons, R., and Veloso, M. What to
Communicate? Execution-Time Decisions in
Multi-agent POMDPs In DARS, 2006
Roth, M., Simmons, R., and Veloso, M. Exploiting
Factored Representations for Decentralized
Execution in Multi-agent Teams In AAMAS, 2007
Bernstein, D., Zilberstein, S., and Immerman, N.
The Complexity of Decentralized Control of
Markov Decision Processes In UAI, 2000
Pynadath, D. and Tambe, M. The Communicative
Multiagent Team Decision Problem Analyzing
Teamwork Theories and Models In JAIR, 2002
Becker, R., Zilberstein, S., Lesser, V., and
Goldman, C. Transition-independent Decentralized
Markov Decision Processes In AAMAS, 2003
Nair, R., Roth, M., Yokoo, M., and Tambe, M.
Communication for Improving Policy Computation
in Distributed POMDPs In IJCAI, 2003

54
Tiger Domain Details
55
Particle filter representation

Each particle is a possible joint belief
Each agent maintains two particle filters
Ljoint possible joint team beliefs
Lown possible joint beliefs that are consistent
with local observation history
Compare action selected by Q-POMDP over Ljoint to
action selected over Lown and communicate as
needed

56
Related Work Transition Independence Becker,
Zilberstein, Lesser, Goldman, 2003

DEC-MDP collective observability
Transition independence
Local state transitions
Each agent observes local state
Individual actions only affect local state
transitions
Team connected through joint reward
Coverage set algorithm finds optimal policy
quickly in experimental domains
No communication

57
Related Work COMM-JESP Nair, Roth, Yokoo,
Tambe, 2004

Add SYNC action to domain
If one agent chooses SYNC, all other agents SYNC
At SYNC, send entire observation history since
last SYNC
SYNC brings agents to synchronized belief over
world states
Policies indexed by root synchronized belief and
observation history since last SYNC

t0 (SL () 0.5) (SR () 0.5)
a Listen, Listen
? HL
? HR
(SL (HR) 0.1275) (SL (HL) 0.7225) (SR (HR)
0.1275) (SR (HL) 0.0225)
(SL (HR) 0.0225) (SL (HL) 0.1275) (SR (HR)
0.7225) (SR (HL) 0.1275)
a SYNC
t 2 (SL () 0.5) (SR () 0.5)
t 2 (SL () 0.97) (SR () 0.03)

At-most K heuristic there must be a SYNC
within at most K timesteps

58
Related Work No news is good news Xuan,
Lesser, Zilberstein, 2000

Applies to transition-independent DEC-MDPs
Agents form joint plan
plan exact path to be followed to accomplish
goal
Communicate when deviation from plan occurs
agent sees it has slipped from optimal path
communicates need for re-planning

59
Related Work BaGA-Comm Emery-Montemerlo,
2005

Each agent has a type
Observation and action history
Agents model distribution of possible joint types
Choose actions by finding joint type closest to
own local type
Allows coordination errors
Communicate if gain in expected reward is greater
than cost of communication

60
Colorado/Wyoming Domain

Robots must meet in the capital, but do not know
if they are in Colorado or Wyoming
Robots receive positive reward of 20 only if
they SIGNAL simultaneously from correct goal
location
To simplify problem, each robot knows both own
and teammate position

Wyoming
Colorado
capital
61
Colorado/Wyoming Domain

Noisy observations mountain, plain, pikes peak,
old faithful
Communication can help team reach goal more
efficiently

Pikes Peak
Old Faithful
62
Build-Message What to Communicate

First, determine if communication is necessary
Calculate AC using Ace-PJB-Comm
If AC ANC, do not communicate
Greedily build message
Hill-climbing towards AC, away from ANC
Choose single observation that most increases
difference between Q-POMDP values of AC and ANC

Mt
Pl
Mt
Pike
63
Build-Message What to Communicate

Is communication necessary?

Mt
ANC east, south
Pl
AC east, west
Mt
Pike
AC ? ANC so agent should communicate
64
Build-Message What to Communicate
Distribution if agent communicates entire
observation history
AC east, west - toward Denver
Mt
Mt
Pl
Pl
Mt
Pike
65
Build-Message What to Communicate
AC east, west toward Denver
Mt

PIKE is single best observation
In this case, PIKE sufficient to change joint
action to AC, so agent communicates only one
observation

Pl
m Pike
Mt
66
Context-specific Independence

A variable may be independent of a parent
variable in some contexts but not others
e.g. X2 depends on X3 when X1 has value 1, but
is independent otherwise
Claim - Many multi-agent domains exhibit a large
amount of context-specific independence

67
Constructing Individual Factored Policies

Boutilier et al., 2000 defined Merge and
Simplify operations for policy trees
We want to construct trees that maximize
context-specific independence
Depends on variable ordering in policy
We define Intersect and Independent operations

68
Intersect

Find the intersection of the action sets of a
nodes children

1. If all children are leaves, and action sets
have non-empty intersections, replace the node
with the intersection
2. If all but one child is a leaf, and all the
actions in the non-leaf childs subtree are
included in the leaf-childrens intersection,
replace with the non-leaf child
69
Independent

An individual action is Independent in a
particular leaf of a policy tree if it is optimal
when paired with any action its teammate could
perform at that leaf

a is independent for agent 1
agent 1 has no independent actions
70
Generate Individual Policies

Generate a tree-structured joint policy
For each agent
Reorder variables in joint policy so that
variables local to this agent are near the root
For each leaf in the policy, find the Independent
actions
Break ties among remaining joint actions
Convert joint actions individual actions
Intersect and Simplify

ExecutionTime Communication Decisions for Coordination of MultiAgent Teams - PowerPoint PPT Presentation

ExecutionTime Communication Decisions for Coordination of MultiAgent Teams

Guarantee agents will Avoid Coordination Errors (ACE) during decentralized execution ... Coordination Errors by executing Individual Factored Policies (ACE-IFP) ... – PowerPoint PPT presentation