Reasoning in Uncertain Adversarial Environments in AgentMultiagent Systems - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Reasoning in Uncertain Adversarial Environments in AgentMultiagent Systems

Description:

Milind Tambe, Leana Golubchik, Gaurav S. Sukhatme, Sarit Kraus, Stacy ... Rock-Paper-Scissors game, Littman 1994. CMDPs. Constrained MDPs, Altman 1999. Privacy ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 32
Provided by: Milind8
Category:

less

Transcript and Presenter's Notes

Title: Reasoning in Uncertain Adversarial Environments in AgentMultiagent Systems


1
Reasoning in Uncertain Adversarial Environments
in Agent/Multiagent Systems
  • Praveen Paruchuri
  • University of Southern California

Guidance Committee Milind Tambe, Leana
Golubchik, Gaurav S. Sukhatme, Sarit Kraus, Stacy
Marsella, Fernando Ordonez
2
Motivation The Prediction Game
  • An UAV (Unmanned Aerial Vehicle)
  • Flies between the 4 regions
  • Can you predict the UAV-fly patterns ??
  • Pattern 1
  • 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,
  • Pattern 2
  • 1, 4, 3, 1, 1, 4, 2, 4, 2, 3, 4, 3, (as
    generated by 4-sided dice)
  • Can you predict if 100 numbers in pattern 2 are
    given ??
  • Randomization decreases Predictability
  • Increases Security

3
Problem Definition
  • Problem Provide security for agent/agent-team
    acting in uncertain adversarial environments.
  • Assumptions for Agent/agent-team
  • Acting in uncertain adversarial environments
  • Adversary is unobservable
  • Adversarys actions/capabilities or payoffs are
    unknown or difficult to model explicitly
  • Assumptions for Adversary
  • Can see the agents state or belief state
  • Knows the agents plan/policy
  • Exploits the action predictability

4
Solution Technique
  • Technique developed
  • Intentional policy randomization for security
  • MDP/POMDP framework to handle sequential decision
    making under environment uncertainty
  • MDP ? Markov Decision Process
  • POMDP ? Partially Observable MDP
  • Increase Security gt Solve Multi-criteria problem
    for agents
  • Maximize action unpredictability (Policy
    randomization)
  • Maintain reward above threshold (Quality
    constraints)

5
Domains
  • Scheduled activities at airports like security
    check, refueling etc
  • Observed by terrorists
  • Randomization of schedules helpful
  • UAV/UAV-team patrolling humanitarian mission
  • Adversary monitoring UAV schedule to disrupt
    mission
  • Can disrupt food, harm refugees, shoot down UAVs
    etc
  • Randomize UAV patrol policy (More domains in
    report)

6
My Contributions
  • Two main contributions
  • Single Agent Case
  • Formulate as Non linear program Entropy based
    metric
  • Convert to Linear Program called BRLP ( Binary
    search for randomization )
  • Randomize single agent policies with reward gt
    threshold
  • Multi Agent Case RDR ( Rolling Down
    Randomization )
  • Randomized policies for decentralized POMDPs
  • Threshold on team reward

7
Related work
  • Randomized policies in literature
  • Stochastic Games
  • Rock-Paper-Scissors game, Littman 1994
  • CMDPs
  • Constrained MDPs, Altman 1999
  • Privacy
  • Act on information while maintaining privacy,
    Otterloo 2005
  • Security via randomization
  • Patrol units for security, Carroll 2005
  • Randomized patrol sentry vehicles, Lewis 2005
  • Randomized Police Patrol, Billante 2003

8
Plan for Talk
  • Achieved Contribution versus Expected Contribution

Increase Security for Agent/Agent Team Acting in
Uncertain, Adversarial domains
No Adversary Model
Partial Adversary Model
No Communication
Limited Communication
Dec- POMDP based Agent Team
Dec-MDP based Agent Team
Dec-POMDP Based Agent Team
MDP based Single Agent
9
MDP based single agent case
  • MDP is tuple lt S, A, P, R gt
  • S Set of states
  • A Set of actions
  • P Transition function
  • R Reward function
  • Basic terms used
  • x(s,a) Expected times action a is taken in
    state s
  • Policy (as function of MDP flows)

10
Entropy Measure of randomness
  • Randomness or information content quantified
    using Entropy ( Shannon 1948 )
  • Entropy for MDP -
  • Additive Entropy Add entropies of each state
  • Weighted Entropy Weigh each state by it
    contribution to total flow
  • where alpha_j is the initial flow of the system

11
Tradeoff Reward vs Entropy
  • Non-linear Program Max entropy, Reward above
    threshold
  • Objective (Entropy) is non-linear
  • BRLP ( Binary Search for Randomization LP )
  • Linear Program
  • No entropy calculation, Entropy as function of
    flows

12
BRLP
  • Given input and target reward
  • Poly-time convergence to within of target
    reward.
  • Monotonicity Entropy decreases or constant with
    increasing reward.
  • Control through
  • Input can be any high entropy policy
  • One such input is the uniform policy
  • Equal probability for all actions out of all
    states
  • Controls the final policy structure

13
LP for Binary Search
  • Policy as function of and
  • Linear Program

14
BRLP in Action
Beta .5
1 - Max entropy
0 Deterministic Max Reward
Target Reward
15
Results (Averaged over 10 MDPs)
  • For a given reward threshold,
  • Highest entropy Expected Entropy Method 10
    avg gain over BRLP
  • Fastest BRLP 7 fold average speedup over
    Expected Entropy
  • These results are statistically significant
    (t-Tests performed)

16
Multi Agent Case Problem
  • Maximize entropy for agent teams subject to
    reward threshold
  • For agent team
  • Decentralized POMDP framework used
  • Agents know initial joint belief state
  • No communication possible between agents
  • For adversary
  • Can calculate the agents belief state
  • Knows the agents policy
  • Exploits the action predictability
  • For Dec-POMDP-
  • Deterministic policy maps observation history to
    action.
  • Randomized Policy maps observation history to
    action probability distribution.

17
Policy trees Deterministic vs Randomized
Deterministic Policy Tree
Randomized Policy Tree
18
RDR Rolling Down Randomization
  • Input
  • Best ( local or global ) deterministic policy
  • Percent of reward loss
  • d parameter Number of turns each agent gets
  • Ex d .5 gt Number of steps 1/d 2
  • Each agent gets one turn ( for 2 agent case )
  • Single agent MDP problem at each step
  • For agent 1s turn
  • Fix policy of other agents ( Agent 2 )
  • Find randomized policy
  • Maximizes joint entropy ( w1 Entropy(agent1)
    w2 Entropy(agent2) )
  • Maintains joint reward above threshold

19
RDR d .5
Agent 1 Maximize joint entropy Joint Reward gt 90
Max Reward
Reward 90
Agent 2 Maximize joint entropy Joint reward gt 80
80 of Max Reward
20
RDR details
  • For single agent sufficient
    statistic
  • Not sufficient for multi agent case
  • For multiagent case with
  • Deterministic policy (other agents policies
    fixed)
  • Reason about current world state observation
    history of other agents
  • Randomized policy
  • Current world state, action and observation
    history of other agents
  • Define extended state
  • where
  • Joint belief state is a probability distribution
    over extended states.

21
New Transition and Observation functions
  • Transition function for deterministic case
  • Transition function of RDR
  • Observation function remains the same

22
Belief Update Rule
  • Original Belief Update Rule
  • New Belief Update Rule

23
Experimental Results Reward Threshold vs
Weighted Entropy ( Averaged 10 instances )
24
Future work Dec-MDPs with Bandwidth Constraints
  • Agents teams can communicate
  • Limited bandwidth assumed
  • Bandwidth modeled as a shared resource constraint
  • Typical policy randomization formulation

25
Dec-POMDPs with Bandwidth Constraints
  • RDR assumes no communication
  • Communication in Dec-POMDPs
  • With Deterministic Policies Nair et al, 04
  • Communicate observation histories
  • Compress belief states and increase expected
    rewards
  • With Randomized Policies
  • Can communicate observation histories
  • Or communicate action taken
  • Optimization Problem Optimal allocation of
    bandwidth between observation histories and
    actions st Expected reward maximized.

26
Incorporating Adversary Models
  • Real world situation
  • Known Adversary cannot target leftmost region of
    mission
  • No UAV patrolling needed there
  • Prior knowledge to be incorporated while
    randomizing policy.
  • Partial adversary models Standard framework
    needs to be developed.
  • Optimal plan for modeled part of adversary
  • Plan for the unmodeled part also.

27
Timeline
Defense
Thesis Writing
Experimental Evaluation
Incorporating Adversarial Models
Experimental Evaluation
Dec-MDP/POMDP with communication constraints
Proposal
March 06
April 06
June 06
August 06
November 06
Feb 07
March 07
28
Other Work
  • Self-interest vs Team-interest
  • Electric Elves Domain
  • Teamwork with resource constraints
  • Developed the EMTDP framework
  • Maximize expected team reward while bounding
    expected resource consumption
  • CRLP (Convex Combination for Randomization)
    algorithm
  • Heuristic algorithm for single agent policy
    randomization
  • Communication issue in Dec-MDPs
  • Communication increases security

29
Summary
  • Intentional randomization as main focus
  • Single agent case
  • BRLP algorithm introduced
  • Multi agent case
  • RDR algorithm introduced
  • Multi-criterion problem solved that
  • Maximizes entropy
  • Maintains Reward gt Threshold

30
Thank You
  • Any comments/questions ??

31
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com