Learning in networks (and other asides)

1 / 58

About This Presentation

Title:

Learning in networks (and other asides)

Description:

Coop. Defect. Coop. R2. R1 (1,1) (-1,-1) (2,-2) (-2,2) safety value. Better ... The 'Chicken' game (Hawk-Dove) Undesirable. Nash Eq. Achieving 'perfection' ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 59

Provided by: yuhan

Learn more at: http://www.cs.rutgers.edu

more less

Transcript and Presenter's Notes

Title: Learning in networks (and other asides)

1
Learning in networks(and other asides)

A preliminary investigation some comments
Yu-Han Chang
Joint work with Tracey Ho and Leslie Kaelbling
AI Lab, MIT
NIPS Multi-agent Learning Workshop, Whistler, BC
2002

2
Networks a multi-agent system

Graphical games Kearns, Ortiz, Guestrin,
Real networks, e.g. a LAN Boyan, Littman,
Mobile ad-hoc networks Johnson, Maltz,

3
Mobilized ad-hoc networks

Mobile sensors, tracking agents,
Generally a distributed system that wants to
optimize some global reward function

4
Learning

Nash equilibrium is the phrase of the day, but is
it a good solution?
Other equilibria, i.e. refinements of NE
Can we do better than Nash Equilibrium?
(Game playing approach)
Perhaps we want to just learn some good policy in
a distributed manner. Then what?
(Distributed problem solving)

5
What are we studying?
Learning
RL, NDP
Stochastic games, Learning in games,
Decision Theory, Planning
Game Theory
Known world
Multiple agents
Single-agent
6
Part I Learning
Rewards
Observations, Sensations
Learning Algorithm
World, State
Actions
7
Learning to act in the world
Other agents (possibly learning)
Rewards
Observations, Sensations
?
Learning Algorithm
Environ-ment
Actions
World
8
A simple example

The problem Prisoners Dilemma
Possible solutions Space of policies
The solution metric Nash equilibrium

Player 2s actions
Cooperate Defect
Cooperate 1,1 -2,2
Defect 2,-2 -1,-1
World, State
Player 1s actions
Rewards
9
That Folk Theorem

For discount factors close to 1, any individually
rational payoffs are feasible (and are Nash) in
the infinitely repeated game

R2
Coop. Defect
Coop. 1,1 -2,2
Defect 2,-2 -1,-1
(-2,2)
(1,1)
R1
safety value
(-1,-1)
(2,-2)
10
Better policies Tit-for-Tat

Expand our notion of policies to include maps
from past history to actions
Our choice of action now depends on previous
choices (i.e. non-stationary)

Tit-for-Tat Policy ( . , Defect ) ?
Defect ( . , Cooperate ) ? Cooperate
history (last periods play)
11
Types of policies consequences

Stationary 1 ? At
At best, leads to same outcome as single-shot
Nash Equilibrium against rational opponents
Reactionary ( ht-1 ) ? At
Tit for Tat achieves best outcome in Prisoners
Dilemma
Finite Memory ( ht-n , , ht-2 , ht-1 ) ?
At
May be useful against more complex opponents or
in more complex games
Algorithmic ( h1 , h2 , , ht-2 , ht-1 )
? At
Makes use of the entire history of actions as it
learns over time

12
Classifying our policy space

We can classify our learning algorithms
potential power by observing the amount of
history its policies can use
Stationary H0
1 ? At
Reactionary H1
( ht-1 ) ? At
Behavioral/Finite Memory Hn
( ht-n , , ht-2 , ht-1 ) ? At
Algorithmic/Infinite Memory H?
( h1 , h2 , , ht-2 , ht-1 ) ? At

13
Classifying our belief space

Its also important to quantify our belief space,
i.e. our assumptions about what types of policies
the opponent is capable of playing
Stationary B0
Reactionary B1
Behavioral/Finite Memory Bn
Infinite Memory/Arbitrary B?

14
A Simple Classification
B0 B1 Bn B?
H0 Minimax-Q, Nash-Q, Corr-Q Bully
H1 Godfather
Hn
H? (WoLF) PHC, Fictitious Play, Q-learning (JAL) Q1-learning Qt-learning? ???
15
A Classification
B0 B1 Bn B?
H0 Minimax-Q, Nash-Q, Corr-Q Bully
H1 Godfather
Hn
H? (WoLF) PHC, Fictitious Play, Q-learning (JAL) Q1-learning Qt-learning? ???
16
H? x B0 Stationary opponent

Since the opponent is stationary, this case
reduces the world to an MDP. Hence we can apply
any traditional reinforcement learning methods
Policy hill climber (PHC) Bowling Veloso,
02
Estimates the gradient in the action space and
follows it towards the local optimum
Fictitious play Robinson, 51 Fudenburg
Levine, 95
Plays a stationary best response to the
statistical frequency of the opponents play
Q-learning (JAL) Watkins, 89 Claus
Boutilier, 98
Learns Q-values of states and possibly joint
actions

17
A Classification
B0 B1 Bn B?
H0 Minimax-Q, Nash-Q, Corr-Q Bully
H1 Godfather
Hn
H? (WoLF) PHC, Fictitious Play, Q-learning (JAL) Q1-learning Qt-learning? ???
18
H0 x B? My enemys pretty smart

Bully Littman Stone, 01
Tries to force opponent to conform to the
preferred outcome by choosing to play only some
part of the game matrix

Them

The Chicken game
(Hawk-Dove)

Cooperate Swerve Defect Drive
Cooperate Swerve 1,1 -2,2
Defect Drive 2,-2 -5,-5
Undesirable Nash Eq.
Us
19
Achieving perfection

Can we design a learning algorithm that will
perform well in all circumstances?
Prediction
Optimization
But this is not possible!
Nachbar, 95 Binmore, 89
Universal consistency (Exp3 Auer et al, 02,
smoothed fictitious play Fudenburg Levine,
95) does provide a way out, but it merely
guarantees that well do almost as well as any
stationary policy that we could have used

20
A reasonable goal?

Can we design an algorithm in H? x Bn or in a
subclass of H? x B? that will do well?
Should always try to play a best response to any
given opponent strategy
Against a fully rational opponent, should thus
learn to play a Nash equilibrium strategy
Should try to guarantee that well never do too
badly
One possible approach given knowledge about the
opponent, model its behavior and exploit its
weaknesses (play best response)
Lets start by constructing a player that plays
well against PHC players in 2x2 games

21
2x2 Repeated Matrix Games

We choose row i to play
Opponent chooses column j to play
We receive reward rij , they receive cij

Left Right
Up r11 , c11 r12 , c12
Down r21 , c21 r22 , c22
22
Iterated gradient ascent

System dynamics for 2x2 matrix games take one
of two forms

Player 2s probability for Action 1
Player 2s probability for Action 1
Player 1s probability for Action 1
Player 1s probability for Action 1
Singh Kearns Mansour, 00
23
Can we do better and actually win?

Singh et al show that we can achieve Nash payoffs
But is this a best response? We can do better
Exploit while winning
Deceive and bait while losing

Them
Matching pennies
Heads Tails
Heads -1,1 1,-1
Tails 1,-1 -1,1
Us
24
A winning strategy against PHC
1
If winning play probability 1 for
current preferred action in order to maximize
rewards while winning If losing play a
deceiving policy until we are ready to
take advantage of them again
0.5
Probability opponent plays heads
0
1
0.5
Probability we play heads
25
Formally, PHC does

Keeps and updates Q values
Updates policy

26
PHC-Exploiter

Updates policy differently if winning vs. losing

If we are winning
Otherwise, we are losing
27
PHC-Exploiter

Updates policy differently if winning vs. losing

If
Otherwise, we are losing
28
PHC-Exploiter

Updates policy differently if winning vs. losing

If
Otherwise, we are losing
29
But we dont have complete information

Estimate opponents policy ?2 at each time period
Estimate opponents learning rate ?2

t
t-w
t-2w
time
w
30
Ideally wed like to see this
winning
losing
31
With our approximations
32
And indeed were doing well.
losing
winning
33
Knowledge (beliefs) are useful

Using our knowledge about the opponent, weve
demonstrated one case in which we can achieve
better than Nash rewards
In general, wed like algorithms that can
guarantee Nash payoffs against fully rational
players but can exploit bounded players (such as
a PHC)

34
So what do we want from learning?

Best Response / Adaptive exploit the
opponents weaknesses, essentially always try to
play a best response
Regret-minimization wed like to be able to
look back and not regret our actions we
wouldnt say to ourselves Gosh, why didnt I
choose to do that instead

35
A next step

Expand the comparison class in universally
consistent (regret-minimization) algorithms to
include richer spaces of possible strategies
For example, the comparison class could include a
best-response player to a PHC
Could also include all t-period strategies

36
Part II

What if were cooperating?

37
What if were cooperating?

Nash equilibrium is not the most useful concept
in cooperative scenarios
We simply want to distributively find the global
(perhaps approximately) optimal solution
This happens to be a Nash equilibrium, but its
not really the point of NE to address this
scenario
Distributed problem solving rather than game
playing
May also deal with modeling emergent behaviors

38
Mobilized ad-hoc networks

Ad-hoc networks are limited in connectivity
Mobilized nodes can significantly improve
connectivity

39
Network simulator
40
Connectivity bounds

Static ad-hoc networks have loose bounds of the
following form
Given n nodes uniformly distributed i.i.d. in a
disk of area A, each with range
the graph is connected almost surely as n ? ?
iff ?n ? ? .

41
Connectivity bounds

Allowing mobility can improve our loose bounds
to
Can we achieve this or even do significantly
better than this?

Fraction mobile Required range nodes
1/2 n/2
2/3 n/3
k/(k1) n/(k1)
42
Many challenges

Routing
Dynamic environment neighbor nodes moving in
and out of range, source and receivers may also
be moving
Limited bandwidth channel allocation, limited
buffer sizes
Moving
What is the globally optimal configuration?
What is the globally optimal trajectory of
configurations?
Can we learn a good policy using only local
knowledge?

43
Routing

Q-routing Boyan Littman, 93
Applied simple Q-learning to the static network
routing problem under congestion
Actions Forward packet to a particular neighbor
node
States Current packets intended receiver
Reward Estimated time to arrival at receiver
Performed well by learning to route packets
around congested areas
Direct application of Q-routing to the mobile
ad-hoc network case
Adaptations to the highly dynamic nature of
mobilized ad-hoc networks

44
Movement An RL approach

What should our actions be?
North, South, East, West, Stay Put
Explore, Maintain connection, Terminate
connection, etc.
What should our states be?
Local information about nodes, locations, and
paths
Summarized local information
Globally shared statistics
Policy search? Mixture of experts?

45
Macros, options, complex actions

Allow the nodes (agents) to utilize complex
actions rather than simple N, S, E, W type
movements
Actions might take varying amounts of time
Agents can re-evaluate whether to continue to do
the action or not at each time step
If the state hasnt really changed, then
naturally the same action will be chosen again

46
Example action plug

Sniff packets in neighborhood
Identify path (source, receiver pair) with
longest average hops
Move to that path
Move along this path until a long hop is
encountered
Insert yourself into the path at this point,
thereby decreasing the average hop distance

47
Some notion of state

State space could be huge, so we choose certain
features to parameterize the state space
Connectivity, average hop distance,
Actions should change the world state
Exploring will hopefully lead to connectivity,
plugging will lead to smaller average hops,

48
Experimental results
Number of nodes Range Theoretical fraction mobile Empirical fraction mobile required
25 2 rn
25 rn 1/2 0.21
50 1.7 rn
50 0.85 rn 1/2 0.25
100 1.7 rn
100 0.85 rn 1/2 0.19
200 1.6 rn
200 0.8 rn 1/2 0.17
400 1.6 rn
400 0.8 rn 1/2 0.14
49
Seems to work well
50
Pretty pictures
51
Pretty pictures
52
Pretty pictures
53
Pretty pictures
54
Many things to play with

Lossy transmissions
Transmission interference
Existence of opponents, jamming signals
Self-interested nodes
More realistic simulations ns2
Learning different agent roles or optimizing the
individual complex actions
Interaction between route learning and movement
learning

55
Three yardsticks

Non-cooperative case We want to play our best
response to the observed play of the world we
want to learn about the opponent
Minimize regret
Play our best response

56
Three yardsticks

Non-cooperative case We want to play our best
response to the observed play of the world
Cooperative case Approximate a global optimal
using only local information or less computation

57
Three yardsticks

Non-cooperative case We want to play our best
response to the observed play of the world
Cooperative case Approximate a global optimal
in a distributed manner
Skiiing case 17 cm of fresh powder last night
and its still snowing. More snow is better. Who
can argue with that?

58
The End

Write a Comment

User Comments (0)