Reinforcement Learning to Play an Optimal Nash Equilibrium in Coordination Markov Games

About This Presentation

Title:

Reinforcement Learning to Play an Optimal Nash Equilibrium in Coordination Markov Games

Description:

... learning process may converge to a sub-optimal NE, usually a risk dominant NE ... Virtual games eliminate all the strict sub-optimal NE of the original games. ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 43

Provided by: wangxiaofe

Category:

more less

Transcript and Presenter's Notes

Title: Reinforcement Learning to Play an Optimal Nash Equilibrium in Coordination Markov Games

1
Reinforcement Learning to Play an Optimal Nash
Equilibrium in Coordination Markov Games

XiaoFeng Wang and Tuomas Sandholm
Carnegie Mellon University

2
Outline

Introduction
Settings
Coordination Difficulties
Optimal Adaptive Learning
Convergence Proof
Extension Beyond Self-play
Extension Beyond Team Games
Conclusion and Future Works

3
Coordination Games

Coordination Games
A coordination game typically possesses multiple
Nash equilibria, some of which might be Pareto
dominated by some of the others.
Assumption Players (self-interested agents)
prefer Nash equilibria than any other steady
states (for example, a best-response loop).
Objective to play a Nash equilibrium which is
not Pareto dominated by other Nash equilibria.
Why coordination games are important?
Whenever an individual agent cannot achieve its
goal without interacting with others,
coordination problems could happen.
Study on coordination games helps us to
understand how to achieve win-win outcomes in
interactions and avoid being stuck in
undesirable equilibria.
Examples Team games, Battle-of-the-sexes, and
minimum-effort games.

4
Team Games

Team Games
In a team game, agents receive the same expected
rewards.
Team Games are the simplest form of coordination
games
Why team games are important?
A team game can have multiple Nash equilibria.
Only some of them are optimal. This captures the
important properties of a general category of
coordination games. Study on team games gives us
an easy start without loss of important
generalities.

5
Coordination Markov Games

Markov decision process
Model environment as a set of states S. A
decision-maker (agent) drives the changes of
states to maximize the sum of its discounted
long-term payoffs.
A coordination Markov game
Combination of MDP and coordination games A set
of self-interested agents choose joint action
a?A to determine the state transition so as to
maximize their own profits. For example, Team
Markov games.
Relation between Markov game and Repeated stage
games
A joint Q-function maps a state-joint action pair
(s, a) to the tuple of the sum of discounted
long-term rewards individual agents receive by
taking joint action a at state s and then
following a joint strategy ?.
Q(s, . ) can be viewed as a stage game in which
agent i receives a payoff Qi(s, a) (a component
of the tuple of Q(s, a)) with a joint action a
being taken by all agents at state s. We call
such a game as state game.
A Subgame Perfect Nash equilibrium (SPNE) of a
coordination Markov game is composed of the Nash
equilibria of a sequence of coordination state
games.

6
Reinforcement Learning (RL)

Objective of reinforcement learning
Find a strategy ? S ? A to maximize an agents
discounted long-term payoffs without knowledge
about environment model (rewarding structure and
transition probability)
Model-based reinforcement learning
Learning rewarding structure and transition
probability to compute Q-function.
Model-free reinforcement learning
Learning Q-function directly.
Learning policy
Interleave learning with execution of learnt
policy.
GLIE guarantees the convergence to an optimal
policy for a single-agent MDP.

7
RL in a Coordination Markov Game

Objective
Without knowing game structure, an agent i is
trying to find an optimal individual strategy ?i
S ? Ai to maximize the sum of its discounted
long-term payoffs.
Difficulties
Two layers of learning (Learning of game
structure and learning of strategy) are
interdependent during the learning of a general
Markov game On one hand, strategy is determined
over Q-function. On the other hand, Q-function
is learnt with respect to the joint strategy
agents take.
RL in team Markov games
Team Markov games simplify the learning problem
Off-policy learning of game structure, learning
coordination over the individual state games.
In a team Markov game, the accumulation of
individual agents optimal policies is an optimal
Nash equilibrium for the game.
Although simple, more tricky than it appears to
be.

8
Research Issues

How to play an optimal Nash equilibrium in an
unknown team Markov game?
How to extend the results to a more general
category of coordination stage game and Markov
games?

9
Outline

Introduction
Settings
Coordination Difficulties
Optimal Adaptive Learning
Convergence Proof
Extension Beyond Self-play
Extension Beyond Team Games
Conclusion and Future Works

10
Setting

Agents make decision independently and
concurrently.
No communications between agents.
Agents independently receive reward signals with
the same expected values
Environment model is unknown
Agents actions are fully observable
Objective find an optimal joint policy ? S ?
?Ai to maximize the sum of discounted long-term
rewards.

11
Outline

Introduction
Settings
Coordination Difficulties
Optimal Adaptive Learning
Convergence Proof
Extension Beyond Self-play
Extension Beyond Team Games
Conclusion and Future Works

12
Coordination over a known game
A0 A1 A2

A team may have multiple optimal NE. Without
coordination, agents do not know how to play.

10 0 -100
0 5 0
-100 0 10
B0 B1 B2
Claus and Boutiliers stage game

Solutions
Lexicographic conventions (Boutilier)
Problem Sometimes, mechanism designer unable or
unwilling to impose orders.
Learning
Each agent treats others as nonstrategic players
and best responds to the empirical distribution
of others previous plays. E.g, Fictitious play,
adaptive play
Problem The learning process may converge to a
sub-optimal NE, usually a risk dominant NE

13
Coordination over an unknown game

Unknown game structure and noisy payoffs make
coordination even more difficult.
Independently receiving noisy rewards, agents may
hold different views of a game at a particular
moment. In this case, even lexicographic
convention does not work.

14
Problems

Against a known game
By solving the game, agents can identify all the
NE but do not know how to play.
By myopic play (learning), agents can learn to
play a consistent NE which however may not be
optimal.
Against an unknown game
Agents might not identify optimal NE before the
game structure fully converges.

15
Outline

Introduction
Settings
Coordination Difficulties
Optimal Adaptive Learning
Convergence Proof
Extension Beyond Self-play
Extension Beyond Team Games
Conclusion and Future Works

16
Optimal Adaptive Learning

Basic ideas
Over a known game eliminate the sub-optimal NE
and then use myopic play (learning) to learn to
play.
Over a unknown game estimate the NE of the game
before the game structure converges. Interleave
learning of coordination with learning of game
structure.
Learning layers
Learning of coordination Biased Adaptive Play
against virtual games.
Learning of game structure Construction virtual
games with ?-bound over a model-based RL
algorithm.

17
Virtual games

A virtual game (VG) is derived from a team state
game Q(s,.) as follows
If a is an optimal NE in Q(s,.), VG(s,a)1.
Otherwise, VG(s,a)0.
Virtual games eliminate all the strict
sub-optimal NE of the original games. This is
nontrivial when the number of players are more
than 2.

18
Adaptive Play

Adaptive play (AP)
Each agent has a limited memory size to hold m
recent plays being observed.
To choose actions, an agent i randomly draws k
samples (without replacement) to build up an
empirical model of others joint strategy.
For example, suppose that there exists an reduced
joint action profile a-i (all but is individual
actions) which appears in the samples for
K(a-i) times, agent i treats the probability of
the action as K(a-i)/k.
Agent i chooses the action which best responds to
this distribution.
Previous work (Peyton Young) shows that AP
converges to a strict NE in any weakly acyclic
game.

19
Weakly Acyclic Games and Biased Set

Weakly acyclic games (WAG)
In a weakly acyclic game, there exists a
best-response path from any strategy profile to a
strict NE.
Many virtual games are WAGs
However, not all VGs are WAGs.
Some VGs only have weak NE which does not
constitute an absorbing state.
Weakly acyclic game w.r.t. a biased set (WAGB)
A game in which exist best-response paths from
any profile to an NE in a set D (called biased
set).

20
Biased Adaptive Play

Biased adaptive play (BAP)
Similar to AP except that an agent biases its
action selection when it detects that it is
playing an NE in the biased set.
Biased rules
For an agent i if its k samples contain the same
a-i which has also been included in at least one
of NE in D, the agent chooses its most recent
best response to the strategy profile. For
example, if Bs samples show that A keeps
playing A0 and its most recent best response is
B0, B will stick to this action.
Biased adaptive play guarantees the convergence
to an optimal NE for any VG constructed over a
team game with the biased set containing all the
optimal NE.

21
Construct VG over an unknown game

Basic ideas
Using a slowly decreasing bound (called ?-bound)
to find all optimal NE. Specifically,
At a state s and time t, an joint action a is
?-optimal for the state game in if
Qt(s,a)?t?maxaQt(s,a).
A virtual game VGt is constructed over these
?-optimal joint actions.
If limt???t0 and ?t decreases slower than
Q-function, VGt converges to VG.
Construction of ?-bound depends on the RL
algorithm used to learn the game structure. Over
a model-based reinforcement learning algorithm,
we prove that the following bound meets the
condition Nb-0.5 for all 0ltblt0.5, where N is the
minimal number of samples made up to time t.

22
The Algorithm

Learning of coordination
For each state, construct VGt according to
?-optimal actions.
Follow GLIE learning policy, use BAP to choose
best-response actions over VGt with exploitation
probability.
Learning of game structure
Use a model-based RL to update Q-function.
Update ?-bound with the minimal number of
sampling. Find ?-optimal actions
with the bound

23
Outline

Introduction
Settings
Coordination Difficulties
Optimal Adaptive Learning
Convergence Proof
Extension Beyond Self-play
Extension Beyond Team Games
Conclusion and Future Works

24
Flowchart of the Proof
Theorem 1 BAP converges over WAGB
Theorem 3 BAP with GLIE converges over WAGB
Lemma 2 Nonstationary Markov Chain
Main Theorem OAL converges to an optimal
NE w.p.1
Lemma 4 Any VG is WAGB
Theorem 5 Convergence rate of the model-based RL
Theorem 6 VG can be learnt with ?-bound w.p.1
25
Model BAP as a Markov chain

Stationary Markov chain model
State
An initial state is composed of m initial joint
actions agents observed h0(a1, a2,, am).
The definition of other states is inductive The
successor state h of a state h is obtained by
deleting the leftmost element and add in a new
observed joint action at the leftmost side of the
tuple.
Absorbing state (a,a,,a) is an individual
absorbing state if a?D or it is a strict NE. All
individual absorbing states are clustered into a
unique absorbing state.
Transition
The probability ph,h that a state h transits to
h is positive if and only if the left most joint
action aa1, a2,, an in h is composed of
individual action ai which best responds to at
least k samples in h.
Since the distribution an agent takes to sample
its memory is independent of time, the transition
probability between any two states does not
change with time. Therefore, the Markov chain
is stationary.

26
Convergence over a known game

Theorem 1 Let L(a) be the shortest length of a
best-response path from joint action a to an NE
in D. LGmaxaL(a). If m?k(LG2), BAP over WAGB
converges to either a NE in D or a strict NE
w.p.1.
Nonstationary Markov Chain Model
With GLIE learning policy, at any moment, an
agent has a probability to do experimenting
(exploring the actions other than the estimated
best-response). The exploration probability is
diminishing with time. Therefore, we can model
BAP with GLIE over WAGB as a nonstationary Markov
chain, with a transition matrix Pt. Let P be
the transition matrix of the stationary Markov
chain for BAP over the same WAGB. Clearly, GLIE
guarantees that Pt?P with t??.
In stationary Markov chain model, we have only
one absorbing state (composed of several
individual absorbing states). Theorem 1 says
that such a Markov chain is ergodic, with only
one stationary distribution, given m?k(LG2).
With nonstationary Markov chain theory, we can
get the following Theorem
Theorem 2 With m?k(LG2), BAP with GLIE converges
to either a NE in D or a strict NE w.p.1.

27
Determine the length of best-response path

In a team game, LG is no more than n (the number
of agents). The following figure illustrates
this. In the figure, each box represents an
individual action of an agent. represents
an individual action contained in a NE. In the
figure, we see that n-n agents can move the
joint actions to an NE by switching their
individual actions one after the other. This
switching is best-response given others stick to
their individual actions.
Lemma 4 The VG of any team game is a WAGB w.r.t.
the set of optimal NE with LVG ? n.

28
Learning the virtual games

First, we assess the convergence rate of the
model-based RL algorithm.
Then, we construct the sufficient condition for
?-bound over the convergence rate lemma.

29
Main Theorem

Theorem 7 In any team Markov game among n agents
if 1) m?k(n1) 2) ?-bound satisfies Lemma 6, then
the OAL algorithm converges to an optimal NE
w.p.1
General ideas of the proof
With Lemma 6, we have that the probability of the
event E that VGtVG for the rest of play after
time t converges to 1 with t.
Starting from a time t, conditioning on the
probability of E, agents play BAP with GLIE over
a known game, which converges to an optimal NE
w.p.1 according to Theorem 3.
Combine these two convergence process together,
we get the convergence result.

30
Example 2-agent game
A0 A1 A2
10 0 -100
0 5 0
-100 0 10
B0 B1 B2
31
Example 3-agent game
32
Example Multiple stage games
33
Outline

Introduction
Settings
Coordination Difficulties
Optimal Adaptive Learning
Convergence Proof
Extension Beyond Self-play
Extension Beyond Team Games
Conclusion and Future Works

34
Extension general ideas

Classic game theory tells us how to solve a
games, i.e., identifying the fixed points of
introspections. However, it is less clear about
how to play a game.
Standard ways to play a game
Solve the game first and play a NE strategy
(strategic play).
Problem 1) With existence of multiple NE,
sometimes, agents may not know how to play. 2) It
might be computationally expensive.
Assume that others take stationary strategy and
best response to the belief (myopic play).
Problem Myopic strategies may lead agents to
play a sub-optimal (Pareto dominated) NE.
The idea generalized from OAL Partially Myopic
and Partially Strategic (PMPS) play.
Biased Action Selection Strategically lead the
other to play a stable strategy.
Virtual Games Compute NE first and then
eliminate the sub-optimal NE.
Adaptive Play Myopically adjust best-response
strategy w.r.t. the agents observations.

35
Extension Beyond self-play

Problem
OAL only guarantees convergence to an optimal NE
in self-play. That is, all players are OAL
agents. Can agents find optimal coordination
when only some of them play OAL? Lets consider
the simplest case two agent, one is JAL or IL
player (Claus and Boutilier 98) and the other is
OAL player.
A straightforward way to enforce the optimal
coordination
Two players, one of them is an opinionated
player who leads the play.
Leader
Learner
If the other is either JAL and IL player, the
convergence to optimal NE is guaranteed.
How about that the other is also a leader agent?
More seriously, how to play if the leader does
not know the type of the other player?

A0 A1 A2
B0 B1 B2
10 0 -100
0 5 0
-100 0 10
36
New Biased Rules

Original biased rules
For an agent i if its k samples contain the same
a-i which has also been included in at least one
of NE in D, the agent chooses its most recent
best response to the strategy profile. For
example, if Bs samples show that A keeps
playing A0 and its most recent best response is
B0, B will stick to this action.
New biased rules
If an agent i has multiple best-response actions
w.r.t. its k samples, it chooses the one included
in an optimal NE in VG. If there exists several
such choices, it chooses the one which has been
played most recently.
Difference between the old and the new rules
Old rules biases the action-selection when
others joint strategy has been included in an
optimal NE. Otherwise, it just randomizes its
choices of best-response actions.
The new rules always biases the agents
action-selection.

37
Example

The new rules preserves the properties of
convergence in n-agent team Markov games.

38
Extension Beyond Team Games

How to extend the ideas of PMPS play to general
coordination games?
To simplify the setting, now we consider a
category of coordination stage games with the
following properties
These games have at least one pure strategy NE.
Agents have compatible preferences of some of
these NE over any other steady states (such as
mixed strategy NE or best-response loops).
Lets consider two situations Perfect monitoring
and imperfect monitoring.
Perfect monitoring Agents can observe others
actions and payoffs.
Imperfect monitoring Agents only observe
others actions.
All agents may not have information about the
game structure.

39
Perfect Monitoring

Following the same idea of OAL.
Algorithm
Learning of coordination
Compute all the NE of the game estimated.
Find out all the NE being ? dominated. For
example, a strategy profile (a,b) is ? dominated
by (a,b) if (Q(a)ltQ(a)-?) and (Q(b)?Q(b)?).
Construct a VG which contains all the NE not
being ? dominated, setting other values in VG to
zero (without loss of generality, suppose that
agents normalize their payoff to a value between
zero and one).
With GLIE exploration, BAP over the VG.
Learning of game structure
Observe the others payoffs and update the sample
means of agents expected payoffs in the game
matrix.
Compute an ?-bound in the same way as OAL.
The learning over the coordination stage games we
discussed is conjectured to converge to an NE not
being Pareto dominated w.p.1

40
Imperfect Monitoring

In general, it is difficult to eliminate
sub-optimal NE without knowing others payoffs.
Lets consider the simplest case Two learning
agents have at least one common interest (a
strategy profile maximizes both agents payoffs).
For this game, agents can learn to play an
optimal NE with a modified version of OAL (with
new biased rules).
Biased rules 1) Each agent randomizes its
action-selection whenever the payoff of its
best-response actions is zero over the virtual
game. 2) Each agent biases its action to recent
best response if all its k samples contain the
same individual actions of the other agent, more
than m-k recorded joint actions have this
property and the agent have multiple best
responses to give it payoff 1 w.r.t. to its k
samples. Otherwise, randomly choose best-response
action.
In this type of coordination stage game, the
learning process is conjectured to converge to an
optimal NE. The result can be extended to Markov
game.

41
Example
A0 A1 A2
1, 0 0, 0 0, 1
0,0 1,1 0,0
0,1 0,0 1,0
B0 B1 B2
42
Conclusions and Future Works

In this research, we study RL techniques for
agents to play an optimal NE (not being Pareto
dominated by other NE) in coordination games when
the environmental model is unknown beforehand.
We start our research with team game and propose
the OAL algorithm, the first algorithm which
guarantees the convergence to an optimal NE in
any team Markov games.
We further generalize the basic ideas in OAL and
propose a new approach for learning in games,
called partially myopic and partially strategic
play.
We extend the PMPS play beyond self-play and team
games. Some of the results can be extended to
Markov games.
In future research, we will further explore the
application of PMPS play in coordination games.
Especially, we will study how to eliminate
sub-optimal NE in imperfect monitoring
environments.