Title: Reinforcement Learning to Play an Optimal Nash Equilibrium in Coordination Markov Games
1Reinforcement Learning to Play an Optimal Nash
Equilibrium in Coordination Markov Games
- XiaoFeng Wang and Tuomas Sandholm
- Carnegie Mellon University
2Outline
- Introduction
- Settings
- Coordination Difficulties
- Optimal Adaptive Learning
- Convergence Proof
- Extension Beyond Self-play
- Extension Beyond Team Games
- Conclusion and Future Works
3Coordination Games
- Coordination Games
- A coordination game typically possesses multiple
Nash equilibria, some of which might be Pareto
dominated by some of the others. - Assumption Players (self-interested agents)
prefer Nash equilibria than any other steady
states (for example, a best-response loop). - Objective to play a Nash equilibrium which is
not Pareto dominated by other Nash equilibria. - Why coordination games are important?
- Whenever an individual agent cannot achieve its
goal without interacting with others,
coordination problems could happen. - Study on coordination games helps us to
understand how to achieve win-win outcomes in
interactions and avoid being stuck in
undesirable equilibria. - Examples Team games, Battle-of-the-sexes, and
minimum-effort games. -
4Team Games
- Team Games
- In a team game, agents receive the same expected
rewards. - Team Games are the simplest form of coordination
games - Why team games are important?
- A team game can have multiple Nash equilibria.
Only some of them are optimal. This captures the
important properties of a general category of
coordination games. Study on team games gives us
an easy start without loss of important
generalities.
5Coordination Markov Games
- Markov decision process
- Model environment as a set of states S. A
decision-maker (agent) drives the changes of
states to maximize the sum of its discounted
long-term payoffs. - A coordination Markov game
- Combination of MDP and coordination games A set
of self-interested agents choose joint action
a?A to determine the state transition so as to
maximize their own profits. For example, Team
Markov games. - Relation between Markov game and Repeated stage
games - A joint Q-function maps a state-joint action pair
(s, a) to the tuple of the sum of discounted
long-term rewards individual agents receive by
taking joint action a at state s and then
following a joint strategy ?. - Q(s, . ) can be viewed as a stage game in which
agent i receives a payoff Qi(s, a) (a component
of the tuple of Q(s, a)) with a joint action a
being taken by all agents at state s. We call
such a game as state game. - A Subgame Perfect Nash equilibrium (SPNE) of a
coordination Markov game is composed of the Nash
equilibria of a sequence of coordination state
games.
6Reinforcement Learning (RL)
- Objective of reinforcement learning
- Find a strategy ? S ? A to maximize an agents
discounted long-term payoffs without knowledge
about environment model (rewarding structure and
transition probability) - Model-based reinforcement learning
- Learning rewarding structure and transition
probability to compute Q-function. - Model-free reinforcement learning
- Learning Q-function directly.
- Learning policy
- Interleave learning with execution of learnt
policy. - GLIE guarantees the convergence to an optimal
policy for a single-agent MDP.
7RL in a Coordination Markov Game
- Objective
- Without knowing game structure, an agent i is
trying to find an optimal individual strategy ?i
S ? Ai to maximize the sum of its discounted
long-term payoffs. - Difficulties
- Two layers of learning (Learning of game
structure and learning of strategy) are
interdependent during the learning of a general
Markov game On one hand, strategy is determined
over Q-function. On the other hand, Q-function
is learnt with respect to the joint strategy
agents take. - RL in team Markov games
- Team Markov games simplify the learning problem
Off-policy learning of game structure, learning
coordination over the individual state games. - In a team Markov game, the accumulation of
individual agents optimal policies is an optimal
Nash equilibrium for the game. - Although simple, more tricky than it appears to
be.
8Research Issues
- How to play an optimal Nash equilibrium in an
unknown team Markov game? - How to extend the results to a more general
category of coordination stage game and Markov
games?
9Outline
- Introduction
- Settings
- Coordination Difficulties
- Optimal Adaptive Learning
- Convergence Proof
- Extension Beyond Self-play
- Extension Beyond Team Games
- Conclusion and Future Works
10Setting
- Agents make decision independently and
concurrently. - No communications between agents.
- Agents independently receive reward signals with
the same expected values - Environment model is unknown
- Agents actions are fully observable
- Objective find an optimal joint policy ? S ?
?Ai to maximize the sum of discounted long-term
rewards. -
11Outline
- Introduction
- Settings
- Coordination Difficulties
- Optimal Adaptive Learning
- Convergence Proof
- Extension Beyond Self-play
- Extension Beyond Team Games
- Conclusion and Future Works
12Coordination over a known game
A0 A1 A2
- A team may have multiple optimal NE. Without
coordination, agents do not know how to play.
10 0 -100
0 5 0
-100 0 10
B0 B1 B2
Claus and Boutiliers stage game
- Solutions
- Lexicographic conventions (Boutilier)
- Problem Sometimes, mechanism designer unable or
unwilling to impose orders. - Learning
- Each agent treats others as nonstrategic players
and best responds to the empirical distribution
of others previous plays. E.g, Fictitious play,
adaptive play - Problem The learning process may converge to a
sub-optimal NE, usually a risk dominant NE
13Coordination over an unknown game
- Unknown game structure and noisy payoffs make
coordination even more difficult. - Independently receiving noisy rewards, agents may
hold different views of a game at a particular
moment. In this case, even lexicographic
convention does not work.
14Problems
- Against a known game
- By solving the game, agents can identify all the
NE but do not know how to play. - By myopic play (learning), agents can learn to
play a consistent NE which however may not be
optimal. - Against an unknown game
- Agents might not identify optimal NE before the
game structure fully converges.
15Outline
- Introduction
- Settings
- Coordination Difficulties
- Optimal Adaptive Learning
- Convergence Proof
- Extension Beyond Self-play
- Extension Beyond Team Games
- Conclusion and Future Works
16Optimal Adaptive Learning
- Basic ideas
- Over a known game eliminate the sub-optimal NE
and then use myopic play (learning) to learn to
play. - Over a unknown game estimate the NE of the game
before the game structure converges. Interleave
learning of coordination with learning of game
structure. - Learning layers
- Learning of coordination Biased Adaptive Play
against virtual games. - Learning of game structure Construction virtual
games with ?-bound over a model-based RL
algorithm.
17Virtual games
- A virtual game (VG) is derived from a team state
game Q(s,.) as follows - If a is an optimal NE in Q(s,.), VG(s,a)1.
Otherwise, VG(s,a)0. - Virtual games eliminate all the strict
sub-optimal NE of the original games. This is
nontrivial when the number of players are more
than 2.
18Adaptive Play
- Adaptive play (AP)
- Each agent has a limited memory size to hold m
recent plays being observed. - To choose actions, an agent i randomly draws k
samples (without replacement) to build up an
empirical model of others joint strategy. - For example, suppose that there exists an reduced
joint action profile a-i (all but is individual
actions) which appears in the samples for
K(a-i) times, agent i treats the probability of
the action as K(a-i)/k. - Agent i chooses the action which best responds to
this distribution. - Previous work (Peyton Young) shows that AP
converges to a strict NE in any weakly acyclic
game.
19Weakly Acyclic Games and Biased Set
- Weakly acyclic games (WAG)
- In a weakly acyclic game, there exists a
best-response path from any strategy profile to a
strict NE. - Many virtual games are WAGs
- However, not all VGs are WAGs.
- Some VGs only have weak NE which does not
constitute an absorbing state. - Weakly acyclic game w.r.t. a biased set (WAGB)
- A game in which exist best-response paths from
any profile to an NE in a set D (called biased
set).
20Biased Adaptive Play
- Biased adaptive play (BAP)
- Similar to AP except that an agent biases its
action selection when it detects that it is
playing an NE in the biased set. - Biased rules
- For an agent i if its k samples contain the same
a-i which has also been included in at least one
of NE in D, the agent chooses its most recent
best response to the strategy profile. For
example, if Bs samples show that A keeps
playing A0 and its most recent best response is
B0, B will stick to this action. - Biased adaptive play guarantees the convergence
to an optimal NE for any VG constructed over a
team game with the biased set containing all the
optimal NE.
21Construct VG over an unknown game
- Basic ideas
- Using a slowly decreasing bound (called ?-bound)
to find all optimal NE. Specifically, - At a state s and time t, an joint action a is
?-optimal for the state game in if
Qt(s,a)?t?maxaQt(s,a). - A virtual game VGt is constructed over these
?-optimal joint actions. - If limt???t0 and ?t decreases slower than
Q-function, VGt converges to VG. - Construction of ?-bound depends on the RL
algorithm used to learn the game structure. Over
a model-based reinforcement learning algorithm,
we prove that the following bound meets the
condition Nb-0.5 for all 0ltblt0.5, where N is the
minimal number of samples made up to time t.
22The Algorithm
- Learning of coordination
- For each state, construct VGt according to
?-optimal actions. - Follow GLIE learning policy, use BAP to choose
best-response actions over VGt with exploitation
probability. - Learning of game structure
- Use a model-based RL to update Q-function.
- Update ?-bound with the minimal number of
sampling. Find ?-optimal actions
with the bound
23Outline
- Introduction
- Settings
- Coordination Difficulties
- Optimal Adaptive Learning
- Convergence Proof
- Extension Beyond Self-play
- Extension Beyond Team Games
- Conclusion and Future Works
24Flowchart of the Proof
Theorem 1 BAP converges over WAGB
Theorem 3 BAP with GLIE converges over WAGB
Lemma 2 Nonstationary Markov Chain
Main Theorem OAL converges to an optimal
NE w.p.1
Lemma 4 Any VG is WAGB
Theorem 5 Convergence rate of the model-based RL
Theorem 6 VG can be learnt with ?-bound w.p.1
25Model BAP as a Markov chain
- Stationary Markov chain model
- State
- An initial state is composed of m initial joint
actions agents observed h0(a1, a2,, am). - The definition of other states is inductive The
successor state h of a state h is obtained by
deleting the leftmost element and add in a new
observed joint action at the leftmost side of the
tuple. - Absorbing state (a,a,,a) is an individual
absorbing state if a?D or it is a strict NE. All
individual absorbing states are clustered into a
unique absorbing state. - Transition
- The probability ph,h that a state h transits to
h is positive if and only if the left most joint
action aa1, a2,, an in h is composed of
individual action ai which best responds to at
least k samples in h. - Since the distribution an agent takes to sample
its memory is independent of time, the transition
probability between any two states does not
change with time. Therefore, the Markov chain
is stationary.
26Convergence over a known game
- Theorem 1 Let L(a) be the shortest length of a
best-response path from joint action a to an NE
in D. LGmaxaL(a). If m?k(LG2), BAP over WAGB
converges to either a NE in D or a strict NE
w.p.1. - Nonstationary Markov Chain Model
- With GLIE learning policy, at any moment, an
agent has a probability to do experimenting
(exploring the actions other than the estimated
best-response). The exploration probability is
diminishing with time. Therefore, we can model
BAP with GLIE over WAGB as a nonstationary Markov
chain, with a transition matrix Pt. Let P be
the transition matrix of the stationary Markov
chain for BAP over the same WAGB. Clearly, GLIE
guarantees that Pt?P with t??. - In stationary Markov chain model, we have only
one absorbing state (composed of several
individual absorbing states). Theorem 1 says
that such a Markov chain is ergodic, with only
one stationary distribution, given m?k(LG2).
With nonstationary Markov chain theory, we can
get the following Theorem - Theorem 2 With m?k(LG2), BAP with GLIE converges
to either a NE in D or a strict NE w.p.1.
27Determine the length of best-response path
- In a team game, LG is no more than n (the number
of agents). The following figure illustrates
this. In the figure, each box represents an
individual action of an agent. represents
an individual action contained in a NE. In the
figure, we see that n-n agents can move the
joint actions to an NE by switching their
individual actions one after the other. This
switching is best-response given others stick to
their individual actions. - Lemma 4 The VG of any team game is a WAGB w.r.t.
the set of optimal NE with LVG ? n.
28Learning the virtual games
- First, we assess the convergence rate of the
model-based RL algorithm. - Then, we construct the sufficient condition for
?-bound over the convergence rate lemma.
29Main Theorem
- Theorem 7 In any team Markov game among n agents
if 1) m?k(n1) 2) ?-bound satisfies Lemma 6, then
the OAL algorithm converges to an optimal NE
w.p.1 - General ideas of the proof
- With Lemma 6, we have that the probability of the
event E that VGtVG for the rest of play after
time t converges to 1 with t. - Starting from a time t, conditioning on the
probability of E, agents play BAP with GLIE over
a known game, which converges to an optimal NE
w.p.1 according to Theorem 3. - Combine these two convergence process together,
we get the convergence result.
30Example 2-agent game
A0 A1 A2
10 0 -100
0 5 0
-100 0 10
B0 B1 B2
31Example 3-agent game
32Example Multiple stage games
33Outline
- Introduction
- Settings
- Coordination Difficulties
- Optimal Adaptive Learning
- Convergence Proof
- Extension Beyond Self-play
- Extension Beyond Team Games
- Conclusion and Future Works
34Extension general ideas
- Classic game theory tells us how to solve a
games, i.e., identifying the fixed points of
introspections. However, it is less clear about
how to play a game. - Standard ways to play a game
- Solve the game first and play a NE strategy
(strategic play). - Problem 1) With existence of multiple NE,
sometimes, agents may not know how to play. 2) It
might be computationally expensive. - Assume that others take stationary strategy and
best response to the belief (myopic play). - Problem Myopic strategies may lead agents to
play a sub-optimal (Pareto dominated) NE. - The idea generalized from OAL Partially Myopic
and Partially Strategic (PMPS) play. - Biased Action Selection Strategically lead the
other to play a stable strategy. - Virtual Games Compute NE first and then
eliminate the sub-optimal NE. - Adaptive Play Myopically adjust best-response
strategy w.r.t. the agents observations.
35Extension Beyond self-play
- Problem
- OAL only guarantees convergence to an optimal NE
in self-play. That is, all players are OAL
agents. Can agents find optimal coordination
when only some of them play OAL? Lets consider
the simplest case two agent, one is JAL or IL
player (Claus and Boutilier 98) and the other is
OAL player. - A straightforward way to enforce the optimal
coordination - Two players, one of them is an opinionated
player who leads the play. - Leader
- Learner
- If the other is either JAL and IL player, the
convergence to optimal NE is guaranteed. - How about that the other is also a leader agent?
More seriously, how to play if the leader does
not know the type of the other player?
A0 A1 A2
B0 B1 B2
10 0 -100
0 5 0
-100 0 10
36New Biased Rules
- Original biased rules
- For an agent i if its k samples contain the same
a-i which has also been included in at least one
of NE in D, the agent chooses its most recent
best response to the strategy profile. For
example, if Bs samples show that A keeps
playing A0 and its most recent best response is
B0, B will stick to this action. - New biased rules
- If an agent i has multiple best-response actions
w.r.t. its k samples, it chooses the one included
in an optimal NE in VG. If there exists several
such choices, it chooses the one which has been
played most recently. - Difference between the old and the new rules
- Old rules biases the action-selection when
others joint strategy has been included in an
optimal NE. Otherwise, it just randomizes its
choices of best-response actions. - The new rules always biases the agents
action-selection.
37Example
- The new rules preserves the properties of
convergence in n-agent team Markov games.
38Extension Beyond Team Games
- How to extend the ideas of PMPS play to general
coordination games? - To simplify the setting, now we consider a
category of coordination stage games with the
following properties - These games have at least one pure strategy NE.
- Agents have compatible preferences of some of
these NE over any other steady states (such as
mixed strategy NE or best-response loops). - Lets consider two situations Perfect monitoring
and imperfect monitoring. - Perfect monitoring Agents can observe others
actions and payoffs. - Imperfect monitoring Agents only observe
others actions. - All agents may not have information about the
game structure.
39Perfect Monitoring
- Following the same idea of OAL.
- Algorithm
- Learning of coordination
- Compute all the NE of the game estimated.
- Find out all the NE being ? dominated. For
example, a strategy profile (a,b) is ? dominated
by (a,b) if (Q(a)ltQ(a)-?) and (Q(b)?Q(b)?). - Construct a VG which contains all the NE not
being ? dominated, setting other values in VG to
zero (without loss of generality, suppose that
agents normalize their payoff to a value between
zero and one). - With GLIE exploration, BAP over the VG.
- Learning of game structure
- Observe the others payoffs and update the sample
means of agents expected payoffs in the game
matrix. - Compute an ?-bound in the same way as OAL.
- The learning over the coordination stage games we
discussed is conjectured to converge to an NE not
being Pareto dominated w.p.1
40Imperfect Monitoring
- In general, it is difficult to eliminate
sub-optimal NE without knowing others payoffs.
Lets consider the simplest case Two learning
agents have at least one common interest (a
strategy profile maximizes both agents payoffs). - For this game, agents can learn to play an
optimal NE with a modified version of OAL (with
new biased rules). - Biased rules 1) Each agent randomizes its
action-selection whenever the payoff of its
best-response actions is zero over the virtual
game. 2) Each agent biases its action to recent
best response if all its k samples contain the
same individual actions of the other agent, more
than m-k recorded joint actions have this
property and the agent have multiple best
responses to give it payoff 1 w.r.t. to its k
samples. Otherwise, randomly choose best-response
action. - In this type of coordination stage game, the
learning process is conjectured to converge to an
optimal NE. The result can be extended to Markov
game.
41Example
A0 A1 A2
1, 0 0, 0 0, 1
0,0 1,1 0,0
0,1 0,0 1,0
B0 B1 B2
42Conclusions and Future Works
- In this research, we study RL techniques for
agents to play an optimal NE (not being Pareto
dominated by other NE) in coordination games when
the environmental model is unknown beforehand. - We start our research with team game and propose
the OAL algorithm, the first algorithm which
guarantees the convergence to an optimal NE in
any team Markov games. - We further generalize the basic ideas in OAL and
propose a new approach for learning in games,
called partially myopic and partially strategic
play. - We extend the PMPS play beyond self-play and team
games. Some of the results can be extended to
Markov games. - In future research, we will further explore the
application of PMPS play in coordination games.
Especially, we will study how to eliminate
sub-optimal NE in imperfect monitoring
environments.