Reinforcement Learning to Play an Optimal Nash Equilibrium in Coordination Markov Games - PowerPoint PPT Presentation

About This Presentation
Title:

Reinforcement Learning to Play an Optimal Nash Equilibrium in Coordination Markov Games

Description:

... learning process may converge to a sub-optimal NE, usually a risk dominant NE ... Virtual games eliminate all the strict sub-optimal NE of the original games. ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 43
Provided by: wangxiaofe
Category:

less

Transcript and Presenter's Notes

Title: Reinforcement Learning to Play an Optimal Nash Equilibrium in Coordination Markov Games


1
Reinforcement Learning to Play an Optimal Nash
Equilibrium in Coordination Markov Games
  • XiaoFeng Wang and Tuomas Sandholm
  • Carnegie Mellon University

2
Outline
  • Introduction
  • Settings
  • Coordination Difficulties
  • Optimal Adaptive Learning
  • Convergence Proof
  • Extension Beyond Self-play
  • Extension Beyond Team Games
  • Conclusion and Future Works

3
Coordination Games
  • Coordination Games
  • A coordination game typically possesses multiple
    Nash equilibria, some of which might be Pareto
    dominated by some of the others.
  • Assumption Players (self-interested agents)
    prefer Nash equilibria than any other steady
    states (for example, a best-response loop).
  • Objective to play a Nash equilibrium which is
    not Pareto dominated by other Nash equilibria.
  • Why coordination games are important?
  • Whenever an individual agent cannot achieve its
    goal without interacting with others,
    coordination problems could happen.
  • Study on coordination games helps us to
    understand how to achieve win-win outcomes in
    interactions and avoid being stuck in
    undesirable equilibria.
  • Examples Team games, Battle-of-the-sexes, and
    minimum-effort games.

4
Team Games
  • Team Games
  • In a team game, agents receive the same expected
    rewards.
  • Team Games are the simplest form of coordination
    games
  • Why team games are important?
  • A team game can have multiple Nash equilibria.
    Only some of them are optimal. This captures the
    important properties of a general category of
    coordination games. Study on team games gives us
    an easy start without loss of important
    generalities.

5
Coordination Markov Games
  • Markov decision process
  • Model environment as a set of states S. A
    decision-maker (agent) drives the changes of
    states to maximize the sum of its discounted
    long-term payoffs.
  • A coordination Markov game
  • Combination of MDP and coordination games A set
    of self-interested agents choose joint action
    a?A to determine the state transition so as to
    maximize their own profits. For example, Team
    Markov games.
  • Relation between Markov game and Repeated stage
    games
  • A joint Q-function maps a state-joint action pair
    (s, a) to the tuple of the sum of discounted
    long-term rewards individual agents receive by
    taking joint action a at state s and then
    following a joint strategy ?.
  • Q(s, . ) can be viewed as a stage game in which
    agent i receives a payoff Qi(s, a) (a component
    of the tuple of Q(s, a)) with a joint action a
    being taken by all agents at state s. We call
    such a game as state game.
  • A Subgame Perfect Nash equilibrium (SPNE) of a
    coordination Markov game is composed of the Nash
    equilibria of a sequence of coordination state
    games.

6
Reinforcement Learning (RL)
  • Objective of reinforcement learning
  • Find a strategy ? S ? A to maximize an agents
    discounted long-term payoffs without knowledge
    about environment model (rewarding structure and
    transition probability)
  • Model-based reinforcement learning
  • Learning rewarding structure and transition
    probability to compute Q-function.
  • Model-free reinforcement learning
  • Learning Q-function directly.
  • Learning policy
  • Interleave learning with execution of learnt
    policy.
  • GLIE guarantees the convergence to an optimal
    policy for a single-agent MDP.

7
RL in a Coordination Markov Game
  • Objective
  • Without knowing game structure, an agent i is
    trying to find an optimal individual strategy ?i
    S ? Ai to maximize the sum of its discounted
    long-term payoffs.
  • Difficulties
  • Two layers of learning (Learning of game
    structure and learning of strategy) are
    interdependent during the learning of a general
    Markov game On one hand, strategy is determined
    over Q-function. On the other hand, Q-function
    is learnt with respect to the joint strategy
    agents take.
  • RL in team Markov games
  • Team Markov games simplify the learning problem
    Off-policy learning of game structure, learning
    coordination over the individual state games.
  • In a team Markov game, the accumulation of
    individual agents optimal policies is an optimal
    Nash equilibrium for the game.
  • Although simple, more tricky than it appears to
    be.

8
Research Issues
  • How to play an optimal Nash equilibrium in an
    unknown team Markov game?
  • How to extend the results to a more general
    category of coordination stage game and Markov
    games?

9
Outline
  • Introduction
  • Settings
  • Coordination Difficulties
  • Optimal Adaptive Learning
  • Convergence Proof
  • Extension Beyond Self-play
  • Extension Beyond Team Games
  • Conclusion and Future Works

10
Setting
  • Agents make decision independently and
    concurrently.
  • No communications between agents.
  • Agents independently receive reward signals with
    the same expected values
  • Environment model is unknown
  • Agents actions are fully observable
  • Objective find an optimal joint policy ? S ?
    ?Ai to maximize the sum of discounted long-term
    rewards.

11
Outline
  • Introduction
  • Settings
  • Coordination Difficulties
  • Optimal Adaptive Learning
  • Convergence Proof
  • Extension Beyond Self-play
  • Extension Beyond Team Games
  • Conclusion and Future Works

12
Coordination over a known game
A0 A1 A2
  • A team may have multiple optimal NE. Without
    coordination, agents do not know how to play.

10 0 -100
0 5 0
-100 0 10
B0 B1 B2
Claus and Boutiliers stage game
  • Solutions
  • Lexicographic conventions (Boutilier)
  • Problem Sometimes, mechanism designer unable or
    unwilling to impose orders.
  • Learning
  • Each agent treats others as nonstrategic players
    and best responds to the empirical distribution
    of others previous plays. E.g, Fictitious play,
    adaptive play
  • Problem The learning process may converge to a
    sub-optimal NE, usually a risk dominant NE

13
Coordination over an unknown game
  • Unknown game structure and noisy payoffs make
    coordination even more difficult.
  • Independently receiving noisy rewards, agents may
    hold different views of a game at a particular
    moment. In this case, even lexicographic
    convention does not work.

14
Problems
  • Against a known game
  • By solving the game, agents can identify all the
    NE but do not know how to play.
  • By myopic play (learning), agents can learn to
    play a consistent NE which however may not be
    optimal.
  • Against an unknown game
  • Agents might not identify optimal NE before the
    game structure fully converges.

15
Outline
  • Introduction
  • Settings
  • Coordination Difficulties
  • Optimal Adaptive Learning
  • Convergence Proof
  • Extension Beyond Self-play
  • Extension Beyond Team Games
  • Conclusion and Future Works

16
Optimal Adaptive Learning
  • Basic ideas
  • Over a known game eliminate the sub-optimal NE
    and then use myopic play (learning) to learn to
    play.
  • Over a unknown game estimate the NE of the game
    before the game structure converges. Interleave
    learning of coordination with learning of game
    structure.
  • Learning layers
  • Learning of coordination Biased Adaptive Play
    against virtual games.
  • Learning of game structure Construction virtual
    games with ?-bound over a model-based RL
    algorithm.

17
Virtual games
  • A virtual game (VG) is derived from a team state
    game Q(s,.) as follows
  • If a is an optimal NE in Q(s,.), VG(s,a)1.
    Otherwise, VG(s,a)0.
  • Virtual games eliminate all the strict
    sub-optimal NE of the original games. This is
    nontrivial when the number of players are more
    than 2.

18
Adaptive Play
  • Adaptive play (AP)
  • Each agent has a limited memory size to hold m
    recent plays being observed.
  • To choose actions, an agent i randomly draws k
    samples (without replacement) to build up an
    empirical model of others joint strategy.
  • For example, suppose that there exists an reduced
    joint action profile a-i (all but is individual
    actions) which appears in the samples for
    K(a-i) times, agent i treats the probability of
    the action as K(a-i)/k.
  • Agent i chooses the action which best responds to
    this distribution.
  • Previous work (Peyton Young) shows that AP
    converges to a strict NE in any weakly acyclic
    game.

19
Weakly Acyclic Games and Biased Set
  • Weakly acyclic games (WAG)
  • In a weakly acyclic game, there exists a
    best-response path from any strategy profile to a
    strict NE.
  • Many virtual games are WAGs
  • However, not all VGs are WAGs.
  • Some VGs only have weak NE which does not
    constitute an absorbing state.
  • Weakly acyclic game w.r.t. a biased set (WAGB)
  • A game in which exist best-response paths from
    any profile to an NE in a set D (called biased
    set).

20
Biased Adaptive Play
  • Biased adaptive play (BAP)
  • Similar to AP except that an agent biases its
    action selection when it detects that it is
    playing an NE in the biased set.
  • Biased rules
  • For an agent i if its k samples contain the same
    a-i which has also been included in at least one
    of NE in D, the agent chooses its most recent
    best response to the strategy profile. For
    example, if Bs samples show that A keeps
    playing A0 and its most recent best response is
    B0, B will stick to this action.
  • Biased adaptive play guarantees the convergence
    to an optimal NE for any VG constructed over a
    team game with the biased set containing all the
    optimal NE.

21
Construct VG over an unknown game
  • Basic ideas
  • Using a slowly decreasing bound (called ?-bound)
    to find all optimal NE. Specifically,
  • At a state s and time t, an joint action a is
    ?-optimal for the state game in if
    Qt(s,a)?t?maxaQt(s,a).
  • A virtual game VGt is constructed over these
    ?-optimal joint actions.
  • If limt???t0 and ?t decreases slower than
    Q-function, VGt converges to VG.
  • Construction of ?-bound depends on the RL
    algorithm used to learn the game structure. Over
    a model-based reinforcement learning algorithm,
    we prove that the following bound meets the
    condition Nb-0.5 for all 0ltblt0.5, where N is the
    minimal number of samples made up to time t.

22
The Algorithm
  • Learning of coordination
  • For each state, construct VGt according to
    ?-optimal actions.
  • Follow GLIE learning policy, use BAP to choose
    best-response actions over VGt with exploitation
    probability.
  • Learning of game structure
  • Use a model-based RL to update Q-function.
  • Update ?-bound with the minimal number of
    sampling. Find ?-optimal actions
    with the bound

23
Outline
  • Introduction
  • Settings
  • Coordination Difficulties
  • Optimal Adaptive Learning
  • Convergence Proof
  • Extension Beyond Self-play
  • Extension Beyond Team Games
  • Conclusion and Future Works

24
Flowchart of the Proof
Theorem 1 BAP converges over WAGB
Theorem 3 BAP with GLIE converges over WAGB
Lemma 2 Nonstationary Markov Chain
Main Theorem OAL converges to an optimal
NE w.p.1
Lemma 4 Any VG is WAGB
Theorem 5 Convergence rate of the model-based RL
Theorem 6 VG can be learnt with ?-bound w.p.1
25
Model BAP as a Markov chain
  • Stationary Markov chain model
  • State
  • An initial state is composed of m initial joint
    actions agents observed h0(a1, a2,, am).
  • The definition of other states is inductive The
    successor state h of a state h is obtained by
    deleting the leftmost element and add in a new
    observed joint action at the leftmost side of the
    tuple.
  • Absorbing state (a,a,,a) is an individual
    absorbing state if a?D or it is a strict NE. All
    individual absorbing states are clustered into a
    unique absorbing state.
  • Transition
  • The probability ph,h that a state h transits to
    h is positive if and only if the left most joint
    action aa1, a2,, an in h is composed of
    individual action ai which best responds to at
    least k samples in h.
  • Since the distribution an agent takes to sample
    its memory is independent of time, the transition
    probability between any two states does not
    change with time. Therefore, the Markov chain
    is stationary.

26
Convergence over a known game
  • Theorem 1 Let L(a) be the shortest length of a
    best-response path from joint action a to an NE
    in D. LGmaxaL(a). If m?k(LG2), BAP over WAGB
    converges to either a NE in D or a strict NE
    w.p.1.
  • Nonstationary Markov Chain Model
  • With GLIE learning policy, at any moment, an
    agent has a probability to do experimenting
    (exploring the actions other than the estimated
    best-response). The exploration probability is
    diminishing with time. Therefore, we can model
    BAP with GLIE over WAGB as a nonstationary Markov
    chain, with a transition matrix Pt. Let P be
    the transition matrix of the stationary Markov
    chain for BAP over the same WAGB. Clearly, GLIE
    guarantees that Pt?P with t??.
  • In stationary Markov chain model, we have only
    one absorbing state (composed of several
    individual absorbing states). Theorem 1 says
    that such a Markov chain is ergodic, with only
    one stationary distribution, given m?k(LG2).
    With nonstationary Markov chain theory, we can
    get the following Theorem
  • Theorem 2 With m?k(LG2), BAP with GLIE converges
    to either a NE in D or a strict NE w.p.1.

27
Determine the length of best-response path
  • In a team game, LG is no more than n (the number
    of agents). The following figure illustrates
    this. In the figure, each box represents an
    individual action of an agent. represents
    an individual action contained in a NE. In the
    figure, we see that n-n agents can move the
    joint actions to an NE by switching their
    individual actions one after the other. This
    switching is best-response given others stick to
    their individual actions.
  • Lemma 4 The VG of any team game is a WAGB w.r.t.
    the set of optimal NE with LVG ? n.

28
Learning the virtual games
  • First, we assess the convergence rate of the
    model-based RL algorithm.
  • Then, we construct the sufficient condition for
    ?-bound over the convergence rate lemma.

29
Main Theorem
  • Theorem 7 In any team Markov game among n agents
    if 1) m?k(n1) 2) ?-bound satisfies Lemma 6, then
    the OAL algorithm converges to an optimal NE
    w.p.1
  • General ideas of the proof
  • With Lemma 6, we have that the probability of the
    event E that VGtVG for the rest of play after
    time t converges to 1 with t.
  • Starting from a time t, conditioning on the
    probability of E, agents play BAP with GLIE over
    a known game, which converges to an optimal NE
    w.p.1 according to Theorem 3.
  • Combine these two convergence process together,
    we get the convergence result.

30
Example 2-agent game
A0 A1 A2
10 0 -100
0 5 0
-100 0 10
B0 B1 B2
31
Example 3-agent game
32
Example Multiple stage games
33
Outline
  • Introduction
  • Settings
  • Coordination Difficulties
  • Optimal Adaptive Learning
  • Convergence Proof
  • Extension Beyond Self-play
  • Extension Beyond Team Games
  • Conclusion and Future Works

34
Extension general ideas
  • Classic game theory tells us how to solve a
    games, i.e., identifying the fixed points of
    introspections. However, it is less clear about
    how to play a game.
  • Standard ways to play a game
  • Solve the game first and play a NE strategy
    (strategic play).
  • Problem 1) With existence of multiple NE,
    sometimes, agents may not know how to play. 2) It
    might be computationally expensive.
  • Assume that others take stationary strategy and
    best response to the belief (myopic play).
  • Problem Myopic strategies may lead agents to
    play a sub-optimal (Pareto dominated) NE.
  • The idea generalized from OAL Partially Myopic
    and Partially Strategic (PMPS) play.
  • Biased Action Selection Strategically lead the
    other to play a stable strategy.
  • Virtual Games Compute NE first and then
    eliminate the sub-optimal NE.
  • Adaptive Play Myopically adjust best-response
    strategy w.r.t. the agents observations.

35
Extension Beyond self-play
  • Problem
  • OAL only guarantees convergence to an optimal NE
    in self-play. That is, all players are OAL
    agents. Can agents find optimal coordination
    when only some of them play OAL? Lets consider
    the simplest case two agent, one is JAL or IL
    player (Claus and Boutilier 98) and the other is
    OAL player.
  • A straightforward way to enforce the optimal
    coordination
  • Two players, one of them is an opinionated
    player who leads the play.
  • Leader
  • Learner
  • If the other is either JAL and IL player, the
    convergence to optimal NE is guaranteed.
  • How about that the other is also a leader agent?
    More seriously, how to play if the leader does
    not know the type of the other player?

A0 A1 A2
B0 B1 B2
10 0 -100
0 5 0
-100 0 10
36
New Biased Rules
  • Original biased rules
  • For an agent i if its k samples contain the same
    a-i which has also been included in at least one
    of NE in D, the agent chooses its most recent
    best response to the strategy profile. For
    example, if Bs samples show that A keeps
    playing A0 and its most recent best response is
    B0, B will stick to this action.
  • New biased rules
  • If an agent i has multiple best-response actions
    w.r.t. its k samples, it chooses the one included
    in an optimal NE in VG. If there exists several
    such choices, it chooses the one which has been
    played most recently.
  • Difference between the old and the new rules
  • Old rules biases the action-selection when
    others joint strategy has been included in an
    optimal NE. Otherwise, it just randomizes its
    choices of best-response actions.
  • The new rules always biases the agents
    action-selection.

37
Example
  • The new rules preserves the properties of
    convergence in n-agent team Markov games.

38
Extension Beyond Team Games
  • How to extend the ideas of PMPS play to general
    coordination games?
  • To simplify the setting, now we consider a
    category of coordination stage games with the
    following properties
  • These games have at least one pure strategy NE.
  • Agents have compatible preferences of some of
    these NE over any other steady states (such as
    mixed strategy NE or best-response loops).
  • Lets consider two situations Perfect monitoring
    and imperfect monitoring.
  • Perfect monitoring Agents can observe others
    actions and payoffs.
  • Imperfect monitoring Agents only observe
    others actions.
  • All agents may not have information about the
    game structure.

39
Perfect Monitoring
  • Following the same idea of OAL.
  • Algorithm
  • Learning of coordination
  • Compute all the NE of the game estimated.
  • Find out all the NE being ? dominated. For
    example, a strategy profile (a,b) is ? dominated
    by (a,b) if (Q(a)ltQ(a)-?) and (Q(b)?Q(b)?).
  • Construct a VG which contains all the NE not
    being ? dominated, setting other values in VG to
    zero (without loss of generality, suppose that
    agents normalize their payoff to a value between
    zero and one).
  • With GLIE exploration, BAP over the VG.
  • Learning of game structure
  • Observe the others payoffs and update the sample
    means of agents expected payoffs in the game
    matrix.
  • Compute an ?-bound in the same way as OAL.
  • The learning over the coordination stage games we
    discussed is conjectured to converge to an NE not
    being Pareto dominated w.p.1

40
Imperfect Monitoring
  • In general, it is difficult to eliminate
    sub-optimal NE without knowing others payoffs.
    Lets consider the simplest case Two learning
    agents have at least one common interest (a
    strategy profile maximizes both agents payoffs).
  • For this game, agents can learn to play an
    optimal NE with a modified version of OAL (with
    new biased rules).
  • Biased rules 1) Each agent randomizes its
    action-selection whenever the payoff of its
    best-response actions is zero over the virtual
    game. 2) Each agent biases its action to recent
    best response if all its k samples contain the
    same individual actions of the other agent, more
    than m-k recorded joint actions have this
    property and the agent have multiple best
    responses to give it payoff 1 w.r.t. to its k
    samples. Otherwise, randomly choose best-response
    action.
  • In this type of coordination stage game, the
    learning process is conjectured to converge to an
    optimal NE. The result can be extended to Markov
    game.

41
Example
A0 A1 A2
1, 0 0, 0 0, 1
0,0 1,1 0,0
0,1 0,0 1,0
B0 B1 B2
42
Conclusions and Future Works
  • In this research, we study RL techniques for
    agents to play an optimal NE (not being Pareto
    dominated by other NE) in coordination games when
    the environmental model is unknown beforehand.
  • We start our research with team game and propose
    the OAL algorithm, the first algorithm which
    guarantees the convergence to an optimal NE in
    any team Markov games.
  • We further generalize the basic ideas in OAL and
    propose a new approach for learning in games,
    called partially myopic and partially strategic
    play.
  • We extend the PMPS play beyond self-play and team
    games. Some of the results can be extended to
    Markov games.
  • In future research, we will further explore the
    application of PMPS play in coordination games.
    Especially, we will study how to eliminate
    sub-optimal NE in imperfect monitoring
    environments.
Write a Comment
User Comments (0)
About PowerShow.com