Analyzing Value Iteration and Policy Iteration Algorithms on MDPs - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Analyzing Value Iteration and Policy Iteration Algorithms on MDPs

Description:

In the zero case, highest values exist and arrive at the vertices of the optimal cycle in time. Once they do, no vertex in the optimal cycle switches away ... – PowerPoint PPT presentation

Number of Views:173
Avg rating:3.0/5.0
Slides: 61
Provided by: omidandan
Category:

less

Transcript and Presenter's Notes

Title: Analyzing Value Iteration and Policy Iteration Algorithms on MDPs


1
Analyzing Value Iteration and Policy Iteration
Algorithms on MDPs
  • Omid Madani

2
Dynamic Decision Making
decision maker
rewards observations
actions
system
3
Markov Decision Processes(MDPs)
  • Rich theory applications
  • Contribution from several fields
  • My work Computational aspects using
    computational complexity theory

4
MDPs in Context
Partially Observable MDPs
Linear Programs
MDP games
MDPs
deterministic MDPs
Shortest paths
5
MDP contributions
  • As a framework
  • probabilistic planning
  • reinforcement learning
  • combinatorial problems
  • As application
  • robot navigation
  • maintenance (highway, engine)
  • system management (fisheries, etc.)
  • system performance analysis
  • .

6
Value and Policy Iteration and Algorithms for MDPs
  • These algorithms can be described as
  • Simple
  • Iterative
  • Dynamic programming
  • Very fast in practice, widely used
  • Many varieties

7
Value and Policy Iteration and Algorithms for MDPs
  • Local search (some)
  • Distributed (some)
  • Special case of simplex algorithm for linear
    programming (some)
  • Newtons method for finding zero of a function
    (some)
  • Challenge to analyze!

8
Talk Overview
  • The DMDP problem and value iteration
  • Analysis (shows polynomial convergence)
  • Policy Iteration
  • MDP games
  • Summary

9
Deterministic MDP
  • Deterministic MDP problem Given a weighted
    directed graph, find a cycle of highest average
    weight (average reward)
  • There is an eigenvalue interpretation
  • Optimal policy every vertex (state) has path to
    the best reachable cycle

10
The DMDP
6
v
4
w
0
8
10
6
u
Highest average reward is 7.
11
Optimal Policy
6
  • Optimal
  • policy

v
4
w
0
8
10
6
u
12
Value Iteration
  • 1. Assign to each vertex a value
  • 2. Repeat
  • Each vertex picks the maximum valued out-edge and
    corresponding value
  • where value of a (u,v) edge e is reward
    of e plus value of v

v
u
5
2
13
Value Iteration Example
  • Initialize with some values

6
3
v
4
6
w
0
8
u
10
6
2
14
6
3
10
4
v
6
9
0
8
10
6
2
14
15
6
10
14
4
9
20
0
8
10
6
14
20
16
6
14
24
4
20
26
0
8
10
6
20
28
17
6
24
30
4
26
32
0
8
10
6
28
34
18
6
30
36
4
32
38
0
8
10
6
34
40
19
Bounding the Time to Convergence
  • Known Value iteration converges to an optimal
    policy
  • Previous bound pseudo-polynomial
  • This work iterations to converge to
    an optimal cycle
  • Proof utilizes
  • history of edge choices
  • a mapping to a zero average reward graph

20
Motivation
  • Motivated by analyzing policy iteration
  • Value iteration is simple, used frequently to
    solve MDPs
  • Pseudo-polynomial on most MDPs
  • Efficient on simpler deterministic problems such
    as shortest path problems

21
Mapping to Mean-Zero
  • Subtracting a constant from all edges does not
    change the optimal cycle nor the behavior of
    value iteration
  • Subtract the maximum average reward from all edge
    rewards
  • Analyze the resulting mean-zero problem (i.e.
    optimal cycle has average reward zero)

22
Mapping to Mean-Zero
6
-1
4
-3
-7
0
-7
8
1
10
3
6
-1
23
Value Iteration Behaves Identically
6
-1
3
3
4
-3
6
6
0
-7
8
1
10
3
6
-1
2
2
24
Value Iteration Behaves Identically
6
-1
10
3
4
-3
9
2
0
-7
8
1
10
3
6
-1
14
7
25
Value Iteration Behaves Identically
6
-1
14
0
4
-3
20
6
0
-7
8
1
10
3
6
-1
20
6
26
Value Iteration Behaves Identically
6
-1
24
3
4
-3
26
5
0
-7
8
1
10
3
6
-1
28
7
27
Value Iteration Behaves Identically
6
-1
30
2
4
-3
34
6
0
-7
8
1
10
3
6
-1
34
6
28
Value Iteration Behaves Identically
6
-1
38
3
4
-3
40
5
0
-7
8
1
10
3
6
-1
42
7
29
Value Iteration Behaves Identically
6
-1
44
2
4
-3
48
6
0
-7
8
1
10
3
6
-1
48
6
30
Value Iteration Behaves Identically
6
-1
52
3
4
-3
54
5
0
-7
8
1
10
3
6
-1
56
7
31
Another View Histories of Edge Choices
t 5 4 3 2
1 0
2
7
6
7
6
7
.
u
3
1
3
3
0
3
2
3
-7
.
v
-3
-1
6
2
6
5
6
5
-1
.
w
32
Histories of Edge Choices
t 5 4 3 2
1 0
2
7
6
7
6
7
.
u
3
3
0
3
2
3
.
v
6
2
6
5
6
5
.
w
u
w
u
w
History of vertex u at time 3 time 4
u
w
u
v
w
33
Sequence of Values of a Vertex
t 5 4 3 2
1 0
2
7
6
7
6
7
.
u

  • lt2,7,6,7,6,7,gt
  • lt gtlt6,2,6,5,6,5,6,5,gt

3
1
3
3
0
3
2
3
-7
.
v
-3
-1
6
2
6
5
6
5
-1
.
w
34
Mean-Zero Properties
  • Removing cycles from histories does not reduce
    value (as total reward of a cycle is negative)
  • (value of u at time 4 is 6) and
    history of u is

u
w
u
v
w
Remove u-w-u cycle, we get
u
v
w
Remove w-u-v w cycle value of u at time 1 is 7
u
w
35
Removing Cycles Doesnt Hurt
c
w
v
u
remove the cycle
v
w
u
v
w
36
Mean-Zero Properties
  • All vertices obtain their highest values in the
    first n iterations
  • The 2nd highest value is also well defined, and
    is obtained in the first 2n iterations
  • The kth highest value is obtained in at most k n
    iterations

37
Mean-Zero Properties
  • Once all the high values arrive at the optimal
    cycle, the optimal cycle is formed permanently

Value iteration converges in
38
Lower Bound (easy)
  • It takes iterations until the optimal
    cycle is fixed in all policies seen
  • Edge rewards zero, unless indicated
  • Initial vertex values are zero

-20
10
-1
u
optimal cycle
39
Lower Bound (harder)
  • iterations until the optimal cycle is
    formed for the first time
  • Vertices begin with except s, with 0

optimal cycle
-1
-2
-3
-9
0
-1
  • Convergence to optimal policy is
    pseudo-polynomial !

40
From to (Using
the History to Speed Up)
  • The history walk of vertices on the optimal cycle
    can only contain the optimal cycle whenever they
    receive their highest value
  • Keep track of edge choices for first n
    iterations, and compute the average reward of the
    cycles formed in the history walks
  • Takes O(n) iterations, but not space efficient (
    space)

41
Keep Summary of History to Save Space
  • After first n iterations, each vertex keeps track
    of super edges as well
  • Super edges has 3 parameters actual length, end
    vertex, and total reward along corresponding
    history
  • Cyclic super edge start and end vertex are equal
  • At least one cyclic super edge corresponding to
    an optimal cycle is created in the second n
    iterations
  • Need only 2n iterations, with
    linear space

42
From to
  • The worst case constructions are very contrived
  • They suggest that with random initial values, and
    possibly with random reward and graph structure,
    convergence is much faster
  • Results on random sparse graphs, where rewards
    are chosen in 0,1
  • Expected length of optimal cycles grow
    sub-linearly, possibly

43
Experiments
  • The number of iterations required to converge to
    optimal cycle is sub-linear, possibly O(log n) as
    well
  • Both value iteration algorithms are very
    efficient
  • Other study finds a policy iteration algorithm
    fastest among several algorithms tested on
    various graph types (Dasdan et. al.)

44
Summary
  • Behavior of value iteration is identical on any
    graph and version with edge reward shifted
  • In the zero case, highest values exist and arrive
    at the vertices of the optimal cycle in
    time
  • Once they do, no vertex in the optimal cycle
    switches away
  • Value iteration converges to an optimal cycle on
    any graph in n2 time
  • Extensions reduce it to O(n)

45
Policy Iteration
  • After each value iteration, add a self-arc to
    every vertex with reward equal to the best cycle
    found in the current policy
  • Only better cycles are formed, hence its a
    policy iteration algorithm
  • Convergence bound on value iteration applies
  • Lower bound doesnt apply..

46
Limiting Value Iteration
  • After discovery of each new cycle, use it but
    also restart value iteration with old initial
    values
  • Call each restart a phase
  • Result the level of activity (edge changes)
    goes down from one phase to the next

47
2
7
6
7
6
7
-oo
.
u
3
1
-oo
3
3
0
3
2
3
-7
phase 1
.
v
-3
-1
-oo
6
2
6
5
6
5
-1
.
w
2
7
6
7
6
7
-2
.
u
-2
3
3
1
3
2
3
phase 2
.
v
6
4
6
5
6
5
-2
.
w
48
2
7
6
7
6
7
.
u
3
1
-7
3
3
1
3
2
3
.
phase 2
v
-3
-1
-2
6
4
6
5
6
5
-1
.
w
2
7
6.67
7
6.67
7
.
u
3
3
2.67
3
2.67
3
.
phase 3
v
-0.33
6
5.67
6
5.67
6
5.67
.
w
49
2
7
6.67
7
6
7
3
.
u
1
-7
3
3
2.67
3
2
3
.
phase 3
v
-3
-1
-1
-.33
6
5.67
6
5.67
6
5
.
w
2
7
7
7
7
7
.
u
3
3
3
3
3
3
.
phase 4
v
0
6
6
6
6
6
6
.
w
50
Summary
  • Policy iteration
  • From one phase to the next, vertices have
    identical behavior, except,
  • at least one vertex becomes less active in some
    iteraton
  • In at most phases, in some iteration no
    vertex remains active, and we are done
  • Starting each phase with a fixed value for
    vertices helps the analysis
  • Open problem
  • Tighten the bound
  • Extend to other policy iteration algorithms
    (relax the start values should be identical
    requirement)

51
Other Algorithms
  • edges (actions) and W largest number
  • Previous algorithms Karp , and Orlin
    and Ahuja Tarjan, Young and Orlin O(mn log n)
  • Value iteration , value iteration
    using histories

52
MDPs in Context
Partially Observable MDPs
Linear Programs
MDP games
MDPs
deterministic MDPs
Shortest paths
53
Mean-Payoff Games(Two player DMDPs)
  • Two players Min and Max, each having control
    over a portion of vertices
  • Min wants to get to a cycle with minimum average
    reward
  • Optimal policies exist in the minimax sense

2
2
2
3
minimax
-5
-5
-5
3
1
0
1
0
54
Algorithms
  • optimize-optimize is not correct

2
3
-5
-5
-5
3
3
1
2
2
2
3
-5
0
0
55
Correcting the Algorithm
  • One player has to play conservative while the
    other optimizes
  • The conservative performs a value iteration
    computation with proper state values
  • The algorithm where both play conservative is
    incorrect too..!

56
MDP Games
  • Two players, zero sum
  • Varieties
  • Deterministic, Stochastic
  • Alternating, Simultaneous
  • Other partial observable, different objectives
    (e.g. average reward, total reward)
  • The stochastic alternating game has a complexity
    motivation
  • Other applications in multi-agent settings,
    reinforcement learning, and in worst case
    analysis of online problems

57
Algorithms
  • Anne Condon considers a number of algorithms
  • Shows several are incorrect
  • Value iteration (given by Shapley 53) is correct,
    but pseudo-polynomial
  • A policy iteration algorithm (conservative-optimiz
    e) (Karp and Hoffman) is correct, but no better
    bound than exponential
  • No linear programming formulation is known, a
    quadratic programming formulation is given by
    Condon

58
Partially Observable MDPs
simultaneous games
Linear Programs
alternating stochastic
MDPs
deterministic games
MDP(2)
deterministic MDPs
shortest paths
59
Summary
  • Many algorithms are in the value/policy iteration
    family
  • I show several polynomial
  • Challenging but rewarding to analyze
  • In the process new algorithms with various
    attributes are discovered ( proof techniques)
  • Many open problems remain (i.e. extend and
    tighten the analyses) plus empirical studies

60
Work at UA
  • Interested in machine learning
  • How can we learn complex concepts?
  • How can we build systems with many learning
    components?
  • Lots of other interesting questions, from basics
    of learning (theory) to applications
Write a Comment
User Comments (0)
About PowerShow.com