Title: Analyzing Value Iteration and Policy Iteration Algorithms on MDPs
1Analyzing Value Iteration and Policy Iteration
Algorithms on MDPs
2Dynamic Decision Making
decision maker
rewards observations
actions
system
3Markov Decision Processes(MDPs)
- Rich theory applications
- Contribution from several fields
- My work Computational aspects using
computational complexity theory
4MDPs in Context
Partially Observable MDPs
Linear Programs
MDP games
MDPs
deterministic MDPs
Shortest paths
5MDP contributions
- As a framework
- probabilistic planning
- reinforcement learning
- combinatorial problems
- As application
- robot navigation
- maintenance (highway, engine)
- system management (fisheries, etc.)
- system performance analysis
- .
6Value and Policy Iteration and Algorithms for MDPs
- These algorithms can be described as
- Simple
- Iterative
- Dynamic programming
- Very fast in practice, widely used
- Many varieties
7Value and Policy Iteration and Algorithms for MDPs
- Local search (some)
- Distributed (some)
- Special case of simplex algorithm for linear
programming (some) - Newtons method for finding zero of a function
(some) - Challenge to analyze!
8Talk Overview
- The DMDP problem and value iteration
- Analysis (shows polynomial convergence)
- Policy Iteration
- MDP games
- Summary
9Deterministic MDP
- Deterministic MDP problem Given a weighted
directed graph, find a cycle of highest average
weight (average reward) - There is an eigenvalue interpretation
- Optimal policy every vertex (state) has path to
the best reachable cycle
10The DMDP
6
v
4
w
0
8
10
6
u
Highest average reward is 7.
11Optimal Policy
6
v
4
w
0
8
10
6
u
12Value Iteration
- 1. Assign to each vertex a value
- 2. Repeat
- Each vertex picks the maximum valued out-edge and
corresponding value - where value of a (u,v) edge e is reward
of e plus value of v
v
u
5
2
13Value Iteration Example
- Initialize with some values
6
3
v
4
6
w
0
8
u
10
6
2
146
3
10
4
v
6
9
0
8
10
6
2
14
156
10
14
4
9
20
0
8
10
6
14
20
166
14
24
4
20
26
0
8
10
6
20
28
176
24
30
4
26
32
0
8
10
6
28
34
186
30
36
4
32
38
0
8
10
6
34
40
19Bounding the Time to Convergence
- Known Value iteration converges to an optimal
policy - Previous bound pseudo-polynomial
- This work iterations to converge to
an optimal cycle - Proof utilizes
- history of edge choices
- a mapping to a zero average reward graph
20Motivation
- Motivated by analyzing policy iteration
- Value iteration is simple, used frequently to
solve MDPs - Pseudo-polynomial on most MDPs
- Efficient on simpler deterministic problems such
as shortest path problems
21Mapping to Mean-Zero
- Subtracting a constant from all edges does not
change the optimal cycle nor the behavior of
value iteration - Subtract the maximum average reward from all edge
rewards - Analyze the resulting mean-zero problem (i.e.
optimal cycle has average reward zero)
22Mapping to Mean-Zero
6
-1
4
-3
-7
0
-7
8
1
10
3
6
-1
23Value Iteration Behaves Identically
6
-1
3
3
4
-3
6
6
0
-7
8
1
10
3
6
-1
2
2
24Value Iteration Behaves Identically
6
-1
10
3
4
-3
9
2
0
-7
8
1
10
3
6
-1
14
7
25Value Iteration Behaves Identically
6
-1
14
0
4
-3
20
6
0
-7
8
1
10
3
6
-1
20
6
26Value Iteration Behaves Identically
6
-1
24
3
4
-3
26
5
0
-7
8
1
10
3
6
-1
28
7
27Value Iteration Behaves Identically
6
-1
30
2
4
-3
34
6
0
-7
8
1
10
3
6
-1
34
6
28Value Iteration Behaves Identically
6
-1
38
3
4
-3
40
5
0
-7
8
1
10
3
6
-1
42
7
29Value Iteration Behaves Identically
6
-1
44
2
4
-3
48
6
0
-7
8
1
10
3
6
-1
48
6
30Value Iteration Behaves Identically
6
-1
52
3
4
-3
54
5
0
-7
8
1
10
3
6
-1
56
7
31Another View Histories of Edge Choices
t 5 4 3 2
1 0
2
7
6
7
6
7
.
u
3
1
3
3
0
3
2
3
-7
.
v
-3
-1
6
2
6
5
6
5
-1
.
w
32Histories of Edge Choices
t 5 4 3 2
1 0
2
7
6
7
6
7
.
u
3
3
0
3
2
3
.
v
6
2
6
5
6
5
.
w
u
w
u
w
History of vertex u at time 3 time 4
u
w
u
v
w
33Sequence of Values of a Vertex
t 5 4 3 2
1 0
2
7
6
7
6
7
.
u
-
lt2,7,6,7,6,7,gt - lt gtlt6,2,6,5,6,5,6,5,gt
3
1
3
3
0
3
2
3
-7
.
v
-3
-1
6
2
6
5
6
5
-1
.
w
34Mean-Zero Properties
- Removing cycles from histories does not reduce
value (as total reward of a cycle is negative) - (value of u at time 4 is 6) and
history of u is
u
w
u
v
w
Remove u-w-u cycle, we get
u
v
w
Remove w-u-v w cycle value of u at time 1 is 7
u
w
35Removing Cycles Doesnt Hurt
c
w
v
u
remove the cycle
v
w
u
v
w
36Mean-Zero Properties
- All vertices obtain their highest values in the
first n iterations - The 2nd highest value is also well defined, and
is obtained in the first 2n iterations - The kth highest value is obtained in at most k n
iterations
37Mean-Zero Properties
- Once all the high values arrive at the optimal
cycle, the optimal cycle is formed permanently
Value iteration converges in
38Lower Bound (easy)
- It takes iterations until the optimal
cycle is fixed in all policies seen - Edge rewards zero, unless indicated
- Initial vertex values are zero
-20
10
-1
u
optimal cycle
39Lower Bound (harder)
- iterations until the optimal cycle is
formed for the first time - Vertices begin with except s, with 0
optimal cycle
-1
-2
-3
-9
0
-1
- Convergence to optimal policy is
pseudo-polynomial !
40 From to (Using
the History to Speed Up)
- The history walk of vertices on the optimal cycle
can only contain the optimal cycle whenever they
receive their highest value - Keep track of edge choices for first n
iterations, and compute the average reward of the
cycles formed in the history walks - Takes O(n) iterations, but not space efficient (
space)
41Keep Summary of History to Save Space
- After first n iterations, each vertex keeps track
of super edges as well - Super edges has 3 parameters actual length, end
vertex, and total reward along corresponding
history - Cyclic super edge start and end vertex are equal
- At least one cyclic super edge corresponding to
an optimal cycle is created in the second n
iterations - Need only 2n iterations, with
linear space
42 From to
- The worst case constructions are very contrived
- They suggest that with random initial values, and
possibly with random reward and graph structure,
convergence is much faster - Results on random sparse graphs, where rewards
are chosen in 0,1 - Expected length of optimal cycles grow
sub-linearly, possibly
43Experiments
- The number of iterations required to converge to
optimal cycle is sub-linear, possibly O(log n) as
well - Both value iteration algorithms are very
efficient - Other study finds a policy iteration algorithm
fastest among several algorithms tested on
various graph types (Dasdan et. al.)
44Summary
- Behavior of value iteration is identical on any
graph and version with edge reward shifted - In the zero case, highest values exist and arrive
at the vertices of the optimal cycle in
time - Once they do, no vertex in the optimal cycle
switches away - Value iteration converges to an optimal cycle on
any graph in n2 time - Extensions reduce it to O(n)
45Policy Iteration
- After each value iteration, add a self-arc to
every vertex with reward equal to the best cycle
found in the current policy - Only better cycles are formed, hence its a
policy iteration algorithm - Convergence bound on value iteration applies
- Lower bound doesnt apply..
46Limiting Value Iteration
- After discovery of each new cycle, use it but
also restart value iteration with old initial
values - Call each restart a phase
- Result the level of activity (edge changes)
goes down from one phase to the next
472
7
6
7
6
7
-oo
.
u
3
1
-oo
3
3
0
3
2
3
-7
phase 1
.
v
-3
-1
-oo
6
2
6
5
6
5
-1
.
w
2
7
6
7
6
7
-2
.
u
-2
3
3
1
3
2
3
phase 2
.
v
6
4
6
5
6
5
-2
.
w
482
7
6
7
6
7
.
u
3
1
-7
3
3
1
3
2
3
.
phase 2
v
-3
-1
-2
6
4
6
5
6
5
-1
.
w
2
7
6.67
7
6.67
7
.
u
3
3
2.67
3
2.67
3
.
phase 3
v
-0.33
6
5.67
6
5.67
6
5.67
.
w
492
7
6.67
7
6
7
3
.
u
1
-7
3
3
2.67
3
2
3
.
phase 3
v
-3
-1
-1
-.33
6
5.67
6
5.67
6
5
.
w
2
7
7
7
7
7
.
u
3
3
3
3
3
3
.
phase 4
v
0
6
6
6
6
6
6
.
w
50Summary
- Policy iteration
- From one phase to the next, vertices have
identical behavior, except, - at least one vertex becomes less active in some
iteraton - In at most phases, in some iteration no
vertex remains active, and we are done - Starting each phase with a fixed value for
vertices helps the analysis - Open problem
- Tighten the bound
- Extend to other policy iteration algorithms
(relax the start values should be identical
requirement)
51Other Algorithms
- edges (actions) and W largest number
- Previous algorithms Karp , and Orlin
and Ahuja Tarjan, Young and Orlin O(mn log n) - Value iteration , value iteration
using histories
52MDPs in Context
Partially Observable MDPs
Linear Programs
MDP games
MDPs
deterministic MDPs
Shortest paths
53Mean-Payoff Games(Two player DMDPs)
- Two players Min and Max, each having control
over a portion of vertices - Min wants to get to a cycle with minimum average
reward - Optimal policies exist in the minimax sense
2
2
2
3
minimax
-5
-5
-5
3
1
0
1
0
54Algorithms
- optimize-optimize is not correct
2
3
-5
-5
-5
3
3
1
2
2
2
3
-5
0
0
55Correcting the Algorithm
- One player has to play conservative while the
other optimizes - The conservative performs a value iteration
computation with proper state values - The algorithm where both play conservative is
incorrect too..!
56MDP Games
- Two players, zero sum
- Varieties
- Deterministic, Stochastic
- Alternating, Simultaneous
- Other partial observable, different objectives
(e.g. average reward, total reward) - The stochastic alternating game has a complexity
motivation - Other applications in multi-agent settings,
reinforcement learning, and in worst case
analysis of online problems
57Algorithms
- Anne Condon considers a number of algorithms
- Shows several are incorrect
- Value iteration (given by Shapley 53) is correct,
but pseudo-polynomial - A policy iteration algorithm (conservative-optimiz
e) (Karp and Hoffman) is correct, but no better
bound than exponential - No linear programming formulation is known, a
quadratic programming formulation is given by
Condon
58Partially Observable MDPs
simultaneous games
Linear Programs
alternating stochastic
MDPs
deterministic games
MDP(2)
deterministic MDPs
shortest paths
59Summary
- Many algorithms are in the value/policy iteration
family - I show several polynomial
- Challenging but rewarding to analyze
- In the process new algorithms with various
attributes are discovered ( proof techniques) - Many open problems remain (i.e. extend and
tighten the analyses) plus empirical studies
60Work at UA
- Interested in machine learning
- How can we learn complex concepts?
- How can we build systems with many learning
components? - Lots of other interesting questions, from basics
of learning (theory) to applications