Analyzing Value Iteration and Policy Iteration Algorithms on MDPs - PowerPoint PPT Presentation

1 / 60

About This Presentation

Title:

Analyzing Value Iteration and Policy Iteration Algorithms on MDPs

Description:

In the zero case, highest values exist and arrive at the vertices of the optimal cycle in time. Once they do, no vertex in the optimal cycle switches away ... – PowerPoint PPT presentation

Number of Views:173

Avg rating:3.0/5.0

Slides: 61

Provided by: omidandan

Category:

more less

Transcript and Presenter's Notes

Title: Analyzing Value Iteration and Policy Iteration Algorithms on MDPs

1
Analyzing Value Iteration and Policy Iteration
Algorithms on MDPs

Omid Madani

2
Dynamic Decision Making
decision maker
rewards observations
actions
system
3
Markov Decision Processes(MDPs)

Rich theory applications
Contribution from several fields
My work Computational aspects using
computational complexity theory

4
MDPs in Context
Partially Observable MDPs
Linear Programs
MDP games
MDPs
deterministic MDPs
Shortest paths
5
MDP contributions

As a framework
probabilistic planning
reinforcement learning
combinatorial problems
As application
robot navigation
maintenance (highway, engine)
system management (fisheries, etc.)
system performance analysis
.

6
Value and Policy Iteration and Algorithms for MDPs

These algorithms can be described as
Simple
Iterative
Dynamic programming
Very fast in practice, widely used
Many varieties

7
Value and Policy Iteration and Algorithms for MDPs

Local search (some)
Distributed (some)
Special case of simplex algorithm for linear
programming (some)
Newtons method for finding zero of a function
(some)
Challenge to analyze!

8
Talk Overview

The DMDP problem and value iteration
Analysis (shows polynomial convergence)
Policy Iteration
MDP games
Summary

9
Deterministic MDP

Deterministic MDP problem Given a weighted
directed graph, find a cycle of highest average
weight (average reward)
There is an eigenvalue interpretation
Optimal policy every vertex (state) has path to
the best reachable cycle

10
The DMDP
6
v
4
w
0
8
10
6
u
Highest average reward is 7.
11
Optimal Policy
6

Optimal
policy

v
4
w
0
8
10
6
u
12
Value Iteration

1. Assign to each vertex a value
2. Repeat
Each vertex picks the maximum valued out-edge and
corresponding value
where value of a (u,v) edge e is reward
of e plus value of v

v
u
5
2
13
Value Iteration Example

Initialize with some values

6
3
v
4
6
w
0
8
u
10
6
2
14
6
3
10
4
v
6
9
0
8
10
6
2
14
15
6
10
14
4
9
20
0
8
10
6
14
20
16
6
14
24
4
20
26
0
8
10
6
20
28
17
6
24
30
4
26
32
0
8
10
6
28
34
18
6
30
36
4
32
38
0
8
10
6
34
40
19
Bounding the Time to Convergence

Known Value iteration converges to an optimal
policy
Previous bound pseudo-polynomial
This work iterations to converge to
an optimal cycle
Proof utilizes
history of edge choices
a mapping to a zero average reward graph

20
Motivation

Motivated by analyzing policy iteration
Value iteration is simple, used frequently to
solve MDPs
Pseudo-polynomial on most MDPs
Efficient on simpler deterministic problems such
as shortest path problems

21
Mapping to Mean-Zero

Subtracting a constant from all edges does not
change the optimal cycle nor the behavior of
value iteration
Subtract the maximum average reward from all edge
rewards
Analyze the resulting mean-zero problem (i.e.
optimal cycle has average reward zero)

22
Mapping to Mean-Zero
6
-1
4
-3
-7
0
-7
8
1
10
3
6
-1
23
Value Iteration Behaves Identically
6
-1
3
3
4
-3
6
6
0
-7
8
1
10
3
6
-1
2
2
24
Value Iteration Behaves Identically
6
-1
10
3
4
-3
9
2
0
-7
8
1
10
3
6
-1
14
7
25
Value Iteration Behaves Identically
6
-1
14
0
4
-3
20
6
0
-7
8
1
10
3
6
-1
20
6
26
Value Iteration Behaves Identically
6
-1
24
3
4
-3
26
5
0
-7
8
1
10
3
6
-1
28
7
27
Value Iteration Behaves Identically
6
-1
30
2
4
-3
34
6
0
-7
8
1
10
3
6
-1
34
6
28
Value Iteration Behaves Identically
6
-1
38
3
4
-3
40
5
0
-7
8
1
10
3
6
-1
42
7
29
Value Iteration Behaves Identically
6
-1
44
2
4
-3
48
6
0
-7
8
1
10
3
6
-1
48
6
30
Value Iteration Behaves Identically
6
-1
52
3
4
-3
54
5
0
-7
8
1
10
3
6
-1
56
7
31
Another View Histories of Edge Choices
t 5 4 3 2
1 0
2
7
6
7
6
7
.
u
3
1
3
3
0
3
2
3
-7
.
v
-3
-1
6
2
6
5
6
5
-1
.
w
32
Histories of Edge Choices
t 5 4 3 2
1 0
2
7
6
7
6
7
.
u
3
3
0
3
2
3
.
v
6
2
6
5
6
5
.
w
u
w
u
w
History of vertex u at time 3 time 4
u
w
u
v
w
33
Sequence of Values of a Vertex
t 5 4 3 2
1 0
2
7
6
7
6
7
.
u

lt2,7,6,7,6,7,gt
lt gtlt6,2,6,5,6,5,6,5,gt

3
1
3
3
0
3
2
3
-7
.
v
-3
-1
6
2
6
5
6
5
-1
.
w
34
Mean-Zero Properties

Removing cycles from histories does not reduce
value (as total reward of a cycle is negative)
(value of u at time 4 is 6) and
history of u is

u
w
u
v
w
Remove u-w-u cycle, we get
u
v
w
Remove w-u-v w cycle value of u at time 1 is 7
u
w
35
Removing Cycles Doesnt Hurt
c
w
v
u
remove the cycle
v
w
u
v
w
36
Mean-Zero Properties

All vertices obtain their highest values in the
first n iterations
The 2nd highest value is also well defined, and
is obtained in the first 2n iterations
The kth highest value is obtained in at most k n
iterations

37
Mean-Zero Properties

Once all the high values arrive at the optimal
cycle, the optimal cycle is formed permanently

Value iteration converges in
38
Lower Bound (easy)

It takes iterations until the optimal
cycle is fixed in all policies seen
Edge rewards zero, unless indicated
Initial vertex values are zero

-20
10
-1
u
optimal cycle
39
Lower Bound (harder)

iterations until the optimal cycle is
formed for the first time
Vertices begin with except s, with 0

optimal cycle
-1
-2
-3
-9
0
-1

Convergence to optimal policy is
pseudo-polynomial !

40
From to (Using
the History to Speed Up)

The history walk of vertices on the optimal cycle
can only contain the optimal cycle whenever they
receive their highest value
Keep track of edge choices for first n
iterations, and compute the average reward of the
cycles formed in the history walks
Takes O(n) iterations, but not space efficient (
space)

41
Keep Summary of History to Save Space

After first n iterations, each vertex keeps track
of super edges as well
Super edges has 3 parameters actual length, end
vertex, and total reward along corresponding
history
Cyclic super edge start and end vertex are equal
At least one cyclic super edge corresponding to
an optimal cycle is created in the second n
iterations
Need only 2n iterations, with
linear space

42
From to

The worst case constructions are very contrived
They suggest that with random initial values, and
possibly with random reward and graph structure,
convergence is much faster
Results on random sparse graphs, where rewards
are chosen in 0,1
Expected length of optimal cycles grow
sub-linearly, possibly

43
Experiments

The number of iterations required to converge to
optimal cycle is sub-linear, possibly O(log n) as
well
Both value iteration algorithms are very
efficient
Other study finds a policy iteration algorithm
fastest among several algorithms tested on
various graph types (Dasdan et. al.)

44
Summary

Behavior of value iteration is identical on any
graph and version with edge reward shifted
In the zero case, highest values exist and arrive
at the vertices of the optimal cycle in
time
Once they do, no vertex in the optimal cycle
switches away
Value iteration converges to an optimal cycle on
any graph in n2 time
Extensions reduce it to O(n)

45
Policy Iteration

After each value iteration, add a self-arc to
every vertex with reward equal to the best cycle
found in the current policy
Only better cycles are formed, hence its a
policy iteration algorithm
Convergence bound on value iteration applies
Lower bound doesnt apply..

46
Limiting Value Iteration

After discovery of each new cycle, use it but
also restart value iteration with old initial
values
Call each restart a phase
Result the level of activity (edge changes)
goes down from one phase to the next

47
2
7
6
7
6
7
-oo
.
u
3
1
-oo
3
3
0
3
2
3
-7
phase 1
.
v
-3
-1
-oo
6
2
6
5
6
5
-1
.
w
2
7
6
7
6
7
-2
.
u
-2
3
3
1
3
2
3
phase 2
.
v
6
4
6
5
6
5
-2
.
w
48
2
7
6
7
6
7
.
u
3
1
-7
3
3
1
3
2
3
.
phase 2
v
-3
-1
-2
6
4
6
5
6
5
-1
.
w
2
7
6.67
7
6.67
7
.
u
3
3
2.67
3
2.67
3
.
phase 3
v
-0.33
6
5.67
6
5.67
6
5.67
.
w
49
2
7
6.67
7
6
7
3
.
u
1
-7
3
3
2.67
3
2
3
.
phase 3
v
-3
-1
-1
-.33
6
5.67
6
5.67
6
5
.
w
2
7
7
7
7
7
.
u
3
3
3
3
3
3
.
phase 4
v
0
6
6
6
6
6
6
.
w
50
Summary

Policy iteration
From one phase to the next, vertices have
identical behavior, except,
at least one vertex becomes less active in some
iteraton
In at most phases, in some iteration no
vertex remains active, and we are done
Starting each phase with a fixed value for
vertices helps the analysis
Open problem
Tighten the bound
Extend to other policy iteration algorithms
(relax the start values should be identical
requirement)

51
Other Algorithms

edges (actions) and W largest number
Previous algorithms Karp , and Orlin
and Ahuja Tarjan, Young and Orlin O(mn log n)
Value iteration , value iteration
using histories

52
MDPs in Context
Partially Observable MDPs
Linear Programs
MDP games
MDPs
deterministic MDPs
Shortest paths
53
Mean-Payoff Games(Two player DMDPs)

Two players Min and Max, each having control
over a portion of vertices
Min wants to get to a cycle with minimum average
reward
Optimal policies exist in the minimax sense

2
2
2
3
minimax
-5
-5
-5
3
1
0
1
0
54
Algorithms

optimize-optimize is not correct

2
3
-5
-5
-5
3
3
1
2
2
2
3
-5
0
0
55
Correcting the Algorithm

One player has to play conservative while the
other optimizes
The conservative performs a value iteration
computation with proper state values
The algorithm where both play conservative is
incorrect too..!

56
MDP Games

Two players, zero sum
Varieties
Deterministic, Stochastic
Alternating, Simultaneous
Other partial observable, different objectives
(e.g. average reward, total reward)
The stochastic alternating game has a complexity
motivation
Other applications in multi-agent settings,
reinforcement learning, and in worst case
analysis of online problems

57
Algorithms

Anne Condon considers a number of algorithms
Shows several are incorrect
Value iteration (given by Shapley 53) is correct,
but pseudo-polynomial
A policy iteration algorithm (conservative-optimiz
e) (Karp and Hoffman) is correct, but no better
bound than exponential
No linear programming formulation is known, a
quadratic programming formulation is given by
Condon

58
Partially Observable MDPs
simultaneous games
Linear Programs
alternating stochastic
MDPs
deterministic games
MDP(2)
deterministic MDPs
shortest paths
59
Summary

Many algorithms are in the value/policy iteration
family
I show several polynomial
Challenging but rewarding to analyze
In the process new algorithms with various
attributes are discovered ( proof techniques)
Many open problems remain (i.e. extend and
tighten the analyses) plus empirical studies

60
Work at UA

Interested in machine learning
How can we learn complex concepts?
How can we build systems with many learning
components?
Lots of other interesting questions, from basics
of learning (theory) to applications

Write a Comment

User Comments (0)