Dynamic Programming Applications presentation

About This Presentation

Transcript and Presenter's Notes

Title: Dynamic Programming Applications

1
Dynamic Programming Applications

Lecture 2

2
Preview

Last time Deterministic DP.
Today Intro to Stochastic DP. MDPs.
Backward induction. Principle of optimality.
Open-loop policies. Interchange argument.
Ex Scheduling
Next time
Structured Monotone policies.
Ex Sequential assignment.

3
So whos counting?

5 rounds ? 5 digit number
Digits outcomes 0,19
After each round, place outcome digit into one of
remaining slots
Maximize final number

4
Dynamical System w/noise
5
Model ingredients

Horizon
States. State sets. Terminal states.
Actions. Action sets.
Randomness
Transition
Reward per stage
Objective

6
Dynamic systems w/noise

Discrete-time, continuous-state system with noise
States actions xt ? St, ut ? Ut(xt)
Random disturbances (noise) wt characterized by
distribution Pt(. xt, ut), independent on
wt-1,,w0
Transitions xt1 ft(xt, ut,,wt)
Cost per stage gt(xt, ut, wt).
Terminal stage gN(xN)

7
Discrete-state finite-state

State var. xt ? D discrete/finite Xt
(x0,x1,..xt)
Decision ut based on past history Ut-1
(u0,u1,..ut-1)
Ht (Xt, Ut-1) is the observed history at time t
Markov dynamics ( stochastic plant equation)
P(xt1 Ht) P(xt1xt, ut)
MDP
State evolution transition probabilities
ptij(u) P(xt1 j xti, utu) wt
Cost at stage t from decision u gt(i,u)

8
Policy

Policy pp0(),, pN-1()
Admissible policy ut pt(xt,) ? Ut(xt)
Markov policy pt() St ? Ut
History dependent pt(.) Ht ? Ut
Randomized policy pt(.) Ht ? P(Ut )
Optimal policy p admissible optimizes
objective

9
Objective

How to compare rewards?
Vp(x0) (..gt(xt, pt(xt), wt), gN(xN)) - rand.
vector
Gp(x0) gN(xN) S gt(xt, pt(xt), wt) rand.
variable
Jp(x0) EwGp(x0) expected value
or EwU(Gp(x0)) or EwU(Vp(x0)) exp.
utility
Optimal policy p admissible minimizes cost
J(x0) Jp(x0) infp Jp(x0)
This is the optimal value (cost) function

10
System works forward

Choose admissible policy p p0(),.., pN-1()
Typical stage t
Controller observes xt and decides ut pt(xt)
Generate disturbances wt Pt(. xt, ut).
Incur cost gt(xt, ut, wt) add to previous
costs
Generate next state plant eq. xt1
ft(xt,ut,wt)
If last stage (N-1) , add the terminal cost
gN(xN) and the process ends. Otherwise t is
incremented..

11
Policy Evaluation Algorithm

Given policy p
1. Let t N and JpN(xN) gN(xN) for all
xN?S
2. If t1 STOP, else go to 3
3. Let t ? t-1
Control
Jt(xt) Ew gt(xt, pt (xt), wt) Jt1(ft(xt,
pt (xt), wt))
MDP
Jt(xt) gt(xt, pt (xt)) S p(xt1
xt, pt (xt)) Jt1(xt1)
4. Go to 2

Q1 What if history dependent? Q2 Evaluate
computational requirement.
12
The DP Algorithm

For every initial state x0, the optimal cost
J(x0) of the basic problem equals J0(x0), where
Jt(xt) min Ew gt(xt, ut, wt)
Jt1(ft(xt, ut, wt))
JN(xN) gN(xN)
If ut pt(xt) minimizes the RHS above, then
pp0(),.., pN-1() is an optimal
policy.
This is the backward induction algorithm.

ut ?Ut(xt)
13
The Principle of Optimality

Let pp0,.., pN-1 be an optimal policy.
Subproblem i we are at state xi at time i and
wish to minimize the cost to go from time i to
time N.
Result the truncated policy pipi,.., pN-1 is
optimal for this subproblem.
Optimal policy is s.t. whatever initial state and
initial decision, the remaining decisions must be
optimal with respect to the resulting state.

14
Looking at Optimality

Optimality Criterion
If policy is optimal for the optimally
truncated problem at any stage, then that policy
is optimal for every stage.
Principle of Optimality
If policy is optimal for a given stage then
it must also be optimal for the next

15
Principle of optimality

There exists a policy that is optimal in every
state .
Can history information do better?
Can randomized policies do better?
What is the key assumption?

16
Counterexample

U(path i-j) 2nd largest arc in that path

1
5
6
1
2
5
3
4
4
3
2
4
6
J(1-5) J(1-2-3-5)5, with J(2-3-5)1. However,
J(2-4-5)3, so 2-3-5 is not optimal.
17
Peculiar example

States actions S0,1 U0,1
Noise wt U0,1
Transitions xt1 wt
Reward gt(xt, ut, wt) ut
Discount d lt 1
Optimal maximizing policy?

18
Mending the P.O.

Restrict to states of non-zero probability!
OR
Assume a distribution on initial states p
A policy is p-optimal for a stage if it maximizes
the conditional expected value of the returns for
that stage over all strategies w.p. 1
Now P.O. holds for p-policies
Is it possible that no optimal policies exist but
p-optimal ones do?

19
Peculiar example

States S0,1
Actions U(x)0,1, x ?1, U(1)(0,1)
Noise wt U0,1
Transitions xt1 wt
Reward gt(xt, ut, wt) ut
Discount d lt 1
p-optimal pa ut(x)1, x ?1 ut(1) a
No optimal policy exists (pa -gt p infeasible!)

20
When does PO works ?

Noise spaces D0,DN-1 are finite, or countable
(this implies resulting state sets are so) , and
all expectations are finite for admissible
policies OR
State set is finite or countable, and action sets
are
finite for every set OR
compact for every state, and costs are continuous
and uniformly bounded, and transitions are
uniformly continuous OR
compact for every state, and costs are upper
semi-continuous and uniformly bounded, and
transitions are lower semi-continuous

21
Open vs. closed loop

Open-loop select all controls u0,..,uN-1 at time
0.
Policy does not depend on current state.
Closed-loop select a policy pp0,.., pN-1
where the control at time t ut pt(xt) uses
information about the current state.
This extra information may be used to achieve
lower costs. Cost reduction value of info.
Example Monty Hall

22
Sequencing interchange

N tasks values Vi, success probability Pi,
i0,..N-1.
Once a task fails, the machine breaks.
In what order to schedule the tasks ?
Objective max total reward from completed tasks.

23
Sequencing interchange

Task sequence LN (i0,i1,,iN-1)
V(i0,i1,,iN-1) Pi0Vi0 Pi0 V(i1,,iN-1)
Pi0Vi0 Pi0Pi1Vi1 (Pi0PiN-1)ViN-1

24
The interchange argument

The problem has a structure s.t. there exists an
optimal open-loop policy .
Argument start with an optimal sequence and
argue that by exchanging the order of two
consecutive controls, expected cost cannot
decrease.
This is a necessary optimality condition.
In certain cases it is sufficient to characterize
the solution!

25
Scheduling

Arranging tasks in order, s.t. feasibility
constraints, in such a way to secure a
performance that is economic in some sense.
trains over railroad
operations in construction building
production in an oil refinery
jobs on processors etc

26
Scheduling

Process N jobs sequentially on one machine
Job i requires random time Ti (T1TN indep)
If job i is completed at time t ? reward Ri(t)
Ri(t) atRi ( a ?(0,1) is a discount factor )
Find order that maximizes total expected reward.

27
Structure

State k (tk,Sk)
time tk when kth job has just been completed
set Sk of jobs that remain to be processed
Cost-to-go
Jk(tk,Sk) max E atkTi Ri Jk1(tkTi,
Sk-i)
Markov Property optimization of future jobs is
independent of completion times of previous jobs
Open-loop policy is optimal.
Optimal policy job schedule (i0,i1,,iN-1) .

i ?Sk
28
More scheduling

Scheduling tasks on processors in series (Bkas p.
170)
Scheduling tasks on simultaneous processors
(Weiss and Pinedo, Weber)
Pinedo 1995. Scheduling Theory, Algorithms and
systems, Prentice hall N.J.
Multiarmed bandit problems
Project ideas ..

Write a Comment

User Comments (0)

Dynamic Programming Applications PowerPoint PPT Presentation