Title: Dynamic Programming Applications
1Dynamic Programming Applications
2Preview
- Last time Deterministic DP.
- Today Intro to Stochastic DP. MDPs.
- Backward induction. Principle of optimality.
- Open-loop policies. Interchange argument.
- Ex Scheduling
- Next time
- Structured Monotone policies.
- Ex Sequential assignment.
-
-
3So whos counting?
- 5 rounds ? 5 digit number
- Digits outcomes 0,19
- After each round, place outcome digit into one of
remaining slots - Maximize final number
4Dynamical System w/noise
5Model ingredients
- Horizon
- States. State sets. Terminal states.
- Actions. Action sets.
- Randomness
- Transition
- Reward per stage
- Objective
6Dynamic systems w/noise
- Discrete-time, continuous-state system with noise
- States actions xt ? St, ut ? Ut(xt)
- Random disturbances (noise) wt characterized by
distribution Pt(. xt, ut), independent on
wt-1,,w0 - Transitions xt1 ft(xt, ut,,wt)
- Cost per stage gt(xt, ut, wt).
- Terminal stage gN(xN)
-
7Discrete-state finite-state
- State var. xt ? D discrete/finite Xt
(x0,x1,..xt) - Decision ut based on past history Ut-1
(u0,u1,..ut-1) - Ht (Xt, Ut-1) is the observed history at time t
- Markov dynamics ( stochastic plant equation)
- P(xt1 Ht) P(xt1xt, ut)
MDP - State evolution transition probabilities
- ptij(u) P(xt1 j xti, utu) wt
- Cost at stage t from decision u gt(i,u)
-
8Policy
- Policy pp0(),, pN-1()
- Admissible policy ut pt(xt,) ? Ut(xt)
- Markov policy pt() St ? Ut
- History dependent pt(.) Ht ? Ut
- Randomized policy pt(.) Ht ? P(Ut )
- Optimal policy p admissible optimizes
objective
9Objective
- How to compare rewards?
- Vp(x0) (..gt(xt, pt(xt), wt), gN(xN)) - rand.
vector - Gp(x0) gN(xN) S gt(xt, pt(xt), wt) rand.
variable - Jp(x0) EwGp(x0) expected value
- or EwU(Gp(x0)) or EwU(Vp(x0)) exp.
utility - Optimal policy p admissible minimizes cost
- J(x0) Jp(x0) infp Jp(x0)
- This is the optimal value (cost) function
10System works forward
- Choose admissible policy p p0(),.., pN-1()
- Typical stage t
- Controller observes xt and decides ut pt(xt)
- Generate disturbances wt Pt(. xt, ut).
- Incur cost gt(xt, ut, wt) add to previous
costs - Generate next state plant eq. xt1
ft(xt,ut,wt) - If last stage (N-1) , add the terminal cost
gN(xN) and the process ends. Otherwise t is
incremented..
11Policy Evaluation Algorithm
- Given policy p
- 1. Let t N and JpN(xN) gN(xN) for all
xN?S - 2. If t1 STOP, else go to 3
- 3. Let t ? t-1
- Control
- Jt(xt) Ew gt(xt, pt (xt), wt) Jt1(ft(xt,
pt (xt), wt)) - MDP
- Jt(xt) gt(xt, pt (xt)) S p(xt1
xt, pt (xt)) Jt1(xt1) - 4. Go to 2
Q1 What if history dependent? Q2 Evaluate
computational requirement.
12The DP Algorithm
- For every initial state x0, the optimal cost
J(x0) of the basic problem equals J0(x0), where - Jt(xt) min Ew gt(xt, ut, wt)
Jt1(ft(xt, ut, wt)) - JN(xN) gN(xN)
- If ut pt(xt) minimizes the RHS above, then
pp0(),.., pN-1() is an optimal
policy. - This is the backward induction algorithm.
ut ?Ut(xt)
13The Principle of Optimality
- Let pp0,.., pN-1 be an optimal policy.
- Subproblem i we are at state xi at time i and
wish to minimize the cost to go from time i to
time N. - Result the truncated policy pipi,.., pN-1 is
optimal for this subproblem. - Optimal policy is s.t. whatever initial state and
initial decision, the remaining decisions must be
optimal with respect to the resulting state.
14Looking at Optimality
- Optimality Criterion
- If policy is optimal for the optimally
truncated problem at any stage, then that policy
is optimal for every stage. - Principle of Optimality
- If policy is optimal for a given stage then
it must also be optimal for the next
15Principle of optimality
- There exists a policy that is optimal in every
state . - Can history information do better?
- Can randomized policies do better?
- What is the key assumption?
16Counterexample
- U(path i-j) 2nd largest arc in that path
1
5
6
1
2
5
3
4
4
3
2
4
6
J(1-5) J(1-2-3-5)5, with J(2-3-5)1. However,
J(2-4-5)3, so 2-3-5 is not optimal.
17Peculiar example
- States actions S0,1 U0,1
- Noise wt U0,1
- Transitions xt1 wt
- Reward gt(xt, ut, wt) ut
- Discount d lt 1
- Optimal maximizing policy?
18Mending the P.O.
- Restrict to states of non-zero probability!
- OR
- Assume a distribution on initial states p
- A policy is p-optimal for a stage if it maximizes
the conditional expected value of the returns for
that stage over all strategies w.p. 1 - Now P.O. holds for p-policies
- Is it possible that no optimal policies exist but
p-optimal ones do?
19Peculiar example
- States S0,1
- Actions U(x)0,1, x ?1, U(1)(0,1)
- Noise wt U0,1
- Transitions xt1 wt
- Reward gt(xt, ut, wt) ut
- Discount d lt 1
- p-optimal pa ut(x)1, x ?1 ut(1) a
- No optimal policy exists (pa -gt p infeasible!)
20When does PO works ?
- Noise spaces D0,DN-1 are finite, or countable
(this implies resulting state sets are so) , and
all expectations are finite for admissible
policies OR - State set is finite or countable, and action sets
are - finite for every set OR
- compact for every state, and costs are continuous
and uniformly bounded, and transitions are
uniformly continuous OR - compact for every state, and costs are upper
semi-continuous and uniformly bounded, and
transitions are lower semi-continuous
21Open vs. closed loop
- Open-loop select all controls u0,..,uN-1 at time
0. - Policy does not depend on current state.
- Closed-loop select a policy pp0,.., pN-1
where the control at time t ut pt(xt) uses
information about the current state. - This extra information may be used to achieve
lower costs. Cost reduction value of info. - Example Monty Hall
22Sequencing interchange
- N tasks values Vi, success probability Pi,
i0,..N-1. - Once a task fails, the machine breaks.
- In what order to schedule the tasks ?
- Objective max total reward from completed tasks.
23Sequencing interchange
- Task sequence LN (i0,i1,,iN-1)
- V(i0,i1,,iN-1) Pi0Vi0 Pi0 V(i1,,iN-1)
- Pi0Vi0 Pi0Pi1Vi1 (Pi0PiN-1)ViN-1
24The interchange argument
- The problem has a structure s.t. there exists an
optimal open-loop policy . - Argument start with an optimal sequence and
argue that by exchanging the order of two
consecutive controls, expected cost cannot
decrease. - This is a necessary optimality condition.
- In certain cases it is sufficient to characterize
the solution!
25Scheduling
- Arranging tasks in order, s.t. feasibility
constraints, in such a way to secure a
performance that is economic in some sense. - trains over railroad
- operations in construction building
- production in an oil refinery
- jobs on processors etc
26Scheduling
- Process N jobs sequentially on one machine
- Job i requires random time Ti (T1TN indep)
- If job i is completed at time t ? reward Ri(t)
- Ri(t) atRi ( a ?(0,1) is a discount factor )
- Find order that maximizes total expected reward.
27Structure
- State k (tk,Sk)
- time tk when kth job has just been completed
- set Sk of jobs that remain to be processed
- Cost-to-go
- Jk(tk,Sk) max E atkTi Ri Jk1(tkTi,
Sk-i) - Markov Property optimization of future jobs is
independent of completion times of previous jobs - Open-loop policy is optimal.
- Optimal policy job schedule (i0,i1,,iN-1) .
i ?Sk
28More scheduling
- Scheduling tasks on processors in series (Bkas p.
170) - Scheduling tasks on simultaneous processors
(Weiss and Pinedo, Weber) - Pinedo 1995. Scheduling Theory, Algorithms and
systems, Prentice hall N.J. - Multiarmed bandit problems
- Project ideas ..