Dynamic Programming Applications PowerPoint PPT Presentation

presentation player overlay
1 / 28
About This Presentation
Transcript and Presenter's Notes

Title: Dynamic Programming Applications


1
Dynamic Programming Applications
  • Lecture 2

2
Preview
  • Last time Deterministic DP.
  • Today Intro to Stochastic DP. MDPs.
  • Backward induction. Principle of optimality.
  • Open-loop policies. Interchange argument.
  • Ex Scheduling
  • Next time
  • Structured Monotone policies.
  • Ex Sequential assignment.

3
So whos counting?
  • 5 rounds ? 5 digit number
  • Digits outcomes 0,19
  • After each round, place outcome digit into one of
    remaining slots
  • Maximize final number

4
Dynamical System w/noise
5
Model ingredients
  • Horizon
  • States. State sets. Terminal states.
  • Actions. Action sets.
  • Randomness
  • Transition
  • Reward per stage
  • Objective

6
Dynamic systems w/noise
  • Discrete-time, continuous-state system with noise
  • States actions xt ? St, ut ? Ut(xt)
  • Random disturbances (noise) wt characterized by
    distribution Pt(. xt, ut), independent on
    wt-1,,w0
  • Transitions xt1 ft(xt, ut,,wt)
  • Cost per stage gt(xt, ut, wt).
  • Terminal stage gN(xN)

7
Discrete-state finite-state
  • State var. xt ? D discrete/finite Xt
    (x0,x1,..xt)
  • Decision ut based on past history Ut-1
    (u0,u1,..ut-1)
  • Ht (Xt, Ut-1) is the observed history at time t
  • Markov dynamics ( stochastic plant equation)
  • P(xt1 Ht) P(xt1xt, ut)
    MDP
  • State evolution transition probabilities
  • ptij(u) P(xt1 j xti, utu) wt
  • Cost at stage t from decision u gt(i,u)

8
Policy
  • Policy pp0(),, pN-1()
  • Admissible policy ut pt(xt,) ? Ut(xt)
  • Markov policy pt() St ? Ut
  • History dependent pt(.) Ht ? Ut
  • Randomized policy pt(.) Ht ? P(Ut )
  • Optimal policy p admissible optimizes
    objective

9
Objective
  • How to compare rewards?
  • Vp(x0) (..gt(xt, pt(xt), wt), gN(xN)) - rand.
    vector
  • Gp(x0) gN(xN) S gt(xt, pt(xt), wt) rand.
    variable
  • Jp(x0) EwGp(x0) expected value
  • or EwU(Gp(x0)) or EwU(Vp(x0)) exp.
    utility
  • Optimal policy p admissible minimizes cost
  • J(x0) Jp(x0) infp Jp(x0)
  • This is the optimal value (cost) function

10
System works forward
  • Choose admissible policy p p0(),.., pN-1()
  • Typical stage t
  • Controller observes xt and decides ut pt(xt)
  • Generate disturbances wt Pt(. xt, ut).
  • Incur cost gt(xt, ut, wt) add to previous
    costs
  • Generate next state plant eq. xt1
    ft(xt,ut,wt)
  • If last stage (N-1) , add the terminal cost
    gN(xN) and the process ends. Otherwise t is
    incremented..

11
Policy Evaluation Algorithm
  • Given policy p
  • 1. Let t N and JpN(xN) gN(xN) for all
    xN?S
  • 2. If t1 STOP, else go to 3
  • 3. Let t ? t-1
  • Control
  • Jt(xt) Ew gt(xt, pt (xt), wt) Jt1(ft(xt,
    pt (xt), wt))
  • MDP
  • Jt(xt) gt(xt, pt (xt)) S p(xt1
    xt, pt (xt)) Jt1(xt1)
  • 4. Go to 2

Q1 What if history dependent? Q2 Evaluate
computational requirement.
12
The DP Algorithm
  • For every initial state x0, the optimal cost
    J(x0) of the basic problem equals J0(x0), where
  • Jt(xt) min Ew gt(xt, ut, wt)
    Jt1(ft(xt, ut, wt))
  • JN(xN) gN(xN)
  • If ut pt(xt) minimizes the RHS above, then
    pp0(),.., pN-1() is an optimal
    policy.
  • This is the backward induction algorithm.

ut ?Ut(xt)
13
The Principle of Optimality
  • Let pp0,.., pN-1 be an optimal policy.
  • Subproblem i we are at state xi at time i and
    wish to minimize the cost to go from time i to
    time N.
  • Result the truncated policy pipi,.., pN-1 is
    optimal for this subproblem.
  • Optimal policy is s.t. whatever initial state and
    initial decision, the remaining decisions must be
    optimal with respect to the resulting state.

14
Looking at Optimality
  • Optimality Criterion
  • If policy is optimal for the optimally
    truncated problem at any stage, then that policy
    is optimal for every stage.
  • Principle of Optimality
  • If policy is optimal for a given stage then
    it must also be optimal for the next

15
Principle of optimality
  • There exists a policy that is optimal in every
    state .
  • Can history information do better?
  • Can randomized policies do better?
  • What is the key assumption?

16
Counterexample
  • U(path i-j) 2nd largest arc in that path

1
5
6
1
2
5
3
4
4
3
2
4
6
J(1-5) J(1-2-3-5)5, with J(2-3-5)1. However,
J(2-4-5)3, so 2-3-5 is not optimal.
17
Peculiar example
  • States actions S0,1 U0,1
  • Noise wt U0,1
  • Transitions xt1 wt
  • Reward gt(xt, ut, wt) ut
  • Discount d lt 1
  • Optimal maximizing policy?

18
Mending the P.O.
  • Restrict to states of non-zero probability!
  • OR
  • Assume a distribution on initial states p
  • A policy is p-optimal for a stage if it maximizes
    the conditional expected value of the returns for
    that stage over all strategies w.p. 1
  • Now P.O. holds for p-policies
  • Is it possible that no optimal policies exist but
    p-optimal ones do?

19
Peculiar example
  • States S0,1
  • Actions U(x)0,1, x ?1, U(1)(0,1)
  • Noise wt U0,1
  • Transitions xt1 wt
  • Reward gt(xt, ut, wt) ut
  • Discount d lt 1
  • p-optimal pa ut(x)1, x ?1 ut(1) a
  • No optimal policy exists (pa -gt p infeasible!)

20
When does PO works ?
  • Noise spaces D0,DN-1 are finite, or countable
    (this implies resulting state sets are so) , and
    all expectations are finite for admissible
    policies OR
  • State set is finite or countable, and action sets
    are
  • finite for every set OR
  • compact for every state, and costs are continuous
    and uniformly bounded, and transitions are
    uniformly continuous OR
  • compact for every state, and costs are upper
    semi-continuous and uniformly bounded, and
    transitions are lower semi-continuous

21
Open vs. closed loop
  • Open-loop select all controls u0,..,uN-1 at time
    0.
  • Policy does not depend on current state.
  • Closed-loop select a policy pp0,.., pN-1
    where the control at time t ut pt(xt) uses
    information about the current state.
  • This extra information may be used to achieve
    lower costs. Cost reduction value of info.
  • Example Monty Hall

22
Sequencing interchange
  • N tasks values Vi, success probability Pi,
    i0,..N-1.
  • Once a task fails, the machine breaks.
  • In what order to schedule the tasks ?
  • Objective max total reward from completed tasks.

23
Sequencing interchange
  • Task sequence LN (i0,i1,,iN-1)
  • V(i0,i1,,iN-1) Pi0Vi0 Pi0 V(i1,,iN-1)
  • Pi0Vi0 Pi0Pi1Vi1 (Pi0PiN-1)ViN-1

24
The interchange argument
  • The problem has a structure s.t. there exists an
    optimal open-loop policy .
  • Argument start with an optimal sequence and
    argue that by exchanging the order of two
    consecutive controls, expected cost cannot
    decrease.
  • This is a necessary optimality condition.
  • In certain cases it is sufficient to characterize
    the solution!

25
Scheduling
  • Arranging tasks in order, s.t. feasibility
    constraints, in such a way to secure a
    performance that is economic in some sense.
  • trains over railroad
  • operations in construction building
  • production in an oil refinery
  • jobs on processors etc

26
Scheduling
  • Process N jobs sequentially on one machine
  • Job i requires random time Ti (T1TN indep)
  • If job i is completed at time t ? reward Ri(t)
  • Ri(t) atRi ( a ?(0,1) is a discount factor )
  • Find order that maximizes total expected reward.

27
Structure
  • State k (tk,Sk)
  • time tk when kth job has just been completed
  • set Sk of jobs that remain to be processed
  • Cost-to-go
  • Jk(tk,Sk) max E atkTi Ri Jk1(tkTi,
    Sk-i)
  • Markov Property optimization of future jobs is
    independent of completion times of previous jobs
  • Open-loop policy is optimal.
  • Optimal policy job schedule (i0,i1,,iN-1) .

i ?Sk
28
More scheduling
  • Scheduling tasks on processors in series (Bkas p.
    170)
  • Scheduling tasks on simultaneous processors
    (Weiss and Pinedo, Weber)
  • Pinedo 1995. Scheduling Theory, Algorithms and
    systems, Prentice hall N.J.
  • Multiarmed bandit problems
  • Project ideas ..
Write a Comment
User Comments (0)