Markov Decision Processes: A Survey

1 / 63
About This Presentation
Title:

Markov Decision Processes: A Survey

Description:

Airline Meal Planning Models and Results. MDP Theory and Computation ... Harpaz, Lee and Winkler (1982) study output decisions of a competitive firm in a ... – PowerPoint PPT presentation

Number of Views:160
Avg rating:3.0/5.0
Slides: 64
Provided by: martinlp

less

Transcript and Presenter's Notes

Title: Markov Decision Processes: A Survey


1
Markov Decision ProcessesA Survey
  • Martin L. Puterman

2
Outline
  • Example - Airline Meal Planning
  • MDP Overview and Applications
  • Airline Meal Planning Models and Results
  • MDP Theory and Computation
  • Bayesian MDPs and Censored Models
  • Reinforcement Learning
  • Concluding Remarks

3
Airline Meal Planning
  • Goal Get the right number of meals on each
    flight
  • Why is this hard?
  • Meal preparation lead times
  • Load uncertainty
  • Last minute uploading capacity constraints
  • Why is this important to an airline?
  • 500 flights per day ? 365 days ? 5/meal
    912,500

4

5
How Significant is the Problem?
6
The Meal Planning Decision Process
  • At several key decision points up to 3 hours
    before departure, the meal planner observes
    reservations and meals allocated and adjusts
    allocated meal quantity.
  • Hourly in the last three hours, adjustments are
    made but the cost of adjustment is significantly
    higher and limited by delivery van capacity and
    uploading logistics.

7
Meal Planning Timeline

8
Airline Meal Planning
  • Operational goal develop a meal planning
    strategy that minimizes expected total overage,
    underage and operational costs

A Meal Planning Strategy specifies at each
decision point the number of extra meals to
prepare or deliver for any observed meal
allocation and reservation quantity.
9
Why is Finding an Optimal Meal Planning Strategy
Challenging?
  • 6 decision points
  • 108 passengers
  • 108 possible actions
  • One strategy requires 108?108?6 69984 order
    quantities.
  • There are 7,558,272 strategies to consider.
  • Demand must be forecasted.

10
Airline Meal Planning Characteristics
  • A similar decision is made at several time points
  • There are costs associated with each decision
  • The decision has future consequences
  • The overall cost depends on several events
  • There is uncertainty about the future

11
What is a Markov decision process?
  • A mathematical representation of a sequential
    decision making problem in which
  • A system evolves through time.
  • A decision maker controls it by taking actions at
    pre-specified points of time.
  • Actions incur immediate costs or accrue immediate
    rewards and affect the subsequent system state.

12
MDP Overview

13
Markov Decision Processes are also known as
  • MDPs
  • Dynamic Programs
  • Stochastic Dynamic Programs
  • Sequential Decision Processes
  • Stochastic Control Problems

14
Early Historical Perspective
  • Massé - Reservoir Control (1940s )
  • Wald - Sequential Analysis (1940s )
  • Bellman - Dynamic Programing (1950s)
  • Arrow, Dvorestsky, Wolfowitz, Kiefer, Karlin -
    Inventory (1950s)
  • Howard (1960) - Finite State and Action Models
  • Blackwell (1962) - Theoretical Foundation
  • Derman, Ross, Denardo, Veinott (1960s) - Theory
    - USA
  • Dynkin, Krylov, Shirayev, Yushkevitch (1960s) -
    Theory - USSR

15
Basic Model Ingredients
  • Decision epochs 0, 1, 2, ., N or 0,N or
    0,1,2, or 0,?)
  • State Space S (generic state s)
  • Action Sets As (generic action a)
  • Rewards rt(s,a)
  • Transition probabilities pt(js,a)
  • A model is called stationary if rewards and
    transition probabilities are independent of t

16
System Evolution
at1
at
st
st1
rt(st,at)
rt1(st1,at1)
Decision Epoch t 1
Decision Epoch t
17
Another Perspective
s1
s1
a1
s2
s2
s3
s3
a2
s4
s4
18
Yet Another Perspective An Event Timeline
...
May 15
June 1
June 10
June 15
...
Place June order
May order arrives ship product to DCs
May sales data arrives prepare July forecast
Place July order
19
Some Variants on the Basic Model
  • There may be a continuum of states and/or actions
  • Decisions may be made in continuous time
  • Rewards and transition rates may change over time
  • System state may be not observable
  • Some model parameters may not be known

20
Derived Quantities
  • Decision Rules dt(s)
  • Policies, Strategies or Plans ? ( d1, d2, )
    or ? ( d1, d2, , dN)
  • Stochastic Processes ( Xt, Yt ), Es? ?
  • Value Functions vt ?(s), v??(s), g ?,
  • Value functions differ from immediate rewards,
    they represent the value starting in a state of
    all future events

21
Objective
  • Identify a policy that maximizes either the
  • expected total reward (finite or infinite
    horizon)
  • v(s) Es ? rt (Xt,Yt )
  • expected discounted reward
  • expected long run average reward
  • expected utility
  • possibly subject to constraints on system
    performance

?
t0
22
The Bellman Equation
  • MDP computation and theory focuses on solving the
    optimality (Bellman) equation which for infinite
    horizon discounted models

This can also be expressed as
v Tv or Bv 0
v(s) is the value function of the MDP
23
Some Theoretical Issues
  • When does an optimal policy with nice structure
    exist?
  • Markov or Stationary Policy
  • (s,S) or Control Limit Policy
  • When do computational algorithms converge? and
    how fast?
  • What properties do solutions of the optimality
    equation have?

24
Computing Optimal Policies
  • Why?
  • Implementation
  • Gold Standard for Heuristics
  • Basic Principle - Transform multi-period problem
    into a sequence of one-period problems.
  • Why is computation difficult in practice?
  • Curse of Dimensionality

25
Computational Methods
  • Finite Horizon Models
  • Backward Induction (Dynamic Programming)
  • Infinite Horizon Models
  • Value Iteration
  • Policy Iteration
  • Modified Policy Iteration
  • Linear Programming
  • Neuro-Dynamic Programming/Reinforcement learning

26
Infinite Horizon Computation
  • Iterative algorithms work as follows
  • Approximate the value function by v(s)
  • Select a new decision rule by
  • Re-approximate the value function
  • Approximation methods
  • exact - policy iteration
  • iterative - value iteration and modified policy
    iteration
  • simulation based - reinforcement learning


27
Applications (A to N)
  • Airline Meal Planning
  • Behaviourial Ecology
  • Capacity Expansion
  • Decision Analysis
  • Equipment Replacement
  • Fisheries Management
  • Gambling Systems
  • Highway Pavement Repair
  • Inventory Control
  • Job Seeking Strategies
  • Knapsack Problems
  • Learning
  • Medical Treatment
  • Network Control

28
Applications (O to Z)
  • Option Pricing
  • Project Selection
  • Queueing System Control
  • Robotic Motion
  • Scheduling
  • Tetris
  • User Modeling
  • Vision (Computer)
  • Water Resources
  • X-Ray Dosage
  • Yield Management
  • Zebra Hunting

29
Coffee, Tea or ? A Markov Decision Process
Model for Airline Meal Provisioning J. Goto,
M.E. Lewis and MLP
  • Decision Epochs T 1, ,5
  • 0 - Departure time
  • 1-3 1,2 and 3 Hours Pre-Departure
  • 4 6 Hours Pre - Departure
  • 5 36 Hours Pre-Departure
  • States (l,q) 0 ? l ? Booking Limit, 0 ? q ?
    Capacity
  • Actions (Meal quantity after delivery)
  • At,(l,q) 0, 1, , Plane Capacity t 3,4,5
  • At,(l,q) q-van capacity, , q van capacity)
    t 1,2

30
Markov Decision Process Formulation
  • Costs (depending on t)
  • rt((l,q),a) Meal Cost Return penalty late
    delivery charge shortage cost

Transition Probabilities
?pt(qq)
aq pt((l,q)(l,q),a) ?
? 0 a?q
31
An Optimal Decision Rule
Meal Quantity
Decision Epoch 1
Passenger Load
Departure
Adjust with Van
32
Empirical Performance
33
Overage versus Shortage
  • Evaluate the model over a range of terminal costs
  • Observe the relationship of average overage and
    proportion of flights short-catered
  • 55 flight number / aircraft capacity combinations
    (evaluated separately)

34
Overage versus Shortage
  • Performance of optimal policies

35
Information Acquisition

36
Information Acquisition and Optimization
  • Objective Investigate the tradeoff between
    acquiring information and optimal policy choice
  • Examples
  • Harpaz, Lee and Winkler (1982) study output
    decisions of a competitive firm in a market with
    random demand in which the demand distribution is
    unknown.
  • Braden and Oren (1994) study dynamic pricing
    decisions of a firm in a market with unknown
    consumer demand curves.
  • Lariviere and Porteus (1999) and Ding, Puterman
    and Bisi (2002) study order decisions of a
    censored newsvendor with unobservable lost sales
    and unknown demand distributions.
  • Key result - it is optimal to experiment

37
Bayesian Newsvendor Model
  • Newsvendor cost structure (c - cost, h - salvage
    value, p - penalty cost)
  • Demand assumptions
  • positive continuous
  • i.i.d. sample from f(x?) with unknown ?
  • prior on ? is ?1(?)
  • Assume first that demand is unobservable
  • Demand Sales observed lost sales

38
  • Time Line of Events

obs. x1
set y2
obs. x2
set y1
1
2
3
39
Demand Updating
xn

n n1
40
Bayesian Newsvendor Model
  • Bayesian MDP Formulation
  • At decision epoch n, (n1,2, ..., N)
  • States
  • all probability distributions on the

    unknown parameter
  • Actions
  • Costs
  • Transition Prob

41
Bayesian Newsvendor Model
  • The Optimality Equations
  • for n1,,N with the boundary condition
  • Key Observation
  • The transition probabilities are independent
    of the actions. So the problem can be reduced to
    a sequence of single-period problems.

42
  • The Bayesian Newsvendor Policy
  • The BMDP reduces to a sequence of single-period,
    two-step problems.
  • Demand distribution parameter updating
  • Cost minimization
  • where Mn is the CDF of mn

43
Bayesian Newsvendor with Unobservable Lost Sales
  • Model Set-up
  • Same as fully observable case but unmet demand is
    lost and unobservable
  • Question
  • Is the Bayesian Newsvendor policy optimal?

44
Bayesian Newsvendor with Unobservable Lost Sales
  • Demand is censored by the order quantity.
  • demand exactly observed
  • demand censored
  • Demand updating is different in this case

45
Bayesian Newsvendor with Unobservable Lost Sales
  • Set yn-1

obs. xn-10 with mn-1(0)
obs. xn-11 with mn-1(1)
obs. xn-1 yn-1 with 1-Mn-1(yn-1)
46
Bayesian Newsvendor with Unobservable Lost Sales
  • Model Formulation
  • States, Actions, Costs As above
  • Transition Probabilities

The Optimality Equations
47
The Key Result
Bayesian Newsvendor with Unobservable Lost Sales
  • if f(x? ) is likelihood order increasing in ?.
  • In this model, decisions in separate periods are
    interrelated through the optimality equation.
  • This means that it is optimal to tradeoff
    learning for short term optimality.
  • Question What is an upper bound on yn?

48
Bayesian Newsvendor with Unobservable Lost Sales
Solving the optimality equations gives For
N For n1,..., N-1,
where p(yn) is a policy dependent penalty cost
. Proof of key result is based on showing p(yn)
gt p for n lt N.
49
Bayesian Newsvendor with Unobservable Lost Sales
  • Some comments
  • The extra penalty can be interpreted as the
    marginal expected value of information at
    decision epoch n.
  • Numerical results show small improvements when
    using the optimal policy as opposed to the
    Bayesian Newsvendor policy.
  • We have extended this to a two level supply chain

50
Partially Observed MDPs
  • In POMDPs, system state is not observable.
  • Decision maker receives a signal y which is
    related to the system state by q(ys,a).
  • Analysis is based on using Bayes Theorem to
    estimate distribution of the system state given
    the signal
  • Similar to Bayesian MDPs described above
  • the posterior state distribution is a sufficient
    statistic for decision making
  • State space is a continuum
  • Early work by Smallwood and Sondik (1972)
  • Applications
  • Medical diagnosis and treatment
  • Equipment repair

51
Reinforcement Learning and Neuro-Dynamic
Programming

52
Neuro-Dynamic Programming or Reinforcement
Learning
  • A different way to think about dynamic
    programming
  • Basis in artificial intelligence research
  • Mimics learning behavior of animals
  • Developed to
  • Solve problems with high dimensional state spaces
    and/or
  • Solve problems in which the system is described
    by a simulator (as opposed to a mathematical
    model)
  • NDL/Rl refers to a collection of methods
    combining Monte Carlo methods with MDP concepts

53
Reinforcement Learning
  • Mimics learning by experimenting in real life
  • learn by interacting with the environment
  • goal is long term
  • uncertainty may be present (task must be repeated
    many time to obtain its value)
  • Trades off between exploration and exploitation
  • Key focus is estimating value function ( v(s) or
    Q(s,a) )
  • Start with guess of value function
  • Carry out task and observe immediate outcome
    (reward and transition)
  • Update value function

54
Reinforcement Learning - Example
  • Playing Tic-Tac-Toe (Sutton and Barto, 2000)
  • You know the rules of the game but not opponents
    strategy (assumed fixed over time but with random
    component)
  • Approach
  • List possible system states
  • Start with initial guess of probability of
    winning in each state
  • Observe current state (s) and choose action that
    will move you to state with highest probability
    of winning
  • Observe state (s) after opponent plays
  • Revise value in state s based on value in s.
  • Player might not always choose best action but
    try other actions to learn about different
    states.

55
Reinforcement Learning - Example
  • Observations about Tic-Tac-Toe Example
  • Dimension of state space is 39
  • Writing down a mathematical model for the game is
    challenging, simulating it is easy.
  • Goal is to maximize the probability of winning,
    there is no immediate reward
  • Possible updating mechanism using observations
  • vnew(s) vold(s) ? ( vold(s) - vold(s) )
  • ? is a step-size parameter
  • vold(s) - vold(s) is a temporal-difference
  • The subsequent state s depends on players action.

56
Reinforcement Learning
  • Problems can be classified in two ways

57
Temporal Difference Updating
  • No model example, discounted case - based on
    Q(s,a)
  • Algorithm (policy specified)
  • Start system in s, choose action a and observe s
    and a
  • Update Q(s,a) ? Q(s,a) ? ( r ?Q(s,a) -
    Q(s,a))
  • Repeat replacing (s,a) by (s,a)
  • Algorithm (no-policy specified) (Q-Learning)
  • Start system in s, choose action a and observe s
    and a
  • Update Q(s,a) ? Q(s,a) ? ( r ? max a? A
    Q(s,a) - Q(s,a))
  • Repeat replacing s by s
  • Issues include choosing ? and stopping criteria

58
RL Function Approximation
  • High-dimensionality addressed by
  • replacing v(s) or Q(s,a) by representation
  • and then applying Q-learning algorithm updating
    weights wi at each iteration, or
  • approximating v(s) or Q(s,a) by a neural network
  • Issue choose basis functions ?i(s,a) to
    reflect problem structure

59
RL Applications
  • Backgammon
  • Checker Player
  • Robot Control
  • Elevator Dispatching
  • Dynamic Telecommunications Channel Allocation
  • Job Shop Scheduling
  • Supply Chain Management

60
Neuro-Dynamic Programming Reinforcement Learning
It is unclear which algorithms and parameter
settings will work on a particular problem, and
when a method does work, it is still unclear
which ingredients are actually necessary for
success. As a result, applications often require
trial and error in a long process of a parameter
tweaking and experimentation. van Roy - 2002
61
Conclusions

62
Concluding Comments
  • MDPs provide an elegant formal framework for
    sequential decision making
  • They are widely applicable
  • They can be used to compute optimal policies
  • They can be used as a baseline to evaluate
    heuristics
  • They can be used to determine structural results
    about optimal policies
  • Recent research is addressing The Curse of
    Dimensionality

63
Some References
  • Bertsekas, D.P. and Tsitsiklis, J.N.,
    Neuro-Dynamic Programming, Athena, 1996.
  • Feinberg, E.A. and Shwartz, A. Handbook of Markov
    Decision Processes Methods and Applications,
    Kluwer 2002.
  • Puterman, M.L. Markov Decision Processes, Wiley,
    1994.
  • Sutton, R.S. and Barto, A.G. Reinforcement
    Learning, MIT, 2000.
Write a Comment
User Comments (0)