Online SamplingforMarkov Decision Processes

- Bob Givan
- Joint work w/ E. K. P. Chong, H. Chang, G. Wu

Electrical and Computer Engineering Purdue

University

Markov Decision Process (MDP)

- Ingredients
- System state x in state space X
- Control action a in A(x)
- Reward R(x,a)
- State-transition probability P(x,y,a)
- Find control policy to maximize objective fun

Optimal Policies

- Policy mapping from state and time to actions
- Stationary Policy mapping from state to actions
- Goal a policy maximizing the objective function
- VH(x0) max Obj R(x0,a0), , R(xH-1,aH-1)
- where the max is over all policies u

u0,,uH-1 - For large H, a0 independent of H. (w/ergodicity

assum.) - Stationary optimal action a0 for H ? via

receding horizon control

Q Values

- ? Fix a large H, focus on finite-horizon reward
- Define Q(x,a) R(x,a) EVH-1(y)
- Utility of action a at state x.
- Name Q-value of action a at state x.
- Key identities (Bellmans equations)
- VH(x) maxa Q(x,a)
- ?0(x) argmaxa Q(x,a)

Solution Methods

- Recall
- u0(x) argmaxa Q(x,a)
- Q(x,a) R(x,a) E VH-1(y)
- Problem Q-value depends on optimal policy.
- State space is extremely large (often continuous)
- Two-pronged solution approach
- Apply a receding-horizon method
- Estimate Q-values via simulation/sampling

Methods for Q-value Estimation

- Previous work by other authors
- Unbiased sampling (exact Q value)Kearns et al.,

IJCAI-99 - Policy rollout (lower bound) Bertsekas

Castanon, 1999 - Our techniques
- Hindsight optimization (upper bound)
- Parallel rollout (lower bound)

Expectimax Tree for V

Unbiased Sampling

Unbiased Sampling (Contd)

- For a given desired accuracy, how largeshould

sampling width and depth be? - Answered Kearns, Mansour, and Ng (1999)
- Requires prohibitive sampling width and depth
- e.g. C ? 108, Hs gt 60 to distinguish best and

worst policies in our scheduling domain - We evaluate with smaller width and depth

How to Look Deeper?

Policy Roll-out

Policy Rollout in Equations

- Write VHu (y) for the value of following policy u
- Recall Q(x,a) R(x,a) E VH-1(y)
- R(x,a) E maxu

VH-1u(y) - Given a base policy u, use
- R(x,a) E VH-1u(y)
- as an lower bound estimate of Q-value.
- Resulting policy is PI(u), given infinite sampling

Policy Roll-out (contd)

Parallel Policy Rollout

- Generalization of policy rollout, due toChang,

Givan, and Chong, 2000 - Given a set U of base policies, use
- R(x,a) E maxu?U VH-1u(y)
- as an estimate of Q-value
- More accurate estimate than policy rollout
- Still gives a lower bound to true Q-value
- Still gives a policy no worse than any in U

Hindsight Optimization Tree View

Hindsight Optimization Equations

- Swap Max and Exp in expectimax tree.
- Solve each off-line optimization problem
- O (kC f(H)) time
- where f(H) is the offline problem complexity
- Jensens inequality implies upper bounds

Hindsight Optimization (Contd)

Application to Example Problems

- Apply unbiased sampling, policy rollout, parallel

rollout, and hindsight optimization to - Multi-class deadline scheduling
- Random early dropping
- Congestion control

Basic Approach

- Traffic model provides a stochastic description

of possible future outcomes - Method
- Formulate network decision problems as POMDPs by

incorporating traffic model - Solve belief-state MDP online using

sampling(choose time-scale to allow for

computation time)

Domain 1 Deadline Scheduling

Objective Minimize weighted loss

Domain 2 Random Early Dropping

Objective Minimize delaywithout sacrificing

throughput

Domain 3 Congestion Control

Traffic Modeling

- A Hidden Markov Model (HMM) for each source
- Note state is hidden, model is partially

observed

Deadline Scheduling Results

- Non-sampling Policies
- EDF earliest deadline first.
- Deadline sensitive, class insensitive.
- SP static priority.
- Deadline insensitive, class sensitive.
- CM current minloss Givan et al., 2000
- Deadline and class sensitive.
- Minimizes weighted loss for the current packets.

Deadline Scheduling Results

- Objective minimize weighted loss
- Comparison
- Non-sampling policies
- Unbiased sampling (Kearns et al.)
- Hindsight optimization
- Rollout with CM as base policy
- Parallel rollout
- Results due to H. S. Chang

Deadline Scheduling Results

Deadline Scheduling Results

Deadline Scheduling Results

Random Early Dropping Results

- Objective minimize delay subject to throughput

loss-tolerance - Comparison
- Candidate policies RED and buffer-k
- KMN-sampling
- Rollout of buffer-k
- Parallel rollout
- Hindsight optimization
- Results due to H. S. Chang.

Random Early Dropping Results

Random Early Dropping Results

Congestion Control Results

- MDP Objective minimize weighted sum of

throughput, delay, and loss-rate - Fairness is hard-wired
- Comparisons
- PD-k (proportional-derivative with k target

queue) - Hindsight optimization
- Rollout of PD-k parallel rollout
- Results due to G. Wu, in progress

Congestion Control Results

Congestion Control Results

Congestion Control Results

Congestion Control Results

Results Summary

- Unbiased sampling cannot cope
- Parallel rollout wins in 2 domains
- Not always equal to simple rollout of one base

policy - Hindsight optimization wins in 1 domain
- Simple policy rollout the cheapest method
- Poor in domain 1
- Strong in domain 2 with best base policy but

how to find this policy? - So-so in domain 3 with any base policy

Talk Summary

- Case study of MDP sampling methods
- New methods offering practical improvements
- Parallel policy rollout
- Hindsight optimization
- Systematic methods for using traffic models to

help make network control decisions - Feasibility of real-time implementation depends

on problem timescale

Ongoing Research

- Apply to other control problems (different

timescales) - Admission/access control
- QoS routing
- Link bandwidth allotment
- Multiclass connection management
- Problems arising in proxy-services
- Diagnosis and recovery

Ongoing Research (Contd)

- Alternative traffic models
- Multi-timescale models
- Long-range dependent models
- Closed-loop traffic
- Fluid models
- Learning traffic model online
- Adaptation to changing traffic conditions

Congestion Control (Contd)

Congestion Control Results

Hindsight Optimization (Contd)

Policy Rollout (Contd)

Policy-performance

Base Policy

Receding-horizon Control

- For large horizon H, policy is stationary.
- At each time, if state is x, then apply action
- u(x) argmaxa Q(x,a)
- argmaxa R(x,a) E

VH-1(y) - Compute estimate of Q-value at each time.

Congestion Control (Contd)

Domain 3 Congestion Control

High-priority Traffic

Bottleneck

Node

Best-effort Traffic

- Resources Bandwidth and buffer
- Objective optimize throughput, delay, loss, and

fairness

- High-priority traffic
- Open-loop controlled
- Low-priority traffic
- Closed-loop controlled

Congestion Control Results

Congestion Control Results

Congestion Control Results

Congestion Control Results