Online Sampling for Markov Decision Processes - PowerPoint PPT Presentation

About This Presentation

Title:

Online Sampling for Markov Decision Processes

Description:

Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University – PowerPoint PPT presentation

Number of Views:128

Avg rating:3.0/5.0

Slides: 52

Provided by: Edwin48

Learn more at: https://www.cs.rice.edu

Category:

more less

Transcript and Presenter's Notes

Title: Online Sampling for Markov Decision Processes

1
Online SamplingforMarkov Decision Processes

Bob Givan
Joint work w/ E. K. P. Chong, H. Chang, G. Wu

Electrical and Computer Engineering Purdue
University
2
Markov Decision Process (MDP)

Ingredients
System state x in state space X
Control action a in A(x)
Reward R(x,a)
State-transition probability P(x,y,a)
Find control policy to maximize objective fun

3
Optimal Policies

Policy mapping from state and time to actions
Stationary Policy mapping from state to actions
Goal a policy maximizing the objective function
VH(x0) max Obj R(x0,a0), , R(xH-1,aH-1)
where the max is over all policies u
u0,,uH-1
For large H, a0 independent of H. (w/ergodicity
assum.)
Stationary optimal action a0 for H ? via
receding horizon control

4
Q Values

? Fix a large H, focus on finite-horizon reward
Define Q(x,a) R(x,a) EVH-1(y)
Utility of action a at state x.
Name Q-value of action a at state x.
Key identities (Bellmans equations)
VH(x) maxa Q(x,a)
?0(x) argmaxa Q(x,a)

5
Solution Methods

Recall
u0(x) argmaxa Q(x,a)
Q(x,a) R(x,a) E VH-1(y)
Problem Q-value depends on optimal policy.
State space is extremely large (often continuous)
Two-pronged solution approach
Apply a receding-horizon method
Estimate Q-values via simulation/sampling

6
Methods for Q-value Estimation

Previous work by other authors
Unbiased sampling (exact Q value)Kearns et al.,
IJCAI-99
Policy rollout (lower bound) Bertsekas
Castanon, 1999
Our techniques
Hindsight optimization (upper bound)
Parallel rollout (lower bound)

7
Expectimax Tree for V
8
Unbiased Sampling
9
Unbiased Sampling (Contd)

For a given desired accuracy, how largeshould
sampling width and depth be?
Answered Kearns, Mansour, and Ng (1999)
Requires prohibitive sampling width and depth
e.g. C ? 108, Hs gt 60 to distinguish best and
worst policies in our scheduling domain
We evaluate with smaller width and depth

10
How to Look Deeper?
11
Policy Roll-out
12
Policy Rollout in Equations

Write VHu (y) for the value of following policy u
Recall Q(x,a) R(x,a) E VH-1(y)
R(x,a) E maxu
VH-1u(y)
Given a base policy u, use
R(x,a) E VH-1u(y)
as an lower bound estimate of Q-value.
Resulting policy is PI(u), given infinite sampling

13
Policy Roll-out (contd)
14
Parallel Policy Rollout

Generalization of policy rollout, due toChang,
Givan, and Chong, 2000
Given a set U of base policies, use
R(x,a) E maxu?U VH-1u(y)
as an estimate of Q-value
More accurate estimate than policy rollout
Still gives a lower bound to true Q-value
Still gives a policy no worse than any in U

15
Hindsight Optimization Tree View
16
Hindsight Optimization Equations

Swap Max and Exp in expectimax tree.
Solve each off-line optimization problem
O (kC f(H)) time
where f(H) is the offline problem complexity
Jensens inequality implies upper bounds

17
Hindsight Optimization (Contd)
18
Application to Example Problems

Apply unbiased sampling, policy rollout, parallel
rollout, and hindsight optimization to
Multi-class deadline scheduling
Random early dropping
Congestion control

19
Basic Approach

Traffic model provides a stochastic description
of possible future outcomes
Method
Formulate network decision problems as POMDPs by
incorporating traffic model
Solve belief-state MDP online using
sampling(choose time-scale to allow for
computation time)

20
Domain 1 Deadline Scheduling
Objective Minimize weighted loss
21
Domain 2 Random Early Dropping
Objective Minimize delaywithout sacrificing
throughput
22
Domain 3 Congestion Control
23
Traffic Modeling

A Hidden Markov Model (HMM) for each source
Note state is hidden, model is partially
observed

24
Deadline Scheduling Results

Non-sampling Policies
EDF earliest deadline first.
Deadline sensitive, class insensitive.
SP static priority.
Deadline insensitive, class sensitive.
CM current minloss Givan et al., 2000
Deadline and class sensitive.
Minimizes weighted loss for the current packets.

25
Deadline Scheduling Results

Objective minimize weighted loss
Comparison
Non-sampling policies
Unbiased sampling (Kearns et al.)
Hindsight optimization
Rollout with CM as base policy
Parallel rollout
Results due to H. S. Chang

26
Deadline Scheduling Results
27
Deadline Scheduling Results
28
Deadline Scheduling Results
29
Random Early Dropping Results

Objective minimize delay subject to throughput
loss-tolerance
Comparison
Candidate policies RED and buffer-k
KMN-sampling
Rollout of buffer-k
Parallel rollout
Hindsight optimization
Results due to H. S. Chang.

30
Random Early Dropping Results
31
Random Early Dropping Results
32
Congestion Control Results

MDP Objective minimize weighted sum of
throughput, delay, and loss-rate
Fairness is hard-wired
Comparisons
PD-k (proportional-derivative with k target
queue)
Hindsight optimization
Rollout of PD-k parallel rollout
Results due to G. Wu, in progress

33
Congestion Control Results
34
Congestion Control Results
35
Congestion Control Results
36
Congestion Control Results
37
Results Summary

Unbiased sampling cannot cope
Parallel rollout wins in 2 domains
Not always equal to simple rollout of one base
policy
Hindsight optimization wins in 1 domain
Simple policy rollout the cheapest method
Poor in domain 1
Strong in domain 2 with best base policy but
how to find this policy?
So-so in domain 3 with any base policy

38
Talk Summary

Case study of MDP sampling methods
New methods offering practical improvements
Parallel policy rollout
Hindsight optimization
Systematic methods for using traffic models to
help make network control decisions
Feasibility of real-time implementation depends
on problem timescale

39
Ongoing Research

Apply to other control problems (different
timescales)
Admission/access control
QoS routing
Link bandwidth allotment
Multiclass connection management
Problems arising in proxy-services
Diagnosis and recovery

40
Ongoing Research (Contd)

Alternative traffic models
Multi-timescale models
Long-range dependent models
Closed-loop traffic
Fluid models
Learning traffic model online
Adaptation to changing traffic conditions

41
Congestion Control (Contd)
42
Congestion Control Results
43
Hindsight Optimization (Contd)
44
Policy Rollout (Contd)
Policy-performance
Base Policy
45
Receding-horizon Control