# Online Sampling for Markov Decision Processes - PowerPoint PPT Presentation

Title:

## Online Sampling for Markov Decision Processes

Description:

### Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 52
Provided by: Edwin48
Category:
Tags:
Transcript and Presenter's Notes

Title: Online Sampling for Markov Decision Processes

1
Online SamplingforMarkov Decision Processes
• Bob Givan
• Joint work w/ E. K. P. Chong, H. Chang, G. Wu

Electrical and Computer Engineering Purdue
University
2
Markov Decision Process (MDP)
• Ingredients
• System state x in state space X
• Control action a in A(x)
• Reward R(x,a)
• State-transition probability P(x,y,a)
• Find control policy to maximize objective fun

3
Optimal Policies
• Policy mapping from state and time to actions
• Stationary Policy mapping from state to actions
• Goal a policy maximizing the objective function
• VH(x0) max Obj R(x0,a0), , R(xH-1,aH-1)
• where the max is over all policies u
u0,,uH-1
• For large H, a0 independent of H. (w/ergodicity
assum.)
• Stationary optimal action a0 for H ? via
receding horizon control

4
Q Values
• ? Fix a large H, focus on finite-horizon reward
• Define Q(x,a) R(x,a) EVH-1(y)
• Utility of action a at state x.
• Name Q-value of action a at state x.
• Key identities (Bellmans equations)
• VH(x) maxa Q(x,a)
• ?0(x) argmaxa Q(x,a)

5
Solution Methods
• Recall
• u0(x) argmaxa Q(x,a)
• Q(x,a) R(x,a) E VH-1(y)
• Problem Q-value depends on optimal policy.
• State space is extremely large (often continuous)
• Two-pronged solution approach
• Apply a receding-horizon method
• Estimate Q-values via simulation/sampling

6
Methods for Q-value Estimation
• Previous work by other authors
• Unbiased sampling (exact Q value)Kearns et al.,
IJCAI-99
• Policy rollout (lower bound) Bertsekas
Castanon, 1999
• Our techniques
• Hindsight optimization (upper bound)
• Parallel rollout (lower bound)

7
Expectimax Tree for V
8
Unbiased Sampling
9
Unbiased Sampling (Contd)
• For a given desired accuracy, how largeshould
sampling width and depth be?
• Answered Kearns, Mansour, and Ng (1999)
• Requires prohibitive sampling width and depth
• e.g. C ? 108, Hs gt 60 to distinguish best and
worst policies in our scheduling domain
• We evaluate with smaller width and depth

10
How to Look Deeper?
11
Policy Roll-out
12
Policy Rollout in Equations
• Write VHu (y) for the value of following policy u
• Recall Q(x,a) R(x,a) E VH-1(y)
• R(x,a) E maxu
VH-1u(y)
• Given a base policy u, use
• R(x,a) E VH-1u(y)
• as an lower bound estimate of Q-value.
• Resulting policy is PI(u), given infinite sampling

13
Policy Roll-out (contd)
14
Parallel Policy Rollout
• Generalization of policy rollout, due toChang,
Givan, and Chong, 2000
• Given a set U of base policies, use
• R(x,a) E maxu?U VH-1u(y)
• as an estimate of Q-value
• More accurate estimate than policy rollout
• Still gives a lower bound to true Q-value
• Still gives a policy no worse than any in U

15
Hindsight Optimization Tree View
16
Hindsight Optimization Equations
• Swap Max and Exp in expectimax tree.
• Solve each off-line optimization problem
• O (kC f(H)) time
• where f(H) is the offline problem complexity
• Jensens inequality implies upper bounds

17
Hindsight Optimization (Contd)
18
Application to Example Problems
• Apply unbiased sampling, policy rollout, parallel
rollout, and hindsight optimization to
• Multi-class deadline scheduling
• Random early dropping
• Congestion control

19
Basic Approach
• Traffic model provides a stochastic description
of possible future outcomes
• Method
• Formulate network decision problems as POMDPs by
incorporating traffic model
• Solve belief-state MDP online using
sampling(choose time-scale to allow for
computation time)

20
Domain 1 Deadline Scheduling
Objective Minimize weighted loss
21
Domain 2 Random Early Dropping
Objective Minimize delaywithout sacrificing
throughput
22
Domain 3 Congestion Control
23
Traffic Modeling
• A Hidden Markov Model (HMM) for each source
• Note state is hidden, model is partially
observed

24
• Non-sampling Policies
• EDF earliest deadline first.
• Deadline sensitive, class insensitive.
• SP static priority.
• Deadline insensitive, class sensitive.
• CM current minloss Givan et al., 2000
• Deadline and class sensitive.
• Minimizes weighted loss for the current packets.

25
• Objective minimize weighted loss
• Comparison
• Non-sampling policies
• Unbiased sampling (Kearns et al.)
• Hindsight optimization
• Rollout with CM as base policy
• Parallel rollout
• Results due to H. S. Chang

26
27
28
29
Random Early Dropping Results
• Objective minimize delay subject to throughput
loss-tolerance
• Comparison
• Candidate policies RED and buffer-k
• KMN-sampling
• Rollout of buffer-k
• Parallel rollout
• Hindsight optimization
• Results due to H. S. Chang.

30
Random Early Dropping Results
31
Random Early Dropping Results
32
Congestion Control Results
• MDP Objective minimize weighted sum of
throughput, delay, and loss-rate
• Fairness is hard-wired
• Comparisons
• PD-k (proportional-derivative with k target
queue)
• Hindsight optimization
• Rollout of PD-k parallel rollout
• Results due to G. Wu, in progress

33
Congestion Control Results
34
Congestion Control Results
35
Congestion Control Results
36
Congestion Control Results
37
Results Summary
• Unbiased sampling cannot cope
• Parallel rollout wins in 2 domains
• Not always equal to simple rollout of one base
policy
• Hindsight optimization wins in 1 domain
• Simple policy rollout the cheapest method
• Poor in domain 1
• Strong in domain 2 with best base policy but
how to find this policy?
• So-so in domain 3 with any base policy

38
Talk Summary
• Case study of MDP sampling methods
• New methods offering practical improvements
• Parallel policy rollout
• Hindsight optimization
• Systematic methods for using traffic models to
help make network control decisions
• Feasibility of real-time implementation depends
on problem timescale

39
Ongoing Research
• Apply to other control problems (different
timescales)
• QoS routing
• Link bandwidth allotment
• Multiclass connection management
• Problems arising in proxy-services
• Diagnosis and recovery

40
Ongoing Research (Contd)
• Alternative traffic models
• Multi-timescale models
• Long-range dependent models
• Closed-loop traffic
• Fluid models
• Learning traffic model online
• Adaptation to changing traffic conditions

41
Congestion Control (Contd)
42
Congestion Control Results
43
Hindsight Optimization (Contd)
44
Policy Rollout (Contd)
Policy-performance
Base Policy
45
Receding-horizon Control
• For large horizon H, policy is stationary.
• At each time, if state is x, then apply action
• u(x) argmaxa Q(x,a)
• argmaxa R(x,a) E
VH-1(y)
• Compute estimate of Q-value at each time.

46
Congestion Control (Contd)
47
Domain 3 Congestion Control
High-priority Traffic
Bottleneck
Node
Best-effort Traffic
• Resources Bandwidth and buffer
• Objective optimize throughput, delay, loss, and
fairness
• High-priority traffic
• Open-loop controlled
• Low-priority traffic
• Closed-loop controlled

48
Congestion Control Results
49
Congestion Control Results
50
Congestion Control Results
51
Congestion Control Results