Hierarchical Reinforcement Learning

1 / 17
About This Presentation
Title:

Hierarchical Reinforcement Learning

Description:

Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition ... Amount of time between decisions random variable (discrete, continuous) ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 18
Provided by: staffSci

less

Transcript and Presenter's Notes

Title: Hierarchical Reinforcement Learning


1
Hierarchical Reinforcement Learning
  • Gideon Maillette de Buy Wenniger
  • Recent Advances in Hierarchical Reinforcement
    Learning
  • Andrew G.Barto
  • Sridhar Mahadevan
  • Hierarchical Reinforcement Learning with the MAXQ
    Value Function Decomposition
  • Thomas G.Dietterich

2
Reinforcement Learning, Formulas
  • Value function, discounted reward
  • Future expexted discounted reward
  • Bellman Equations
  • Optimal values

3
Value Iteration
  • Dynamic programming update rules
  • Q-learning update rule (off-policy)
  • Sarsa update rule (on-policy)

4
Extension SMDP's
  • -Extension of MDP
  • -Amount of time between decisions random variable
    (discrete, continuous)
  • -Nescessary for operations that take multiple
    timesteps
  • -Random variable denotes waiting time
  • New formula's

5
Approaches to Hierarchical Reinforcement Learning
  • -Idea of macro-operator
  • -sequence of actions, can be invoked as simple
    action.
  • -macro's can call other macro's
  • Hierarchical policies extension of macro idea
  • - Specifie Termination conditions
  • -partial policies/temporally extended actions

6
Options approach
  • Simplest option
  • -Markov options
  • -Stationary stochastic policy
  • -Termination condition
  • -Input set
  • Semi markov options Option-policies may depend
    on history since option-call.
  • Expansion of each option to primitiva actions
  • -flat policy
  • -Non-markovian

7
Adapted value functions, update rules
  • Event of being initiated at
    time t in s
  • Semi markov policy that follows o unitil
    it terminates after timeteps and then
    continues according to .
  • Analog Value-iteration step
  • Analog Q-learning step

8
MAXQ motivation
9
Max-Q value decomposition
10
MaxQ
  • -Decompose task in to set of closed
    hierarchically subtasks M0, M1,.. M,n
  • -Subtasks have to be solved to complete root-task
    M0
  • -Assign local reward to completing subtask
  • -When subtasks is called it runs until it, or a
    subtask higher in the Hierarchie completes
  • -Use deterministic completion states
  • -Assign reward depending on completion state
  • - Recursive instead of Hierarchical optimality

11
Optimalities
12
MAXQ - continued
  • -Lowest level of Hierarchie gives primitive
    actions, and direct rewards
  • -Use thes in combination with local rewards to
    implement learning
  • -Find recursive optimal policy
  • (v.s hierarchical optimality)
  • - Enable state abstractions
  • - Speed-up learning by minimizing no states
  • - Proof of convergence to Recursive optimality
  • - Possibility of Executing policy
    nonhierarchically

13
MaxQ graph
14
Results MAXQ
15
Topics for Future Research
  • 1. Compact representations
  • 2. Learning task Hierarchies
  • 3. Dynamic Abstraction
  • 4. Large Applications

16
Conclusions
  • Two main approaches
  • Extend statespace with macro's
  • Limit statespace using hierarchie
  • Macro's Problem How to learn good macro's
    autonomically. Eiterh suboptimal performance
    (limited actionspace) or no real gain (extended
    actionspace). But might speedup learning
    signifficantly.
  • MAXQ Making use of programmer-defined
    hierarchical decomposition makes statespaces
    smaller and learning faster. Problem is the
    effort needed from the programmer.

17
The MAXQ algorithm
Write a Comment
User Comments (0)