Transfer in Variable - Reward Hierarchical Reinforcement Learning - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Transfer in Variable - Reward Hierarchical Reinforcement Learning

Description:

Title: PowerPoint Presentation Author: xjliao Last modified by: xjliao Created Date: 3/29/2006 1:44:49 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 22
Provided by: xjl3
Category:

less

Transcript and Presenter's Notes

Title: Transfer in Variable - Reward Hierarchical Reinforcement Learning


1
Transfer in Variable - Reward Hierarchical
Reinforcement Learning
Hui Li March 31, 2006
2
Overview
  • Multi-criteria reinforcement learning
  • Transfer in variable-reward hierarchical
    reinforcement learning
  • Results
  • Conclusions

3
Multi-criteria reinforcement learning
Reinforcement Learning
  • Definition
  • Reinforcement learning is the process by
    which the agent
  • learns an approximately optimal policy
    through trial and
  • error interactions with the environment.

4
  • Goal
  • The agents goal is to maximize the
    cumulative amount of
  • rewards he receives over the long run.
  • A new value function -- average adjusted sum of
    rewards (bias)

Average reward (gain) per time step at given
policy ?
  • Bellman equation

5
  • H-learning
  • model-based version of average reward
    reinforcement learning

learning rate, 0lt?lt1
?new
New observation rs(a)
?old
  • R-learning
  • model-free version of average reward
    reinforcement learning

6
Multi-criteria reinforcement learning
In many situations, it is nature to express the
objective as making some appropriate tradeoffs
between different kinds of rewards.
  • Goals
  • Eating food
  • Guarding food
  • Minimize the number of steps it walks

Buridans donkey problem
7
Weighted optimization criterion
weight, which represents the importance of each
reward
If the weight vector w is static, never changes
over time, then the problem reduces to the
reinforcement learning with a scalar value of
reward.
If the weight vector varies from time to time,
learning policy for each weight vector from
scratch is very inefficient.
8
Since the MDP model is a liner transformation,
the average reward ?? and the average adjusted
reward h?(s) are linear in the reward weights
for a given policy ?.
9
  • Each line represents the weighted average reward
    given a policy ?k ,
  • Solid lines represent those active weighted
    average rewards
  • Dot lines represent those inactive weighted
    average rewards
  • Dark line segments represent the best average
    rewards for any weight vectors

10
The key idea
? the set of all stored policies.
Only those policies which have active average
rewards are stored.
Update equations
11
(No Transcript)
12
Variable-reward hierarchical reinforcement
learning
  • The original MDP M is split into sub-SMDP M0,
    Mn, each sub-SMDP representing a sub-task
  • Solving the root task M0 solves the entire MDP M
  • The task hierarchy is represented as a directed
    acyclic graph known as the task graph
  • A local policy ?i for the subtask Mi is a
    mapping from the states to the child tasks of Mi
  • A hierarchical policy ? for the whole task is
    an assignment of a local policy ?i to each
    subtask Mi
  • The objective is to learn an optimal policy that
    optimizes the policy for each subtask assuming
    that its childrens polices are optimized

13
Forest
Home base
Enemy base
Goldmine
Peasants
14
Two kinds of subtask
  • Composite subtask
  • Root the whole task
  • Harvest the goal is to harvest wood or gold
  • Deposit the goal is to deposit a resource
    into home base
  • Attack the goal is to attack the enemy base
  • Primitive subtask primitive actions
  • north, south, east, west,
  • pick a resource, put a resource
  • attack the enemy base
  • idle

15
SMDP semi-MDP
A SMDP is a tuple lt S, A, P, r, t gt
  • S, A, P, r are defined the same as in MDP
  • t(s, a) is the execution tine for taking action
    a in state s
  • Bellman equation of SMDP for average reward
    learning

A sub-task is a tuple lt Bi, Ai , Gi gt
  • Bi state abstraction function which maps state
    s in the original MDP into an abstract state in
    Mi
  • Ai The set of subtasks that can be called by Mi
  • Gi Termination predicate

16
The value function decomposition satisfied the
following set of Bellman equations
where
At root, we only store the average adjusted reward
17
(No Transcript)
18
Results
  • Learning curves for a test reward weight after
    having seen 0, 1,
  • 2, , 10 previous training weight vectors
  • Negative transfer learning based on one
    previous weight is worse than learning from
    scratch.

19
Transfer ratio
FY/FY/X
  • FY is the area between the learning curve and
    its optimal value for problem with no prior
    learning experience on X.
  • FY/X is the area between the learning curve and
    its optimal value for problem given prior
    training on X.

20
Conclusions
  • This paper showed that hierarchical task
    structure can
  • accelerate transfer across variable-reward MDPs
    more
  • than in the flat MDP
  • This hierarchical task structure facilitates
    multi-agent
  • learning

21
References
1 T. Dietterich, Hierarchical Reinforcement
Learning with the MAXQ Value Function
Decomposition. Journal of Artificial Intelligence
Research, 9227303, 2000. 2 N. Mehta and P.
Tadepalli, Multi-Agent Shared Hierarchy
Reinforcement Learning. ICML Workshop on Richer
Representations in Reinforcement Learning,
2005. 3 S. Natarajan and P. Tadepalli, Dynamic
Preferences in Multi-Criteria Reinforcement
Learning, in Proceedings of ICML-05, 2005. 4 N.
Mehta, S. Natarajan, P. Tadepalli and A. Fern,
Transfer in Variable-Reward Hierarchical
Reinforcement Learning, in NIPS Workshop on
transfer learning, 2005. 5 Barto, A.,
Mahadevan, S. (2003). Recent Advances in
Hierarchical Reinforcement Learning, Discrete
Event Systems. 6 S. Mahadevan, Average Reward
Reinforcement Learning Foundations, Algorithms,
and Empirical Results, Machine Learning, 22,
169-196 (1996) 7 P. Tadepalli and D. OK,
Model-based Average Reward Reinforcement
Learning, Artificial intelligence 1998
Write a Comment
User Comments (0)
About PowerShow.com