Title: Investigations on Automatic Behavior-based System Design [A Survey on] Hierarchical Reinforcement Learning
1Investigations on Automatic Behavior-based System
DesignA Survey on Hierarchical Reinforcement
Learning
- Amir massoud Farahmand
- Majid Nili Ahmadabadi, Babak N. Araabi, Caro
Lucas - www.SoloGen.net
- SoloGen_at_SoloGen.net
2a non-uniformOutline
- Brief History of AI
- Challenges and Requirements of Robotic
Applications - Behavior-based Approach to AI
- The Problem of Behavior-based System Design
- MDP and Standard Reinforcement Learning Framework
- A Survey on Hierarchical Reinforcement Learning
- Behavior-based System Design
- Learning in BBS
- Structure Learning
- Behavior Learning
- Behavior Evolution and Hierarchy Learning in
Behavior-based Systems
3Happy birthday to Artificial Intelligence
- 1941 Konrad Zuse, Germany, general purpose
computer - 1943 Britain (Turing and others) Collossus, for
decoding - 1945 ENIAC, US. John von Neumann a consultant
- 1946 The Logic Theorist on JOHNNIAC--Newell, Shaw
and Simon - 1956 Dartmouth Conference organized by John
McCarthy (inventor of LISP) - The term Artificial Intelligence coined at
Dartmouth---intended as a two month, ten man
study!
4HP to AI (2)
- It is not my aim to surprise or shock you----but
the simplest way I can summarize is to say that
there are now in the world machines that think,
that learn and that create. Moreover, their
ability to these things is going to increase
rapidly until........ - (Herb Simon 1957)
- Unfortunately, Simon was too optimistic!
5What AI have done for us?
- Rather good OCR (Optical Character Recognition)
and Speech recognition softwares - Robots make cars in all advanced countries
- Reasonable machine translation is available for
a large range of foreign web pages - Systems land 200 ton jumbo jets unaided every few
minutes - Search systems like Google are not perfect but
very effective information retrieval - Computer games and autogenerated cartoons are
advancing at an astonishing rate and have huge
markets - Deep blue beat Kasparov in 1997. The world Go
champion is a computer. - Medical expert systems can outperform doctors in
many areas of diagnosis (but we arent allowed to
find out easily!)
6AI What is it?
- What is AI?
- Different definitions
- The use of computer programs and programming
techniques to cast light on the principles of
intelligence in general and human thought in
particular (Boden) - The study of intelligence independent of its
embodiment in humans, animals or machines
(McCarthy) - AI is the study of how to do things which at the
moment people do better (Rich Knight) - AI is the science of making machines do things
that would require intelligence if done by men.
(Minsky) (fast arithmetic?) - Is it definable?!
- Turing test, Weak and Strong AI and
7AI Basic assumption
- Symbol System Hypothesis it is possible to
construct a universal symbol system that thinks - Strong Symbol System Hypothesis the only way a
system can think is through symbolic processing - Happy birthday Symbolic (Traditional Good
old-fashioned) AI
8Symbolic AI Methods
- Knowledge representation (Abstraction)
- Search
- Logic and deduction
- Planning
- Learning
9Symbolic AI Was it efficient?
- Chess OK!
- Block-worlds OK!
- Daily Life Problems
- Robots OK!
- Commonsense OK!
- OK
10Symbolic AI and Robotics
World Modelling
Motor control
sensors
actuators
- Functional decomposition
- Sequential flow
- Correct perceptions is assumed to be done by
vision-researched in a a-good-and-happy-will-come
-day! - Get a logic-based or formal description of
percepts - Apply search operators or logical inference or
planning operators
11Challenges and Requirements of Robotic Systems
- Challenges
- Sensor and Effector Uncertainty
- Partial Observability
- Non-Stationarity
- Requirements
- (among many others)
- Multi-goal
- Robustness
- Multiple Sensors
- Scalability
- Automatic design
- Adaptation (Learning/Evolution)
12Behavior-based approach to AI
- Behavioral (activity) decomposition against
functional decomposition - Behavior Sensor-gtAction (Direct link between
perception and action) - Situatedness
- Embodiment
- Intelligence as Emergence of
13Behavioral decomposition
manipulate the world
build maps
sensors
actuators
explore
avoid obstacles
locomote
14Situatedness
- No world modelling and abstraction
- No planning
- No sequence of operations on symbols
- Direct link between sensors and actions
- Motto The world is its own best model
15Embodiment
- Only an embodied agent is validated as one that
can deal with real world. - Only through a physical grounding can any
internal symbolic system be given meaning
16Emergence as a Route to Intelligence
- Emergence interaction of some simple systems
which results in something more than sum of those
systems - Intelligence as emergent outcome of dynamical
interaction of behaviors with the world
17Behavior-based design
- Robust
- not sensitive to failure of particular part of
the system - no need for precise perception as there is no
modelling there - Reactive Fast response as there is no long route
from perception to action - No representation
18A Simple problem
- Goal make a mobile robot controller that
collects balls from the field and move them to
home - What we have
- Differentially controlled mobile robot
- 8 sonar sensors
- Vision system that detects balls and home
19Basic design
avoid obstacles
move toward ball
move toward home
exploration
20A Simple Shot
21?
- How should we
- DESIGN
- a behavior-based system?!
22Behavior-based System Design Methodologies
- Hand Design
- Common in almost everywhere.
- Complicated may be even infeasible in complex
problems - Even if it is possible to find a working system,
it is not optimal probably. - Evolution
- Good solutions can be found
- Biologically feasible
- Time consuming
- Not fast in making new solutions
- Learning
- Biologically feasible
- Learning is essential for life-time survival of
the agent.
23The Importance of Adaptation (Learning/Evolution)
- Unknown environment/body
- exact Model of environment/body is not known
- Non-stationary environment/body
- Changing environment (offices, houses, streets,
and almost everywhere) - Aging
- cannot be remedied with evolution very easily
- Designer may not know how to benefit from every
aspects of her agent/environment - Lets the agent learn it by itself (learning as
optimization) - etc
24Different Learning Methods
25Reinforcement Learning
- Agent senses state of the environment
- Agent chooses an action
- Agent receives reward from an internal/external
critic - Agent learns to maximize its received rewards
through time.
26Reinforcement Learning
- Inspired from Psychology
- Thorndike, Skinner, Hull, Pavlov,
- Very successful applications
- Games (Backgammon)
- Control
- Robotics
- Elevator Scheduling
-
- Well-defined mathematical formulation
- Markov Decision Problems
27Markov Decision Problems
- Markov Process Formulating a wide range of
dynamical systems - Finding an optimal solution of an objective
function - Stochastic Dynamics Programming
- Planning Known environment
- Learning Unknown environment
28MDP
29Reinforcement Learning Revisited (1)
- Very important Machine Learning method
- An approximate online solution of MDP
- Monte Carlo method
- Stochastic Approximation
- Function Approximation
30Reinforcement Learning Revisited (2)
- Q-Learning and SARSA are among the most important
solution of RL
31Some Simple Samples
1D Grid World
Map of the Environment
Policy
Value Function
32Some Simple Samples
2D Grid World
Map
Value Function
Policy
Value Function (3D view)
33Some Simple Samples
2D Grid World
Map
Value Function
Value Function (3D view)
Policy
34Curses of DP
- It is not easy to use DP (and RL) in robotic
tasks. - Curse of Modeling
- RL solves this problem
- Curse of Dimensionality (e.g. robotic tasks have
a very big state space) - Approximating Value function
- Neural Networks
- Fuzzy Approximation
- Hierarchical Reinforcement Learning
35A Sample of Learning in a Robot
Hajime Kimura, Shigenobu Kobayashi,
Reinforcement Learning using Stochastic Gradient
Algorithm and its Application to Robots, The
Transaction of the Institute of Electrical
Engineers of Japan, Vol.119, No.8 (1999) (in
Japanese!)
36Hierarchical Reinforcement Learning
37ATTENTION
- Hierarchical reinforcement learning methods are
not specially designed for behavior-based
systems. - Covering them in this presentation with this
depth should not be interpreted as their high
amount of relation to behavior-based system
design.
38Hierarchical RL (1)
- Use some kind of hierarchy in order to
- Learn faster
- Need less values to be updated (smaller storage
dimension) - Incorporate a priori knowledge by designer
- Increase reusability
- Have a more meaningful structure than a mere
Q-table
39Hierarchical RL (2)
- Is there any unified meaning of hierarchy?
- NO!
- Different methods
- Temporal abstraction
- State abstraction
- Behavioral decomposition
-
40Hierarchical RL (3)
- Feudal Q-Learning Dayan, Hinton
- Options Sutton, Precup, Singh
- MaxQ Dietterich
- HAM Russell, Parr, Andre
- ALisp Andre, Russell
- HexQ Hengst
- Weakly-Coupled MDP Bernstein, Dean Lin,
- Structure Learning in SSA Farahmand, Nili
- Behavior Learning in SSA Farahmand, Nili
41Feudal Q-Learning
- Divide each task to a few smaller sub-tasks
- State abstraction method
- Different layers of managers
- Each manager gets orders from its super-manager
and orders to its sub-managers
42Feudal Q-Learning
- Principles of Feudal Q-Learning
- Reward Hiding Managers must reward sub-managers
for doing their bidding whether or not this
satisfies the commands of the super-managers.
Sub-managers should just learn to obey their
managers and leave it up to them to determine
what it is best to do at the next level up. - Information Hiding Managers only need to know
the state of the system at the granularity of
their own choices of tasks. Indeed, allowing some
decision making to take place at a coarser grain
is one of the main goals of the hierarchical
decomposition. Information is hidden both
downwards - sub-managers do not know the task the
super-manager has set the manager - and upwards
-a super-manager does not know what choices its
manager has made to satisfy its command.
43Feudal Q-Learning
44Feudal Q-Learning
45Options Introduction
- People make decisions at different time scales
- Traveling example
- People perform actions with different time scales
- Kicking a ball
- Becoming a soccer player
- It is desirable to have a method to support this
temporally-extended actions over different time
scales
46Options Concept
- Macro-actions
- Temporal abstraction method of Hierarchical RL
- Options are temporally extended actions which
each of them is consisted of a set of primitive
actions - Example
- Primitive actions walking NSWE
- Options go to door, cornet, table, straight
- Options can be Open-loop or Closed-loop
- Semi-Markov Decision Process Theory Puterman
47Options Formal Definitions
48Options Rise of SMDP!
49Options Value function
50Options Bellman-like optimality condition
51Options A simple example
52Options A simple example
53Options A simple example
54Interrupting Options
- Options policy is followed until it terminates.
- It is somehow unnecessary condition
- You may change your decision in the middle of
execution of your previous decision. - Interruption Theorem Yes! It is better!
55Interrupting OptionsAn example
56Options Other issues
- Intra-option model, value learning
- Learning each options
- Defining sub-goal reward function
- Generating new options
- Intrinsically Motivated RL
57MaxQ
- MaxQ Value Function Decomposition
- Somehow related to Feudal Q-Learning
- Decomposing value function in a hierarchical
structure
58MaxQ
59MaxQ Value decomposition
60MaxQ Existence theorem
- Recursive optimal policy.
- There may be many recursive optimal policies with
different value function. - Recursive optimal policies are not an optimal
policy. - If H is stationary macro hierarchy for MDP M,
then all recursively optimal policies w.r.t. have
the same value.
61MaxQ Learning
- Theorem If M is MDP, H is stationary macro, GLIE
(Greedy in the Limit with Infinite Exploration)
policy, common convergence conditions (bounded V
and C, sum of alpha is ), then with Prob. 1,
algorithm MaxQ-0 will converge!
62MaxQ
- Faster learning all states updating
- Similar to all-goal-updating of Kaelbling
63MaxQ
64MaxQ State abstraction
- Advantageous
- Memory reduction
- Needed exploration will be reduced
- Increase reusability as it is not dependent on
its higher parents - Is it possible?!
65MaxQ State abstraction
- Exact preservation of value function
- Approximate preservation
66MaxQ State abstraction
- Does it converge?
- It has not proved formally yet.
- What can we do if we want to use an abstraction
that violates theorem 3? - Reward function decomposition
- Design a reward function that reinforces those
responsible parts of the architecture.
67MaxQ Other issues
- Undesired Terminal states
- Non-hierarchical execution (polling execution)
- Better performance
- Computational intensive
68Return of BBS (Episode II) Automatic Design
69Learning in Behavior-based Systems
- There are a few works on behavior-based learning
- Mataric, Mahadevan, Maes, and ...
- but there is no deep investigation about it
(specially mathematical formulation)! - And most of them incorporate flat architectures.
70Learning in Behavior-based Systems
- There are different methods of learning with
different viewpoints, but we have concentrated on
Reinforcement Learning. - Agent Did I perform it correctly?!
- Tutor Yes/No! (or 0.3)
71Learning in Behavior-based Systems
- We have divided learning in BBS into two parts
- Structure Learning
- How should we organize behaviors in the
architecture assume having a repertoire of
working behaviors - Behavior Learning
- How should each behavior behave? (we do not have
a necessary toolbox)
72Structure LearningAssumptions
- Structure Learning in Subsumption Architecture as
a good sample for BBS - Purely parallel case
- We know B1, B2, and but we do not know how to
arrange them in the architecture - we know how to avoid obstacles, pick an object,
stop, move forward, turn, but we dont know
which one is superior to others.
73Structure Learning
build maps
explore
manipulate the world
The agent wants to learn how to arrange these
behaviors in order to get maximum reward from its
environment (or tutor).
locomote
avoid obstacles
Behavior Toolbox
74Structure Learning
build maps
explore
manipulate the world
locomote
avoid obstacles
Behavior Toolbox
75Structure Learning
build maps
manipulate the world
explore
locomote
avoid obstacles
1-explore becomes controlling behavior and
suppress avoid obstacles 2-The agent hits a wall!
Behavior Toolbox
76Structure Learning
build maps
manipulate the world
explore
locomote
avoid obstacles
Tutor (environment) gives explore a punishment
for its being in that place of the structure.
Behavior Toolbox
77Structure Learning
build maps
manipulate the world
explore
locomote
avoid obstacles
explore is not a very good behavior for the
highest position of the structure. So it is
replaced by avoid obstacles.
Behavior Toolbox
78Structure LearningChallenging Issues
- Representation How should the agent represent
knowledge gathered during learning? - Sufficient (Concept space should be covered by
Hypothesis space) - Tractable (small Hypothesis space)
- Well-defined credit assignment
- Hierarchical Credit Assignment How should the
agent assign credit to different behaviors and
layers in its architecture? - If the agent receives a reward/punishment, how
should we reward/punish the structure of the
agent? - Learning How should the agent update its
knowledge when it receives reinforcement signal?
79Structure LearningOvercoming Challenging Issues
- Decomposing the behavior of a multi-agent system
to simpler components may enhance our vision to
the problem under investigation decomposing
value function of the agent to simpler elements. - Structure can provide a lot of clues to us.
80Structure LearningValue Function Decomposition
- Each structure has a value regarding its
receiving reinforcement signal.
- The objective is finding a structure T with a
high value. - We have decomposed value function to simpler
components that enable the agent to benefit from
previous interaction with the environment.
81Structure LearningValue Function Decomposition
- It is possible to decompose total systems value
to value of each behavior in each layer. - We call it Zero-Order method.
Dont read the following equations!
82Structure Learning Value Function
Decomposition(Zero Order Method)
- It stores the value of behavior-being in a
specific layer.
ZO Value Table in the agents mind
avoid obstacles (0.8)
explore (0.7)
locomote (0.4)
Higher layer
avoid obstacles (0.6)
explore (0.9)
locomote (0.4)
Lower layer
83Structure LearningCredit Assignment(Zero Order
Method)
- Controlling behavior is the only responsible
behavior for the current reinforcement signal. - Appropriate ZO value table updating method is
available.
84Structure LearningValue Function Decomposition
and Credit AssignmentAnother Method (First Order)
- It stores the value of relative order of
behaviors - How much is it good/bad if B1 is being placed
higher than B2?! - V(avoid obstaclesgtexplore) 0.8
- V(exploregtavoid obstacles) -0.3
- Sorry! Not that easy (and informative) to show
graphically!! - Credits are assigned to all (controlling,
activated) pairs of behaviors. - The agent receives reward while B1 is controlling
and B3 and B5 are activated - (B1gtB3)
- (B1gtB5)
85Structure LearningExperiment Multi-RobotObject
Lifting
- A Group of three robots want to lift an object
using their own local sensors - No central control
- No communication
- Local sensors
- Objectives
- Reaching prescribed height
- Keeping tilt angle small
86Structure LearningExperiment Multi-RobotObject
Lifting
Push More
?!
Hurry Up
Stop
Slow Down
Dont Go Fast
Behavior Toolbox
87Structure LearningExperiment Multi-RobotObject
Lifting
88Structure LearningExperiment Multi-RobotObject
Lifting
Sample shot of height of each robot after
sufficient learning
89Structure LearningExperiment Multi-RobotObject
Lifting
Sample shot of tilt angle of the object after
sufficient learning
90Behavior Learning
- The assumption of having a working behavior
repertoire may not be practical in every
situations - Partial Knowledge of the Designer to the Problem
Suboptimal Solutions - Assumption
- Input and output spaces of each behavior is known
(S and A). - Fixed Structure
91Behavior Learning
92Behavior Learning
a1B1(s1)
explore
avoid obstacles
a2B2(s2)
How should each behavior behave when the system
is in state S?!
93Behavior LearningChallenging Issues
- Hierarchical Behavior Credit Assignment How
should the agent assign credit to different
behaviors in its architecture? - If the agent receives a reward/punishment, how
should we reward/punish the behaviors of the
agent? - Multi-agent Credit Assignment Problem
- Cooperation between Behaviors How should we
design behaviors so that they can cooperate with
each other? - Learning How should the agent update its
knowledge when it receives reinforcement signal?
94Behavior LearningValue Function Decomposition
- Value function of the agent can be decomposed
into simpler behavior-level components.
95Behavior LearningHierarchical Behavior Credit
Assignment
- Augmenting action space of behaviors with No
Action - Cooperation between behaviors
- Each behavior knows whether there exists a better
behavior in lower behaviors - Do not suppress them!
- Developed a multi-agent credit assignment
framework for logically expressible teams.
96Behavior LearningHierarchical Behavior Credit
Assignment
97Behavior LearningOptimality Condition and Value
Updating
98Concurrent Behavior and Structure Learning
- We have divided the BBS learning task into two
separate process - Structure Learning
- Behavior Learning
- Concurrent behavior and structure learning is
possible
99Concurrent Behavior and Structure Learning
Initialize Learning Parameters
Interact with the environment and receive
reinforcement signal
Update estimation of structure and behavior
value functions
Update Architecture according to new estimations
100Behavior and Structure LearningExperiment
Multi-RobotObject Lifting
101Behavior and Structure LearningExperiment
Multi-RobotObject Lifting
102Austin Villa Robot Soccer Team
N. Kohl and P. Stone, Policy Gradient
Reinforcement Learning for Fast Quadrupedal
Locomotion, IEEE International Conference on
Robotics and Automation (ICRA) 2004
103Austin Villa Robot Soccer Team
Initial Gait
N. Kohl and P. Stone, Policy Gradient
Reinforcement Learning for Fast Quadrupedal
Locomotion, IEEE International Conference on
Robotics and Automation (ICRA) 2004
104Austin Villa Robot Soccer Team
During Training Process
N. Kohl and P. Stone, Policy Gradient
Reinforcement Learning for Fast Quadrupedal
Locomotion, IEEE International Conference on
Robotics and Automation (ICRA) 2004
105Austin Villa Robot Soccer Team
Fastest Final Result
N. Kohl and P. Stone, Policy Gradient
Reinforcement Learning for Fast Quadrupedal
Locomotion, IEEE International Conference on
Robotics and Automation (ICRA) 2004
106Artificial Evolution
- Computational framework inspired from natural
evolution. - Natural Selection (Selection of the Fittest)
- Reproduction
- Crossover
- Mutation
107Artificial Evolution
- A good (fit) individual survives from different
hazards and difficulties during its lifetime and
can find a mate and reproduce itself. - Its useful genetic information is passed to its
offspring. - If two fit parents mate with each other, their
offspring is probably better than both of them.
108Artificial Evolution
- Artificial Evolution is used a method of
optimization - Does not need explicit knowledge of objective
function - Does not need objective function derivatives
- Does not get stuck in local min./max.
- In contrast with Gradient-based searches
109Artificial Evolution
110Artificial Evolution
111Artificial EvolutionA General Scheme
Initialize population
Calculate fitness of each individual
Select best individuals
Mate best individuals
112Artificial Evolution in Robotics
- Artificial Evolution as an approach to
automatically design controller of situated
agent. - Evolving Controller Neural Network
113Artificial Evolution in Robotics
- Objective function is not a very well-defined in
robotic task. - The dynamic of the whole system
(agent/environment) is too complex to compute
derivative of objective function.
114Artificial Evolution in Robotics
- Evolution is very time consuming.
- Actually in most cases, we do not have a
population of robots. So we use a single robot
instead of a population (take much more time). - Implementation on a real physical robot may cause
damage to the robot before evolving a suitable
controller.
115Artificial Evolution in RoboticsSimulated/Physi
cal Robot
- Evolve from the first generation on the physical
robot. - Too expensive
- Simulate robots and evolve an appropriate
controller in a simulated world. Transfer the
final solution to the physical robot. - Different dynamics of physical and simulated
robots. - After evolving a controller on a simulated robot,
continue the evolution on the physical system
too.
116Artificial Evolution in Robotics
117Artificial Evolution in Robotics
118Artificial Evolution in Robotics
Best individual of generation 45, born after 35
hours
Floreano, D. and Mondada, F. Automatic Creation
of an Agent Genetic Evolution of a Neural
Network Driven Robot, In D. Cliff, P. Husbands,
J.-A. Meyer, and S. Wilson (Eds.), From Animals
to Animats III, Cambridge, MA MIT Press, 1994.
119Artificial Evolution in Robotics
25 generations (a few days)
D. Floreano, S. Nolfi, and F. Mondada,
Co-Evolution and Ontogenetic Change in Competing
Robots, Robotics and Autonomous Systems, To
appear, 1999
120Artificial Evolution in Robotics
J. Urzelai, D. Floreano, M. Dorigo, and M.
Colombetti, Incremental Robot Shaping,
Connection Science, 10, 341-360, 1998.
121Hybrid Evolution/Learning in Robots
- Evolution is slow
- but can find very good solutions
- Learning is fast (more flexible during lifetime)
- but may get stuck in local maxima of fitness
function. - We may use both evolution and learning.
122Hybrid Evolution/Learning in Robots
- You may remember that in the structure learning
method, we have assumed that there is a set of
working behaviors. - To develop behaviors, we have used learning.
- Now, we want to use evolution instead.
123Behavior Evolution and Hierarchy Learning in BBS
- Behavior Generation
- Co-evolution
- Slow
- Structure Organization
- Learning
- Memetically Biased Initial Structure
124Behavior Evolution and Hierarchy Learning in BBS
- Fitness function How to calculate fitness of
each behavior? - Fitness Sharing
- Uniform
- Value-based
- Genetic Operators
- Mutation
- Crossover
125Behavior Evolution and Hierarchy Learning in
BBSExperiment Multi-Robot Object Lifting
126Behavior Evolution and Hierarchy Learning in
BBSExperiment Multi-Robot Object Lifting
127Behavior Evolution and Hierarchy Learning in
BBSExperiment Multi-Robot Object Lifting
128Behavior Evolution and Hierarchy Learning in
BBSExperiment Multi-Robot Object Lifting
129Behavior Evolution and Hierarchy Learning in
BBSExperiment Multi-Robot Object Lifting
130Behavior Evolution and Hierarchy Learning in
BBSExperiment Multi-Robot Object Lifting
131Conclusions, Ongoing Research, and Future Work
- A rather complete and mathematical
investigation on automatic designing of
behavior-based systems - Structure Learning
- Behavior Learning
- Concurrent Behavior and Structure Learning
- Behavior Evolution and Structure Learning
- Memetical Bias
- Good results in two different domain
- Multi-robot Object Lifting
- An Abstract Problem
132Conclusions, Ongoing Research, and Future Work
- However, there are many steps remained for fully
automated agent design - Extending to Multi-Step Formulation
- How should we generate new behaviors without even
knowing which sensory information is necessary
for the task (feature selection) - Applying structure learning methods to more
general architectures, e.g. MaxQ. - Problem of Reinforcement Signal Design
- Designing a good reinforcement signal is not easy
at all.
133Questions?!