Apprentissage par Renforcement Reinforcement Learning

1 / 45
About This Presentation
Title:

Apprentissage par Renforcement Reinforcement Learning

Description:

ATR Human Information Science Laboratories. CREST, Japan Science and Technology Corporation ... action selection: a = argmaxa E[ R(s,a) g Ss' V(s')P(s'|s,a) ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 46
Provided by: kenji

less

Transcript and Presenter's Notes

Title: Apprentissage par Renforcement Reinforcement Learning


1
Apprentissage par RenforcementReinforcement
Learning
  • Kenji Doya
  • doya_at_atr.co.jp
  • ATR Human Information Science Laboratories
  • CREST, Japan Science and Technology Corporation

2
Outline
  • Introduction to Reinforcement Learning (RL)
  • Markov decision process (MDP)
  • Current topics
  • RL in Continuous Space and Time
  • Model-free and model-based approaches
  • Learning to Stand Up
  • Discrete plans and continuous control
  • Modular Decomposition
  • Multiple model-based RL (MMRL)

3
Learning to Walk (Doya Nakano, 1985)
  • Action cycle of 4 postures
  • Reward speed sensor output
  • Multiple solutions creeping, jumping,

4
Markov Decision Process (MDP)
  • Environment
  • dynamics P(ss,a)
  • reward P(rs,a)
  • Agent
  • policy P(as)
  • Goal maximize cumulative future rewards
  • E r(t1) g r(t2)
  • 0g1 discount factor

5
Value Function and TD error
  • State value function
  • V(s) E r(t1) g r(t2) s(t)s, P(as)
  • 0g1 discount factor
  • Consistency condition
  • d(t) r(t) g V(s(t)) - V(s(t-1)) 0
  • new estimate - old estimate
  • Dual role of temporal difference (TD) error d(t)
  • Reward prediction d(t) ? 0 in average
  • Action selection d(t)gt0 better than average

6
Example Navigation
  • Reward field

7
Actor-Critic Architecture
critic V(s)
reward r
TD error d
environment
action a
actor P(as)
state s
  • Critic future reward prediction
  • update value DV(s(t-1)) ? d(t)
  • Actor action reinforcement
  • increase P(a(t-1)s(t-1)) if d(t) gt 0

8
Q Learning
  • Action value function
  • Q(s,a) E r(t1) g r(t2) s(t)s,
    a(t)a, P(as) E r(t1) g V(s(t1))
    s(t)s, a(t)a
  • Action selection
  • a(t) argmaxa Q(s(t),a) with prob. 1-e
  • Update
  • Q(s(t),a(t)) r(t1) g maxa Q(s(t1),a)
  • Q(s(t),a(t)) r(t1) g Q(s(t1),a(t1))

9
Dynamic Programming and RL
  • Dynamic Programming
  • given models P(ss,a) and P(rs,a)
  • off-line solution of Bellman equation
  • V(s) maxa ?rrP(rs,a) g?sV(s)P(ss,a)
  • Reinforcement Learning
  • on-line learning with TD error
  • d(t) r(t) gV(s(t) - V(s(t-1))
  • DV(s(t-1)) a d(t)
  • DQ(s(t-1),a(t-1)) a d(t)

10
Model-free and Model-based RL
  • Model-free e.g., learn action values
  • Q(s,a) r(s,a) g Q(s,a)
  • a argmaxa Q(s,a)
  • Model-based forward model P(ss,a)
  • action selection
  • a argmaxa E R(s,a) g Ss V(s)P(ss,a)
  • simulation learn V(s) and/or Q(s,a) off-line
  • dynamic programming solve Bellman eq.
  • V(s) maxa E R(s,a) g Ss V(s)P(ss,a)

11
Current Topics
  • Convergence proofs
  • with function approximators
  • Learning with hidden states POMDP
  • estimate belief states
  • reactive, stochastic policy
  • parameterized finite-state policies
  • Hierarchical architectures
  • learn to select fixed sub-modules
  • train sub-modules
  • both

12
Partially Observable Markov Decision Process
(POMDP)
  • Update the belief state
  • observation P(os) not identity
  • belief state b(P(s1), P(s2),) real valued
  • P(sko) ? P(osk) Si P(sksi,a) P(si)

13
Tiger Problem (Kaelbing et al., 1998)
  • state a tiger is in left,right
  • action left, right, listen
  • observation with 15 error
  • policy tree finite state policy

14
Outline
  • Introduction to Reinforcement Learning (RL)
  • Markov decision process (MDP)
  • Current topics
  • RL in Continuous Space and Time
  • Model-free and model-based approaches
  • Learning to Stand Up
  • Discrete plans and continuous control
  • Modular Decomposition
  • Multiple model-based RL (MMRL)

15
Why Continuous?
  • Analog control problems
  • discretization ? poor control performance
  • how to discretize?
  • Better theoretical properties
  • differential algorithms
  • use of local linear models

16
Continuous TD learning
  • Dynamics
  • Value function
  • TD error
  • Discount factor
  • Gradient Policy

17
On-line Learning of State Value
  • state x(angle, angular vel.)
  • V(x)

18
Example Cart-pole Swing up
  • Reward height of the tip
  • Punish crash to wall

19
Fast Learning by Internal Models
  • Pole balancing (Stefan Schaal, USC)
  • Forward modelof pole dynamics
  • Inverse modelof arm dynamics

20
Internal Models for Planning
  • Devil sticking (Chris Atkeson, CMU)

21
Outline
  • Introduction to Reinforcement Learning (RL)
  • Markov decision process (MDP)
  • Current topics
  • RL in Continuous Space and Time
  • Model-free and model-based approaches
  • Learning to Stand Up
  • Discrete plans and continuous control
  • Modular Decomposition
  • Multiple model-based RL (MMRL)

22
Need for Hierarchical Architecture
  • Performance of control
  • Many high-precision sensors and actuator
  • Prohibitively long time for learning
  • Speed of learning
  • Search in low-dimensional, low-resolution space

23
Learning to Stand up (Morimoto Doya, 1998)
  • Reward height of the head
  • Punishment tumble
  • State pitch and joint angles, their derivatives
  • Simulation ? many thousands of trials to learn

24
Hierarchical Architecture
  • Upper level
  • discrete state/time
  • kinematics
  • action subgoals
  • reward total task
  • Lower level
  • continuous state/time
  • dynamics
  • action motor torque
  • reward
  • achieving subgoals

Q(S,A)
sequence ofsubgoals
V(s) ag(s)
25
Learning in Simulation
Upper level subgoals
Lower level control
early learning
after 700 trials
26
Learning with Real Hardware (Morimoto Doya,
2001)
  • after simulation
  • after 100 physical trials
  • Adaptation by lower control modules

27
Outline
  • Introduction to Reinforcement Learning (RL)
  • Markov decision process (MDP)
  • Current topics
  • RL in Continuous Space and Time
  • Model-free and model-based approaches
  • Learning to Stand Up
  • Discrete plans and continuous control
  • Modular Decomposition
  • Multiple model-based RL (MMRL)

28
Modularity in Motor Learning
  • Fast De-adaptation and Re-adaptation
  • switching rather than re-learning
  • Combination of Learned Modules
  • serial/parallel/sigmoidal mixture

29
Soft Switching of Adaptive Modules
  • Hard switching based on prediction errors
  • (Narendra et al., 1995)
  • Can result in sub-optimal task decomposition with
    initially poor prediction models.
  • Soft switching by softmax of prediction
    errors
  • (Wolpert and Kawato, 1998)
  • Can use annealing for optimal decomposition.
  • (Pawelzik et al., 1996)

30
Responsibility by Competition
  • predict state change
  • responsibility
  • weight output/learning

31
Multiple Linear Quadratic Controllers
  • Linear dynamic models
  • Quadratic reward models
  • Value functions
  • Action outputs

32
Swing-up control of a pendulum
  • Red module 1 Green module 2

33
Non-linearity and Non-stationarity
  • Specialization by predictability in space and time

34
Swing-up control of an Acrobot
  • Reward height of the center of mass
  • Linearized around four fixed points

35
Swing-up motions
  • R0.001 R0.002

36
Module switching
  • trajectories x(t) R0.001 R0.002
  • responsibility li symbol-like representation
  • 1-2-1-2-1-3-4-1-3-4-3-4 1-2-1-2-1-2-1-3-4-1-
    3-4

37
Stand Up by Multiple Modules
  • Seven locally linear models

38
Segmentation of Observed Trajectory
  • Predicted motor output
  • Predicted state change
  • Predicted responsibility

39
Imitation of Acrobot Swing-up
  • q1(0)p/12 q1(0)p/6 q1(0)p/12 (imitation)

40
Outline
  • Introduction to Reinforcement Learning (RL)
  • Markov decision process (MDP)
  • Current topics
  • RL in Continuous Space and Time
  • Model-free and model-based approaches
  • Learning to Stand Up
  • Discrete plans and continuous control
  • Modular Decomposition
  • Multiple model-based RL (MMRL)

41
Future Directions
  • Autonomous learning agents
  • Tuning of meta-parameters
  • Design of rewards
  • Selection of necessary/sufficient state coding
  • Neural mechanisms of RL
  • Dopamine neurons encoding TD error
  • Basal ganglia value-based action selection
  • Cerebellum internal models
  • Cerebral cortex modular decomposition

42
What is Reward for a robot?
  • Should be grounded by
  • Self preservation self recharging
  • Self reproduction copying control program
  • Cyber Rodent

43
The Cyber Rodent Project
  • Learning mechanisms under realistic constraints
    of self-preservation and self-reproduction
  • acquisition of task-oriented internal
    representation
  • metalearning algorithms
  • constraints of finite time and energy
  • mechanisms for collaborative behaviors
  • roles of communication
  • abstract/emotional, concrete/symbolic
  • gene exchange rules for evolution

44
Input/Output
  • Sensory
  • CCD camera
  • range sensor
  • IR proximity x8
  • acceleration/gylo
  • microphone x2
  • Motor
  • two wheels
  • jaw
  • R/G/B LED
  • speaker

45
Computation/Communication
  • CPU Hitachi SH-4 CPU
  • FPGA image processor
  • IO modules
  • Communication
  • IR port
  • wireless LAN
  • Software
  • learning/evolution
  • dynamic simulation
Write a Comment
User Comments (0)