Exploration and Apprenticeship Learning in Reinforcement Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Exploration and Apprenticeship Learning in Reinforcement Learning

Description:

in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University Overview Reinforcement learning in systems with unknown dynamics. Algorithms such as E3 ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 23
Provided by: Pieter66
Learn more at: http://ai.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Exploration and Apprenticeship Learning in Reinforcement Learning


1
Exploration and Apprenticeship Learning in
Reinforcement Learning
  • Pieter Abbeel and Andrew Y. Ng
  • Stanford University

2
Overview
  • Reinforcement learning in systems with unknown
    dynamics.
  • Algorithms such as E3 (Kearns and Singh, 2002)
    learn the dynamics by using exploration policies.
  • Aggressive exploration is dangerous for many
    systems.
  • We show that in apprenticeship learning, when we
    have a teacher demonstration of the task, this
    explicit exploration step is unnecessary and
    instead we can just use exploitation policies.

3
Reinforcement learning formalism
  • Markov Decision Process (MDP),
  • (S, A, Psa , H, s0, R).
  • Policy ? S ! A.
  • Utility of a policy ?
  • U(?) E ? R(st) ?.
  • Goal find policy ? that maximizes U(?).

H
t0
4
Motivating example
Collect flight data.
How to fly helicopter for data collection? How to
ensure that entire flight envelope is covered by
the data collection process?
  • Textbook model
  • Specification
  • Textbook model
  • Specification

Accurate dynamics model Psa
Accurate dynamics model Psa
Learn model from data.
5
Learning the dynamical model
  • State-of-the-art E3 algorithm, Kearns and Singh
    (2002). (And its variants/extensions Kearns
    and Koller, 1999 Kakade, Kearns and Langford,
    2003 Brafman and Tennenholtz, 2002.)

NO
YES
Explore
Exploit
6
Aggressive manual exploration
7
Learning the dynamical model
  • State-of-the-art E3 algorithm, Kearns and Singh
    (2002). (And its variants/extensions Kearns
    and Koller, 1999 Kakade, Kearns and Langford,
    2003 Brafman and Tennenholtz, 2002.)

Exploration policies are impractical they do not
even try to perform well.
NO
YES
Can we avoid explicit exploration and just
exploit?
Explore
Exploit
8
Apprenticeship learning of the model
Number of iterations?
Duration?
Duration?
Performance?
Autonomous flight
Expert human pilot flight
Learn Psa
Learn Psa
Dynamics model Psa
(a1, s1, a2, s2, a3, s3, .)
(a1, s1, a2, s2, a3, s3, .)
Reinforcement learning max ER(s0)R(sH)
Control policy ?
9
Typical scenario
  • Initially all state-action pairs are
    inaccurately modeled.

Accurately modeled state-action pair.
Inaccurately modeled state-action pair.
10
Typical scenario (2)
  • Teacher demonstration.

Not frequently visited by teachers policy.
Frequently visited by teachers policy.
Accurately modeled state-action pair.
Inaccurately modeled state-action pair.
11
Typical scenario (3)
  • First exploitation policy.

Frequently visited by first exploitation policy.
Not frequently visited by teachers policy.
Frequently visited by teachers policy.
Accurately modeled state-action pair.
Inaccurately modeled state-action pair.
12
Typical scenario (4)
  • Second exploitation policy.

Frequently visited by second exploitation policy.
Not frequently visited by teachers policy.
Frequently visited by teachers policy.
Accurately modeled state-action pair.
Inaccurately modeled state-action pair.
13
Typical scenario (5)
  • Third exploitation policy.

Frequently visited by third exploitation
policy.
Frequently visited by teachers policy.
Not frequently visited by teachers policy.
  • Model accurate for exploitation policy.
  • Model accurate for teachers policy.
  • Exploitation policy better than teacher in model.

Also better than teacher in real world.
Done.
14
Two dynamics models
  • Discrete dynamics
  • Finite S and A.
  • Dynamics Psa are described by state transition
    probabilities P(ss,a).
  • Learn dynamics from data using maximum
    likelihood.
  • Continuous, linear dynamics
  • Continuous valued states and actions. (S ltnS,
    A ltnA).
  • st1 G ?(st) H at wt.
  • Estimate G, H from data using linear regression.

15
Performance guarantees
To perform as well as teacher, it suffices
  • Let any ?, ? gt 0 be given.
  • Theorem. For U(?) U(?T) - ?
  • within NO(poly(1/?,1/?,H,Rmax,?))
  • iterations with probability 1-?, it suffices
  • Nteacher ?(poly(1/?,1/?,H,Rmax,?)),
  • Nexploit ?(poly(1/?,1/?,H,Rmax,?)).

a poly number of iterations
a poly number of teacher demonstrations
a poly number of trials with each exploitation
policy.
  • Take-home message so long as a demonstration is
    available, it is not necessary to explicitly
    explore it suffices to only exploit.

? S,A (discrete), ? nS,nA,GFro,H
Fro (continuous).
16
Proof idea
  • From initial pilot demonstrations, our
    model/simulator Psa will be accurate for the part
    of the state space (s,a) visited by the pilot.
  • Our model/simulator will correctly predict the
    helicopters behavior under the pilots policy
    ?T.
  • Consequently, there is at least one policy
    (namely ?T) that looks capable of flying the
    helicopter well in our simulation.
  • Thus, each time we solve the MDP using the
    current model/simulator Psa, we will find a
    policy that successfully flies the helicopter
    according to Psa.
  • If, on the actual helicopter, this policy fails
    to fly the helicopter---despite the model Psa
    predicting that it should---then it must be
    visiting parts of the state space that are
    inaccurately modeled.
  • Hence, we get useful training data to improve the
    model. This can happen only a small number of
    times.

17
Learning with non-IID samples
  • IID independent and identically distributed.
  • Our algorithm
  • All future states depend on current state.
  • Exploitation policies depend on states visited.
  • States visited depend on past exploitation
    policies.
  • Exploitation policies depend on past exploitation
    policies.
  • Very complicated non-IID sample generating
    process.
  • Standard learning theory/convergence bounds
    (e.g., Hoeffding inequalities) cannot be used in
    our setting.
  • Martingales, Azumas inequality, optional
    stopping theorem.

18
Related Work
  • Schaal Atkeson, 1994 open-loop policy as
    starting point for devil-sticking, slow
    exploration of state space.
  • Smart Kaelbling, 2000 model-free Q-learning,
    initial updates based on teacher.
  • Supervised learning of a policy from
    demonstration, e.g.,
  • Sammut et al. (1992) Pomerleau (1989)
    Kuniyhoshi et al. (1994) Amit Mataric (2002),
  • Apprenticeship learning for unknown reward
    function (Abbeel Ng, 2004).

19
Conclusion
  • Reinforcement learning in systems with unknown
    dynamics.
  • Algorithms such as E3 (Kearns and Singh, 2002)
    learn the dynamics by using exploration policies,
    which are dangerous/impractical for many systems.
  • We show that this explicit exploration step is
    unnecessary in apprenticeship learning, when we
    have an initial teacher demonstration of the
    task. We attain near-optimal performance
    (compared to the teacher) simply by repeatedly
    executing exploitation policies'' that try to
    maximize rewards.
  • In finite-state MDPs, our algorithm scales
    polynomially in the number of states in
    continuous-state linearly parameterized dynamical
    systems, it scales polynomially in the dimension
    of the state space.

20
End of talk, additional slides for poster after
this
21
Samples from teacher
  • Dynamics model st1 G ?(st) H at wt
  • Parameter estimates after k samples
  • (G(k),H(k)) arg minG,H loss(k)(G,H)
  • arg minG,H ? (st1 (G ?(st) H
    at))2
  • Consider
  • Z(k) loss(k)(G,H) Eloss(k)(G,H)
  • Then
  • EZ(k) history up to time k-1 Z(k-1)
  • Thus Z(0), Z(1), is a martingale sequence.
  • Using Azumas inequality (a standard martingale
    result) we prove convergence.

k
t0
22
Samples from exploitation policies
  • Consider
  • Z(k) exp(loss(k)(G,H) loss(k)(G,H))
  • Then
  • EZ(k) history up to time k-1 Z(k-1)
  • Thus Z(0), Z(1), is a martingale sequence.
  • Using the optional stopping theorem (a standard
    martingale result) we prove true parameters G,H
    outperform G, H with high probability for all
    k0,1,
Write a Comment
User Comments (0)
About PowerShow.com