Reinforcement Learning Dealing with Complexity and Safety in RL - PowerPoint PPT Presentation

About This Presentation
Title:

Reinforcement Learning Dealing with Complexity and Safety in RL

Description:

What are the belief-space properties that allow some POMDP problems to be approximated efficiently, explaining the point-based algorithms success? – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 26
Provided by: infEdAcU2
Category:

less

Transcript and Presenter's Notes

Title: Reinforcement Learning Dealing with Complexity and Safety in RL


1
Reinforcement LearningDealing with Complexity
and Safety in RL
  • Subramanian Ramamoorthy
  • School of Informatics
  • 27 March, 2012

2
(Why) Isnt RL Deployed More Widely?
  • Very interesting discussion at
    http//umichrl.pbworks.com/w/page/7597585/Myths20
    of20Reinforcement20Learning, maintained by
    Satinder Singh
  • Negative views/myths RL is hard due to
    dimensionality, partial observability, function
    approximation, etc. etc.
  • Positive view There is no getting away from the
    fact that RL is the proper statement of the
    agents problem. So, the question is really one
    of how to solve it!

3
A Provocative Claim
  • The (PO)MDP frameworks are fundamentally
    broken, not because they are insu?ciently
    powerful representations, but because they are
    too powerful. We submit that, rather than
    generalizing these models, we should be
    specializing them if we want to make progress on
    solving real problems in the real world.
  • T. Lane, W.D. Smart, Why (PO)MDPs Lose for
    Spatial Tasks and What to Do About It, ICML
    Workshop on Rich Representations for RL, 2005.

4
What is the Issue? (Lane et al.)
  • In our e?orts to formalize the notion of
    learning control, we have striven to construct
    ever more general and, putatively, powerful
    models. By the mid-1990s we had (with a little
    bit of blatant borrowing from the Operations
    Research community) arrived at the (PO)MDP
    formalism (Puterman, 1994) and grounded our RL
    methods in it (Sutton Barto, 1998 Kaelbling et
    al., 1996 Kaelbling et al., 1998).
  • These models are mathematically elegant, have
    enabled precise descriptions and analysis of a
    wide array of RL algorithms, and are incredibly
    general. We argue, however, that their very
    generality is a hindrance in many practical
    cases.
  • In their generality, these models have discarded
    the very qualities metric, topology, scale,
    etc. that have proven to be so valuable for
    many, many science and engineering disciplines.

5
What is Missing in POMDPs?
  • POMDPs do not describe natural metrics in
    environment
  • When driving, we know both global and local
    distances
  • POMDPs do not natively recognize differences
    between scales
  • Uncertainty in control is entirely different from
    uncertainty in routing
  • POMDPs conflate properties of the environment
    with properties of the agent
  • Roads and buildings behave differently from cars
    and pedestrians we need to generalize over them
    differently
  • POMDPs are defined in a global coordinate frame,
    often discrete!
  • We may need many different representations in
    real problems

6
Specific Insight 1
  • Metric of a space imposes a speed limit on the
    agent the agent cannot transition to arbitrary
    points in the environment in a single step.
  • Consequences
  • Agent can neglect large parts of the state space
    when planning.
  • More importantly, however, this result implies
    that control experience can be generalized across
    regions of the state space.
  • If the agent learns a good policy for one bounded
    region of the state space, and it can ?nd a
    second bounded region that is homeomorphic to the
    ?rst.

Metric envelope bound for point-to-point
navigation in an open-space gridworld
environment. The outer region is the elliptical
envelope that contains 90 of the trajectory
probability mass. The inner, darker region is
the set of states occupied by an agent in a
total of 10,000 steps of experience (319
trajectories from bottom to top).
7
Insight 2 Manifold Representations
  • Informally, a manifold representation models the
    domain of the value function using a set of
    overlapping local regions, called charts.
  • Each chart has a local coordinate frame, is a
    (topological) disk, and has a (local) Euclidean
    distance metric. The collection of charts and
    their overlap regions is called a manifold.
  • We can embed partial value functions (and other
    models) on these charts, and combine them, using
    the theory of manifolds, to provide a global
    value function (or model).

13 eq. classes. If you consider Rotational
symmetry, Only 4 classes.
8
  • What Makes Some POMDP Problems Easy to
    Approximate?
  • David Hsu, Wee Sun Lee, Nan Rong, NIPS 2007

9
Understanding Why PBVI Works
  • Point-based algorithms have been surprisingly
    successful in computing approximately optimal
    solutions for POMDPs.
  • What are the belief-space properties that allow
    some POMDP problems to be approximated
    efficiently, explaining the point-based
    algorithms success?

10
Hardness of POMDPs
  • Intractability due to curse of dimensionality
  • Size of belief space grows exponentially with
    state space, S
  • But, in recent years, good progress has been made
    in sampling the belief space and approximating
    solutions
  • Hsu et al. refer to solutions to a POMDP with
    hundreds of states in seconds
  • Tag problem robot needs to search for a moving
    tag (whose position is unobserved except when
    robot bumps into it), 870-dim space
  • Solved using PBVI methods in lt1 minute

11
Initial Observation
  • Many point-based algorithms only explore a subset
    of the belief space, , the
    reachable space
  • The reachable space contains all points reachable
    from a given initial belief point b0 under
    arbitrary sequences of actions and observations
  • Is the reason for PBVIs success that reachable
    space is small?
  • Not always Tag has approx. 860-dim reachable
    space.

12
Covering Number
  • Covering number of a space is the minimum number
    of given size balls that needed to cover the
    space fully
  • Hsu et al. show that an approximately optimal
    POMDP solution can be computed in time polynomial
    in the covering number of R(b0)
  • Covering number also reveals that the belief
    space for Tag behaves more like the union of some
    29-dimensional spaces rather than an
    870-dimensional space, as the robots position is
    fully observed.

13
Further Questions
  • Is it possible to compute an approximate solution
    efficiently under the weaker condition of having
    a small covering number for an optimal reachable
    R(b0), which contains only points in B reachable
    from b0 under an optimal policy?
  • Unfortunately, this problem is NP-hard. The
    problem remains NP-hard, even if the optimal
    policies have a compact piecewise-linear
    representation using a-vectors.
  • However, given a suitable set of points that
    cover R(b0) well, a good approximate solution
    can be computed in polynomial time.
  • Using sampling to approximate an optimal
    reachable space, and not just the reachable
    space, may be a promising approach in practice.

14
Lyapunov Design for Safe Reinforcement Learning
  • Theodore J. Perkins and Andrew G. Barto, JMLR
    2002

15
Dynamical Systems
  • Dynamical systems can be described by states and
    evolution of states over time
  • The evolution of states is constrained by
    dynamics of the system
  • In other words, dynamical systems are mappings
    from current state to next state
  • If the mapping is a contraction, the state will
    eventually converge to a fixed point

16
Reinforcement Learning Traditional Methods
  • The target or goal state may not be a natural
    attractor
  • Hypothesis Learning is easier if target is a
    fixed point, e.g., TD-Gammon
  • People have tried to embed domain knowledge in
    various ways
  • Known good actions are specified
  • Sub-goals are explicitly specified

17
Key Idea
  • Use Lyapunov functions to constrain action
    selection
  • This forces the RL agent to move towards the goal
  • e.g., consider grid world, finite steps if
    Lyapunov constrained

18
Problem Setup
  • Deterministic dynamical system
  • Evolution according to MDP,

19
Lyapunov Functions
  • Generalized energy functions

20
Pendulum Problem
21
Results 1
  • AEA,AAll had shorter trials than Aconst
  • AEA outperformed AAll, especially at fine
    resolutions of discretization
  • AEA trial times seemed independent of binning
  • AConst alone never worked
  • Note Theorem guarantees that AEA monotonically
    increases energy.

22
Results 2
  • 1 AEA, G2 2 AAll, G2 3 AConst, G2 4 AAll
    sat LQR, G1

23
Stochastic Case
24
Results Stochastic Case
25
Some Open Questions
  • How can you improve performance using less
    sophisticated primitive actions?
  • Perkins and Barto use deep intuition to design
    local laws, e.g., to avoid undesired
    gravity-control equilibria. How do we deal with
    this when the dynamics is less understood?
  • Stochastic cases have rather weak guarantees. How
    can they be improved?
Write a Comment
User Comments (0)
About PowerShow.com