Reinforcement Learning Dealing with Complexity and Safety in RL - PowerPoint PPT Presentation

About This Presentation

Title:

Reinforcement Learning Dealing with Complexity and Safety in RL

Description:

What are the belief-space properties that allow some POMDP problems to be approximated efficiently, explaining the point-based algorithms success? – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 26

Provided by: infEdAcU2

Category:

more less

Transcript and Presenter's Notes

Title: Reinforcement Learning Dealing with Complexity and Safety in RL

1
Reinforcement LearningDealing with Complexity
and Safety in RL

Subramanian Ramamoorthy
School of Informatics
27 March, 2012

2
(Why) Isnt RL Deployed More Widely?

Very interesting discussion at
http//umichrl.pbworks.com/w/page/7597585/Myths20
of20Reinforcement20Learning, maintained by
Satinder Singh
Negative views/myths RL is hard due to
dimensionality, partial observability, function
approximation, etc. etc.
Positive view There is no getting away from the
fact that RL is the proper statement of the
agents problem. So, the question is really one
of how to solve it!

3
A Provocative Claim

The (PO)MDP frameworks are fundamentally
broken, not because they are insu?ciently
powerful representations, but because they are
too powerful. We submit that, rather than
generalizing these models, we should be
specializing them if we want to make progress on
solving real problems in the real world.
T. Lane, W.D. Smart, Why (PO)MDPs Lose for
Spatial Tasks and What to Do About It, ICML
Workshop on Rich Representations for RL, 2005.

4
What is the Issue? (Lane et al.)

In our e?orts to formalize the notion of
learning control, we have striven to construct
ever more general and, putatively, powerful
models. By the mid-1990s we had (with a little
bit of blatant borrowing from the Operations
Research community) arrived at the (PO)MDP
formalism (Puterman, 1994) and grounded our RL
methods in it (Sutton Barto, 1998 Kaelbling et
al., 1996 Kaelbling et al., 1998).
These models are mathematically elegant, have
enabled precise descriptions and analysis of a
wide array of RL algorithms, and are incredibly
general. We argue, however, that their very
generality is a hindrance in many practical
cases.
In their generality, these models have discarded
the very qualities metric, topology, scale,
etc. that have proven to be so valuable for
many, many science and engineering disciplines.

5
What is Missing in POMDPs?

POMDPs do not describe natural metrics in
environment
When driving, we know both global and local
distances
POMDPs do not natively recognize differences
between scales
Uncertainty in control is entirely different from
uncertainty in routing
POMDPs conflate properties of the environment
with properties of the agent
Roads and buildings behave differently from cars
and pedestrians we need to generalize over them
differently
POMDPs are defined in a global coordinate frame,
often discrete!
We may need many different representations in
real problems

6
Specific Insight 1

Metric of a space imposes a speed limit on the
agent the agent cannot transition to arbitrary
points in the environment in a single step.

Consequences
Agent can neglect large parts of the state space
when planning.
More importantly, however, this result implies
that control experience can be generalized across
regions of the state space.
If the agent learns a good policy for one bounded
region of the state space, and it can ?nd a
second bounded region that is homeomorphic to the
?rst.

Metric envelope bound for point-to-point
navigation in an open-space gridworld
environment. The outer region is the elliptical
envelope that contains 90 of the trajectory
probability mass. The inner, darker region is
the set of states occupied by an agent in a
total of 10,000 steps of experience (319
trajectories from bottom to top).
7
Insight 2 Manifold Representations

Informally, a manifold representation models the
domain of the value function using a set of
overlapping local regions, called charts.
Each chart has a local coordinate frame, is a
(topological) disk, and has a (local) Euclidean
distance metric. The collection of charts and
their overlap regions is called a manifold.
We can embed partial value functions (and other
models) on these charts, and combine them, using
the theory of manifolds, to provide a global
value function (or model).

13 eq. classes. If you consider Rotational
symmetry, Only 4 classes.
8

What Makes Some POMDP Problems Easy to
Approximate?
David Hsu, Wee Sun Lee, Nan Rong, NIPS 2007

9
Understanding Why PBVI Works

Point-based algorithms have been surprisingly
successful in computing approximately optimal
solutions for POMDPs.
What are the belief-space properties that allow
some POMDP problems to be approximated
efficiently, explaining the point-based
algorithms success?

10
Hardness of POMDPs

Intractability due to curse of dimensionality
Size of belief space grows exponentially with
state space, S
But, in recent years, good progress has been made
in sampling the belief space and approximating
solutions
Hsu et al. refer to solutions to a POMDP with
hundreds of states in seconds
Tag problem robot needs to search for a moving
tag (whose position is unobserved except when
robot bumps into it), 870-dim space
Solved using PBVI methods in lt1 minute

11
Initial Observation

Many point-based algorithms only explore a subset
of the belief space, , the
reachable space
The reachable space contains all points reachable
from a given initial belief point b0 under
arbitrary sequences of actions and observations
Is the reason for PBVIs success that reachable
space is small?
Not always Tag has approx. 860-dim reachable
space.

12
Covering Number

Covering number of a space is the minimum number
of given size balls that needed to cover the
space fully
Hsu et al. show that an approximately optimal
POMDP solution can be computed in time polynomial
in the covering number of R(b0)
Covering number also reveals that the belief
space for Tag behaves more like the union of some
29-dimensional spaces rather than an
870-dimensional space, as the robots position is
fully observed.

13
Further Questions

Is it possible to compute an approximate solution
efficiently under the weaker condition of having
a small covering number for an optimal reachable
R(b0), which contains only points in B reachable
from b0 under an optimal policy?
Unfortunately, this problem is NP-hard. The
problem remains NP-hard, even if the optimal
policies have a compact piecewise-linear
representation using a-vectors.
However, given a suitable set of points that
cover R(b0) well, a good approximate solution
can be computed in polynomial time.
Using sampling to approximate an optimal
reachable space, and not just the reachable
space, may be a promising approach in practice.

14
Lyapunov Design for Safe Reinforcement Learning

Theodore J. Perkins and Andrew G. Barto, JMLR
2002

15
Dynamical Systems

Dynamical systems can be described by states and
evolution of states over time
The evolution of states is constrained by
dynamics of the system
In other words, dynamical systems are mappings
from current state to next state
If the mapping is a contraction, the state will
eventually converge to a fixed point

16
Reinforcement Learning Traditional Methods

The target or goal state may not be a natural
attractor
Hypothesis Learning is easier if target is a
fixed point, e.g., TD-Gammon
People have tried to embed domain knowledge in
various ways
Known good actions are specified
Sub-goals are explicitly specified

17
Key Idea

Use Lyapunov functions to constrain action
selection
This forces the RL agent to move towards the goal
e.g., consider grid world, finite steps if
Lyapunov constrained

18
Problem Setup

Deterministic dynamical system
Evolution according to MDP,

19
Lyapunov Functions

Generalized energy functions

20
Pendulum Problem
21
Results 1

AEA,AAll had shorter trials than Aconst
AEA outperformed AAll, especially at fine
resolutions of discretization
AEA trial times seemed independent of binning
AConst alone never worked
Note Theorem guarantees that AEA monotonically
increases energy.

22
Results 2

1 AEA, G2 2 AAll, G2 3 AConst, G2 4 AAll
sat LQR, G1

23
Stochastic Case
24
Results Stochastic Case
25
Some Open Questions

How can you improve performance using less
sophisticated primitive actions?
Perkins and Barto use deep intuition to design
local laws, e.g., to avoid undesired
gravity-control equilibria. How do we deal with
this when the dynamics is less understood?
Stochastic cases have rather weak guarantees. How
can they be improved?

Write a Comment

User Comments (0)