Title: Reinforcement Learning Dealing with Complexity and Safety in RL
1Reinforcement LearningDealing with Complexity
and Safety in RL
- Subramanian Ramamoorthy
- School of Informatics
- 27 March, 2012
2(Why) Isnt RL Deployed More Widely?
- Very interesting discussion at
http//umichrl.pbworks.com/w/page/7597585/Myths20
of20Reinforcement20Learning, maintained by
Satinder Singh - Negative views/myths RL is hard due to
dimensionality, partial observability, function
approximation, etc. etc. - Positive view There is no getting away from the
fact that RL is the proper statement of the
agents problem. So, the question is really one
of how to solve it!
3A Provocative Claim
- The (PO)MDP frameworks are fundamentally
broken, not because they are insu?ciently
powerful representations, but because they are
too powerful. We submit that, rather than
generalizing these models, we should be
specializing them if we want to make progress on
solving real problems in the real world. - T. Lane, W.D. Smart, Why (PO)MDPs Lose for
Spatial Tasks and What to Do About It, ICML
Workshop on Rich Representations for RL, 2005.
4What is the Issue? (Lane et al.)
- In our e?orts to formalize the notion of
learning control, we have striven to construct
ever more general and, putatively, powerful
models. By the mid-1990s we had (with a little
bit of blatant borrowing from the Operations
Research community) arrived at the (PO)MDP
formalism (Puterman, 1994) and grounded our RL
methods in it (Sutton Barto, 1998 Kaelbling et
al., 1996 Kaelbling et al., 1998). - These models are mathematically elegant, have
enabled precise descriptions and analysis of a
wide array of RL algorithms, and are incredibly
general. We argue, however, that their very
generality is a hindrance in many practical
cases. - In their generality, these models have discarded
the very qualities metric, topology, scale,
etc. that have proven to be so valuable for
many, many science and engineering disciplines.
5What is Missing in POMDPs?
- POMDPs do not describe natural metrics in
environment - When driving, we know both global and local
distances - POMDPs do not natively recognize differences
between scales - Uncertainty in control is entirely different from
uncertainty in routing - POMDPs conflate properties of the environment
with properties of the agent - Roads and buildings behave differently from cars
and pedestrians we need to generalize over them
differently - POMDPs are defined in a global coordinate frame,
often discrete! - We may need many different representations in
real problems
6Specific Insight 1
- Metric of a space imposes a speed limit on the
agent the agent cannot transition to arbitrary
points in the environment in a single step.
- Consequences
- Agent can neglect large parts of the state space
when planning. - More importantly, however, this result implies
that control experience can be generalized across
regions of the state space. - If the agent learns a good policy for one bounded
region of the state space, and it can ?nd a
second bounded region that is homeomorphic to the
?rst.
Metric envelope bound for point-to-point
navigation in an open-space gridworld
environment. The outer region is the elliptical
envelope that contains 90 of the trajectory
probability mass. The inner, darker region is
the set of states occupied by an agent in a
total of 10,000 steps of experience (319
trajectories from bottom to top).
7Insight 2 Manifold Representations
- Informally, a manifold representation models the
domain of the value function using a set of
overlapping local regions, called charts. - Each chart has a local coordinate frame, is a
(topological) disk, and has a (local) Euclidean
distance metric. The collection of charts and
their overlap regions is called a manifold. - We can embed partial value functions (and other
models) on these charts, and combine them, using
the theory of manifolds, to provide a global
value function (or model).
13 eq. classes. If you consider Rotational
symmetry, Only 4 classes.
8-
- What Makes Some POMDP Problems Easy to
Approximate? - David Hsu, Wee Sun Lee, Nan Rong, NIPS 2007
9Understanding Why PBVI Works
- Point-based algorithms have been surprisingly
successful in computing approximately optimal
solutions for POMDPs. - What are the belief-space properties that allow
some POMDP problems to be approximated
efficiently, explaining the point-based
algorithms success?
10Hardness of POMDPs
- Intractability due to curse of dimensionality
- Size of belief space grows exponentially with
state space, S - But, in recent years, good progress has been made
in sampling the belief space and approximating
solutions - Hsu et al. refer to solutions to a POMDP with
hundreds of states in seconds - Tag problem robot needs to search for a moving
tag (whose position is unobserved except when
robot bumps into it), 870-dim space - Solved using PBVI methods in lt1 minute
11Initial Observation
- Many point-based algorithms only explore a subset
of the belief space, , the
reachable space - The reachable space contains all points reachable
from a given initial belief point b0 under
arbitrary sequences of actions and observations - Is the reason for PBVIs success that reachable
space is small? - Not always Tag has approx. 860-dim reachable
space.
12Covering Number
- Covering number of a space is the minimum number
of given size balls that needed to cover the
space fully - Hsu et al. show that an approximately optimal
POMDP solution can be computed in time polynomial
in the covering number of R(b0) - Covering number also reveals that the belief
space for Tag behaves more like the union of some
29-dimensional spaces rather than an
870-dimensional space, as the robots position is
fully observed.
13Further Questions
- Is it possible to compute an approximate solution
efficiently under the weaker condition of having
a small covering number for an optimal reachable
R(b0), which contains only points in B reachable
from b0 under an optimal policy? - Unfortunately, this problem is NP-hard. The
problem remains NP-hard, even if the optimal
policies have a compact piecewise-linear
representation using a-vectors. - However, given a suitable set of points that
cover R(b0) well, a good approximate solution
can be computed in polynomial time. - Using sampling to approximate an optimal
reachable space, and not just the reachable
space, may be a promising approach in practice.
14Lyapunov Design for Safe Reinforcement Learning
- Theodore J. Perkins and Andrew G. Barto, JMLR
2002
15Dynamical Systems
- Dynamical systems can be described by states and
evolution of states over time - The evolution of states is constrained by
dynamics of the system - In other words, dynamical systems are mappings
from current state to next state - If the mapping is a contraction, the state will
eventually converge to a fixed point
16Reinforcement Learning Traditional Methods
- The target or goal state may not be a natural
attractor - Hypothesis Learning is easier if target is a
fixed point, e.g., TD-Gammon - People have tried to embed domain knowledge in
various ways - Known good actions are specified
- Sub-goals are explicitly specified
17Key Idea
- Use Lyapunov functions to constrain action
selection - This forces the RL agent to move towards the goal
- e.g., consider grid world, finite steps if
Lyapunov constrained
18Problem Setup
- Deterministic dynamical system
- Evolution according to MDP,
19Lyapunov Functions
- Generalized energy functions
20Pendulum Problem
21Results 1
- AEA,AAll had shorter trials than Aconst
- AEA outperformed AAll, especially at fine
resolutions of discretization - AEA trial times seemed independent of binning
- AConst alone never worked
- Note Theorem guarantees that AEA monotonically
increases energy.
22Results 2
- 1 AEA, G2 2 AAll, G2 3 AConst, G2 4 AAll
sat LQR, G1
23Stochastic Case
24Results Stochastic Case
25Some Open Questions
- How can you improve performance using less
sophisticated primitive actions? - Perkins and Barto use deep intuition to design
local laws, e.g., to avoid undesired
gravity-control equilibria. How do we deal with
this when the dynamics is less understood? - Stochastic cases have rather weak guarantees. How
can they be improved?