Decision Making in Intelligent Systems Lecture 9 - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Decision Making in Intelligent Systems Lecture 9

Description:

... Event System: continuous time, asynchronous elevator operation. Traffic Profile: ... A car cannot change direction until it has serviced all onboard passengers ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 44
Provided by: andy285
Category:

less

Transcript and Presenter's Notes

Title: Decision Making in Intelligent Systems Lecture 9


1
Decision Making in Intelligent SystemsLecture 9
  • BSc course Kunstmatige Intelligentie 2008
  • Bram Bakker
  • Intelligent Systems Lab Amsterdam
  • Informatics Institute
  • Universiteit van Amsterdam
  • bram_at_science.uva.nl

2
Overview of this lecture
  • Last lecture!
  • Illustrate trade-offs and issues that arise in
    real applications
  • Illustrate use of domain knowledge
  • Describe some RL topics that UvA is working on

3
Final exam mei 2007
  • Tijd/plek nog onbekend (voor mij)!
  • Let op website

http//staff.science.uva.nl/bram/DMIS/
4
Case study 1 TD Gammon
Tesauro 1992, 1994, 1995, ...
  • White has just rolled a 5 and a 2 so can move one
    of his pieces 5 and one (possibly the same) 2
    steps
  • Objective is to advance all pieces to points
    19-24
  • 30 pieces, 24 locations implies enormous number
    of configurations
  • Effective branching factor of 400

5
A Few Details
  • Reward 0 at all times except those in which the
    game is won, when it is 1
  • Episodic (game episode), undiscounted
  • Gradient descent TD(l) with a multi-layer neural
    network
  • weights initialized to small random numbers
  • backpropagation of TD error
  • four input units for each point unary encoding
    of number of white pieces, plus other features
  • Learning during self-play

6
Multi-layer Neural Network
7
Summary of TD-Gammon Results
8
Samuels Checkers Player
Arthur Samuel 1959, 1967
  • Minimax to determine backed-up score of a
    position
  • Rote learning save each board config encountered
    together with backed-up score
  • Learning similar to TD algorithm

9
Samuels Backups
10
The Basic Idea
. . . we are attempting to make the score,
calculated for the current board position, look
like that calculated for the terminal board
positions of the chain of moves which most
probably occur during actual play.
A. L. Samuel Some Studies in Machine
Learning Using the Game of Checkers, 1959
11
More Samuel Details
  • Did not include explicit rewards
  • Instead used so-called piece advantage feature
  • No special treatment of terminal positions
  • Generalization method produced better than
    average play tricky but beatable
  • Supervised mode book learning

12
The Acrobot
Spong 1994
13
Acrobot Learning Curves for Sarsa(l)
14
Typical Acrobot Learned Behavior
15
Elevator Dispatching
Crites and Barto 1996
16
Control Strategies
  • Zoning divide building into zones park in
    zone when idle. Robust in heavy traffic.
  • Search-based methods greedy or non-greedy.
    Receding Horizon control.
  • Rule-based methods expert systems/fuzzy
    logic from human experts
  • Other heuristic methods Longest Queue First
    (LQF), Highest Unanswered Floor First (HUFF),
    Dynamic Load Balancing (DLB)
  • Adaptive/Learning methods NNs for prediction,
    parameter space search using simulation, DP on
    simplified model, non-sequential RL

17
The Elevator Model(from Lewis, 1991)
Discrete Event System continuous time,
asynchronous elevator operation
  • Parameters
  • Floor Time (time to move one floor at max
    speed) 1.45 secs.
  • Stop Time (time to decelerate, open and close
    doors, and accelerate again) 7.19 secs.
  • TurnTime (time needed by a stopped car to
    change directions) 1 sec.
  • Load Time (the time for one passenger to enter
    or exit a car) a random variable with range from
    0.6 to 6.0 secs, mean of 1 sec.
  • Car Capacity 20 passengers

Traffic Profile Poisson arrivals with rates
changing every 5 minutes down-peak
18
State Space
18
  • 18 hall call buttons 2 combinations
  • positions and directions of cars 18
    (rounding to nearest floor)
  • motion states of cars (accelerating, moving,
    decelerating, stopped, loading, turning) 6
  • 40 car buttons 2
  • Set of passengers waiting at each floor, each
    passenger's arrival time and destination
    unobservable. However, 18 real numbers are
    available giving elapsed time since hall buttons
    pushed we discretize these.
  • Set of passengers riding each car and their
    destinations observable only through the car
    buttons

4
4
40
22
Conservatively about 10 states
19
Actions
  • When moving (halfway between floors)
  • stop at next floor
  • continue past next floor
  • When stopped at a floor
  • go up
  • go down

20
Constraints
  • A car cannot pass a floor if a passenger wants
    to get off there
  • A car cannot change direction until it has
    serviced all onboard passengers traveling in the
    current direction
  • Dont stop at a floor if another car is already
    stopping, or is stopped, there
  • Dont stop at a floor unless someone wants to
    get off there
  • Given a choice, always move up

standard
special heuristic
21
Performance Criteria
Minimize
  • Average wait time
  • Average system time (wait travel time)
  • waiting gt T seconds (e.g., T 60)
  • Average squared wait time (to encourage fast
    and fair service)

22
Average Squared Wait Time
  • Instantaneous cost
  • Define return as an integral rather than a sum
    (Bradtke and Duff, 1994)

becomes
23
Algorithm
24
Neural Networks
47 inputs, 20 sigmoid hidden units, 1 or 2 output
units
Inputs
  • 9 binary state of each hall down button
  • 9 real elapsed time of hall down button if
    pushed
  • 16 binary one on at a time position and
    direction of car making decision
  • 10 real location/direction of other cars
    footprint
  • 1 binary at highest floor with waiting
    passenger?
  • 1 binary at floor with longest waiting
    passenger?
  • 1 bias unit ? 1

25
Elevator Results
26
Dynamic Channel Allocation
Singh and Bertsekas 1997
27
Summary
  • RL can lead to successful applications
  • Background knowledge important
  • Learning directly in the real world is rarely
    possible
  • You need a more or less accurate simulation
  • Function approximation (e.g. neural networks) is
    important to deal with large state spaces

28
Frontier Dimensions
  • Smart function approximation
  • Non-Markov case
  • Partially Observable MDPs (POMDPs)
  • Bayesian approach belief states
  • construct state from sequence of observations
  • Modularity and hierarchies
  • Learning and planning at several different levels
  • Theory of options, MAXQ
  • Multi-agent RL

29
Adaptive resolution function approximation
  • Learn, in your state(-action) space
  • where you can generalize over many states
    (coarse-grained view)
  • where you must distinguish between states
    (fine-grained view)
  • Learn based on experienced rewards
  • Returns guide formation of boundaries between
    regions in state(-action) space

30
Frontier Dimensions
  • Smart function approximation
  • Non-Markov case
  • Partially Observable MDPs (POMDPs)
  • Bayesian approach belief states
  • construct state from sequence of observations
  • Modularity and hierarchies
  • Learning and planning at several different levels
  • Theory of options, MAXQ
  • Multi-agent RL

31
Architectures NNs fully observable MDPs
Direct value function approximation
Actor-critic
32
Architectures partially observable case
Direct value function approximation
Actor-critic
33
Long Short-Term Memory (LSTM)
The memory cells can learn to remember relevant
information from the timeseries of inputs for
long periods of time (e.g. Hochreiter
Schmidhuber, 1997 Bakker, 2001)
34
Frontier Dimensions
  • Smart function approximation
  • Non-Markov case
  • Partially Observable MDPs (POMDPs)
  • Bayesian approach belief states
  • construct state from sequence of observations
  • Modularity and hierarchies
  • Learning and planning at several different levels
  • Theory of options, MAXQ
  • Multi-agent RL

35
Hierarchical methods
policy
HIGH
overall task
policy
policy
subtask
subtask
policy
policy
policy
policy
LOW
subtask
subtask
subtask
subtask
36
Frontier Dimensions
  • Smart function approximation
  • Non-Markov case
  • Partially Observable MDPs (POMDPs)
  • Bayesian approach belief states
  • construct state from sequence of observations
  • Modularity and hierarchies
  • Learning and planning at several different levels
  • Theory of options, MAXQ
  • Multi-agent RL

37
Multi-agent RL
  • Structure overall task such that team of agents
    (rather than single agent) can solve it
  • Task must be decomposible in this way
  • Find way of distributing rewards between agents
  • Agents must be rewarded for good contribution to
    overall team
  • Agents must not be rewarded for bad contribution
    or selfish behavior

38
Traffic simulator
39
Approach
  • Model-based multi-agent reinforcement learning
  • Traffic behavior model is estimated online
    (maximum likelihood model)
  • Value function/policy is estimated online using
    approximate real-time dynamic programming
  • Each traffic light junction makes locally optimal
    decision by using value function and sensing
    local cars around the junction
  • Recently an explicit coordination mechanism was
    added by means of coordination graphs and maxplus

40
3D visualisation traffic simulator
Thanks to Matthijs Amelink
41
Multi-agent example Robocup simulation league
  • Kok Vlassis (2002-2006)

42
Studying for final exam
  • Literature
  • Sutton Barto (1998) book
  • Kaelbling, Littman, Cassandra (1998) Planning
    and acting in partially observable stochastic
    domains, AI journal article. (website) Sections
    1-3.
  • Questions will test
  • General insight into important issues
  • Ability to apply the mathematics of RL
  • Use the slides!
  • Open book
  • Try answering some questions at the end of each
    chapter

43
Final exam mei 2008
  • ? mei
  • Plaats ?
  • Deze informatie zal op de DMIS website staan

http//staff.science.uva.nl/bram/DMIS/
Write a Comment
User Comments (0)
About PowerShow.com