Title: Hidden Markov Model Multiarm Bandits: A Methodology for Beam Scheduling in Multitarget Tracking
1Hidden Markov Model Multiarm Bandits A
Methodology for Beam Scheduling in Multitarget
Tracking
Authors Vikram Krishnamurthy Robin Evans
Presented by Shihao Ji Duke University Machine
Learning Group June 10, 2005
2Outline
- Motivation
- Overview
- Multiarmed Bandits
- HMM Multiarmed Bandits
- Experimental Results
3Motivation
- ESA has only one steerable beam.
- The coordinates of each target evolve according
to a finite state Markov chain. - Question which single target should the tracker
choose to observe at each time instant in order
to optimize some specified cost function?
4Overview - How it works?
5Multiarmed Bandits
- The Model
- One has N parallel projects, indexed
i1,2,,N and at each instant of discrete time
can work on only a single project. Let the state
of project i at time k be denoted . If one
works on project i at time k then one pays an
immediate expected cost of . The
state changes to by a Markov transition
rule (which may depend upon i, but not upon t),
while the state of the projects one has not
touched remain unchanged for
.The problem is how to allocate ones effort
over projects sequentially in time so as to
minimize expected total discounted cost.
6Gittins Index
- Simplest non-trivial problem, classic
- No essential solution until Gittins and his
co-workers. - They proved that to each project i one could
attach an index, - ,such that the optimal
action at time k is to work on that project for
which the current index is smallest. The index is
calculated by solving the problem of allocating
ones effort optimally between project i and a
standard project which yields a constant cost. - Gittins result thus reduces the case of general
N to that of the case N 2.
7HMM Multiarmed Bandits
- The standard multiarmed bandits problem
involves a fully observed finite state Markov
chain and is only a MDP with a rich structure. - For the multitarget tracking, due to measurement
noise at the sensor, the states are only
partially observable. Thus, the multitarget
tracking problem needs to be formulated as a
multiarmed bandits involving HMMs (with the HMM
filter to estimate the information state). - Can be solved brute forcedly by POMDP, but it
involves a much higher (enormous) dimensional
Markov chain. - Bandit assumption decouples the problem.
8Bandit Assumption
- The information state of currently observed
target updates by the HMM filter - For the other P-1 unobserved target, their
information states are kept frozen -
if target q is not observed
9Why it is Valid?
- Slow Dynamics slowly moving targets have a
bandit structure. -
-
where - Decoupling Approximation
- without the bandit assumption, the optimal
solution is intractable. Bandit model is perhaps
the only reasonable approximation that leads to
computationally tractable solution. - Reinitialization a compromise.
- Reinitialize the HMM multiarmed bandits at
regular intervals with updated estimates from all
targets.
10Some details
- Finite State Markov Assumption
-
- denotes the quantized distance of the pth
target from base station, and the target distance
evolves according to a finite-state Markov chain.
- Cost structure
-
- typically depends on the distance of the pth
target to the base station, i.e., the target gets
close to the base station pose a greater threat
and given higher priority by the tracking
algorithm. - Objective function
11Optimal Solution
- For the bandit assumption, the optimal solution
has an indexable (decoupling) rule, that is, the
optimization can be decoupled into P
independent optimization problems. - For each target p, there is a function (Gittins
index) . Solved by POMDP
algorithms, see the next slide. - The optimal scheduling policy at time k is to
steer the beam toward the target with the
smallest Gittins index
12Gittins Index
- For arbitrary multiarmed bandits problem, the
Gittins index can be calculated by solving an
associated infinite horizon discounted control
problem called the return to state. - For the target p, given information state
at time k, there are two actions - 1) Continue, which incurs a cost
and evolves according to HMM
filter - 2) Restart, which moves to a fixed
information state , incurs a cost
, and evolves according to HMM
filter.
13- The Gittins index of the state of target p
is given by - where satisfies the
Bellman equation
14POMDP solver
- Defining new parameters (see eq.15),
- Can be solved by any standard POMDP solver such
as sondiks algorithm, witness algorithm,
incremental-prune, or suboptimal (approximated)
algorithms.
15Experimental Results