Lihan He, Shihao Ji and Lawrence Carin

1 / 40
About This Presentation
Title:

Lihan He, Shihao Ji and Lawrence Carin

Description:

Combining Exploration and Exploitation in Landmine Detection Lihan He, Shihao Ji and Lawrence Carin ECE, Duke University – PowerPoint PPT presentation

Number of Views:4
Avg rating:3.0/5.0
Slides: 41
Provided by: Lih81
Learn more at: http://ece.duke.edu

less

Transcript and Presenter's Notes

Title: Lihan He, Shihao Ji and Lawrence Carin


1
Combining Exploration and Exploitation in
Landmine Detection
Lihan He, Shihao Ji and Lawrence Carin ECE, Duke
University
2
Outline
  • Introduction
  • Partially observable Markov decision processes
    (POMDPs)
  • Model definition offline learning
  • Lifelong-learning algorithm
  • Experimental results

3
Landmine detection (1)
  • Landmine detection
  • By robots instead of human beings
  • Underlying model controlling the robot POMDP
  • Multiple sensors
  • Single sensor sensitive to only certain types
    of objects
  • EMI sensor conductivity
  • GPR sensor dielectric property
  • Seismic sensor mechanical property
  • Multiple complementary sensors improve
    detection performance

4
Landmine detection (2)
  • Statement of the problem
  • Given a minefield where some landmines and
    clutter are buried underground
  • Two types of sensors are available EMI sensor
    and GPR sensor
  • Sensing has cost
  • Correct / incorrect declarations have
    corresponding reward / penalty

How can we develop a strategy to effectively find
the landmines in this minefield with the minimal
cost ?
Questions in it
  • How to optimally choose sensing positions in the
    field, so as to use as few sensing points as
    possible to find the landmines?
  • How to optimally choose sensors in each sensing
    position?
  • When to sense and when to declare?

5
Landmine detection (3)
  • Solution sketch

A partially observable Markov decision process
(POMDP) model is built to solve this problem
since it provides an approach to select actions
(sensor deployment, sensing positions and
declarations) optimally based on maximal reward /
minimal cost.
  • Lifelong learning
  • The robot learns the model at the same time as it
    moves and senses in the mine field (combining
    exploration and exploitation)
  • Model is updated based on the exploration process.

6
Outline
  • Introduction
  • Partially observable Markov decision processes
    (POMDPs)
  • Model definition offline learning
  • Lifelong-learning algorithm
  • Experimental results

7
POMDP (1)
POMDP HMM controllable actions rewards
A POMDP is a model of an agent interacting
synchronously with its environment. The agent
takes as input the observations of the
environment, estimates the state according to
observed information, and then generates as
output actions based on its policy. During the
periodic observation-action loops, the agent gets
maximal reward, or equivalently, minimal cost.
8
POMDP (2)
A POMDP model is defined by the tuple lt S, A, T,
R, ?, O gt
  • S is a finite set of discrete states of the
    environment.
  • A is a finite set of discrete actions.
  • ? is a finite set of discrete observations
    providing noisy state information
  • T S?A ? ?(S) is the state transition
    probability
  • the probability of transitioning from state s
    to s when taking action a
  • O S?A ? ?(?) is the observation function
  • the probability of receiving observation o after
    taking action a, landing in state s.
  • R S?A ? ? , R(s, a) is the expected reward the
    agent receives by taking action a in state s.

9
POMDP (3)
Belief state b
  • The agent believes which state it is currently
    in
  • A probability distribution over all the state S
  • A summary of past information
  • updated in each step by Bayes rule
  • based on the latest action and observation, and
    the previous belief state

10
POMDP (4)
Policy
  • A mapping from belief states to actions
  • Telling agent which action it should take given
    current belief state.

Optimal policy
Horizon length
  • Maximize the expected discounted reward

Immediate reward
Discounted future reward
  • V(b) is piecewise linear and convex in belief
    state (Sondik, 1971)
  • Represent V(b) by a set of S-dimensional
    vector a1, , am

11
POMDP (5)
Policy learning
  • Solve for vectors a1, , am
  • Point based value iteration (PBVI) algorithm
  • Iteratively updates the vector a and value V for
    a set of sample belief points.

One step from the horizon
n1 step from the horizon is computed from the
results of the n step
with
where
12
Outline
  • Introduction
  • Partially observable Markov decision processes
    (POMDPs)
  • Model definition offline learning
  • Lifelong-learning algorithm
  • Experimental results

13
Model definition (1)
  • Feature extraction EMI sensor

EMI Model
Sensor measurements
Model parameters extracted by nonlinear fitting
method
14
Model definition (2)
  • Feature extraction GPR sensor

Time
Down-track position
  • Raw moments energy features
  • Central moments variance and asymmetry of the
    wave

15
Model definition (3)
  • Definition observation O
  • Definition state S

16
Model definition (4)
  • Estimate S and O
  • Variational Bayesian (VB) expectation-maximization
    (EM) method
  • for model selection.
  • Bayesian learning
  • One Criterion compare model evidence (marginal
    likelihood)

17
Model definition (5)
  • Estimate S and O

Candidate models
HMMs with two sets of observations
S1,5,9, O2,3,4,
18
Model definition (6)
  • Estimate S and O

Estimate S
Estimate O
19
Model definition (7)
  • Specification of action A

10 sensing actions allow movements of 4
directions
1 Stay, GPR sensing 2 South, GPR sensing 3
North, GPR sensing 4 West, GPR sensing 5 East,
GPR sensing 6 Stay, EMI sensing 7 South,
EMI sensing 8 North, EMI sensing 9 West, EMI
sensing 10 East, EMI sensing
Declaration actions declare as one type of
target
11 Declare as metal mine 12 Declare as
plastic mine 13 Declare as Type-1 clutter
14 Declare as Type-2 clutter 15 Declare as
clean
20
Model definition (8)
  • Estimate T

Across all 5 types of mines and clutter, a total
of 29 states are defined.
metal mine
plastic mine
type-1 clutter
type-2 clutter
clean
21
Model definition (9)
  • Estimate T
  • Stay actions do not cause state transition
    identity matrix
  • Other sensing actions cause state transition
    by elementary geometric probability computation
  • Declaration actions reset the problem
    uniform distribution over states

where awalk south and then sense with
EMI or walk south and then sense with GPR
d distance traveled in a single step by a robot
s1, s2, s3 and s4 denote the 4 borders of state
5, as well as their respective area metric.
22
Model definition (10)
  • Estimate T
  • Assume a mine or clutter is buried separately
  • State transitions happen only within the states
    of a target when robot moves
  • Clean is a bridge between the targets.

Other target
Other target
29
Other target
Metal mine states
clean
23
Model definition (11)
  • Estimate T
  • State transition matrix diagonal block
  • Model expansion add more diagonal block, each
    one a target

24
Model definition (12)
  • Specification of reward R

Sensing -1
Each sensing (either EMI or GPR) has a cost -1
Correct declaration 10
Correctly declare a target
Partially correct declaration 5
Confused between different types of
landmines Confused between different types of
clutter
Incorrect declaration large penalty
Missing declare as clean or clutter when it is
a landmine -100 False alarm declare as a
landmine when it is clean or clutter -50
25
Outline
  • Introduction
  • Partially observable Markov decision processes
    (POMDPs)
  • Model definition offline learning
  • Lifelong-learning algorithm
  • Experimental results

26
Lifelong learning (1)
Model-based algorithm No training data
available in advance Learn the POMDP model by
Bayes approach during the exploration
exploitation processes. Assume a rough
model is given, but some model parameters are
uncertain An oracle is available, which
can provide exact information about target
label, size and position, but using oracle is
expansive. Criteria to use oracle 1.
Policy selects oracle query action 2. Agent
finds new observations new knowledge 3.
After sensing a lot, agent still cannot make
decisions too difficult
27
Lifelong learning (2)
Oracle query includes three steps 1.
measures data of both sensors on a grid 2. true
target label is revealed 3. build target model
based on measured data Two learning
approaches 1. model expansion (more target
types are considered) 2. model hyper-parameter
update
28
Lifelong learning (3)
Dirichlet distribution
  • A distribution of multinomial distribution
    parameters.
  • A conjugate prior to the multinomial
    distribution.
  • We can put a Dirichlet prior for each
    state-action pair in transition
    probability and observation
    function

variables , parameters
with
where
29

Lifelong learning (4)
Algorithm
  • 1. Imperfect model M0, containing clean and
    some mine or clutter types, with the
    corresponding S and O S and O could be expanded
    in the learning process
  • 2. Oracle query is one possible action
  • Set learning rate ?
  • Set Dirichlet prior according to the imperfect
    model M0

For any unknown transition probability, For any
unknown observation,
? ? ? ?
? ? ? ?
? ? ? ?
? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
30

Lifelong learning (4)
Algorithm
5. Sample N models , solve policies
6. Initial the weights
wi1/N 7. Initial the history h 8. Initial
the belief state b0 for each model 9. Run the
experiment. At each time step
  • Compute the optimal actions for each model ai
    pi(bi) for i1,,N
  • Pick an action a according to the weights wi
    p(ai)wi
  • c. If one of the three query conditions is met
    (exploration)
  • (1) Sense current local area on a grid
  • (2) Current target label is revealed
  • (3) Build the sub-model for current target and
    compute the hyper-parameters.
  • If the target is a new target type
  • Expand model by including the new target type
    as a diagonal block
  • Else, the target is an existing target type
  • Update the Dirichlet parameters of the current
    target type (next page)

31

Lifelong learning (4)
Algorithm
32
Outline
  • Introduction
  • Partially observable Markov decision processes
    (POMDPs)
  • Model definition offline learning
  • Lifelong-learning algorithm
  • Experimental results

33
Results (1)
  • Data description

mine fields 1.61.6 m2 sensing on a
spatial grid of 2cm by 2cm two sensors EMI
and GPR
  • Robot navigation

search almost everywhere to avoid missing
landmines active sensing to minimize the cost
Basic path lanes
The basic path restrains the robot from
moving across the lanes. the robot takes
actions to determine its sensing positions within
the lanes.
34
Results (2)
  • Offline-learning approach performance summary

Mine field 1 Mine field 2 Mine field 3
Ground truth Number of mines (metal plastic) 5 (32) 7 (43) 7 (43)
Ground truth Number of clutter (metal nonmetal) 21 (183) 57 (3423) 29 (236)
Detection result Number of mines missed 1 1 2
Detection result Number of false alarms 2 2 2
Metal clutter soda can, shell, nail, quarter,
penny, screw, lead, rod, ball bearing Nonmetal
clutter rock, bag of wet sand, bag of dry sand,
CD
35
Results (3)
  • Offline-learning approach Minefield 1

Detection result
Ground truth
M
M
P
P
M
P plastic mine M metal mine Other clutter
1 missing 2 false alarms
36
Results (4)
  • Sensor deployment

Plastic mine GPR sensor Metal mine EMI
sensor Clean center of a mine sensing
few times (2-3 times in general) interface of
mine/clean sensing many times
37
Red rectangular region oracle query Other mark
declarations
Results (5)
Red metal mine Pink plastic mine Yellow
clutter1 Cyan clutter2 Blue c clean
  • Lifelong-learning approach Minefield 1

Ground truth
Initial learning from Mine field 1
38
Results (6)
  • Lifelong-learning approach compared with offline
    learning

Difference of the parameters between the model
learned by lifelong learning and the model
learned by offline learning (training data are
given in advance). The three big error drops
correspond to adding new targets into the model.
39
Red rectangular region oracle query Other mark
declarations
Results (7)
Red metal mine Pink plastic mine Yellow
clutter1 Cyan clutter2 Blue c clean
  • Lifelong-learning approach Minefield 2

Sensing Minefield 2 after the model was learned
from Minefield 1
Ground truth
40
Red rectangular region oracle query Other mark
declarations
Results (8)
Red metal mine Pink plastic mine Yellow
clutter1 Cyan clutter2 Blue c clean
  • Lifelong-learning approach Minefield 3

Ground truth
Write a Comment
User Comments (0)