Lihan He, Shihao Ji and Lawrence Carin

1 / 40

About This Presentation

Title:

Lihan He, Shihao Ji and Lawrence Carin

Description:

Combining Exploration and Exploitation in Landmine Detection Lihan He, Shihao Ji and Lawrence Carin ECE, Duke University – PowerPoint PPT presentation

Number of Views:4

Avg rating:3.0/5.0

Slides: 41

Provided by: Lih81

Learn more at: http://ece.duke.edu

more less

Transcript and Presenter's Notes

Title: Lihan He, Shihao Ji and Lawrence Carin

1
Combining Exploration and Exploitation in
Landmine Detection
Lihan He, Shihao Ji and Lawrence Carin ECE, Duke
University
2
Outline

Introduction
Partially observable Markov decision processes
(POMDPs)
Model definition offline learning
Lifelong-learning algorithm
Experimental results

3
Landmine detection (1)

Landmine detection
By robots instead of human beings
Underlying model controlling the robot POMDP

Multiple sensors
Single sensor sensitive to only certain types
of objects
EMI sensor conductivity
GPR sensor dielectric property
Seismic sensor mechanical property
Multiple complementary sensors improve
detection performance

4
Landmine detection (2)

Statement of the problem

Given a minefield where some landmines and
clutter are buried underground
Two types of sensors are available EMI sensor
and GPR sensor
Sensing has cost
Correct / incorrect declarations have
corresponding reward / penalty

How can we develop a strategy to effectively find
the landmines in this minefield with the minimal
cost ?
Questions in it

How to optimally choose sensing positions in the
field, so as to use as few sensing points as
possible to find the landmines?
How to optimally choose sensors in each sensing
position?
When to sense and when to declare?

5
Landmine detection (3)

Solution sketch

A partially observable Markov decision process
(POMDP) model is built to solve this problem
since it provides an approach to select actions
(sensor deployment, sensing positions and
declarations) optimally based on maximal reward /
minimal cost.

Lifelong learning

The robot learns the model at the same time as it
moves and senses in the mine field (combining
exploration and exploitation)
Model is updated based on the exploration process.

6
Outline

Introduction
Partially observable Markov decision processes
(POMDPs)
Model definition offline learning
Lifelong-learning algorithm
Experimental results

7
POMDP (1)
POMDP HMM controllable actions rewards
A POMDP is a model of an agent interacting
synchronously with its environment. The agent
takes as input the observations of the
environment, estimates the state according to
observed information, and then generates as
output actions based on its policy. During the
periodic observation-action loops, the agent gets
maximal reward, or equivalently, minimal cost.
8
POMDP (2)
A POMDP model is defined by the tuple lt S, A, T,
R, ?, O gt

S is a finite set of discrete states of the
environment.
A is a finite set of discrete actions.
? is a finite set of discrete observations
providing noisy state information
T S?A ? ?(S) is the state transition
probability
the probability of transitioning from state s
to s when taking action a
O S?A ? ?(?) is the observation function
the probability of receiving observation o after
taking action a, landing in state s.
R S?A ? ? , R(s, a) is the expected reward the
agent receives by taking action a in state s.

9
POMDP (3)
Belief state b

The agent believes which state it is currently
in
A probability distribution over all the state S
A summary of past information
updated in each step by Bayes rule
based on the latest action and observation, and
the previous belief state

10
POMDP (4)
Policy

A mapping from belief states to actions
Telling agent which action it should take given
current belief state.

Optimal policy
Horizon length

Maximize the expected discounted reward

Immediate reward
Discounted future reward

V(b) is piecewise linear and convex in belief
state (Sondik, 1971)
Represent V(b) by a set of S-dimensional
vector a1, , am

11
POMDP (5)
Policy learning

Solve for vectors a1, , am
Point based value iteration (PBVI) algorithm
Iteratively updates the vector a and value V for
a set of sample belief points.

One step from the horizon
n1 step from the horizon is computed from the
results of the n step
with
where
12
Outline

Introduction
Partially observable Markov decision processes
(POMDPs)
Model definition offline learning
Lifelong-learning algorithm
Experimental results

13
Model definition (1)

Feature extraction EMI sensor

EMI Model
Sensor measurements
Model parameters extracted by nonlinear fitting
method
14
Model definition (2)

Feature extraction GPR sensor

Time
Down-track position

Raw moments energy features

Central moments variance and asymmetry of the
wave

15
Model definition (3)

Definition observation O

Definition state S

16
Model definition (4)

Estimate S and O

Variational Bayesian (VB) expectation-maximization
(EM) method
for model selection.
Bayesian learning
One Criterion compare model evidence (marginal
likelihood)

17
Model definition (5)

Estimate S and O

Candidate models
HMMs with two sets of observations
S1,5,9, O2,3,4,
18
Model definition (6)

Estimate S and O

Estimate S
Estimate O
19
Model definition (7)

Specification of action A

10 sensing actions allow movements of 4
directions
1 Stay, GPR sensing 2 South, GPR sensing 3
North, GPR sensing 4 West, GPR sensing 5 East,
GPR sensing 6 Stay, EMI sensing 7 South,
EMI sensing 8 North, EMI sensing 9 West, EMI
sensing 10 East, EMI sensing
Declaration actions declare as one type of
target
11 Declare as metal mine 12 Declare as
plastic mine 13 Declare as Type-1 clutter
14 Declare as Type-2 clutter 15 Declare as
clean
20
Model definition (8)

Estimate T

Across all 5 types of mines and clutter, a total
of 29 states are defined.
metal mine
plastic mine
type-1 clutter
type-2 clutter
clean
21
Model definition (9)

Estimate T

Stay actions do not cause state transition
identity matrix
Other sensing actions cause state transition
by elementary geometric probability computation
Declaration actions reset the problem
uniform distribution over states

where awalk south and then sense with
EMI or walk south and then sense with GPR
d distance traveled in a single step by a robot
s1, s2, s3 and s4 denote the 4 borders of state
5, as well as their respective area metric.
22
Model definition (10)

Estimate T

Assume a mine or clutter is buried separately
State transitions happen only within the states
of a target when robot moves
Clean is a bridge between the targets.

Other target
Other target
29
Other target
Metal mine states
clean
23
Model definition (11)

Estimate T

State transition matrix diagonal block
Model expansion add more diagonal block, each
one a target

24
Model definition (12)

Specification of reward R

Sensing -1
Each sensing (either EMI or GPR) has a cost -1
Correct declaration 10
Correctly declare a target
Partially correct declaration 5
Confused between different types of
landmines Confused between different types of
clutter
Incorrect declaration large penalty
Missing declare as clean or clutter when it is
a landmine -100 False alarm declare as a
landmine when it is clean or clutter -50
25
Outline

Introduction
Partially observable Markov decision processes
(POMDPs)
Model definition offline learning
Lifelong-learning algorithm
Experimental results

26
Lifelong learning (1)
Model-based algorithm No training data
available in advance Learn the POMDP model by
Bayes approach during the exploration
exploitation processes. Assume a rough
model is given, but some model parameters are
uncertain An oracle is available, which
can provide exact information about target
label, size and position, but using oracle is
expansive. Criteria to use oracle 1.
Policy selects oracle query action 2. Agent
finds new observations new knowledge 3.
After sensing a lot, agent still cannot make
decisions too difficult
27
Lifelong learning (2)
Oracle query includes three steps 1.
measures data of both sensors on a grid 2. true
target label is revealed 3. build target model
based on measured data Two learning
approaches 1. model expansion (more target
types are considered) 2. model hyper-parameter
update
28
Lifelong learning (3)
Dirichlet distribution

A distribution of multinomial distribution
parameters.
A conjugate prior to the multinomial
distribution.
We can put a Dirichlet prior for each
state-action pair in transition
probability and observation
function

variables , parameters
with
where
29

Lifelong learning (4)
Algorithm

1. Imperfect model M0, containing clean and
some mine or clutter types, with the
corresponding S and O S and O could be expanded
in the learning process
2. Oracle query is one possible action
Set learning rate ?
Set Dirichlet prior according to the imperfect
model M0

For any unknown transition probability, For any
unknown observation,
? ? ? ?
? ? ? ?
? ? ? ?
? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
30

Lifelong learning (4)
Algorithm
5. Sample N models , solve policies
6. Initial the weights
wi1/N 7. Initial the history h 8. Initial
the belief state b0 for each model 9. Run the
experiment. At each time step

Compute the optimal actions for each model ai
pi(bi) for i1,,N
Pick an action a according to the weights wi
p(ai)wi
c. If one of the three query conditions is met
(exploration)
(1) Sense current local area on a grid
(2) Current target label is revealed
(3) Build the sub-model for current target and
compute the hyper-parameters.
If the target is a new target type
Expand model by including the new target type
as a diagonal block
Else, the target is an existing target type
Update the Dirichlet parameters of the current
target type (next page)

31

Lifelong learning (4)
Algorithm
32
Outline

Introduction
Partially observable Markov decision processes
(POMDPs)
Model definition offline learning
Lifelong-learning algorithm
Experimental results

33
Results (1)

Data description

mine fields 1.61.6 m2 sensing on a
spatial grid of 2cm by 2cm two sensors EMI
and GPR

Robot navigation

search almost everywhere to avoid missing
landmines active sensing to minimize the cost
Basic path lanes
The basic path restrains the robot from
moving across the lanes. the robot takes
actions to determine its sensing positions within
the lanes.
34
Results (2)

Offline-learning approach performance summary

Mine field 1 Mine field 2 Mine field 3
Ground truth Number of mines (metal plastic) 5 (32) 7 (43) 7 (43)
Ground truth Number of clutter (metal nonmetal) 21 (183) 57 (3423) 29 (236)
Detection result Number of mines missed 1 1 2
Detection result Number of false alarms 2 2 2
Metal clutter soda can, shell, nail, quarter,
penny, screw, lead, rod, ball bearing Nonmetal
clutter rock, bag of wet sand, bag of dry sand,
CD
35
Results (3)

Offline-learning approach Minefield 1

Detection result
Ground truth
M
M
P
P
M
P plastic mine M metal mine Other clutter
1 missing 2 false alarms
36
Results (4)

Sensor deployment

Plastic mine GPR sensor Metal mine EMI
sensor Clean center of a mine sensing
few times (2-3 times in general) interface of
mine/clean sensing many times
37
Red rectangular region oracle query Other mark
declarations
Results (5)
Red metal mine Pink plastic mine Yellow
clutter1 Cyan clutter2 Blue c clean

Lifelong-learning approach Minefield 1

Ground truth
Initial learning from Mine field 1
38
Results (6)

Lifelong-learning approach compared with offline
learning

Difference of the parameters between the model
learned by lifelong learning and the model
learned by offline learning (training data are
given in advance). The three big error drops
correspond to adding new targets into the model.
39
Red rectangular region oracle query Other mark
declarations
Results (7)
Red metal mine Pink plastic mine Yellow
clutter1 Cyan clutter2 Blue c clean

Lifelong-learning approach Minefield 2

Sensing Minefield 2 after the model was learned
from Minefield 1
Ground truth
40
Red rectangular region oracle query Other mark
declarations
Results (8)
Red metal mine Pink plastic mine Yellow
clutter1 Cyan clutter2 Blue c clean