Design Lessons from an Unattended Ground Sensor System* presentation

About This Presentation

Transcript and Presenter's Notes

Title: Design Lessons from an Unattended Ground Sensor System*

1
Design Lessons from an Unattended Ground Sensor
System

Lewis Girod
CS 294-1
23 Sept 2003

Center for Embedded Networked Sensing,
girod_at_cs.ucla.edu Work done at Sensoria Corp,
supported by DARPA/ATO contract
DAAE30-00-C-1055. Sensoria team R. Costanza,
J. Elson, L. Girod, W. Kaiser, D. McIntire, W.
Merrill, F. Newberg, G. Rava, B. Schiffer, K.
Sohrabi
2
Introduction
SPEC (J. Hill) 4MHz/8bit, 3K/0K

Networked embedded systems come in a variety of
sizes
RAM is a primary tradeoff
Costs power, cost to shutdown
Enables greater complexity
Enables development while postponing optimization

Mica2 (Berkeley/Xbow) 8MHz/8bit, 4K/128K
MK2 (UCLA/NESL) 40MHz/16bit, 136K/1M
SHM (Sensoria) 300MIPSFP/32bit, 64M/32M
3
Motivation for EmStar

EmStar is a run time environment designed for
Linux-based distributed embedded systems
Useful facilities (process health/respawn,
logging, emulation)
Common APIs (neighbor discovery, link interface,
etc)
Designed for larger memory footprint (avoids hard
limits)
Many of the ideas and motivations for EmStar
derived from our experience with SHM
Modularity, robustness to module failure
System transparency at low cost to developers
Some parts of EmStar are used in SHM elsewhere
Time Synch service
Audio Server

4
System Objectives and Design
5
Objectives

Unattended Ground Sensor (UGS) System
Fully autonomous operation
Ad-hoc deployment
Scaling unit 50 nodes
All operations and protocols local to 50 node
region
No global operations or context required

Fancy graphics taken from the official SHM
website
6
Adaptive and Self-Configuring

Self-localizing without GPS acoustic ranging
Build map of relative locations
Adaptive/Resilient to environmental conditions
e.g. wind, sunny days, background noise
Self-assembled data network
TDMA MAC layer
Typically 10-hop diameter with 100 nodes
Adaptive/Resilient to RF environment

In this application, GPS is avoided for security
reasons. In other applications, obstructions
and foliage can be an issue
7
Maintain coverage via Actuation

Vigilant units detect failed unit(s)
Remaining units autonomously move in to maintain
coverage

8
Demonstration Requirements

200x50m outdoor field
100 nodes, 10m spacing
Sunny afternoon
85F, 20 MPH wind
No preconfigured state
GPS-free relative geolocation to 0.25m
Detect downed nodes, move to maintain coverage
within 1 min

9
SHM Project Design Choices Optimize for rapid
development

Concurrent HW/SW development
Compressed schedule
Aggressive scaling milestones
Logistical problems with debugging system of 100
nodes in 200x50m field
Complex software required
150K lines C code
30 processes
100 IPC channels
Power is not the driving constraint
Continuous vigilance, rapid response are project
requirements
System lifetime target lt 1 day

10
System Configuration

300 MIPS RISC processor with FPU
64M RAM / 32M Flash
2 50kbps 2.4GHz data radios, TDMA,
frequency-hopping, star-topology MAC, 63 hopping
patterns
4 channels full-duplex audio
3-axis magnetometer / accelerometer
2 mobility units, with integrated thrusters
Linux 2.4 kernel
Optional wired ethernet (for Devel/Debug only)

11
Results Acoustic Ranging

Ground truth was hand-surveyed, /- 0.5m
Ranges not temperature compensated in demo
Ranges with angles are more accurate
Angle from TDOA of two or more ranges, must be
consistent
Bug discovered after the fact, caused large errors

12
Results Radio Utilization

Graph shows traffic at three bases over a
complete run
Initial spikes
Tree formation
Lots of ranging
Quiescent rates
Heartbeats to detect down nodes
Maintenance of trees and location
Reaction to dynamics

13
Challenges in Implementation

Dealing with a dynamic environment
Adapt to wind, weather, RF connectivity
Dealing with noise
Rejecting outliers from timesync, ranges, angles
Filtering neighbor connectivity, insignificant
changes to range/angle
Dealing with failure
Node failure
Infrequent crashes (e.g. FP exceptions from
transient bad data)
Fault tolerance at process boundaries, avoid
ripple effect
Dealing with complexity
Cross layer integration vs. modularity.. Or both?
What are the right set of primitives for
coordination?

14
Case Study Acoustic Ranging
15
Basic TOF Ranging

Basic idea
Sender emits a characteristic acoustic signal
Receiver correlates received time series with
time-offsets of reference signal to find peak
offset

16
Basic AOA Estimation

16 possible paths
First pick best speaker
Then estimate angle from TDOA of one or more
consistent ranges

17
Acoustic Ranging, Version 1

First cut implemented explicit cluster
coordination protocol
Lots of error cases to handle, hard to handle all
efficiently
Very timing sensitive (sync)
Did not scale past 20 nodes
Cant range across clusters
Best acoustic neighbors may be in other clusters
MLat merging algorithm is error prone
Overuse of flooding
Soft state reflood of cluster MLats and
orientation data

18
V2 Decomposing AR
AR
19
Audio Sample Server

Continuously samples audio, integrates to
Timesync
Eliminates error-prone Synchronized start
Enables acquisition of overlapped sample sets
Buffers past N seconds, exposes buffered
interface
Data access can be triggered after the fact
relaxes timing constraints on trigger message
Can process overlapping chirps by requesting
overlapping retrievals, rather than having to
pick one and ignore other
Enables access so audio device from multiple apps
Ranging can coexist with acoustic comm subsystem

Acoustic comm was developed as a backup channel
to be used in event of RF jamming
20
Inter-node Timesync RBS

Key idea
Receiver latency more deterministic than sender
Thus, common receivers of a sender can be synched
by correlating the reception times of senders
broadcasts
Its your only option if you dont control the
MAC

For sender sync, senders must be in some other
senders broadcast domain
21
TimeSync Service

Inter-node Sync Implementation of RBS
Computes conversion params among all nodes in
each cluster
CH does computation, reports parameters to CMs
Intra-node Sync Codec sample clocks
Clock pairs reported by audio server
Map time of DMA interrupt to sample number
Outlier rejection and linear fit to find offset
and skew estimate
Yields more consistent result than synch start
Multihop Time Conversion
Graph of sync relations through system
Conversion from one element to another requires
path through graph. Gaussian error at each step
sqrt(hops)

22
Hop-by-hop Time Conversion

Problem
Nodes have ability to convert within cluster but
not outside
Could continually broadcast conversion
parameters BUT
They are continuously varying
Large amount of data to transmit across network
Solution Integrate time conversion with routing
Routing layer knows about packets that contain
timestamps
Convert timestamps en route
At cluster boundaries
At destination node
Integrated with flooding
Can fail if sync graph ? route

Unclear what the right API is here we simply
added code to flooding.
23
Reliable State Synchronization

Problem
Need to reliably broadcast the latest range data
to N-hop away nodes, so they can build a
consistent coordinate system
Should have reasonable latency and low overhead
V1 addressed this problem with periodic refresh
Cluster heads retransmit Mlat tables every 15
seconds
Problems Traffic load from redundant sends,
latency on msg loss
Traffic load forced new protocol
Send a hash when there was no change since last
refresh
If the hash has not been seen, request full
version
But, still has 15 second latency on lost data
V2 introduced a Reliable State Sync protocol
(RSS)

24
RSS Design

Semantics
Reliably converges on latest published state
Does not guarantee client sees every transition
Robust and Efficient, structurally similar to
SRM/wb
Based on reliable transfer of a sequenced log of
diffs.
Pruning of the log is done with awareness of log
semantics (replaced or deleted keys are pruned)
Per-source forwarding trees (MST of connectivity
graph)
Local repair, up to complete history, from
upstream neighbor
New or restarted nodes will download all active
flows from upstream neighbor

25
RSS API

Node X publishes current state as Key-Value Pairs
Diffs are reliably broadcast N hops away from X
Each node within N hops of X eventually sees the
data X published
API presents each nodes KVPs in its own
namespace
Caveat transmission latency, loss, edge of
hopcount can cause transient inconsistencies

State Sync Bus
1
3
2
4
1 A1 1 B2 2 A3 2 C4
1 A1 1 B2 2 A3 2 C4
1 A1 1 B2 2 A3 2 C4
2 A3 2 C4
Note 2-hop publish from 1 doesnt reach 4
26
Putting it back together AR V2
27
AR V1 Event Diagram

But, there are many error cases
REQ lost?
ACK lost?
Bcast lost to some receivers?
Bcast delayed in queue?
Bcast lost to sender?
CMs join two clusters may be busy ranging in
other cluster.
Inaccurate codec sync start?
Interference in acoustic channel?
Reply from sender lost?
Reply from receiver(s) lost?
How long to wait for stragglers?
CH failure loses all ranges for cluster

Cluster Head (Coordinator)
Cluster Member (Sender)
Cluster Member (Receiver)
mlatd (CH)
AR (CH)
AR (CM)
AR (CM)
Reliability challenges The sender is the
linchpin an error in sender sync affects all
ranges to receivers, and replies from receivers
cant be interpreted without the sender reply.
If connectivity to sender is bad and the
broadcast is lost, all receivers waste CPU on a
useless correlation. Implementing reliable
reporting is made more complex because retxd
receiver replies must be matched to a past sender.
If not enough data for cluster mlat, request
ranging to specific missing cluster members
Send Range REQ to first CM in round robin order,
check busy
ACK with preferred start time
Bcast Range Start, specify code
Timestamp msg arrival. Sender delays before
starting to ensure rough sync, and reports exact
time offset from bcast to codec start
Acoustic Signal
Run correlation, report time offset from bcast to
detection in data

Big complexity increase to
Range across clusters
Coordinate adjacent clusters
Do regional mlats
Average multiple sync bcasts

CH waits for stragglers
Report new ranges and notify mlat when round
robin thru CMs completes
28
AR V2 Event Diagram

What can go wrong here?
Collisions in acoustic channel.
Flooded message delayed beyond audio buffering
(16 seconds).
Flooded message dropped for lack of sync
relations along route.
Node restart causes ranges to/from that node to
be dropped.

Continuous Sampling
Continuous Sync Maintenance
Waiting for chirp notification
Waiting for chirp request
Waiting for enough data to compute mlat
mlatd
ar_send
ar_recv
syncd
audiod
If not enough data for mlat, request chirp and
wait for a while

Key design points
Encapsulate timing critical parts, no timing
constraints on reliability.
If a receiver cant sync to sender it wont
attempt correlation.

Chirp audio (audiod on remote node records it)
Acoustic Signal
Flood Chirp notification message, with hop-by-hop
conversion at flood layer
Retrieve samples from buffer and correlate in
separate thread
Publish new range to N hop away neighbors
Try mlat again with new data in separate thread
29
Key Observations

No coordination required
Simplifying transport abstractions
Continuous operation and service model

30
Key Observations

No coordination required
If mlatd doesnt have enough data it triggers
chirping to start generating more data
Exponential backoff on chirping with reset when
data is lost.
Simplicity of system lets designer focus on these
details
ar_send ar_recv are slaves to request and
notify messages.
Transparently, ar_recv can receive overlapping
triggers and buffer the data for correlation
Priority scheme decides the best order to process
queued correlations, based on past
success/failure and RF hopcount
Simplifying transport abstractions
Continuous operation and service model

31
Key Observations

No coordination required
Simplifying transport abstractions
Flooding takes care of delivering a local time
State Sync provides consistency for data input to
mlatd
Efficiently supports a potentially large number
of keys (1000), enabling full regional mlat at
each node (no merging)
Mlat takes 10-15min, sync is consistent on that
timescale
Failure of one node only loses range data for
that node
Continuous operation and service model

32
Key Observations

No coordination required
Simplifying transport abstractions
Continuous operation and service model
Eliminates many inconsistencies and corner cases
Reduces the number of states or modes
Simplifies interfaces to services
Recovery from faults without coordination just
wait for stuff to start working again
Service model supports multiple apps concurrently

33
The Catch

Of course, the catch is power consumption
Continuous operation can be wasteful
Modularity can be less efficient than cross-layer
integration
Interesting questions
How much is gained by fine-grained shutdown, plus
the added coordination overhead, relative to more
coarse grained shutdown and periods of continuous
operation?
For instance, the AR system could shut down after
generating an initial map, and only wake up when
something moves.

34
The End!
For more information on EmStar,
see http//cvs.cens.ucla.edu/emstar/
35
Design Evolution

Initial design strategy shortest path first
Modular decomposition according to best guess at
time
Making a full-blown, generalized service is much
more work than a one-off feature so tradeoff
considered case by case
Problem As more is learned these tradeoffs fit
more poorly
Unmanageable complexity to address problems
Redesign
Factor out common components
Plan for known scaling problems
Remaining modules are of manageable complexity,
yet usually achieve a more complete and correct
implementation
More sophisticated inter-module dependencies

36
Radios

Each node has two radios
TDMA, frequency hopping radios
63 hopping patterns
Each radio can lock to one pattern
Patterns are independent channels
Bases on same pattern tend to be desynchronized
Base/Remote (star) topology
Base synchronizes TDMA cycle, remotes join

37
TDMA Slot Scheme

Each frame contains 1 transmit slot for the base
and 1 transmit slot for each remote
Slot size implies MTU
Frame size is a constant
Base slot size is fixed 70 byte MTU
Number of remotes inv. prop. to remote MTU
Practical MTU (40 bytes) ? 8 node clusters

14ms
Base
S
Base Slot
Remote 1
Remote 2
Remote 3
38
Packet Transfer

Broadcast capability
Base can use its slot to send a broadcast to all
remotes, or a unicast to a single remote
Remotes can send only unicasts to base
Link layer retransmission
MAC implements link layer ACKs for unicast
messages, and configurable retransmission

39
Breach Healing
In this application, healing is intended only
to address breaches created by dying nodes, not
preexisting breaches. Other algorithms might also
be useful, e.g. density maintenance, but were not
implemented here.

Write a Comment

User Comments (0)

About PowerShow.com

Design Lessons from an Unattended Ground Sensor System* PowerPoint PPT Presentation