Title: Design Lessons from an Unattended Ground Sensor System*
 1Design Lessons from an Unattended Ground Sensor 
System
- Lewis Girod 
 - CS 294-1 
 - 23 Sept 2003
 
Center for Embedded Networked Sensing, 
girod_at_cs.ucla.edu Work done at Sensoria Corp, 
 supported by DARPA/ATO contract 
DAAE30-00-C-1055. Sensoria team R. Costanza, 
J. Elson, L. Girod, W. Kaiser, D. McIntire, W. 
Merrill, F. Newberg, G. Rava, B. Schiffer, K. 
Sohrabi 
 2Introduction
SPEC (J. Hill) 4MHz/8bit, 3K/0K
- Networked embedded systems come in a variety of 
sizes  - RAM is a primary tradeoff 
 - Costs power, cost to shutdown 
 - Enables greater complexity 
 - Enables development while postponing optimization
 
Mica2 (Berkeley/Xbow) 8MHz/8bit, 4K/128K
MK2 (UCLA/NESL) 40MHz/16bit, 136K/1M
SHM (Sensoria) 300MIPSFP/32bit, 64M/32M 
 3Motivation for EmStar
- EmStar is a run time environment designed for 
Linux-based distributed embedded systems  - Useful facilities (process health/respawn, 
logging, emulation)  - Common APIs (neighbor discovery, link interface, 
etc)  - Designed for larger memory footprint (avoids hard 
limits)  - Many of the ideas and motivations for EmStar 
derived from our experience with SHM  - Modularity, robustness to module failure 
 - System transparency at low cost to developers 
 - Some parts of EmStar are used in SHM  elsewhere 
 - Time Synch service 
 - Audio Server
 
  4System Objectives and Design 
 5Objectives
- Unattended Ground Sensor (UGS) System 
 - Fully autonomous operation 
 - Ad-hoc deployment 
 - Scaling unit 50 nodes 
 - All operations and protocols local to 50 node 
region  - No global operations or context required
 
Fancy graphics taken from the official SHM 
website 
 6Adaptive and Self-Configuring
- Self-localizing without GPS acoustic ranging 
 - Build map of relative locations 
 - Adaptive/Resilient to environmental conditions 
 - e.g. wind, sunny days, background noise 
 - Self-assembled data network 
 - TDMA MAC layer 
 - Typically 10-hop diameter with 100 nodes 
 - Adaptive/Resilient to RF environment
 
In this application, GPS is avoided for security 
reasons. In other applications, obstructions 
and foliage can be an issue 
 7Maintain coverage via Actuation
- Vigilant units detect failed unit(s) 
 - Remaining units autonomously move in to maintain 
coverage  
  8Demonstration Requirements
- 200x50m outdoor field 
 - 100 nodes, 10m spacing 
 - Sunny afternoon 
 - 85F, 20 MPH wind 
 - No preconfigured state 
 - GPS-free relative geolocation to 0.25m 
 - Detect downed nodes, move to maintain coverage 
within 1 min 
  9SHM Project Design Choices Optimize for rapid 
development
- Concurrent HW/SW development 
 - Compressed schedule 
 - Aggressive scaling milestones 
 - Logistical problems with debugging system of 100 
nodes in 200x50m field  - Complex software required 
 - 150K lines C code 
 - 30 processes 
 - 100 IPC channels 
 - Power is not the driving constraint 
 - Continuous vigilance, rapid response are project 
requirements  - System lifetime target lt 1 day
 
  10System Configuration
- 300 MIPS RISC processor with FPU 
 - 64M RAM / 32M Flash 
 - 2 50kbps 2.4GHz data radios, TDMA, 
frequency-hopping, star-topology MAC, 63 hopping 
patterns  - 4 channels full-duplex audio 
 - 3-axis magnetometer / accelerometer 
 - 2 mobility units, with integrated thrusters 
 - Linux 2.4 kernel 
 - Optional wired ethernet (for Devel/Debug only)
 
  11Results Acoustic Ranging
- Ground truth was hand-surveyed, /- 0.5m 
 - Ranges not temperature compensated in demo 
 - Ranges with angles are more accurate 
 - Angle from TDOA of two or more ranges, must be 
consistent  - Bug discovered after the fact, caused large errors
 
  12Results Radio Utilization
- Graph shows traffic at three bases over a 
complete run  - Initial spikes 
 - Tree formation 
 - Lots of ranging 
 - Quiescent rates 
 - Heartbeats to detect down nodes 
 - Maintenance of trees and location 
 - Reaction to dynamics
 
  13Challenges in Implementation
- Dealing with a dynamic environment 
 - Adapt to wind, weather, RF connectivity 
 - Dealing with noise 
 - Rejecting outliers from timesync, ranges, angles 
 - Filtering neighbor connectivity, insignificant 
changes to range/angle  - Dealing with failure 
 - Node failure 
 - Infrequent crashes (e.g. FP exceptions from 
transient bad data)  - Fault tolerance at process boundaries, avoid 
ripple effect  - Dealing with complexity 
 - Cross layer integration vs. modularity.. Or both? 
 - What are the right set of primitives for 
coordination?  
  14Case Study Acoustic Ranging 
 15Basic TOF Ranging
- Basic idea 
 - Sender emits a characteristic acoustic signal 
 - Receiver correlates received time series with 
time-offsets of reference signal to find peak 
offset  
  16Basic AOA Estimation
- 16 possible paths 
 - First pick best speaker 
 - Then estimate angle from TDOA of one or more 
consistent ranges  
  17Acoustic Ranging, Version 1
- First cut implemented explicit cluster 
coordination protocol  - Lots of error cases to handle, hard to handle all 
efficiently  - Very timing sensitive (sync) 
 - Did not scale past 20 nodes 
 - Cant range across clusters 
 - Best acoustic neighbors may be in other clusters 
 - MLat merging algorithm is error prone 
 - Overuse of flooding 
 - Soft state reflood of cluster MLats and 
orientation data 
  18V2 Decomposing AR
AR 
 19Audio Sample Server
- Continuously samples audio, integrates to 
Timesync  - Eliminates error-prone Synchronized start 
 - Enables acquisition of overlapped sample sets 
 - Buffers past N seconds, exposes buffered 
interface  - Data access can be triggered after the fact 
relaxes timing constraints on trigger message  - Can process overlapping chirps by requesting 
overlapping retrievals, rather than having to 
pick one and ignore other  - Enables access so audio device from multiple apps 
 - Ranging can coexist with acoustic comm subsystem
 
Acoustic comm was developed as a backup channel 
to be used in event of RF jamming 
 20Inter-node Timesync RBS
- Key idea 
 - Receiver latency more deterministic than sender 
 - Thus, common receivers of a sender can be synched 
by correlating the reception times of senders 
broadcasts  - Its your only option if you dont control the 
MAC  
 For sender sync, senders must be in some other 
senders broadcast domain 
 21TimeSync Service
- Inter-node Sync Implementation of RBS 
 - Computes conversion params among all nodes in 
each cluster  - CH does computation, reports parameters to CMs 
 - Intra-node Sync Codec sample clocks 
 - Clock pairs reported by audio server 
 - Map time of DMA interrupt to sample number 
 - Outlier rejection and linear fit to find offset 
and skew estimate  - Yields more consistent result than synch start 
 - Multihop Time Conversion 
 - Graph of sync relations through system 
 - Conversion from one element to another requires 
path through graph. Gaussian error at each step 
sqrt(hops) 
  22Hop-by-hop Time Conversion
- Problem 
 - Nodes have ability to convert within cluster but 
not outside  - Could continually broadcast conversion 
parameters BUT  - They are continuously varying 
 - Large amount of data to transmit across network 
 - Solution Integrate time conversion with routing 
 - Routing layer knows about packets that contain 
timestamps  - Convert timestamps en route 
 - At cluster boundaries 
 - At destination node 
 - Integrated with flooding 
 - Can fail if sync graph ? route
 
Unclear what the right API is here we simply 
added code to flooding. 
 23Reliable State Synchronization
- Problem 
 - Need to reliably broadcast the latest range data 
to N-hop away nodes, so they can build a 
consistent coordinate system  - Should have reasonable latency and low overhead 
 - V1 addressed this problem with periodic refresh 
 - Cluster heads retransmit Mlat tables every 15 
seconds  - Problems Traffic load from redundant sends, 
latency on msg loss  - Traffic load forced new protocol 
 - Send a hash when there was no change since last 
refresh  - If the hash has not been seen, request full 
version  - But, still has 15 second latency on lost data 
 - V2 introduced a Reliable State Sync protocol 
(RSS) 
  24RSS Design
- Semantics 
 - Reliably converges on latest published state 
 - Does not guarantee client sees every transition 
 - Robust and Efficient, structurally similar to 
SRM/wb  - Based on reliable transfer of a sequenced log of 
diffs.  - Pruning of the log is done with awareness of log 
semantics (replaced or deleted keys are pruned)  - Per-source forwarding trees (MST of connectivity 
graph)  - Local repair, up to complete history, from 
upstream neighbor  - New or restarted nodes will download all active 
flows from upstream neighbor  
  25RSS API
- Node X publishes current state as Key-Value Pairs 
 - Diffs are reliably broadcast N hops away from X 
 - Each node within N hops of X eventually sees the 
data X published  - API presents each nodes KVPs in its own 
namespace  - Caveat transmission latency, loss, edge of 
hopcount can cause transient inconsistencies 
State Sync Bus
1
3
2
4
1 A1 1 B2 2 A3 2 C4
1 A1 1 B2 2 A3 2 C4
1 A1 1 B2 2 A3 2 C4
2 A3 2 C4
Note 2-hop publish from 1 doesnt reach 4 
 26Putting it back together AR V2 
 27AR V1 Event Diagram
- But, there are many error cases 
 - REQ lost? 
 - ACK lost? 
 - Bcast lost to some receivers? 
 - Bcast delayed in queue? 
 - Bcast lost to sender? 
 - CMs join two clusters may be busy ranging in 
other cluster.  - Inaccurate codec sync start? 
 - Interference in acoustic channel? 
 - Reply from sender lost? 
 - Reply from receiver(s) lost? 
 - How long to wait for stragglers? 
 - CH failure loses all ranges for cluster
 
Cluster Head (Coordinator)
Cluster Member (Sender)
Cluster Member (Receiver)
mlatd (CH)
AR (CH)
AR (CM)
AR (CM)
Reliability challenges The sender is the 
linchpin an error in sender sync affects all 
ranges to receivers, and replies from receivers 
cant be interpreted without the sender reply. 
If connectivity to sender is bad and the 
broadcast is lost, all receivers waste CPU on a 
useless correlation. Implementing reliable 
reporting is made more complex because retxd 
receiver replies must be matched to a past sender.
If not enough data for cluster mlat, request 
ranging to specific missing cluster members
Send Range REQ to first CM in round robin order, 
check busy
ACK with preferred start time
Bcast Range Start, specify code
Timestamp msg arrival. Sender delays before 
starting to ensure rough sync, and reports exact 
time offset from bcast to codec start
Acoustic Signal
Run correlation, report time offset from bcast to 
detection in data
- Big complexity increase to 
 - Range across clusters 
 - Coordinate adjacent clusters 
 - Do regional mlats 
 - Average multiple sync bcasts
 
CH waits for stragglers
Report new ranges and notify mlat when round 
robin thru CMs completes 
 28AR V2 Event Diagram
- What can go wrong here? 
 - Collisions in acoustic channel. 
 - Flooded message delayed beyond audio buffering 
(16 seconds).  - Flooded message dropped for lack of sync 
relations along route.  - Node restart causes ranges to/from that node to 
be dropped. 
Continuous Sampling
Continuous Sync Maintenance
Waiting for chirp notification
Waiting for chirp request
Waiting for enough data to compute mlat
mlatd
ar_send
ar_recv
syncd
audiod
If not enough data for mlat, request chirp and 
wait for a while
- Key design points 
 - Encapsulate timing critical parts, no timing 
constraints on reliability.  - If a receiver cant sync to sender it wont 
attempt correlation. 
Chirp audio (audiod on remote node records it)
Acoustic Signal
Flood Chirp notification message, with hop-by-hop 
conversion at flood layer
Retrieve samples from buffer and correlate in 
separate thread
Publish new range to N hop away neighbors
Try mlat again with new data in separate thread 
 29Key Observations
- No coordination required 
 - Simplifying transport abstractions 
 - Continuous operation and service model
 
  30Key Observations
- No coordination required 
 - If mlatd doesnt have enough data it triggers 
chirping to start generating more data  - Exponential backoff on chirping with reset when 
data is lost.  - Simplicity of system lets designer focus on these 
details  - ar_send  ar_recv are slaves to request and 
notify messages.  - Transparently, ar_recv can receive overlapping 
triggers and buffer the data for correlation  - Priority scheme decides the best order to process 
queued correlations, based on past 
success/failure and RF hopcount  - Simplifying transport abstractions 
 - Continuous operation and service model
 
  31Key Observations
- No coordination required 
 - Simplifying transport abstractions 
 - Flooding takes care of delivering a local time 
 - State Sync provides consistency for data input to 
mlatd  - Efficiently supports a potentially large number 
of keys (1000), enabling full regional mlat at 
each node (no merging)  - Mlat takes 10-15min, sync is consistent on that 
timescale  - Failure of one node only loses range data for 
that node  - Continuous operation and service model
 
  32Key Observations
- No coordination required 
 - Simplifying transport abstractions 
 - Continuous operation and service model 
 - Eliminates many inconsistencies and corner cases 
 - Reduces the number of states or modes 
 - Simplifies interfaces to services 
 - Recovery from faults without coordination  just 
wait for stuff to start working again  - Service model supports multiple apps concurrently
 
  33The Catch
- Of course, the catch is power consumption 
 - Continuous operation can be wasteful 
 - Modularity can be less efficient than cross-layer 
integration  - Interesting questions 
 - How much is gained by fine-grained shutdown, plus 
the added coordination overhead, relative to more 
coarse grained shutdown and periods of continuous 
operation?  - For instance, the AR system could shut down after 
generating an initial map, and only wake up when 
something moves. 
  34The End!
For more information on EmStar, 
see http//cvs.cens.ucla.edu/emstar/ 
 35Design Evolution
- Initial design strategy shortest path first 
 - Modular decomposition according to best guess at 
time  - Making a full-blown, generalized service is much 
more work than a one-off feature  so tradeoff 
considered case by case  - Problem As more is learned these tradeoffs fit 
more poorly  - Unmanageable complexity to address problems 
 - Redesign 
 - Factor out common components 
 - Plan for known scaling problems 
 - Remaining modules are of manageable complexity, 
yet usually achieve a more complete and correct 
implementation  - More sophisticated inter-module dependencies
 
  36Radios
- Each node has two radios 
 - TDMA, frequency hopping radios 
 - 63 hopping patterns 
 - Each radio can lock to one pattern 
 - Patterns are independent channels 
 - Bases on same pattern tend to be desynchronized 
 - Base/Remote (star) topology 
 - Base synchronizes TDMA cycle, remotes join
 
  37TDMA Slot Scheme
- Each frame contains 1 transmit slot for the base 
and 1 transmit slot for each remote  - Slot size implies MTU 
 - Frame size is a constant 
 - Base slot size is fixed 70 byte MTU 
 - Number of remotes inv. prop. to remote MTU 
 - Practical MTU (40 bytes) ? 8 node clusters
 
14ms
Base
S
Base Slot
Remote 1
Remote 2
Remote 3 
 38Packet Transfer
- Broadcast capability 
 - Base can use its slot to send a broadcast to all 
remotes, or a unicast to a single remote  - Remotes can send only unicasts to base 
 - Link layer retransmission 
 - MAC implements link layer ACKs for unicast 
messages, and configurable retransmission 
  39Breach Healing
In this application, healing is intended only 
to address breaches created by dying nodes, not 
preexisting breaches. Other algorithms might also 
be useful, e.g. density maintenance, but were not 
implemented here.