Title: Quality Aware Sensor Database (QUASAR) Project**
1Quality Aware Sensor Database (QUASAR) Project
- Sharad Mehrotra
- Department of Information and Computer Science
- University of California, Irvine
Supported in part by a collaborative NSF ITR
grant entitled real-time data capture, analysis,
and querying of dynamic spatio-temporal events
in collaboration with UCLA, U. Maryland, U.
Chicago
2Talk Outline
- Quasar Project
- motivation and background
- data collection and archival components
- query processing
- tracking application using QUASAR framework
- challenges and ongoing work
- Brief overview of other research projects
- MARS Project - incorporating similarity retrieval
and refinement over structured and
semi-structured data to aid interactive data
analysis/mining - Database as a Service (DAS) Project - supporting
the application service provider model for data
management
3Emerging Computing Infrastructure
In-body, in-cell, in-vitro spaces
- Generational advances to computing infrastructure
- sensors will be everywhere
- Emerging applications with limitless
possibilities - real-time monitoring and control, analysis
- New challenges
- limited bandwidth energy
- highly dynamic systems
- System architectures are due for an overhaul
- at all levels of the system OS, middleware,
databases, applications
Instrumented wide-area spaces
4Impact to Data Management
- Traditional data management
- client-server architecture
- efficient approaches to data storage querying
- query shipping versus data shipping
- data changes with explicit update
- Emerging Challenge
- data producers must be considered as first
class entities - sensors generate continuously changing highly
dynamic data - sensors may store, process, and communicate data
5Data Management Architecture Issues
producer cache
Data/query request
Data/query result
server
- Where to store data?
- Do not store -- stream model
- not suitable if we wish to archive data for
future analysis or if data is too important to
lose - at the producers
- limited storage, network, compute resources
- at the servers
- server may not be able to cope with high data
production rates. May lead to data staleness
and/or wasted resources - Where to compute?
- At the client, server, data producers
6Quasar Architecture
- Hierarchical architecture
- data flows from producers to server to clients
periodically - queries flow the other way
- If client cache does not suffices, then
- query routed to appropriate server
- If server cache does not suffice, then access
current data at producer - This is a logical architecture-- producers could
also be clients.
Client cache
Query flow
Server cache archive
server
data flow
producer cache
7Quasar Observations Approach
- Applications can tolerate errors in sensor data
- applications may not require exact answers
- small errors in location during tracking or
error in answer to query result may be OK - data cannot be precise due to measurement errors,
transmission delays, etc. - Communication is the dominant cost
- limited wireless bandwidth, source of major
energy drain - Quasar Approach
- exploit application error tolerance to reduce
communication between producer and server - Two approaches
- Minimize resource usage given quality constraints
- Maximize quality given resource constraints
8Quality-based Data Collection Problem
Sensor time series pn, pn-1, , p1
- Let P lt p1, p2, , pn gt be a sequence of
environmental measurements (time series)
generated by the producer, where n now - Let S lts1, s2, , sngt be the server side
representation of the sequence - A within-? quality data collection protocol
guarantees that - for all i error(pi, si) lt ?
- ? is derived from application quality tolerance
9Simple Data Collection Protocol
Sensor time series pn, pn-1, , p1
- sensor Logic (at time step n)
- Let p last value sent to server
- if error(pn, p) gt ?
- send pn to server
- server logic (at time step n)
- If new update pn received at step n
- sn pn
- Else
- sn last update sent by sensor
- guarantees maximum error at server less than
equal to ?
10Exploiting Prediction Models
- Producer and server agree upon a prediction model
(M, ?) - Let spredi be the predicted value at time i
based on (M, ?) - sensor Logic (at time step n)
- if error(pn, spredn ) gt ?
- send pn to server
- server logic (at time step n)
- If new update pn received at step n
- sn pn
- Else
- sn spredn based on model (M, ?)
11Challenges in Prediction
- Simple versus complex models?
- Complex and more accurate models require more
parameters (that will need to be transmitted). - Goal is to minimize communication not
necessarily best prediction - How is a model M generated?
- static -- one out of a fixed set of models
- dynamic -- dynamically learn a model from data
- When should a model M or parameters ? be changed?
- immediately on model violation
- too aggressive -- violation may be a temporary
phenomena - never changed
- too conservative -- data rarely follows a single
model
12Challenges in Prediction (cont.)
- who does the model update?
- Server
- Long-haul prediction models possible, since
server maintains history - might not predict recent behavior well since
server does not know exact S sequence server has
only samples - extra communication to inform the producer
- Producer
- better knowledge of recent history
- long haul models not feasible since producer does
not have history - producers share computation load
- Both
- server looks for new models, sensor performs
parameter fitting given existing models.
13Archiving Sensor Data
- Often sensor-based applications are built with
only the real-time utility of time series data. - Values at time instants ltltn are discarded.
- Archiving such data consists of maintaining the
entire S sequence, or an approximation thereof. - Importance of archiving
- Discovering large-scale patterns
- Once-only phenomena, e.g., earthquakes
- Discovering events detected post facto by
rewinding the time series - Future usage of data which may be not known while
it is being collected
14Problem Formulation
- Let P lt p1, p2, , pn gt be the sensor
time series - Let S lt s1, s2, , sn gt be the server
side representation - A within ?archive quality data archival protocol
guarantees that - error(pi, si) lt ?archive
- Trivial Solution modify collection protocol to
collect data at quality guarantee of
min(?archive , ?collect) - then prediction model by itself will provide a
?archive quality data stream that can be
archived. - Better solutions possible since
- archived data not needed for immediate access by
real-time or forecasting applications (such as
monitoring, tracking) - compression can be used to reduce data transfer
15Data Archival Protocol
Sensor updates for data collection
Compressed representation for archiving
pn, pn-1, ..
compress
processing at sensor exploited to reduce
communication cost and hence battery drain
Sensor memory buffer
- Sensors compresses observed time series p1n
and sends a lossy compression to the server - At time n
- p1n-nlag is at the server in compressed form
s 1n-nlag within-?archive - sn-nlag1n is estimated via a predictive model
(M, ?) - collection protocol guarantees that this remains
within- ?collect - sn1?? can be predicted but its quality is not
guaranteed (because it is in the future and thus
the sensor has not observed these values)
16Piecewise Constant Approximation (PCA)
- Given a time series Sn s1n a piecewise
constant approximation of it is a sequence - PCA(Sn) lt (ci, ei) gt
- that allows us to estimate sj as
- scapt j ci if j in ei-11, ei
- c1 if jlte1
Value
c1
c4
c3
Time
c2
e1 e2 e3 e4
17Online Compression using PCA
- Goal Given stream of sensor values, generate a
within-?archive PCA representation of a time
series - Approach (PMC-midrange)
- Maintain m, M as the minimum/maximum values of
observed samples since last segment - On processing pn, update m and M if needed
- if M - m gt 2?archive , output a segment ((mM
)/2, n)
Value
Example ?archive 1.5
Time
1 2 3 4 5
18Online Compression using PCA
- PMC-MR
- guarantees that each segment compresses the
corresponding time series segment to
within-?archive - requires O(1) storage
- is instance optimal
- no other PCA representation with fewer segments
can meet the within-?archive constraint - Variant of PMC-MR
- PMC-MEAN, which takes the mean of the samples
seen thus far instead of mid range.
19Improving PMC using Prediction
- Observation Prediction models guarantee a
within- ?collect version of the time series at
server even before the compressed time series
arrives from the producer. - Can the prediction model be exploited to reduce
the overhead of compression. - If ?archivegt ?collect no additional effort is
required for archival --gt simply archive the
predicted model. - Approach
- Define an error time series Ei pi-spredi
- Compress E1n to within-?archive instead of
compressing p1n - The archive contains the prediction parameters
and the compressed error time series - Within-?archive of EI (M, ?) can be used to
reconstruct a within- ?archive version of p
20Combing Compression and Prediction (Example)
21Estimating Time Series Values
- Historical samples (before n-nlag) is maintained
at the server within-?archive - Recent samples (between n-nlag1 and n) is
maintained by the sensor and predicted at the
server. - If an application requires ?q precision, then
- if ?q ? ?collect then it must wait for ? time in
case a parameter refresh is en route - if ?q ? ?archive but ?q lt ?collect then it may
probe the sensor or wait for a compressed segment - Otherwise only probing meets precision
- For future samples (after n) immediate probing
not available as an option
22Experiments
- Data sets
- Synthetic Random-Walk
- x1 0 and xixi-1sn where sn drawn
uniformly from -1,1 - Oceanographic Buoy Data
- Environmental attributes (temperature, salinity,
wind-speed, etc.) sampled at 10min intervals from
a buoy in the Pacific Ocean (Tropical Atmosphere
Ocean Project, Pacific Marine Environment
Laboratory) - GPS data collected using IPAQs
- Experiments to test
- Compression Performance of PMC
- Benefits of Model Selection
- Query Accuracy over Compressed Data
- Benefits of Prediction/Compression Combination
23Compression Performance
K/n ratio number of segments/number of samples
24Query Performance Over Compressed Data
How many sensors have values gtv? (Mean
selectivity 50)
25Impact of Model Selection
- Objects moved at approximately constant speed (
measurement noise) - Three models used
- locn c
- locn cvt
- locn cvt0.5at2
- Parameters v, a were estimated at sensor over
moving-window of 5 samples
K/n ratio number of segments/number of samples.
?pred is the localization tolerance in meters
26Combining Prediction with Compression
K/n ratio number of segments/number of samples
27GPS Mobility Data from Mobile Clients (iPAQs)
QUASAR Client Time Series
Latitude Time Series 1800 samples
Compressed Time Series (PMC-MR, ICDE
2003) Accuracy of 100 m 130 segments
28Query Processing in Quasar
- Problem Definition
- Given
- sensor time series with quality-guarantees
captured at the server - A query with a specified quality-tolerance
- Return
- query results incurring least cost
- Techniques depend upon
- nature of queries
- Cost measures
- resource consumption -- energy, communication,
I/O - query response time
29Aggregate Queries
minQ 2 maxQ 7 countQ 3 sumQ 276
15 avgQ 15/3 5
S
30Processing Aggregate Queries (minimize producer
probe)
Let S lts1,s2, ,sngt be set of sensors that
meet the query criteria si.high sipredt
?jpred sj.low sipredt - ?jpred
sn
- MIN Query
- c minj(si.high)
- b c - ?query
- Probe all sensors where sj.low lt b
- only s1 and s3 will be probed
- Sum Query
- select a minimal subset S ? S such that
- ?si in S (?jpred) gt ?si in S(?jpred)- ?query
- If ?query 15, only s1 will be probed
s3
s2
s1
a
b
c
5
s5
3
s4
s3
5
2
s2
10
s1
31Minimizing Cost at Server
- Error tolerance of queries can be exploited to
reduce processing at server. - Key Idea
- Use a multi-resolution index structure (MRA-tree)
for processing aggregate queries at server. - An MRA-Tree is a modified multi-dimensional index
trees (R-Tree, quadtree, Hybrid tree, etc.) - A non-leaf node contains (for each of its
subtrees) four aggregates MIN,MAX,COUNT,SUM - A leaf node contains the actual data points
(sensor models)
32MRA Tree Data Structure
Spatial View
Tree Structure View
A
D
B
C
E
B
G
D
E
F
G
F
C
A
33MRA-Tree Node Structure
Non-Leaf Node
Leaf Node
Probe Pointers (each costs 2 messages)
Disk Page Pointers (each costs 1 I/O)
34Node Classification
- Two sets of nodes
- NP (partial contribution to the query)
- NC (complete contribution)
35Aggregate Queries using MRA Tree
- Initialize NP with the root
- At each iteration Remove one node N from NP and
for each Nchild of its children - discard, if Nchild disjoint with Q
- insert into NP if Q is contained or partially
overlaps with Nchild - insert into NC if Q contains Nchild (we only
need to maintain aggNC) - compute the best estimate based on contents of NP
and NC
N
Q
36MIN (and MAX)
Interval minNC min 4, 5 4 minNP min
3, 9 3 L min minNC, minNP 3 H minNC
4 hence, I 3, 4
9
4
5
Estimate Lower bound E(minQ) L 3
3
37MRA Tree Traversal
- Progressive answer refinement until NP is
exhausted - Greedy priority-based local decision for next
node to be explored based on - Cost (1 I/O or 2 messages)
- Benefit (Expected Reduction in answer
uncertainty)
A
B
C
D
E
F
G
38Adaptive Tracking of mobile objects in sensor
networks
- Tracking Architecture
- A network of wireless acoustic sensors arranged
as a grid transmitting via a base station to
server - A track of the mobile object generated at the
base station or server - Objective
- Track a mobile object at the server such that
the track deviates from the real trajectory
within a user defined error threshold ?track
with minimum communication overhead.
39Sensor Model
- Wireless sensors battery operated, energy
constrained - Operate on the received acoustic waveforms
- Signal attenuation of target object given by
Is(t) P /4? r2 - P source object power
- r distance of object from sensor
- Is(t) intensity reading at time t at ith
sensor - Ith Intensity threshold at ith sensor
-
40Sensor States
- S0 Monitor ( processor on, sensor on, radio off
) - shift to S1 if intensity above threshold
- S1 Active state ( processor on, sensor on,
radio on) - send intensity readings to base station.
- On receiving message from BS containing error
tolerance shift to S2 - S2 Quasi-active (processor on, sensor on, radio
intermittent) - send intensity reading to BS if error from
previous reading exceeds error threshold - Quasar Collection approach used in Quasi-active
state
41Server side protocol
- Server maintains
- list of sensors in the active/ quasi-active
state - history of their intensity readings over a
period of time - Server Side Protocol
- convert track quality to a relative intensity
error at sensors - Send relative intensity error to sensor when
sensor state S1( quasi- active state) - Triangulate using n sensor readings at discrete
time intervals.
42Basic Triangulation Algorithm (using 3 sensor
readings)
P source object power, Ii intensity reading at
ith sensor (x-x1)2 (y- y1)2 P/4? I1 (x-x2)2
(y- y2)2 P/4? I2 (x-x3)2 (y- y3)2 P/4?
I3 Solving we get (x, y)f(x1,x2,x3,y1,y2,y3,
P,I1, I2 , I3, )
(x, y)
- More complex approaches to amalgamate more than
three sensor readings possible - Based on numerical methods -- do not provide a
closed form equation between sensor reading and
tracking location ! - Server can use simple triangulation to convert
track quality to sensor intensity quality
tolerances and a more complex approach to track.
43Adaptive Tracking Mapping track quality to
sensor reading
- Claim 1 (power constant)
- Let Ii be the intensity value of sensor
- If then, track quality is guaranteed
to be within ?track - where and C is a constant derived from the
known locations of the sensors and the power of
the object. - Claim 2 (power varies between Pmin , Pmax )
- If then
- track quality is guaranteed to be within ?track
where C C/ P2 and is a constant . - The above constraint is a conservative estimate.
Better bounds possible
t i t( i1 )
? I2
Intensity ( I2 )
t i t( i1 )
time
? I3
Intensity ( I3 )
t i t( i1 )
time
? track
X (m)
Y (m)
44Adaptive Tracking prediction to improve
performance
- Communication overhead further reduced by
exploiting the predictability of the object being
tracked - Static Prediction sensor server agree on a
set of prediction models - only 2 models used stationary constant
velocity - Who Predicts sensor based mobility prediction
protocol - Every sensor by default follows a stationary
model - Based on its history readings may change to
constant velocity model (number of readings
limited by sensor memory size) - informs server of model switch
45Actual Track versus track on Adaptive Tracking
(error tolerance 20m)
- A restricted random motion the object starts
at (0,d) and moves from one node to another
randomly chosen node until it walks out of the
grid.
46Energy Savings due to Adaptive Tracking
- total energy consumption over all sensor nodes
for random mobility model with varying ?track or
track error. - significant energy savings using adaptive
precision protocol over non adaptive tracking (
constant line in graph) - for a random model, prediction does not work
well !
47Energy consumption with Distance from BS
- total energy consumption over all sensor nodes
for random mobility model with varying base
station distance from sensor grid. - As base station moves away, one can expect
energy consumption to increase since transmission
cost varies as d n ( n 2 ) - adaptive precision algorithm gives us better
results with increasing base station distance
48Challenges Ongoing Work
- Ongoing Work
- Supporting a larger class of SQL queries
- Supporting continuous monitoring queries
- Larger class of sensors (e.g., video sensors)
- Better approaches to model fitting/switching in
prediction - In the future
- distributed Quasar architecture
- optimizing quality given resource constraints
- supporting applications with real-time
constraints - dealing with failures
49The DAS Project
- Goals
- Support Database as a Service on the Internet
- Collaboration
- IBM (Dr. Bala Iyer)
- UCI (Gene Tsudik)
Supported in part by NSF ITR grant entitled
Privacy in Database as a Service and by the IBM
Corporation
50Software as a Service
- Get
- what you need
- when you need
- Pay
- what you use
- Dont worry
- how to deploy, implement, maintain, upgrade
51Software As a Service Why?
- Driving Forces
- Faster, cheaper, more accessible networks
- Virtualization in server and storage technologies
- Established e-business infrastructures
- Already in Market
- ERP and CRM (many examples)
- More horizontal storage services, disaster
recovery services, e-mail services,
rent-a-spreadsheet services etc. - Sun ONE, Oracle Online Services, Microsoft .NET
My Services etc
- Advantages
- reduced cost to client
- pay for what you use and not for hardware,
software infrastructure or personnel to deploy,
maintain, upgrade - reduced overall cost
- cost amortization across users
- Better service
- leveraging experts across organizations
Better Service for Cheaper
52Database As a Service
- Why?
- Most organizations need DBMSs
- DBMSs extremely complex to deploy, setup,
maintain - require skilled DBAs with high cost
53What do we want to do?
Server
Application Service Provider (ASP)
- Database as a Service (DAS) Model
- DB management transferred to service provider for
- backup, administration, restoration, space
management, upgrades etc. - use the database as a service provided by an
ASP - use SW, HW, human resources of ASP, instead of
your own
BUT.
54Challenges
- Economic/business model?
- How to charge for service, what kind of service
guarantees can be offered, costing of
guarantees, liability of service provider. - Powerful interfaces to support complete
application development environment - User Interface for SQL, support for embedded SQL
programming, support for user defined interfaces,
etc. - Scalability in the web environment
- overheads due to network latency (data proxies?)
- Privacy and Security
- Protecting data at service providers from
intruders and attacks. - Protecting clients from misuse of data by service
providers - Ensuring result integrity
55Data privacy from service provider
Server
User Data
Encrypted User Database
Untrusted
Application Service Provider
- The problem is we do not trust the service
provider for sensitive information! - Fact 1 Theft of intellectual property due to
database vulnerabilities costs American
businesses 103 billion annually - Fact 2 45 of those attacks are conducted by
insiders! (CSI/FBI Computer Crime and
Security Survey, 2001) - encrypt the data and store it
- but still be able to run queries over the
encrypted data - do most of the work at the server
Server Site
56System Architecture
Server Site
Client Site
Encrypted Results
Client Side Query
?
Server Side Query
Service Provider
Original Query
?
Actual Results
?
57NetDB2 Service
- Developed in collaboration with IBM
- Deployed on the Internet about 2 years ago
- Been used by 15 universities and more than 2500
students to help teaching database classes - Currently offered through IBM Scholars Program
58MARS Project
- Goals integration of similarity retrieval and
query refinement over structured and
semi-structured databases to help interactive
data analysis/mining
Supported in part by NSF CAREER award, NSF
grant entitled learning digital behavior and a
KDD grant entitled Mining events and entities
over large spatio-temporal data sets
59Similarity Search in Databases (SR)
Honda sedan, inexpensive, after 1994, around LA
Alice
60Query Refinement (QR)
Refined Results
61Why are SR and QR important?
- Most queries are similarity searches
- Specially in exploratory data analysis tasks
(e.g., catalog search) - Users have only a partial idea of their
information need - Existing Search technologies (text retrieval,
SQL) do not provide appropriate support for SR
and (almost) no support for QR. - Users must artificially convert similarity
queries to keyword-searches or exact-match
queries - Good mappings difficult or not feasible
- Lack of good knowledge of the underlying data or
its structure - Exact-match may be meaningless for certain data
types (e.g., images, text, multimedia)
62Similarity Access and Interactive Mining
Architecture
Search Client
History-based Refinement Method
Query/Feedback
Ranked Results
Feedback-based Refinement Method
Query Session Manager
Feedback
Feedback Table
Refinement Manager
Schemes
Answer Table
Initial Query
Ranked Results
Legend --- logging __ Process
Similarity Query Processor
Scores Table
Ranking Rules
Query Log Manager/Miner
Query Log
Query
Results
Similarity Operators
Types
ORDBMS
Database
63MARS Challenges...
- Learning queries from
- user interactions
- user profiles
- past history of other users
- Efficient implementation of
- similarity queries
- refined queries
- Role of similarity queries in
- OLAP
- interactive data mining
64Similarity Query Processor
Query-Session Manager
-executes query on ORDBMS - ranks results (e.g.
can exclude already see tuples, etc) - logs
query(query or Top-k)
-parse the query - check query validity -generate
schema for support tables - maintain sessions
registry
Refinement Manager
Query Log Manager/Miner
- maintains a registry of query refinement
policies (content/collaborative) - generates the
scores table - identifies and invokes
intra-predicate refiners.
- maintains query log . Initial-Final pair
. Top-K results . Complete trajectory -
Query-query similarity (can have multiple
policies) - Query clustering
65Text Search Technologies (Altavista, Verity,
Vality, Infoseek)
Movies
Actors
Approach convert enterprise structured data into
a searchable text index.
Limitations cannot capture semantics of
relationships in data cannot capture semantics of
non-text fields (e.g., multimedia) limited
support for refinement or preferences in current
systems cannot express similarity queries over
structured or semi-structured data (e.g., price,
location)
Directors
Al Pacino acted in a movie directed by Francis
Ford Coppola
Strengths support ranked retrieval can handle
missing data, synonyms, data entry errors
66SQL-based Search TechnologiesOracle, Informix,
DB2, Mercado
Approach translate similarity query into exact
SQL query.
Limitations translation is difficult or not
possible difficult to guess right ranges causes
near misses not feasible for non-numeric
fields cannot rank answers based on
relevance does not account for user preference or
query refinement
Strengths support structured as well as
semi-structured data support for arbitrary data
types Scalable attribute-based lookup
select from user_car_catalog where model
Honda Accord and 1993 year 1995 and
dist(90210) 50 and price lt 5000