Quality Aware Sensor Database (QUASAR) Project** - PowerPoint PPT Presentation

About This Presentation
Title:

Quality Aware Sensor Database (QUASAR) Project**

Description:

Quality Aware Sensor Database (QUASAR) Project** Sharad Mehrotra Department of Information and Computer Science University of California, Irvine – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 67
Provided by: Informati144
Category:

less

Transcript and Presenter's Notes

Title: Quality Aware Sensor Database (QUASAR) Project**


1
Quality Aware Sensor Database (QUASAR) Project
  • Sharad Mehrotra
  • Department of Information and Computer Science
  • University of California, Irvine

Supported in part by a collaborative NSF ITR
grant entitled real-time data capture, analysis,
and querying of dynamic spatio-temporal events
in collaboration with UCLA, U. Maryland, U.
Chicago
2
Talk Outline
  • Quasar Project
  • motivation and background
  • data collection and archival components
  • query processing
  • tracking application using QUASAR framework
  • challenges and ongoing work
  • Brief overview of other research projects
  • MARS Project - incorporating similarity retrieval
    and refinement over structured and
    semi-structured data to aid interactive data
    analysis/mining
  • Database as a Service (DAS) Project - supporting
    the application service provider model for data
    management

3
Emerging Computing Infrastructure
In-body, in-cell, in-vitro spaces
  • Generational advances to computing infrastructure
  • sensors will be everywhere
  • Emerging applications with limitless
    possibilities
  • real-time monitoring and control, analysis
  • New challenges
  • limited bandwidth energy
  • highly dynamic systems
  • System architectures are due for an overhaul
  • at all levels of the system OS, middleware,
    databases, applications

Instrumented wide-area spaces
4
Impact to Data Management
  • Traditional data management
  • client-server architecture
  • efficient approaches to data storage querying
  • query shipping versus data shipping
  • data changes with explicit update
  • Emerging Challenge
  • data producers must be considered as first
    class entities
  • sensors generate continuously changing highly
    dynamic data
  • sensors may store, process, and communicate data

5
Data Management Architecture Issues
producer cache
Data/query request
Data/query result
server
  • Where to store data?
  • Do not store -- stream model
  • not suitable if we wish to archive data for
    future analysis or if data is too important to
    lose
  • at the producers
  • limited storage, network, compute resources
  • at the servers
  • server may not be able to cope with high data
    production rates. May lead to data staleness
    and/or wasted resources
  • Where to compute?
  • At the client, server, data producers

6
Quasar Architecture
  • Hierarchical architecture
  • data flows from producers to server to clients
    periodically
  • queries flow the other way
  • If client cache does not suffices, then
  • query routed to appropriate server
  • If server cache does not suffice, then access
    current data at producer
  • This is a logical architecture-- producers could
    also be clients.

Client cache
Query flow
Server cache archive
server
data flow
producer cache
7
Quasar Observations Approach
  • Applications can tolerate errors in sensor data
  • applications may not require exact answers
  • small errors in location during tracking or
    error in answer to query result may be OK
  • data cannot be precise due to measurement errors,
    transmission delays, etc.
  • Communication is the dominant cost
  • limited wireless bandwidth, source of major
    energy drain
  • Quasar Approach
  • exploit application error tolerance to reduce
    communication between producer and server
  • Two approaches
  • Minimize resource usage given quality constraints
  • Maximize quality given resource constraints

8
Quality-based Data Collection Problem
Sensor time series pn, pn-1, , p1
  • Let P lt p1, p2, , pn gt be a sequence of
    environmental measurements (time series)
    generated by the producer, where n now
  • Let S lts1, s2, , sngt be the server side
    representation of the sequence
  • A within-? quality data collection protocol
    guarantees that
  • for all i error(pi, si) lt ?
  • ? is derived from application quality tolerance

9
Simple Data Collection Protocol
Sensor time series pn, pn-1, , p1
  • sensor Logic (at time step n)
  • Let p last value sent to server
  • if error(pn, p) gt ?
  • send pn to server
  • server logic (at time step n)
  • If new update pn received at step n
  • sn pn
  • Else
  • sn last update sent by sensor
  • guarantees maximum error at server less than
    equal to ?

10
Exploiting Prediction Models
  • Producer and server agree upon a prediction model
    (M, ?)
  • Let spredi be the predicted value at time i
    based on (M, ?)
  • sensor Logic (at time step n)
  • if error(pn, spredn ) gt ?
  • send pn to server
  • server logic (at time step n)
  • If new update pn received at step n
  • sn pn
  • Else
  • sn spredn based on model (M, ?)

11
Challenges in Prediction
  • Simple versus complex models?
  • Complex and more accurate models require more
    parameters (that will need to be transmitted).
  • Goal is to minimize communication not
    necessarily best prediction
  • How is a model M generated?
  • static -- one out of a fixed set of models
  • dynamic -- dynamically learn a model from data
  • When should a model M or parameters ? be changed?
  • immediately on model violation
  • too aggressive -- violation may be a temporary
    phenomena
  • never changed
  • too conservative -- data rarely follows a single
    model

12
Challenges in Prediction (cont.)
  • who does the model update?
  • Server
  • Long-haul prediction models possible, since
    server maintains history
  • might not predict recent behavior well since
    server does not know exact S sequence server has
    only samples
  • extra communication to inform the producer
  • Producer
  • better knowledge of recent history
  • long haul models not feasible since producer does
    not have history
  • producers share computation load
  • Both
  • server looks for new models, sensor performs
    parameter fitting given existing models.

13
Archiving Sensor Data
  • Often sensor-based applications are built with
    only the real-time utility of time series data.
  • Values at time instants ltltn are discarded.
  • Archiving such data consists of maintaining the
    entire S sequence, or an approximation thereof.
  • Importance of archiving
  • Discovering large-scale patterns
  • Once-only phenomena, e.g., earthquakes
  • Discovering events detected post facto by
    rewinding the time series
  • Future usage of data which may be not known while
    it is being collected

14
Problem Formulation
  • Let P lt p1, p2, , pn gt be the sensor
    time series
  • Let S lt s1, s2, , sn gt be the server
    side representation
  • A within ?archive quality data archival protocol
    guarantees that
  • error(pi, si) lt ?archive
  • Trivial Solution modify collection protocol to
    collect data at quality guarantee of
    min(?archive , ?collect)
  • then prediction model by itself will provide a
    ?archive quality data stream that can be
    archived.
  • Better solutions possible since
  • archived data not needed for immediate access by
    real-time or forecasting applications (such as
    monitoring, tracking)
  • compression can be used to reduce data transfer

15
Data Archival Protocol
Sensor updates for data collection
Compressed representation for archiving
pn, pn-1, ..
compress
processing at sensor exploited to reduce
communication cost and hence battery drain
Sensor memory buffer
  • Sensors compresses observed time series p1n
    and sends a lossy compression to the server
  • At time n
  • p1n-nlag is at the server in compressed form
    s 1n-nlag within-?archive
  • sn-nlag1n is estimated via a predictive model
    (M, ?)
  • collection protocol guarantees that this remains
    within- ?collect
  • sn1?? can be predicted but its quality is not
    guaranteed (because it is in the future and thus
    the sensor has not observed these values)

16
Piecewise Constant Approximation (PCA)
  • Given a time series Sn s1n a piecewise
    constant approximation of it is a sequence
  • PCA(Sn) lt (ci, ei) gt
  • that allows us to estimate sj as
  • scapt j ci if j in ei-11, ei
  • c1 if jlte1

Value
c1
c4
c3
Time
c2
e1 e2 e3 e4
17
Online Compression using PCA
  • Goal Given stream of sensor values, generate a
    within-?archive PCA representation of a time
    series
  • Approach (PMC-midrange)
  • Maintain m, M as the minimum/maximum values of
    observed samples since last segment
  • On processing pn, update m and M if needed
  • if M - m gt 2?archive , output a segment ((mM
    )/2, n)

Value
Example ?archive 1.5
Time
1 2 3 4 5
18
Online Compression using PCA
  • PMC-MR
  • guarantees that each segment compresses the
    corresponding time series segment to
    within-?archive
  • requires O(1) storage
  • is instance optimal
  • no other PCA representation with fewer segments
    can meet the within-?archive constraint
  • Variant of PMC-MR
  • PMC-MEAN, which takes the mean of the samples
    seen thus far instead of mid range.

19
Improving PMC using Prediction
  • Observation Prediction models guarantee a
    within- ?collect version of the time series at
    server even before the compressed time series
    arrives from the producer.
  • Can the prediction model be exploited to reduce
    the overhead of compression.
  • If ?archivegt ?collect no additional effort is
    required for archival --gt simply archive the
    predicted model.
  • Approach
  • Define an error time series Ei pi-spredi
  • Compress E1n to within-?archive instead of
    compressing p1n
  • The archive contains the prediction parameters
    and the compressed error time series
  • Within-?archive of EI (M, ?) can be used to
    reconstruct a within- ?archive version of p

20
Combing Compression and Prediction (Example)
21
Estimating Time Series Values
  • Historical samples (before n-nlag) is maintained
    at the server within-?archive
  • Recent samples (between n-nlag1 and n) is
    maintained by the sensor and predicted at the
    server.
  • If an application requires ?q precision, then
  • if ?q ? ?collect then it must wait for ? time in
    case a parameter refresh is en route
  • if ?q ? ?archive but ?q lt ?collect then it may
    probe the sensor or wait for a compressed segment
  • Otherwise only probing meets precision
  • For future samples (after n) immediate probing
    not available as an option

22
Experiments
  • Data sets
  • Synthetic Random-Walk
  • x1 0 and xixi-1sn where sn drawn
    uniformly from -1,1
  • Oceanographic Buoy Data
  • Environmental attributes (temperature, salinity,
    wind-speed, etc.) sampled at 10min intervals from
    a buoy in the Pacific Ocean (Tropical Atmosphere
    Ocean Project, Pacific Marine Environment
    Laboratory)
  • GPS data collected using IPAQs
  • Experiments to test
  • Compression Performance of PMC
  • Benefits of Model Selection
  • Query Accuracy over Compressed Data
  • Benefits of Prediction/Compression Combination

23
Compression Performance
K/n ratio number of segments/number of samples
24
Query Performance Over Compressed Data
How many sensors have values gtv? (Mean
selectivity 50)
25
Impact of Model Selection
  • Objects moved at approximately constant speed (
    measurement noise)
  • Three models used
  • locn c
  • locn cvt
  • locn cvt0.5at2
  • Parameters v, a were estimated at sensor over
    moving-window of 5 samples

K/n ratio number of segments/number of samples.
?pred is the localization tolerance in meters
26
Combining Prediction with Compression
K/n ratio number of segments/number of samples
27
GPS Mobility Data from Mobile Clients (iPAQs)
QUASAR Client Time Series
Latitude Time Series 1800 samples
Compressed Time Series (PMC-MR, ICDE
2003) Accuracy of 100 m 130 segments
28
Query Processing in Quasar
  • Problem Definition
  • Given
  • sensor time series with quality-guarantees
    captured at the server
  • A query with a specified quality-tolerance
  • Return
  • query results incurring least cost
  • Techniques depend upon
  • nature of queries
  • Cost measures
  • resource consumption -- energy, communication,
    I/O
  • query response time

29
Aggregate Queries
minQ 2 maxQ 7 countQ 3 sumQ 276
15 avgQ 15/3 5
S
30
Processing Aggregate Queries (minimize producer
probe)
Let S lts1,s2, ,sngt be set of sensors that
meet the query criteria si.high sipredt
?jpred sj.low sipredt - ?jpred
sn
  • MIN Query
  • c minj(si.high)
  • b c - ?query
  • Probe all sensors where sj.low lt b
  • only s1 and s3 will be probed
  • Sum Query
  • select a minimal subset S ? S such that
  • ?si in S (?jpred) gt ?si in S(?jpred)- ?query
  • If ?query 15, only s1 will be probed

s3
s2
s1
a
b
c
5
s5
3
s4
s3
5
2
s2
10
s1
31
Minimizing Cost at Server
  • Error tolerance of queries can be exploited to
    reduce processing at server.
  • Key Idea
  • Use a multi-resolution index structure (MRA-tree)
    for processing aggregate queries at server.
  • An MRA-Tree is a modified multi-dimensional index
    trees (R-Tree, quadtree, Hybrid tree, etc.)
  • A non-leaf node contains (for each of its
    subtrees) four aggregates MIN,MAX,COUNT,SUM
  • A leaf node contains the actual data points
    (sensor models)

32
MRA Tree Data Structure
Spatial View
Tree Structure View
A
D
B
C
E
B
G
D
E
F
G
F
C
A
33
MRA-Tree Node Structure
Non-Leaf Node
Leaf Node
Probe Pointers (each costs 2 messages)
Disk Page Pointers (each costs 1 I/O)
34
Node Classification
  • Two sets of nodes
  • NP (partial contribution to the query)
  • NC (complete contribution)

35
Aggregate Queries using MRA Tree
  • Initialize NP with the root
  • At each iteration Remove one node N from NP and
    for each Nchild of its children
  • discard, if Nchild disjoint with Q
  • insert into NP if Q is contained or partially
    overlaps with Nchild
  • insert into NC if Q contains Nchild (we only
    need to maintain aggNC)
  • compute the best estimate based on contents of NP
    and NC

N
Q
36
MIN (and MAX)
Interval minNC min 4, 5 4 minNP min
3, 9 3 L min minNC, minNP 3 H minNC
4 hence, I 3, 4
9
4
5
Estimate Lower bound E(minQ) L 3
3
37
MRA Tree Traversal
  • Progressive answer refinement until NP is
    exhausted
  • Greedy priority-based local decision for next
    node to be explored based on
  • Cost (1 I/O or 2 messages)
  • Benefit (Expected Reduction in answer
    uncertainty)

A
B
C
D
E
F
G
38
Adaptive Tracking of mobile objects in sensor
networks
  • Tracking Architecture
  • A network of wireless acoustic sensors arranged
    as a grid transmitting via a base station to
    server
  • A track of the mobile object generated at the
    base station or server
  • Objective
  • Track a mobile object at the server such that
    the track deviates from the real trajectory
    within a user defined error threshold ?track
    with minimum communication overhead.

39
Sensor Model
  • Wireless sensors battery operated, energy
    constrained
  • Operate on the received acoustic waveforms
  • Signal attenuation of target object given by
    Is(t) P /4? r2
  • P source object power
  • r distance of object from sensor
  • Is(t) intensity reading at time t at ith
    sensor
  • Ith Intensity threshold at ith sensor

40
Sensor States
  • S0 Monitor ( processor on, sensor on, radio off
    )
  • shift to S1 if intensity above threshold
  • S1 Active state ( processor on, sensor on,
    radio on)
  • send intensity readings to base station.
  • On receiving message from BS containing error
    tolerance shift to S2
  • S2 Quasi-active (processor on, sensor on, radio
    intermittent)
  • send intensity reading to BS if error from
    previous reading exceeds error threshold
  • Quasar Collection approach used in Quasi-active
    state

41
Server side protocol
  • Server maintains
  • list of sensors in the active/ quasi-active
    state
  • history of their intensity readings over a
    period of time
  • Server Side Protocol
  • convert track quality to a relative intensity
    error at sensors
  • Send relative intensity error to sensor when
    sensor state S1( quasi- active state)
  • Triangulate using n sensor readings at discrete
    time intervals.

42
Basic Triangulation Algorithm (using 3 sensor
readings)
P source object power, Ii intensity reading at
ith sensor (x-x1)2 (y- y1)2 P/4? I1 (x-x2)2
(y- y2)2 P/4? I2 (x-x3)2 (y- y3)2 P/4?
I3 Solving we get (x, y)f(x1,x2,x3,y1,y2,y3,
P,I1, I2 , I3, )
(x, y)
  • More complex approaches to amalgamate more than
    three sensor readings possible
  • Based on numerical methods -- do not provide a
    closed form equation between sensor reading and
    tracking location !
  • Server can use simple triangulation to convert
    track quality to sensor intensity quality
    tolerances and a more complex approach to track.

43
Adaptive Tracking Mapping track quality to
sensor reading
  • Claim 1 (power constant)
  • Let Ii be the intensity value of sensor
  • If then, track quality is guaranteed
    to be within ?track
  • where and C is a constant derived from the
    known locations of the sensors and the power of
    the object.
  • Claim 2 (power varies between Pmin , Pmax )
  • If then
  • track quality is guaranteed to be within ?track
    where C C/ P2 and is a constant .
  • The above constraint is a conservative estimate.
    Better bounds possible

t i t( i1 )
? I2
Intensity ( I2 )
t i t( i1 )
time
? I3
Intensity ( I3 )
t i t( i1 )
time
? track
X (m)
Y (m)
44
Adaptive Tracking prediction to improve
performance
  • Communication overhead further reduced by
    exploiting the predictability of the object being
    tracked
  • Static Prediction sensor server agree on a
    set of prediction models
  • only 2 models used stationary constant
    velocity
  • Who Predicts sensor based mobility prediction
    protocol
  • Every sensor by default follows a stationary
    model
  • Based on its history readings may change to
    constant velocity model (number of readings
    limited by sensor memory size)
  • informs server of model switch

45
Actual Track versus track on Adaptive Tracking
(error tolerance 20m)
  • A restricted random motion the object starts
    at (0,d) and moves from one node to another
    randomly chosen node until it walks out of the
    grid.

46
Energy Savings due to Adaptive Tracking
  • total energy consumption over all sensor nodes
    for random mobility model with varying ?track or
    track error.
  • significant energy savings using adaptive
    precision protocol over non adaptive tracking (
    constant line in graph)
  • for a random model, prediction does not work
    well !

47
Energy consumption with Distance from BS
  • total energy consumption over all sensor nodes
    for random mobility model with varying base
    station distance from sensor grid.
  • As base station moves away, one can expect
    energy consumption to increase since transmission
    cost varies as d n ( n 2 )
  • adaptive precision algorithm gives us better
    results with increasing base station distance

48
Challenges Ongoing Work
  • Ongoing Work
  • Supporting a larger class of SQL queries
  • Supporting continuous monitoring queries
  • Larger class of sensors (e.g., video sensors)
  • Better approaches to model fitting/switching in
    prediction
  • In the future
  • distributed Quasar architecture
  • optimizing quality given resource constraints
  • supporting applications with real-time
    constraints
  • dealing with failures

49
The DAS Project
  • Goals
  • Support Database as a Service on the Internet
  • Collaboration
  • IBM (Dr. Bala Iyer)
  • UCI (Gene Tsudik)

Supported in part by NSF ITR grant entitled
Privacy in Database as a Service and by the IBM
Corporation
50
Software as a Service
  • Get
  • what you need
  • when you need
  • Pay
  • what you use
  • Dont worry
  • how to deploy, implement, maintain, upgrade

51
Software As a Service Why?
  • Driving Forces
  • Faster, cheaper, more accessible networks
  • Virtualization in server and storage technologies
  • Established e-business infrastructures
  • Already in Market
  • ERP and CRM (many examples)
  • More horizontal storage services, disaster
    recovery services, e-mail services,
    rent-a-spreadsheet services etc.
  • Sun ONE, Oracle Online Services, Microsoft .NET
    My Services etc
  • Advantages
  • reduced cost to client
  • pay for what you use and not for hardware,
    software infrastructure or personnel to deploy,
    maintain, upgrade
  • reduced overall cost
  • cost amortization across users
  • Better service
  • leveraging experts across organizations

Better Service for Cheaper
52
Database As a Service
  • Why?
  • Most organizations need DBMSs
  • DBMSs extremely complex to deploy, setup,
    maintain
  • require skilled DBAs with high cost

53
What do we want to do?
Server
Application Service Provider (ASP)
  • Database as a Service (DAS) Model
  • DB management transferred to service provider for
  • backup, administration, restoration, space
    management, upgrades etc.
  • use the database as a service provided by an
    ASP
  • use SW, HW, human resources of ASP, instead of
    your own

BUT.
54
Challenges
  • Economic/business model?
  • How to charge for service, what kind of service
    guarantees can be offered, costing of
    guarantees, liability of service provider.
  • Powerful interfaces to support complete
    application development environment
  • User Interface for SQL, support for embedded SQL
    programming, support for user defined interfaces,
    etc.
  • Scalability in the web environment
  • overheads due to network latency (data proxies?)
  • Privacy and Security
  • Protecting data at service providers from
    intruders and attacks.
  • Protecting clients from misuse of data by service
    providers
  • Ensuring result integrity

55
Data privacy from service provider
Server
User Data
Encrypted User Database
Untrusted
Application Service Provider
  • The problem is we do not trust the service
    provider for sensitive information!
  • Fact 1 Theft of intellectual property due to
    database vulnerabilities costs American
    businesses 103 billion annually
  • Fact 2 45 of those attacks are conducted by
    insiders! (CSI/FBI Computer Crime and
    Security Survey, 2001)
  • encrypt the data and store it
  • but still be able to run queries over the
    encrypted data
  • do most of the work at the server

Server Site
56
System Architecture
Server Site
Client Site
Encrypted Results
Client Side Query
?
Server Side Query
Service Provider
Original Query
?
Actual Results
?
57
NetDB2 Service
  • Developed in collaboration with IBM
  • Deployed on the Internet about 2 years ago
  • Been used by 15 universities and more than 2500
    students to help teaching database classes
  • Currently offered through IBM Scholars Program

58
MARS Project
  • Goals integration of similarity retrieval and
    query refinement over structured and
    semi-structured databases to help interactive
    data analysis/mining

Supported in part by NSF CAREER award, NSF
grant entitled learning digital behavior and a
KDD grant entitled Mining events and entities
over large spatio-temporal data sets
59
Similarity Search in Databases (SR)
Honda sedan, inexpensive, after 1994, around LA
Alice
60
Query Refinement (QR)
Refined Results
61
Why are SR and QR important?
  • Most queries are similarity searches
  • Specially in exploratory data analysis tasks
    (e.g., catalog search)
  • Users have only a partial idea of their
    information need
  • Existing Search technologies (text retrieval,
    SQL) do not provide appropriate support for SR
    and (almost) no support for QR.
  • Users must artificially convert similarity
    queries to keyword-searches or exact-match
    queries
  • Good mappings difficult or not feasible
  • Lack of good knowledge of the underlying data or
    its structure
  • Exact-match may be meaningless for certain data
    types (e.g., images, text, multimedia)

62
Similarity Access and Interactive Mining
Architecture
Search Client
History-based Refinement Method
Query/Feedback
Ranked Results
Feedback-based Refinement Method
Query Session Manager
Feedback
Feedback Table
Refinement Manager
Schemes
Answer Table
Initial Query
Ranked Results
Legend --- logging __ Process
Similarity Query Processor
Scores Table
Ranking Rules
Query Log Manager/Miner
Query Log
Query
Results
Similarity Operators
Types
ORDBMS
Database
63
MARS Challenges...
  • Learning queries from
  • user interactions
  • user profiles
  • past history of other users
  • Efficient implementation of
  • similarity queries
  • refined queries
  • Role of similarity queries in
  • OLAP
  • interactive data mining

64
Similarity Query Processor
Query-Session Manager
-executes query on ORDBMS - ranks results (e.g.
can exclude already see tuples, etc) - logs
query(query or Top-k)
-parse the query - check query validity -generate
schema for support tables - maintain sessions
registry
Refinement Manager
Query Log Manager/Miner
- maintains a registry of query refinement
policies (content/collaborative) - generates the
scores table - identifies and invokes
intra-predicate refiners.
- maintains query log . Initial-Final pair
. Top-K results . Complete trajectory -
Query-query similarity (can have multiple
policies) - Query clustering
65
Text Search Technologies (Altavista, Verity,
Vality, Infoseek)
Movies
Actors
Approach convert enterprise structured data into
a searchable text index.
Limitations cannot capture semantics of
relationships in data cannot capture semantics of
non-text fields (e.g., multimedia) limited
support for refinement or preferences in current
systems cannot express similarity queries over
structured or semi-structured data (e.g., price,
location)
Directors
Al Pacino acted in a movie directed by Francis
Ford Coppola
Strengths support ranked retrieval can handle
missing data, synonyms, data entry errors
66
SQL-based Search TechnologiesOracle, Informix,
DB2, Mercado
Approach translate similarity query into exact
SQL query.
Limitations translation is difficult or not
possible difficult to guess right ranges causes
near misses not feasible for non-numeric
fields cannot rank answers based on
relevance does not account for user preference or
query refinement
Strengths support structured as well as
semi-structured data support for arbitrary data
types Scalable attribute-based lookup
select from user_car_catalog where model
Honda Accord and 1993 year 1995 and
dist(90210) 50 and price lt 5000
Write a Comment
User Comments (0)
About PowerShow.com