Title: Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data
1Efficient Indexing Methods for Probabilistic
Threshold Queries over Uncertain Data
- Reynold Cheng, Yuni Xia, Sunil Prabhakar,
- Rahul Shah and Jeffrey Scott Vitter
- Department of Computer Science
- Purdue University
2Sensor-based Applications
Database System
sensor
sensor
External Environment e.g., temperature, moving
objects, hazardous materials
Network Channel
queries
results
sensor
sensor
user
3Data Uncertainty
- Due to limited network bandwidth and battery
power, readings are sampled only - The value of the entity being monitored (e.g.,
temperature, location) is changing - The database stores old values (sampling
uncertainty) - Query results can be incorrect!
4Data Uncertainty and Query Incorrectness
Recorded Temperature
Current Temperature
30
y1
- Which rooms temperature is between 10oF to 25oF?
- Database y
- Correct answer x
y0
20
x1
10
x0
0
oF
x
y
5Querying Bounded Values
Recorded Temperature
Uncertainty for Current Temperature
30
20
- Both x and y may be answer
- Which one has better chance?
- Measurement error also a source of uncertainty
10
0
oF
T1
T2
6Imprecise Answers
- In general, sensor uncertainty does not allow us
to get exact answer. - Answer is imprecise rather than exact.
- Possible to provide confidence to answers e.g.,
probability values - A probabilistic query returns answers with
probabilistic guarantees
7Probabilistic Queries
Recorded Temperature
Uncertainty for Current Temperature
30
- Which rooms temp is between 10oF to 25oF?
- (T1,10), (T2,80)
20
10
1
1
0
oF
T1
T2
8Outline
- Modeling Sensor Uncertainty
- Probability Threshold Queries
- Probability Threshold Indexing
- Variance-based Clustering
- Experimental Results
9Sampling Uncertainty
- The value of the external entity is sampled at
discrete time - Can produce incorrect results
- Bounded by dead-reckoning update (Wolfson et al.,
99)
10Dead-Reckoning Update
- Each sensor keeps track of the difference (d)
between its current value and the value last sent - Send update to database when d gt deviation
threshold
d
last value
current value
11Measurement Uncertainty
- Measurement Error (Pfoser Jensen 99)
- Due to inherent imprecision in hardware e.g., GPS
- Less serious than sampling error (Trajcevski et
al., 02)
x
12Database Model
13Interval Uncertainty
Ti.z(t)
li(t)
ui(t)
Uncertain Interval Ui(t)
- Example Ui(t) is interval bounding all values
within distance of (t-tupdate)?r of Ti.z - tupdate time that Ti.z is last updated
- r current rate of change of Ti.z
- Example dead-reckoning update (Wolfson 99)
14Probabilistic Uncertainty
fi(x) uncertainty pdf
Ti.z
Li
Ri
uncertainty interval
- Wolfson et al. (1999) proposed fi(x) as Gaussian
distribution for a moving object on a route - Deshpande et al. (2004) discussed parametrization
of Gaussian distribution in sensor networks - Can be extended to n dimensions
15Outline
- Modeling Sensor Uncertainty
- Probability Threshold Queries
- Probability Threshold Indexing
- Variance-based Clustering
- Experimental Results
16Probabilistic Queries
Recorded Temperature
Uncertainty for Current Temperature
30
- Which rooms temp is between 10oF to 25oF?
- (T1,10), (T2,80)
20
10
1
1
0
oF
T1
T2
17Probabilistic Queries
- Drawback Costly integration operations
- In practice, the user is only concerned with
results with sufficiently high probability values - e.g., return ids of sensors with temperature over
30oF where probability 0.7
18Probability Threshold Queries (PTQ)
- INPUT a,b, and p,where a,b,p ??, 0 lt p
? 1 - OUTPUT Ti where probability pi that Ti.z is
inside a,b satisfies pi p - The actual value of pi is not returned
19Interval Indexing
- Interval indexing handles containment, overlap
and stabbing queries - Manolopoulos et al. (2000) proposed an efficient
interval tree for range queries - Arge Vitter (1996) and Kanellakis et al. (1996)
mapped 1D interval queries to 2D queries
20Solving PTQ with Interval Indexes
- Use interval indexes to find intervals that
overlap a,b - For each object retrieved, evaluate its
probability of being within a,b - Return intervals with probability p
21The Problem of Interval Indexes
- Current Interval indexes do not consider
probabilities during search - Many irrelevant objects (probability lt p) may be
retrieved from the interval index
22Our Solutions
- Probability Threshold Indexing (PTI) 1D interval
R-tree with uncertainty - Variance-based Clustering Transform
intervals to 2D points and index based on variance
23Outline
- Modeling Sensor Uncertainty
- Probability Threshold Queries
- Probability Threshold Indexing
- Variance-based Clustering
- Experimental Results
24Pruning in a 1D R-Tree
- Some intervals in the MBR may satisfy Q
- Need to retrieve the contents of MBR
25x-bounds in a PTI Node
left-0.2-bound
right-0.2-bound
? 0.2
0.8
26x-bounds in a PTI Node
left-0-bound
right-0-bound
27Pruning with x-bounds
left-0.2-bound
right-0.2-bound
- An MBR is not further retrieved if
- Q does not cut left and right x-bounds
- p gt x
28Implementation of PTI
29Drawback of PTI
- Extra overhead in storing x-bounds
- Doesnt distinguish small and large intervals
right-0.2-bound
left-0.2-bound
30Outline
- Modeling Sensor Uncertainty
- Probability Threshold Queries
- Probability Threshold Indexing
- Variance-based Clustering
- Experimental Results
31Mapping intervals to 2D-space
- Each 1D interval Li,Ri can be mapped to a point
(x,y) in 2D space - Li ? x
- Ri ? y
- y x mapped points lie above xy line
32The PTQ-Uniform Problem
uniform pdf
uniform pdf
uniform pdf
332D View of PTQ-Uniform
y Ri
Li
Ri
xy
Q (p 0.75)
b
a
y(1-p)xp a Intervals containing a
a ltx lt y lt b Intervals in a,b
x(1-p)yp ? b Intervals in a,b
b-a p(y-x) Intervals containing a,b
a
b
x Li
a
b
1D View (Uniform pdf)
2D View
34Clustering of 2D points
cluster of large intervals
y
- When 2D points are clustered, small and large
intervals are separated
- Points in the same vicinity have similar means
and variances
xy
(Li,Ri)
variance of Li,Ri
mean of Li,Ri
cluster of smaller intervals
x
35Answering PTQ-Uniform with 2D R-Tree
- Construct a 2D R-tree over 2D points
- Perform a trapezoidal range query over the 2D
R-tree - Since points with similar means and variances are
clustered together, it is better than PTI
36Variance-based Clustering
- Can be extended to other pdfs
- Variance-based clustering is an uncertainty
indexing technique based on 2D R-tree - Each item is indexed based on its mean and
variance
37Variance-based Clustering
- For uniform and Guassian distributions, range
queries over 2D points can be constructed - For arbitrary pdfs, a well-defined range query
may be infeasible - In those cases, place x-bounds in each 2D R-tree
node for pruning
38Theoretical Results
- Not possible to create a linear space index that
gives logarithmic query times for PTQs in the
worst case - For most cases, any space-partitioning data
structure e.g., 2D R-tree suffices - PTQ with fixed threshold and uniform distribution
can be answered in logarithmic time with a linear
structure
39Outline
- Modeling Sensor Uncertainty
- Probability Threshold Queries
- Probability Threshold Indexing
- Variance-based Clustering
- Experimental Results
40Performance Comparison
- Compare number of I/Os between
- 1D R-tree on intervals only
- PTI (1D R-tree with probability thresholds)
- 2D variance-based clustering (called Extensive)
41Simulation Model
- 100K uncertain data, with length uniformly
distributed in 0,10000 and uniform uncertainty
pdf - 10K PTQs with length of a,b normally
distributed and p ? 0.1,1 - Each PTI node contains five x-bounds, where x ?
0.1,0.3,0.5,0.7,0.9
42Scalability of Indexes
- Both PTI and Extensive outperform R-tree
- Answering PTQ with R-tree requires more
computation - Extensive needs about 50 less I/Os than PTI
43Effect of Query Probability Threshold
- R-tree does not benefit from the increasing value
of p - When p is 0.5, Extensive is 4 times better than
PTI
44Other Important Results
- Classification of probabilistic queries based on
answer type and operators - different evaluation algorithms
- Moving-objects probabilistic nearest-neighbor
queries - Quality metrics of probabilistic results
- Heuristics for improving answer quality
45Future Work
- Study probabilistic threshold constraints for
other queries, such as nearest neighbors and
joins - Study the indexing of other uncertain data types
e.g., fuzzy data and sets - Study other kinds of constraints on probabilistic
queries e.g., answers with top-k probability
values
46Conclusions
- Based on the pdf information of uncertain
intervals, PTI places tighter bounds in 1D R-tree
nodes. - Variance-based clustering uses a 2D R-tree to
avoid placing intervals of extreme sizes
together. - The concept of these indexes can be extended to
multiple dimensions.
- Contact Reynold Cheng (ckcheng_at_cs.purdue.edu) for
details
47References
- AEM92 Pankaj K. Agarwal, David Eppstein, and
Jir Matousek. Dynamic half-space reporting,
geometric optimization, and minimum spanning
trees. In FOCS, pages 80-89, 1992. - AV96 L. Arge and J. S. Vitter. On dynamic
interval management in external memory (extended
abstract). In FOCS, p. 560-569, 1996. - CP03 R. Cheng and S. Prabhakar. Managing
uncertainty in sensor databases. In SIGMOD
Record, Dec 2003. - CKP03 R. Cheng, D. Kalashnikov, and S.
Prabhakar. Evaluating probabilistic queries over
imprecise data. In Proc. of the ACM SIGMOD, 2003.
48References
- KRVV96 P. C. Kanellakis, S. Ramaswamy, D.
Vengroff, and J. S. Vitter. Indexing for data
models with constraints and classes. In J. Comp.
Syst. Sci, 52(3)589-612, 1996. - MTT00 Y. Manolopoulos, Y. Theodoridis, and V.
J. Tsotras. Chapter 4 Access methods for
intervals. In Advanced Database Indexing, Kluwer,
2000. - WSCY99 O. Wolfson, P. Sistla, S. Chamberlain,
and Y. Yesha. Updating and querying databases
that track mobile units. Distributed and Parallel
Databases, 7(3), 1999. - DGMHH04 A. Deshpande, C. Guestrin, S. Madden,
J. Hellerstein and W. Hong. Model-Driven Data
Acquisition in Sensor Networks. In VLDB, 2004.
49Related Work Interval Indexing
- AV96, KRVV96 discuss the idea of mapping
intervals as points in 2D space. The
transformation of 1D stabbing queries and range
queries to two-sided orthogonal queries in 2D
space are also presented. - MTT00 proposes an efficient interval tree to
facilitate the execution of intersection queries
over intervals. - CKP04 proposes an indexing scheme for
constantly-growing uncertainty of moving objects.
50Related Work 2D Range Queries
- A comprehensive survey on geometric range
searching can be found in M94. - For half-space queries,
- F81 discusses lower bounds.
- AEM92 presents optimal structures.
- For simplex queries,
- C89 derives the lower bounds.
- GRUY97 modifies R-tree to answer simplex queries
51Related Work Uncertainty Indexing
- Few works have addressed the issues of indexing
uncertain data that involves probability
computation. - CKP04 proposes an indexing scheme for
constantly-growing uncertainty of moving objects. - LMPS03 discusses an extension of the TPR-tree
to index trajectories of moving objects, where
each point in the trajectory has a rectangular
uncertain bound.
52Related Work Probabilistic Queries
- CKP03 proposes an uncertainty model for
constantly-evolving data. It also presents
classification, evaluation and quality issues of
different types of probabilistic queries. - For moving object uncertainty,
- WSCY99 study probabilistic range queries.
- CKP04 study probabilistic nearest neighbor
queries. - CP03 proposes computation strategies for
evaluating PTQ, but does not discuss the indexing
of uncertain data.
53Searching a R-tree
- 1D R-tree can be used to indexed uncertainty
intervals - Each R-tree node has a maximum bounding rectangle
(MBR), which encompasses all intervals in the
subtree of that node - Starting from the root, only children with MBRs
that overlap a,b are further followed
54Complexity of PTQU
- Half-space queries report a set of points that
satisfy ax by c - PTQU is at least as hard as half-space queries
which require O(n1/3) operations F81 using a
linear-space index - Simplex queries report a set of points that
satisfy a list of constraints aix biy ci - PTQU is a special case of simplex queries, where
query time is O(ne) using linear structure
AEM92.
55Details of Variance-based Clustering
- The exact indexing technique depends on the form
of pdf - For regular sets, e.g., Gaussian and uniform pdf,
can prune a node without the extra overhead of
PTI - For arbitrary pdfs, need a PTI table in each node
to facilitate pruning