Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data

Description:

Dead ... Q does not cut left and right x-bounds. p x. Uncertainty Indexing. 28 ... Chapter 4: Access methods for intervals. In Advanced Database ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 47
Provided by: Clif76
Category:

less

Transcript and Presenter's Notes

Title: Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data


1
Efficient Indexing Methods for Probabilistic
Threshold Queries over Uncertain Data
  • Reynold Cheng, Yuni Xia, Sunil Prabhakar,
  • Rahul Shah and Jeffrey Scott Vitter
  • Department of Computer Science
  • Purdue University

2
Sensor-based Applications
Database System
sensor
sensor
External Environment e.g., temperature, moving
objects, hazardous materials
Network Channel
queries
results
sensor
sensor
user
3
Data Uncertainty
  • Due to limited network bandwidth and battery
    power, readings are sampled only
  • The value of the entity being monitored (e.g.,
    temperature, location) is changing
  • The database stores old values (sampling
    uncertainty)
  • Query results can be incorrect!

4
Data Uncertainty and Query Incorrectness
Recorded Temperature
Current Temperature
30
y1
  • Which rooms temperature is between 10oF to 25oF?
  • Database y
  • Correct answer x

y0
20
x1
10
x0
0
oF
x
y
5
Querying Bounded Values
Recorded Temperature
Uncertainty for Current Temperature
30
20
  • Both x and y may be answer
  • Which one has better chance?
  • Measurement error also a source of uncertainty

10
0
oF
T1
T2
6
Imprecise Answers
  • In general, sensor uncertainty does not allow us
    to get exact answer.
  • Answer is imprecise rather than exact.
  • Possible to provide confidence to answers e.g.,
    probability values
  • A probabilistic query returns answers with
    probabilistic guarantees

7
Probabilistic Queries
Recorded Temperature
Uncertainty for Current Temperature
30
  • Which rooms temp is between 10oF to 25oF?
  • (T1,10), (T2,80)

20
10
1
1
0
oF
T1
T2
8
Outline
  • Modeling Sensor Uncertainty
  • Probability Threshold Queries
  • Probability Threshold Indexing
  • Variance-based Clustering
  • Experimental Results

9
Sampling Uncertainty
  • The value of the external entity is sampled at
    discrete time
  • Can produce incorrect results
  • Bounded by dead-reckoning update (Wolfson et al.,
    99)

10
Dead-Reckoning Update
  • Each sensor keeps track of the difference (d)
    between its current value and the value last sent
  • Send update to database when d gt deviation
    threshold

d
last value
current value
11
Measurement Uncertainty
  • Measurement Error (Pfoser Jensen 99)
  • Due to inherent imprecision in hardware e.g., GPS
  • Less serious than sampling error (Trajcevski et
    al., 02)

x
12
Database Model
13
Interval Uncertainty
Ti.z(t)
li(t)
ui(t)
Uncertain Interval Ui(t)
  • Example Ui(t) is interval bounding all values
    within distance of (t-tupdate)?r of Ti.z
  • tupdate time that Ti.z is last updated
  • r current rate of change of Ti.z
  • Example dead-reckoning update (Wolfson 99)

14
Probabilistic Uncertainty
fi(x) uncertainty pdf
Ti.z
Li
Ri
uncertainty interval
  • Wolfson et al. (1999) proposed fi(x) as Gaussian
    distribution for a moving object on a route
  • Deshpande et al. (2004) discussed parametrization
    of Gaussian distribution in sensor networks
  • Can be extended to n dimensions

15
Outline
  • Modeling Sensor Uncertainty
  • Probability Threshold Queries
  • Probability Threshold Indexing
  • Variance-based Clustering
  • Experimental Results

16
Probabilistic Queries
Recorded Temperature
Uncertainty for Current Temperature
30
  • Which rooms temp is between 10oF to 25oF?
  • (T1,10), (T2,80)

20
10
1
1
0
oF
T1
T2
17
Probabilistic Queries
  • Drawback Costly integration operations
  • In practice, the user is only concerned with
    results with sufficiently high probability values
  • e.g., return ids of sensors with temperature over
    30oF where probability 0.7

18
Probability Threshold Queries (PTQ)
  • INPUT a,b, and p,where a,b,p ??, 0 lt p
    ? 1
  • OUTPUT Ti where probability pi that Ti.z is
    inside a,b satisfies pi p
  • The actual value of pi is not returned

19
Interval Indexing
  • Interval indexing handles containment, overlap
    and stabbing queries
  • Manolopoulos et al. (2000) proposed an efficient
    interval tree for range queries
  • Arge Vitter (1996) and Kanellakis et al. (1996)
    mapped 1D interval queries to 2D queries

20
Solving PTQ with Interval Indexes
  • Use interval indexes to find intervals that
    overlap a,b
  • For each object retrieved, evaluate its
    probability of being within a,b
  • Return intervals with probability p

21
The Problem of Interval Indexes
  • Current Interval indexes do not consider
    probabilities during search
  • Many irrelevant objects (probability lt p) may be
    retrieved from the interval index

22
Our Solutions
  • Probability Threshold Indexing (PTI) 1D interval
    R-tree with uncertainty
  • Variance-based Clustering Transform
    intervals to 2D points and index based on variance

23
Outline
  • Modeling Sensor Uncertainty
  • Probability Threshold Queries
  • Probability Threshold Indexing
  • Variance-based Clustering
  • Experimental Results

24
Pruning in a 1D R-Tree
  • Some intervals in the MBR may satisfy Q
  • Need to retrieve the contents of MBR

25
x-bounds in a PTI Node
left-0.2-bound
right-0.2-bound
? 0.2
0.8
26
x-bounds in a PTI Node
left-0-bound
right-0-bound
27
Pruning with x-bounds
left-0.2-bound
right-0.2-bound
  • An MBR is not further retrieved if
  • Q does not cut left and right x-bounds
  • p gt x

28
Implementation of PTI
29
Drawback of PTI
  • Extra overhead in storing x-bounds
  • Doesnt distinguish small and large intervals

right-0.2-bound
left-0.2-bound
30
Outline
  • Modeling Sensor Uncertainty
  • Probability Threshold Queries
  • Probability Threshold Indexing
  • Variance-based Clustering
  • Experimental Results

31
Mapping intervals to 2D-space
  • Each 1D interval Li,Ri can be mapped to a point
    (x,y) in 2D space
  • Li ? x
  • Ri ? y
  • y x mapped points lie above xy line

32
The PTQ-Uniform Problem
uniform pdf
uniform pdf
uniform pdf
33
2D View of PTQ-Uniform
y Ri
Li
Ri
xy
Q (p 0.75)
b
a
y(1-p)xp a Intervals containing a
a ltx lt y lt b Intervals in a,b
x(1-p)yp ? b Intervals in a,b
b-a p(y-x) Intervals containing a,b
a
b
x Li
a
b
1D View (Uniform pdf)
2D View
34
Clustering of 2D points
cluster of large intervals
y
  • When 2D points are clustered, small and large
    intervals are separated
  • Points in the same vicinity have similar means
    and variances

xy
(Li,Ri)
variance of Li,Ri
mean of Li,Ri
cluster of smaller intervals
x
35
Answering PTQ-Uniform with 2D R-Tree
  • Construct a 2D R-tree over 2D points
  • Perform a trapezoidal range query over the 2D
    R-tree
  • Since points with similar means and variances are
    clustered together, it is better than PTI

36
Variance-based Clustering
  • Can be extended to other pdfs
  • Variance-based clustering is an uncertainty
    indexing technique based on 2D R-tree
  • Each item is indexed based on its mean and
    variance

37
Variance-based Clustering
  • For uniform and Guassian distributions, range
    queries over 2D points can be constructed
  • For arbitrary pdfs, a well-defined range query
    may be infeasible
  • In those cases, place x-bounds in each 2D R-tree
    node for pruning

38
Theoretical Results
  • Not possible to create a linear space index that
    gives logarithmic query times for PTQs in the
    worst case
  • For most cases, any space-partitioning data
    structure e.g., 2D R-tree suffices
  • PTQ with fixed threshold and uniform distribution
    can be answered in logarithmic time with a linear
    structure

39
Outline
  • Modeling Sensor Uncertainty
  • Probability Threshold Queries
  • Probability Threshold Indexing
  • Variance-based Clustering
  • Experimental Results

40
Performance Comparison
  • Compare number of I/Os between
  • 1D R-tree on intervals only
  • PTI (1D R-tree with probability thresholds)
  • 2D variance-based clustering (called Extensive)

41
Simulation Model
  • 100K uncertain data, with length uniformly
    distributed in 0,10000 and uniform uncertainty
    pdf
  • 10K PTQs with length of a,b normally
    distributed and p ? 0.1,1
  • Each PTI node contains five x-bounds, where x ?
    0.1,0.3,0.5,0.7,0.9

42
Scalability of Indexes
  • Both PTI and Extensive outperform R-tree
  • Answering PTQ with R-tree requires more
    computation
  • Extensive needs about 50 less I/Os than PTI

43
Effect of Query Probability Threshold
  • R-tree does not benefit from the increasing value
    of p
  • When p is 0.5, Extensive is 4 times better than
    PTI

44
Other Important Results
  • Classification of probabilistic queries based on
    answer type and operators
  • different evaluation algorithms
  • Moving-objects probabilistic nearest-neighbor
    queries
  • Quality metrics of probabilistic results
  • Heuristics for improving answer quality

45
Future Work
  • Study probabilistic threshold constraints for
    other queries, such as nearest neighbors and
    joins
  • Study the indexing of other uncertain data types
    e.g., fuzzy data and sets
  • Study other kinds of constraints on probabilistic
    queries e.g., answers with top-k probability
    values

46
Conclusions
  • Based on the pdf information of uncertain
    intervals, PTI places tighter bounds in 1D R-tree
    nodes.
  • Variance-based clustering uses a 2D R-tree to
    avoid placing intervals of extreme sizes
    together.
  • The concept of these indexes can be extended to
    multiple dimensions.
  • Contact Reynold Cheng (ckcheng_at_cs.purdue.edu) for
    details

47
References
  • AEM92 Pankaj K. Agarwal, David Eppstein, and
    Jir Matousek. Dynamic half-space reporting,
    geometric optimization, and minimum spanning
    trees. In FOCS, pages 80-89, 1992.
  • AV96 L. Arge and J. S. Vitter. On dynamic
    interval management in external memory (extended
    abstract). In FOCS, p. 560-569, 1996.
  • CP03 R. Cheng and S. Prabhakar. Managing
    uncertainty in sensor databases. In SIGMOD
    Record, Dec 2003.
  • CKP03 R. Cheng, D. Kalashnikov, and S.
    Prabhakar. Evaluating probabilistic queries over
    imprecise data. In Proc. of the ACM SIGMOD, 2003.

48
References
  • KRVV96 P. C. Kanellakis, S. Ramaswamy, D.
    Vengroff, and J. S. Vitter. Indexing for data
    models with constraints and classes. In J. Comp.
    Syst. Sci, 52(3)589-612, 1996.
  • MTT00 Y. Manolopoulos, Y. Theodoridis, and V.
    J. Tsotras. Chapter 4 Access methods for
    intervals. In Advanced Database Indexing, Kluwer,
    2000.
  • WSCY99 O. Wolfson, P. Sistla, S. Chamberlain,
    and Y. Yesha. Updating and querying databases
    that track mobile units. Distributed and Parallel
    Databases, 7(3), 1999.
  • DGMHH04 A. Deshpande, C. Guestrin, S. Madden,
    J. Hellerstein and W. Hong. Model-Driven Data
    Acquisition in Sensor Networks. In VLDB, 2004.

49
Related Work Interval Indexing
  • AV96, KRVV96 discuss the idea of mapping
    intervals as points in 2D space. The
    transformation of 1D stabbing queries and range
    queries to two-sided orthogonal queries in 2D
    space are also presented.
  • MTT00 proposes an efficient interval tree to
    facilitate the execution of intersection queries
    over intervals.
  • CKP04 proposes an indexing scheme for
    constantly-growing uncertainty of moving objects.

50
Related Work 2D Range Queries
  • A comprehensive survey on geometric range
    searching can be found in M94.
  • For half-space queries,
  • F81 discusses lower bounds.
  • AEM92 presents optimal structures.
  • For simplex queries,
  • C89 derives the lower bounds.
  • GRUY97 modifies R-tree to answer simplex queries

51
Related Work Uncertainty Indexing
  • Few works have addressed the issues of indexing
    uncertain data that involves probability
    computation.
  • CKP04 proposes an indexing scheme for
    constantly-growing uncertainty of moving objects.
  • LMPS03 discusses an extension of the TPR-tree
    to index trajectories of moving objects, where
    each point in the trajectory has a rectangular
    uncertain bound.

52
Related Work Probabilistic Queries
  • CKP03 proposes an uncertainty model for
    constantly-evolving data. It also presents
    classification, evaluation and quality issues of
    different types of probabilistic queries.
  • For moving object uncertainty,
  • WSCY99 study probabilistic range queries.
  • CKP04 study probabilistic nearest neighbor
    queries.
  • CP03 proposes computation strategies for
    evaluating PTQ, but does not discuss the indexing
    of uncertain data.

53
Searching a R-tree
  • 1D R-tree can be used to indexed uncertainty
    intervals
  • Each R-tree node has a maximum bounding rectangle
    (MBR), which encompasses all intervals in the
    subtree of that node
  • Starting from the root, only children with MBRs
    that overlap a,b are further followed

54
Complexity of PTQU
  • Half-space queries report a set of points that
    satisfy ax by c
  • PTQU is at least as hard as half-space queries
    which require O(n1/3) operations F81 using a
    linear-space index
  • Simplex queries report a set of points that
    satisfy a list of constraints aix biy ci
  • PTQU is a special case of simplex queries, where
    query time is O(ne) using linear structure
    AEM92.

55
Details of Variance-based Clustering
  • The exact indexing technique depends on the form
    of pdf
  • For regular sets, e.g., Gaussian and uniform pdf,
    can prune a node without the extra overhead of
    PTI
  • For arbitrary pdfs, need a PTI table in each node
    to facilitate pruning
Write a Comment
User Comments (0)
About PowerShow.com