MultiDimensional Range Searching - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

MultiDimensional Range Searching

Description:

Simple & independent of query shape. Good only for uniform distribution of input and can be quite ... Classical solutions, though good for small dimension space, ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 53
Provided by: csU45
Category:

less

Transcript and Presenter's Notes

Title: MultiDimensional Range Searching


1
Multi-Dimensional Range Searching
Committee Subhash Suri (Chair), Amr El Abbadi,
Teofilo Gonzalez
  • Amit Bhosle
  • Department of Computer Science
  • University of California Santa Barbara

2
What I Will Cover Today
  • Introduction problem statement / applications
  • Some classical solutions
  • A lower bound by Chazelle in orthogonal range
    searching
  • Indexing schemes in context to range searching
  • Some indexing structures (R-trees and Box trees)

3
What is Range Searching ?
  • Preprocess a set P of objects for efficiently
    answering queries.
  • Typically, P is a collection of geometric objects
    (points, rectangles, polygons) in Rd.
  • Query Range, Q d-rectangles, balls, halfspaces,
    simplices, etc..
  • Either count all objects in P ? Q or report
    the objects themselves.

4
Example Points in R2
Q1
Q2
5
Why Study Range Searching ?
  • Applications in several fields
  • Databases
  • Spatial databases (G.I.S.)
  • Computer Graphics
  • Robotics
  • Algorithmic tool (example ?? )
  • And more..

6
Some Classical Approaches
  • Griding or bucketing
  • Simple independent of query shape
  • Good only for uniform distribution of input and
    can be quite bad for skewed data.
  • Range Trees
  • Good query time and space for lower dimensions
  • Poor in higher dimensions O(logdn)
  • kD-Trees
  • Linear space in all dimensions.
  • Query time becomes almost linear for high
    dimensions O(n1-1/d).

7
The Grid Approach
k
2
1

k
1
2
8
Grids (Contd.)
  • Either the queries should be aligned with the
    grid or result is approximate.
  • Cell sizes need not be uniform and can be adapted
    to data distribution.
  • O(kd-1) query time, O(nd k d) preprocessing
    time and O(nk d) space ( k is the number of
    divisions of each axis).
  • Error decreases as k increases, but space
    requirement increases.

9
Range Trees
  • 1-D case Build a balanced binary tree using the
    points co-ordinates as the keys.

6
17
2
4
5
7
8
12
15
19
7
Counting ?
4
12
Optimal time/space
2
5
8
15
7
19
15
12
8
2
4
5
10
Range Trees (Contd.)
  • d-Dimensions
  • Build a 1-D range tree on the first dimension.
  • Each internal node points to another tree built
    recursively on the remaining d-1 dimension for
    the points in its subtree.

P3
P3
P2
P2
P4
P4
P1
P1
P4
P3
P2
P1
11
Range Trees (Contd.)
P3
P8
P5
P2
P6
P4
P7
P1
P3
P2
P4
P1
P4
P3
P2
P1
P8
P7
P6
P5
12
Range Trees (Contd.)
  • Search by the first dimension gives us a O(logn)
    subtrees which together contain the output
    point(s).
  • Search the remaining d-1 dimensions recursively
    among these.
  • O(logd n k) query time, O(n logd-1 n)
    preprocessing time and space.

13
kD-Trees (k-dimensional Trees)
  • 1-d tree split along median point and
    recursively build subtrees for the left and right
    sets.
  • Higher dimensions same approach, but cycle
    through the dimensions. Or, select the next
    dimension as the one with the widest spread.
  • Efficiency of query processing drops as
    dimensions increase (becomes almost linear).
    However, the space requirement remains linear
    O(n.d)

14
kD-Trees (Contd.)
c
o
m
d
f
l
n
a
b
e
g
j
k
h
i
f
l
j
i
h
k
d
n
m
e
g
b
a
c
o
15
kD-Trees (Contd.)
  • Query complexity How many cells can a query box
    intersect ?

Let us consider a facet of the query
  • Any axis parallel line can intersect atmost 2 of
    these 4 cells.
  • Each of these 4 cells contain exactly n/4 points.
  • Q(n) 2.Q(n/4) 1
  • Q(n) O(n1/2)
  • i.e. Query answered in O(n1-1/d m) time where m
    is the output size

16
Summary of Classical Solutions
  • Classical solutions, though good for small
    dimension space, do not perform well in higher
    dimensions.
  • Updates (inserts/deletes) are expensive (we did
    not discuss them).
  • Desired properties of the data structure
  • Near linear size
  • Query time O(k f(n)) where f is a very slowly
    increasing function.
  • Preprocessing time not as important as the above
    two.

17
Lower Bounds in Orthogonal Range Searching
  • Bernard Chazelle, Princeton
  • Proved that for the range reporting problem,
    O(kpolylog(n)) query time requires
    ?(n(logn/loglogn)d-1) space on a pointer
    machine.
  • Lower bound holds only for pointer algorithms
  • These algorithms need an explicit pointer to
    an object to access it! e.g. They cannot use the
    co-ordinates of the points for indexing into a
    structure, etc..
  • Algorithms based on Range Trees, kD-trees fall
    in this class of algorithms.

18
Models of Computation
Memory access rules differ for pointer and RAM
machines
Memory
Output Device
Input Device
Central Processing Unit
Control Unit
19
Chazelles Lower Bound
? (root)
Data structure A digraph G(V,E) of bounded
out-degree.
G has a representative node for each input point
and some other internal nodes.
For a query q, the algorithm non-deterministicall
y traverses G, adds/deletes some (internal) nodes
and edges, and produces a set W(q) of nodes which
is a superset of answer points nodes.
20
Chazelles Lower Bound (Contd.)
Range Trees and kD-Trees are some such data
structures.
Range Trees
kD-Trees
P3
P2
P4
f
P1
a
b
c
d
g
e
P4
P3
P2
P1
21
Chazelles Lower Bound (Contd.)
Desired query time O(k polylog(n)) (k is the
output size)
? (root)
For the query time to be linear in k, the nodes
for the answer set should be close to each
other. This should hold for any query q.
22
Chazelles Lower Bound (Contd.)
  • If we have a set of queries, Sq1 ,q2 ,,qs,
    such that
  • P ? qi ? logb n (each range has many
    points)
  • P ? qi ? qj ? 1 (no two ranges share many
    points)
  • Then, each qi has a representative subset of
    nodes in G which is compact (and has many edges).
    gt G has many edges.
  • By the bounded out-degree condition on G, it
    consequently has many vertices, and thus requires
    large amount of memory.

23
Chazelles Lower Bound (Contd.)
  • If S exhibits the desired properties, then he
    shows that for a query time of
  • W(q) a.(k logbn) ,
  • V gt S (logbn) / 216a4 (long
    algebraic proof)
  • Recall that W(q) is the output set produced for a
    query q.
  • The point set and the queries are generated as
    follows
  • Let n m? ,
  • where m ?2logbp? and
  • ? ? log p / (1 b.loglogp) ? for
    some large integer p.
  • If p is large enough, m ? logbn and ? ? log
    n /(1 b.loglogn)

24
Bad Input Set and Queries
  • Define the point set as P (?m(i), i) 0 ?
    i ? n
  • ?m(i) Write i in base m over ? bits and
    reverse the bits.
  • Consider a tree T, which encodes the x -
    co-ordinates of points in P (take their m-ary
    representation).
  • Each node has m children labeled 0,1,2,,m-1.

.
1st bit
.
2nd bit
0 1 2 m-1
.
ith bit
Height of T is ? (one level for each bit)
.
25
Generating S
  • 0 1 2
    m-1
  • A node at depth r is associated with m ?-r
    points those points whose m-ary representation
    has the first r bits as the ancestors of this
    node.
  • Sort them by the y co-ordinate and split them
    into groups of m points.
  • Total no. of groups ?( nodes at level r).(m
    ?-r/m)
  • Each group can be enclosed in a query box
  • Total of ?m ?-1 queries.

26
Eg n 33
27 queries
For n m?, we have ? m?-1 queries
27
Indexing Perspective
Data
Disk
Data is too large to be stored in memory and has
to be stored on the disk (in chunks of size B
possibly with repetition. B, the block size is
the unit of data transfer from disk in one read).
Storage Redundancy Maximum number of copies of
a data item. A query regarding items satisfying
some criteria is answered by retrieving blocks
from the disk such that the contained points form
a superset of the answer. Access Overhead Ratio
of no. of blocks retrieved to the minimum no. of
blocks required to answer the query.
28
Indexing Perspective (Contd.)
Blocks (redundancy 1)
Now redundancy 2
Data
  • Better overhead if queries have same aspect
    ratio as our blocks.
  • Else, far more blocks have to be retrieved than
    bare minimum!
  • Idea Have blocks of several different aspect
    ratios.

29
Indexing - Limitations
  • A query can have any aspect ratio.
  • Not possible to have blocks of every aspect ratio
    with limited memory.
  • Have blocks with sufficient no. of different
    aspect ratios so that any aspect ratio can be
    approximated (Hellerstein, Koutsoupias,
    Papadimitriou).

30
Overhead for Redundancy r
  • Choose blocks so that any aspect ratio can be
    approximated.
  • Blocks will have the shape Bx ? B1-x
  • Let x (2i-1) /2r i 1,2,,r
  • Store all such blocks
  • Redundancy is r since there are r shapes for the
    blocks and an input point
  • can be present in only 1 block of a particular
    shape

31
Overhead for fixed redundancy..
  • If query is aligned with the blocks, then let k
    blocks suffice.
  • We can easily form a query which will require 2.k
    2 blocks to be covered.

q
They achieve k B1/2r
k
q
32
Lower Bound on Access Overhead for r1
  • Access overhead, a ?(B1-1/d)
  • d2 Use only B ? 1 and 1 ? B queries (2n2/B
    total queries).
  • Let s ? S intersect x horizontal and y vertical
    lines.

n
  • x.y ? B
  • x y ? 2 B1/2
  • i.e. s intersects atleast 2 B1/2 of the above
    queries.
  • Block-query product (n2/B) 2 B1/2
  • gt Average no. of blocks a query intersects
    B1/2

n
x
y
33
Indexing Structures
  • R-trees as indexing structures Extension of
    B-Trees to multiple dimensions.

Node degrees Internal node between t and
2t Root between 2 and 2t
Input objects associated with leaves and all
leaves at the same level. Each internal node
stores the smallest bounding box of the objects
in its subtree.
34
R-Trees
B
u
q
r
t
p
H
C
B
A
v
s
A
w
A
C
y
s
r
q
p
z
y
x
x
H
z
C
B
w
v
u
t
Performance is measured as the number of disk
accesses required to answer a query.
35
Lower Bounds on R-trees
  • Query processing in bounding box hierarchies.
  • Almost similar to query processing in kD-trees.
  • Crossing number as a measure of efficiency of a
    bounding box hierarchy the smaller, the better
    !
  • There is a collection of n d-rectangles, for
    which any r-tree T of min-degree t there is a
    query box intersecting ?((n/t)1-1/d) nodes of T
    and none of the input d-rectangles. (Pankaj
    Agarwal, et al.)

36
R-tree Efficiency
  • Bounding box of any t squares hits ? 2(t1/2-1)
    queries.
  • Total bounding-box query intersections ?
    (n/t1/2)
  • Total queries 2(n1/2-1) O(n1/2) gt A query
    intersects atleast ?((n/t)1/2) bounding boxes.
  • In general, an empty query box intersects ?
    ((n/t)1-1/d) bounding boxes of the rtree.

Query boxes
Input rectangle
37
Good Box-trees and Conversion to Good R-trees
  • Pankaj Agarwal, et al.
  • kD-trees for rectangle intersection queries

c,d
x2 , y2
a,b
x1 , y1
The rectangles intersect iff (c,d) ? (x1 , y1)
(a,b) ? (x2 , y2) i.e. (-a,-b,c,d ) ? (-x2 ,
-y2 , x 1 , y1)
38
kD-Trees to Box Trees
  • Trivial to verify that the original problem of
    range searching on rectangles is now a problem of
    range searching on points.
  • Build a kD-tree on these points O(n1-1/2d k)
    query time.
  • Convert to a box-tree as follows
  • replace each points in leaves of the kD-tree
    with the corresponding d-rectangle
  • at each internal node, store the bounding box of
    its children.
  • Careful analysis shows that the query time is
    actually O(n1-1/d klogn)

39
Box Tree Analysis
  • What is a visited node ? A node is said to be
    visited if the query algorithm continues to its
    children nodes.
  • Two types
  • the input boxes in the subtree of a visited node
    v have one or more output boxes (atmost k such
    nodes).
  • all boxes stored in subtree of v are disjoint
    from the query Q
  • (not many of such nodes can be visited).

40
Box Tree Analysis
a
All input boxes cannot be separated from Q by the
same hyperplane. Thus, atleast 2 such
hyperplanes which separate an input box in
subtree of v from Q.
b
Q
In 2d space, points representing a and b lie on
opposite side of the above hyperplane through a
facet of Q. Thus, this hyperplane intersects the
cell representing v. The other hyperplane also
intersects the cell of v. Thus, their
intersection, which is a 2d-2 flat also
intersects the cell of v.
41
Box Tree Analysis (Contd.)
  • By the property of kD-trees, such cells can be
    atmost O(2i.(2d-2)/2d) O(2i(1-1/d)).
  • Height of the tree is O(log n) (kD-trees are
    perfectly balanced).
  • Thus total number of visited nodes for a query
    ?(k 2i(1-1/d) ) O(klogn n 1-1/d )
  • Using a slightly modified construction of the box
    tree, they reduce the query time to O(k n 1-1/d
    ).

42
Avenues for Further Research
  • Lower bounds suggest that no data structure might
    be possible which scales well in high-dimension
    space for an entirely generic set of inputs and
    queries.
  • Interesting assumptions about the input objects
    and queries might result in better performance.
  • Pankaj et al. showed that R-Trees do not have a
    good worst case performance even if input is a
    set of hypercubes.

43
Further Research
  • What if queries are also hypercubes or have O(1)
    aspect ratio ?
  • The lower bounds do not hold in these cases
    both for R-Trees and indexing.
  • Mark deBerg, et al. constructed box trees with
    polylog query time for collision checking in
    industrial installations.

44
  • Thank You

45
Junk
46
2-d Case
Query boxes
Input rectangle
47
kD-Trees (Contd.)
d
c
f
a
b
g
f
e
a
b
c
d
g
e
O(n) size data structure. O(nlogn) construction
time.
48
Indexing Perspective
  • Hellerstein, Koutsoupias, Papadimitriou
  • Efficiency of an indexing scheme for a database
  • Storage redundancy how many copies of a data
    item
  • Access overhead how many times more blocks
    than necessary does a query retrieve.
  • An indexing problem is defined in the context of
    a workload.
  • Workload consists of
  • A domain (e.g. Rd ),
  • A subset of the domain called instance (e.g. a
    set of points in Rd ), and
  • A set of subsets of the instance, the set of
    queries (Eg. d-rectangles).

49
Range Searching as Indexing Workloads
  • Range queries in R2
  • Domain, D R2
  • Instance, I (i,j) 1? i,j ? n
  • Query, Qa,b,c,d (i,j) a ? i ? b, c ? j
    ? d
  • one query for each quadruple (a,b,c,d)
    with
  • 1 ? a ? b ? n and 1 ? c ? d ? n
  • Indexing Schemes
  • A collection S s1 ,s2,ss of blocks,
  • si I
  • A query retrieves a set of blocks which cover it
    (possibly retrieving more blocks than necessary).

50
Access Overhead for fixed Storage Redundancy
If we have blocks with the same aspect ratio as
the query, then best overhead
But, query can have any aspect ratio. Not
possible to have blocks in S of all possible
aspect ratios (storage redundancy is fixed at r ).
51
Overhead when r 1 (Contd.)
  • d 3
  • Consider B? 1? 1 , 1? B? 1 and 1? 1? B queries
  • Let s ? S intersect x, y and z lines in each
    direction.

x.y.z ? B gt No. of queries intersected
xy yz zx ? 3.B2/3 No. of blocks
n3/B Block-query intersecting pairs
3B2/3.n3/B No. of queries 3.n3/B Thus, a query
intersects B2/3 blocks. In d-dimensions,
overhead is ?(B1-1/d)
z
x
y
52
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com