R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman - PowerPoint PPT Presentation

About This Presentation
Title:

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman

Description:

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman Introduction Range queries in multiple dimensions: Computer Aided Design (CAD) Geo-data ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 67
Provided by: DanielK169
Category:

less

Transcript and Presenter's Notes

Title: R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman


1
R-Trees A Dynamic Index Structure For Spatial
SearchingAntonin Guttman
2
Introduction
  • Range queries in multiple dimensions
  • Computer Aided Design (CAD)
  • Geo-data applications
  • Support spacial data objects (boxes)
  • Index structure is dynamic.

3
R-Tree
  • Balanced (similar to B tree)
  • I is an n-dimensional rectangle of the form (I0,
    I1, ... , In-1) where Ii is a range
    a,b ?-?,?
  • Leaf node index entries (I, tuple_id)
  • Non-leaf node entry (I, child_ptr)
  • M is maximum entries per node.
  • m ? M/2 is the minimum entries per node.

4
Invariants
  1. Every leaf (non-leaf) has between m and M records
    (children) except for the root.
  2. Root has at least two children unless it is a
    leaf.
  3. For each leaf (non-leaf) entry, I is the smallest
    rectangle that contains the data objects
    (children).
  4. All leaves appear at the same level.

5
Example (part 1)
6
Example (part 2)
7
Searching
  • Given a search rectangle S ...
  • Start at root and locate all child nodes whose
    rectangle I intersects S (via linear search).
  • Search the subtrees of those child nodes.
  • When you get to the leaves, return entries whose
    rectangles intersect S.
  • Searches may require inspecting several paths.
  • Worst case running time is not so good ...

8
S R16
9
Insertion
  • Insertion is done at the leaves
  • Where to put new index E with rectangle R?
  • Start at root.
  • Go down the tree by choosing child whose
    rectangle needs the least enlargement to include
    R. In case of a tie, choose child with smallest
    area.
  • If there is room in the correct leaf node, insert
    it. Otherwise split the node (to be continued
    ...)
  • Adjust the tree ...
  • If the root was split into nodes N1 and N2,
    create new root with N1 and N2 as children.

10
Adjusting the tree
  1. N leaf node. If there was a split, then NN is
    the other node.
  2. If N is root, stop. Otherwise P Ns parent and
    EN is its entry for N. Adjust the rectangle for
    EN to tightly enclose N.
  3. If NN exists, add entry ENN to P. ENN points to
    NN and its rectangle tightly encloses NN.
  4. If necessary, split P
  5. Set NP and go to step 2.

11
Deletion
  1. Find the entry to delete and remove it from the
    appropriate leaf L.
  2. Set NL and Q ?. (Q is set of eliminated nodes)
  3. If N is root, go to step 6. Let P be Ns parent
    and EN be the entry that points to N. If N has
    less than m entries, delete EN from P and add N
    to Q.
  4. If N has at least m entries then set the
    rectangle of EN to tightly enclose N.
  5. Set NP and repeat from step 3.
  6. Reinsert entries from eliminated leaves. Insert
    non-leaf entries higher up so that all leaves are
    at the same level.
  7. If root has 1 child, make the child the new root.

12
Why Reinsert?
  • Nodes can be merged with sibling whose area will
    increase the least, or entries can be
    redistributed.
  • In any case, nodes may need to be split.
  • Reinsertion is easier to implement.
  • Reinsertion refines the spatial structure of the
    tree.
  • Entries to be reinserted are likely to be in
    memory because their pages are visited during the
    search to find the index to delete.

13
Other Operations
  • To update, delete the appropriate index, modify
    it, and reinsert.
  • Search for objects completely contained in
    rectangle R.
  • Search for objects that contain a rectangle.
  • Range deletion.

14
Splitting Nodes
  • Problem Divide M1 entries among two nodes so
    that it is unlikely that the nodes are needlessly
    examined during a search.
  • Solution Minimize total area of the covering
    rectangles for both nodes.
  • Exponential algorithm.
  • Quadratic algorithm.
  • Linear time algorithm.

15
Splitting Nodes Exhaustive Search
  • Try all possible combinations.
  • Optimal results!
  • Bad running time!

16
Splitting Nodes Quadratic Algorithm
  1. Find pair of entries E1 and E2 that maximizes
    area(J) - area(E1) - area(E2) where J is covering
    rectangle.
  2. Put E1 in one group, E2 in the other.
  3. If one group has M-m1 entries, put the remaining
    entries into the other group and stop. If all
    entries have been distributed then stop.
  4. For each entry E, calculate d1 and d2 where di is
    the minimum area increase in covering rectangle
    of Group i when E is added.
  5. Find E with maximum d1 - d2 and add E to the
    group whose area will increase the least.
  6. Repeat starting with step 3.

17
Greedy continued
  • Algorithm is quadratic in M.
  • Linear in number of dimensions.
  • But not optimal.

18
Splitting Nodes Linear Algorithm
  • For each dimension, choose entry with greatest
    range.
  • Normalize by dividing the range by the width of
    entire set along that dimension.
  • Put the two entries with largest normalized
    separation into different groups.
  • Randomly, but evenly divide the rest of the
    entries between the two groups.
  • Algorithm is linear, almost no attempt at
    optimality.

19
Performance Tests
  • CENTRAL circuit cell (1057 rectangles)
  • Measure performance on last 10 inserts.
  • Search used randomly generated rectangles that
    match about 5 of the data.
  • Delete every 10th data item.

20
Performance
21
  • With linear-time splitting, inserts spend very
    little time doing splits.
  • Increasing m reduces splitting (and insertion)
    cost because when a groups becomes too full, the
    rest of the entries are assigned to the other
    group.
  • As expected, most of the space is taken up by the
    leaves.

22
Performance
23
  • Deletion cost affected by size of m. For large m
  • More nodes become underfull.
  • More reinserts take place.
  • More possible splits.
  • Running time is pretty bad for m M/2.
  • Search is relatively insensitive to splitting
    algorithm. Smaller values of m reduce average
    number of entries per node, so less time is spent
    on search in the node (?).

24
Space Efficiency
  • Stricter node fill produces smaller index.
  • For very small m, linear algorithm balances
    nodes. Other algorithms tend to produce
    unbalanced groups which are likely to split,
    wasting more space.

25
Conclusions
  • Linear time splitting algorithm is almost as good
    as the others.
  • Low node-fill requirement reduces
    space-utilization but is not siginificantly worse
    than stricter node-fill requirements.
  • R-tree can be added to relational databases.

26
The R-tree
  • An Efficient and Robust Access Method for Points
    and Rectangles
  • Norbert Beckmann, Hans-Peter Kriegel
  • Ralf Schneider, Bernhard Seeger

27
Greenes Split Algorithm
  • Split
  • GS1 call ChooseAxis to determine axis
    perpendicular to the split
  • GS2 call Distribute
  • ChooseAxis
  • CA1 Find pair of entries E1 and E2 that
    maximizes area(J) - area(E1) - area(E2) where J
    is covering rectangle.
  • CA2 For each dimension di, find the normalized
    separation ni by dividing the distance between E1
    and E2 by the length along di of the covering
    rectangle for all the nodes.
  • CA3 Return the dimension i for which ni is
    largest.

28
Greene Split Cont...
  • Distribute
  • D1 Sort entries by low value along chosen
    dimension.
  • D2 Assign the first (M1) div 2 entries to one
    group and assign the last (M1) div 2 entries to
    the other group.
  • D3 If (M1) is odd, assign the remaining entry
    to the group whose covering rectangle will be
    increased by the smallest amount.

29
Introduction
  • R-trees use heuristics to minimize the areas of
    all enclosing rectangles of its nodes.
  • Why?
  • Why not ...
  • minimize overlap of rectangles?
  • minimize margin (sum of length on each dimension)
    of each rectangle (i.e. make it as square as
    possible)?
  • optimize storage utilization?
  • all of the above?

30
Minimizing Covering Rectangle
  • Dead space is the area covered by the covering
    rectangle which is not covered by the enclosing
    rectangles.
  • Minimizing dead space reduces the number of paths
    to traverse during a search, especially if no
    data matches the search.

31
Minimizing Overlap
  • Also reduces number of paths to be traversed
    during a search, especially when there is data
    that matches the search criteria.

32
Mimimizing Margin
  • Minimizing margin produces square-like
    rectangles.
  • Squares are easier to pack so this tends to
    produce smaller covering rectangles in higher
    levels.

33
Storage Utilization
  • Reduces height of tree, so searches are faster.
  • Searches with large query rectangles benefit
    because there are more matches per page.

34
Problems with Guttmans Quadratic Split
  • Distributing entries during a split favors the
    larger rectangle since it will probably need the
    least enlargement to add an additional item.
  • When one group gets M-m1 entries, all the rest
    are put in the other node.

35
Problems with Greene Split
  • The correct dimension is not always chosen
    splitting based on another dimension can improve
    performance sometimes.
  • Tests show that Greene split can give slightly
    better results than quadratic split but in some
    cases performs much worse.

36
When Greenes split goes bad ...
Overfull Node
Greene Split
Correct Split
37
R-tree - ChooseSubtree
  • Let E1, ..., Ep be rectangles of entries of the
    current node,
  • ChooseSubtree(level) finds the best place to
    insert a new node at the appropriate level.
  • CS1 Set N to be the root
  • CS2 If N is at the correct level, return N.
  • CS3 If Ns children are leaves, choose the entry
    whose overlap cost will increase the least. If
    Ns children are not leaves choose entry whose
    rectangle will need least enlargement.
  • CS4 Set N to be the child whose entry was
    selected and repeat CS2.
  • Ties are broken by choosing entry whose rectangle
    needs least enlargement. After that choose
    rectangle with smallest area.

38
ChooseSubtree analysis
  • The only difference from Guttmans algorithm is
    to calculate overlap cost for leaves. This
    creates slightly better insert performance.
  • Cost is quadratic for leaves, but tradeoffs (for
    accuracy) are possible
  • sort the rectangles in N in increasing order
    based on area enlargement. Calculate which of the
    first p entries needs smallest increase in
    overlap cost and return that entry.
  • For 2 dimensions, p32 yields good results.
  • CPU cost is higher but number of disk accesses
    are decreased.
  • Improves retrieval for queries with small query
    rectangles on data composed of small, non-uniform
    distribution of small rectangles or points.

39
Optimizing Splits
  • For each dimension, entries are sorted by low
    value, and then by high value.
  • For each sort we create d M-2m2 distributions.
    In the kth distribution (1?k?d), the first group
    has the first (m-1)k entries.
  • We also have the following measures (Ri is the
    bounding rectangle for group i)
  • area-value areaR1areaR2
  • margin-value marginR1marginR2
  • overlap-value areaR1 ? R2

40
Optimizing Splits
  • Split
  • S1 call ChooseSplitAxis to find axis
    perpendicular to the split.
  • S2 call ChooseSplitIndex to find the best
    distribution. Use this distribution to create the
    two groups.
  • ChooseSplitAxis
  • CSA1 for each dimension, compute the sum of
    margin-values for each distribution produced.
  • CSA2 return the dimension that has minimum sum.
  • ChooseSplitIndex
  • CSI1 for the chosen split axis, choose
    distribution with minimum overlap-value. Break
    ties by choosing distribution with minimum
    area-value.

41
Analyzing Splits
  • Split algorithm was chosen based on performance
    and not on any particular theory.
  • Split is O(n log(n)) in dimension.
  • m 40 of M yields good performance (same value
    of m is also near-optimal for Guttmans quadratic
    split algorithm).

42
Forced Reinsert
  • Splits improve local organization of tree.
  • Can the improvement be made less local?
  • Hint during delete, merging underfilled nodes
    does very little to improve tree structure.
    Experimental results show that delete with
    reinsert improves query performance.
  • Since inserts tend to happen more frequently than
    deletes, why not perform reinsert during inserts?

43
R Insert
  • InsertData
  • ID1 call Insert with leaf level as the
    parameter.
  • Insert(level)
  • I1 call ChooseSubtree(level) to find the node N
    (at the appropriate level) into which we place
    the new entry.
  • I2 if there is room in N, insert new entry,
    otherwise call OverflowTreatment with Ns level
    as parameter.
  • I3 if OverflowTreatment caused a split,
    propagate OverflowTreatment up the tree (if
    necessary).
  • I4 if root was split, create new root.
  • I5 adjust all covering rectangles in insertion
    path.

44
R Insert
  • OverflowTreatment(level)
  • OT1 If the level is not the root and this is the
    first OverflowTreatment for this level during
    insertion of 1 rectangle, call Split. Otherwise
    call Reinsert with level as the parameter.
  • Reinsert(level)
  • RI1 In decreasing order, sort the entries Ei of
    N based on the distance from the center of Ei to
    the center of Ns bounding rectangle.
  • RI2 Remove the first p entries of N and adjust
    Ns bounding rectangle.
  • RI3 call Insert(level) on the p entries in
    reversed sort order (close reinsert).
  • Experimentally, a good value of p is 30 of M.

45
Insert Analysis
  • Experimentally, R insert reduces number of
    splits that have to be performed.
  • Space utilization is increased.
  • Close reinsert tends to favor the original node.
    Outer rectangles may be inserted elsewhere,
    making the original node more quadratic.
  • Forced reinsert can reduce overlap between
    neighboring nodes.

46
Misc.
  • R-tree is mainly optimized for search
    performance. As an unexpected side-effect, insert
    performance is also improved.
  • Delete algorithm remains unchanged (and untested)
    but should improve because it depends on search
    and insert.

47
Test Data
  • (F1) Uniform 100,000 rectangles.
  • (F2) Cluster Centers are distributed into 640
    clusters of about 1600 objects each.
  • (F3) Parcel decompose unit square into 100,000
    rectangles and increase area of each rectangle by
    factor 2.5.
  • (F4) Real-Data 120,576 rectangles from
    elevation lines from cartography data.
  • (F5) Gaussian Centers follow 2-dimensional
    independent Gaussian distribution.
  • (F6) Mixed-Uniform 99,000 uniformly distributed
    small rectangles and 1,000 uniformly distributed
    large rectangles.

48
Performance
  • 6 different data distributions, including
    real-life cartography data.
  • Rectangle intersection query.
  • Point query.
  • Rectangle enclosure query.
  • Spatial Joins (map overlay).

49
Typical Performance Data
50
Spatial Join
  • Test files
  • (SJ1) 1000 random rectangles from (F3) join (F4)
  • (SJ2) 7500 random rectangles from (F3) join 7,536
    rectangles from elevation lines.
  • (SJ3) Self-join of 20,000 random rectangles from
    (F3)

51
Point Access Method
  • R-tree had biggest wins for small query
    rectangles. What about points?

52
Conclusions
  • R-tree even better than GRID in read-mostly
    environments with 2-D point data.
  • R-tree is robust even for bad data
    distributions.
  • R-tree reduces of splits and is more space
    efficient than other R-tree variants.
  • R-tree outperforms all other R-tree variants in
    page I/O.

53
Problems
  • No test data for more than two dimenstions.
  • R-tree algorithms are linear with dimensions.
  • R-tree isnt O(N log(N)).
  • CPU cost not calculated.
  • What about queries that retrieve lots of data?
  • What if not all dimensions are specified?
  • Linear scan performance?

54
When Is Nearest Neighbor Meaningful?
  • Kevin Beyer, Jonathan Goldstein, Raghu
    Ramakrishan, Uri Shaft

55
Introduction
  • Nearest Neighbor (NN) Given a set of data in
    m-dimensional space and a query point, find the
    point closest to a query point.
  • A heuristic for similarity queries for images,
    sequences, video and shapes is to convert data to
    multidimensional points and find the NN.
  • But in many cases, as dimensionality increases,
    d(FN)/d(NN) ?1.
    d(FN) distance to farthest neighbor
    d(NN) distance to nearest
    neighbor

56
When is NN Meaningful?
57
NN Errors
  • Conversion to NN is a heuristic.
  • NN is frequently solved approximately.

58
Instability
  • For a given ? gt 0, a NN query is unstable if the
    distance from the query point to most data points
    is less than (1 ?) times the distance to the
    nearest neighbor.
  • In many cases, for any ? gt 0, as the number of
    dimensions increases, the probability that a
    query is unstable converges to 1.

59
Instability Theorem
  • p is a constant, dm is a distance metric on m
    dimensions, Pm,1, ..., Pm,n are n independent
    m-dimensional points, Qm is an m-dimensional
    query point, Ex is the expectation of x.
  • Theorem if
  • then ??gt0

60
Bad Distributions
  • The condition of the theorem holds for
    independent and identically distributed (IID)
    data which is commonly used for
    high-dimensional index testing.
  • Even many distributions with correlated
    dimensions have this problem.
  • Distributions where the variance in distance with
    each added dimension converges to 0 can also have
    this problem. Ex the ith component comes from
    0,1/i.
  • Usually degradation is very bad with just 10-20
    dimensions.

61
Meaningful Distributions
  • Classification applications data is clustered
    and query point must fall in one of the clusters.
    Meaningful NN query will return all points in the
    cluster.
  • Implicit low dimensionality. For example, when
    all points lie on a line or in a plane.

62
Test on Artificial Data
  • Uniform IID uniform workload over (0,1)
  • Recursive data point Xm(X1,...,Xm)
  • Take independent random variables U1,...,Um where
    Ui comes from uniform distribution over (0, ).
  • X1U1 and for 2 ? i ? m, Xi Ui (Xi-1/2).
  • Two degrees of freedom
  • Let a1, a2, ... and b1, b2, ... be constants in
    (-1,1)
  • Let U1, U2 be independent and uniform in (0,1)
  • Xi aiU1 biU2

63
Results
64
Real Data k Nearest Neighbor
For k1, 15 of queries had half the points
within factor of 3
65
NN Processing
  • For IID and other distributions where NN becomes
    meaningless, sequential scans may easily
    outperform fancy high-dimensional indexing
    schemes (like R-tree).
  • High-dimensional indexing schemes may be useful
    they should be evaluated for meaningful workloads.

66
Conclusions
  • For NN applications, check that your distribution
    is meaningfull.
  • For NN processing, test algorithms on meaningful
    distributions.
  • Compare your algorithms performance against
    linear scans.
Write a Comment
User Comments (0)
About PowerShow.com