iMinMax(?%20)%20Indexing%20the%20Edges%20%20%20%20A%20simple%20and%20yet%20efficient%20approach%20to%20%20high-dimensional%20indexing - PowerPoint PPT Presentation

About This Presentation
Title:

iMinMax(?%20)%20Indexing%20the%20Edges%20%20%20%20A%20simple%20and%20yet%20efficient%20approach%20to%20%20high-dimensional%20indexing

Description:

A most effective mechanism to prune the search ... Nearest Neighbour Query: 'Find me the nearest fire station to Clementi Ave. 3?' Applications ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 34
Provided by: NUS16
Category:

less

Transcript and Presenter's Notes

Title: iMinMax(?%20)%20Indexing%20the%20Edges%20%20%20%20A%20simple%20and%20yet%20efficient%20approach%20to%20%20high-dimensional%20indexing


1
Access Methods for Advanced Database
Applications
2
Applications
  • Geographic Information Systems / Spatial DB
  • Text databases
  • XML databases
  • Data warehouse
  • High-dimensional databases (image, scientific)
  • Time series
  • Sequence databases (genomic databases)
  • Main memory database systems

3
Why New Indexes?
  • A most effective mechanism to prune the search
  • Order of magnitude of difference between I/O and
    CPU cost
  • Increasing data size
  • Increasing complexity of data and search

4
Memory System
CPU Die
CPU
Registers
L1 Cache
L2 Cache
Main Memory
Harddisk
5
Memory Hierarchy
6
Improvement in Performance
CPU (60/yr)
10000
1000
100
DRAM (10/yr)
10
1
1980
2000
7
Design Principles
  • Simple in design
  • Efficient in disk access/CPU time
  • Not necessary contradicting the simplicity!
  • Ease of integration into existing DBMS
  • Built on top of the mature index (eg. B-tree
    R-tree)?
  • Reuse all the well tested concurrency control etc.

8
Spatial Databases
  • Spatial Objects
  • Points spatial location eg. feature vectors
  • Lines set of points eg. roads, coastal line
  • Polygons set of points eg. Buildings, lakes
  • Data Types
  • Point a spatial data object with no extension
  • no size or volume
  • Regiona spatial object with a location and a
    boundary that defines the extension

9
Spatial Queries
  • Range queries Find all cities within 50 km of
    Madras?
  • Nearest neighbor queries Find the 5 cities that
    are nearest to Madras?
  • Find the 10 images most similar to this image?
  • Spatial join queries Find pairs of cities
    within 200 km of each other?

10
More Examples
  • Range Query Find me data points that satisfy
    the conditions x1 ltA1 lt y1, x2 ltA2 lty2?
  • Spatial Query Find me buildings that are
    adjacent to the Railway Stations?
  • Nearest Neighbour Query Find me the nearest
    fire station to Clementi Ave. 3?

11
Applications
  • Geographical Information Systems (GIS) dealing
    extensively with spatial data. Eg. Map system,
    resource management systems
  • Computer-aided design and manufacturing
    (CAD/CAM) dealing mainly with surface data. Eg.
    design systems.
  • Multimedia databases storing and manipulating
    characteristics of MM objects.

12
Representation of Spatial Objects
  • Testing on real objects is expensive
  • Minimum Bounding Box/Rectangle
  • How to test if 2-d rectangles intersect?

13
Approaches to Multi-Dimensional Indexing
R-trees
  • Data Partitioning
  • R-tree, R-tree, X-tree, Skd-tree, SS-tree,
    TV-tree, M-tree
  • Space Partitioning
  • Buddy-tree, R-tree, Grid File, KDB-tree
  • Mapping

14
R-trees
R-trees
15
R-trees
  • Range Query
  • Insert
  • Node splitting
  • Optimization
  • Coverage
  • Overlap
  • Delete
  • Variants R-tree
  • R-tree, buddy-tree

16
Space Filling Curves
  • Assumption att. values can be represented with
    some fixed of bits
  • Space domain on each dimension 2k values
  • Linearize the doman
  • Each point can be represented by a single
    dimensional value

17
Z-ordering
11
10
01
00
00
01
10
11
18
Z-ordering
  • The z-value is obtained by interleaving the bits.
  • Eg. X01, Y11
  • z-value 0111 7
  • Clustering effect on X-Y and z-values can be
    indexed using B-trees
  • Range queries problematic?

19
Hilbert Curve
111
110
101
100
011
010
001
000
100
011
001
010
000
101
110
111
20
Grid Files
  • Based on extendible hashing
  • Design principle any point query can be answered
    in at most 2 disk accesses.
  • Two structures k-dimensional array and k
    1-dimensional array

21
Extendible Hashing
  • Situation Hash Bucket (primary page) becomes
    full. Why not re-organize file by doubling of
    buckets?
  • Reading and writing all pages is expensive!
  • Idea Use directory of pointers to buckets,
    double of buckets by doubling the directory,
    splitting just the bucket that overflowed!
  • Directory much smaller than file, so doubling it
    is much cheaper. Only one page of data entries
    is split. No overflow page!
  • Trick lies in how hash function is adjusted!

22
Example
LOCAL DEPTH
2
Bucket A
16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
13
00
1
21
5
  • Directory is array of size 4.
  • To find bucket for r, take last global depth
    bits of h(r) we denote r by h(r).
  • If h(r) 5 binary 101, it is in bucket
    pointed to by 01.

01
2
10
Bucket C
10
11
2
DIRECTORY
Bucket D
15
7
19
DATA PAGES
  • Insert If bucket is full, split it (allocate
    new page, re-distribute).
  • If necessary, double the directory. (As we will
    see, splitting a
  • bucket does not always require doubling we
    can tell by
  • comparing global depth with local depth for
    the split bucket.)

23
Insert h(r)20 (Causes Doubling)
2
LOCAL DEPTH
3
LOCAL DEPTH
Bucket A
16
32
GLOBAL DEPTH
32
16
Bucket A
GLOBAL DEPTH
2
2
2
3
Bucket B
1
5
21
13
00
1
5
21
13
000
Bucket B
01
001
2
10
2
010
Bucket C
10
11
10
Bucket C
011
100
2
2
DIRECTORY
101
Bucket D
15
7
19
15
19
7
Bucket D
110
111
2
3
Bucket A2
20
4
12
DIRECTORY
20
12
Bucket A2
4
(split image'
of Bucket A)
(split image'
of Bucket A)
24
Points to Note
  • 20 binary 10100. Last 2 bits (00) tell us r
    belongs in A or A2. Last 3 bits needed to tell
    which.
  • Global depth of directory Max of bits needed
    to tell which bucket an entry belongs to.
  • Local depth of a bucket of bits used to
    determine if an entry belongs to this bucket.
  • When does bucket split cause directory doubling?
  • Before insert, local depth of bucket global
    depth. Insert causes local depth to become gt
    global depth directory is doubled by copying it
    over and fixing pointer to split image page.
    (Use of least significant bits enables efficient
    doubling via copying of directory!)

25
Directory Doubling
  • Why use least significant bits in directory?
  • Allows for doubling via copying!

6 110
6 110
3
3
000
000
001
100
2
2
010
010
00
00
1
1
011
110
6
01
10
0
0
100
001
6
6
10
01
1
1
101
101
6
11
11
6
6
110
011
111
111
vs.
Least Significant
Most Significant
26
Comments on Extendible Hashing
  • If directory fits in memory, equality search
    answered with one disk access else two.
  • 100MB file, 100 bytes/rec, 4K pages contains
    1,000,000 records (as data entries) and 25,000
    directory elements chances are high that
    directory will fit in memory.
  • Directory grows in spurts, and, if the
    distribution of hash values is skewed, directory
    can grow large.
  • Multiple entries with same hash value cause
    problems!
  • Delete If removal of data entry makes bucket
    empty, can be merged with split image. If each
    directory element points to same bucket as its
    split image, can halve directory.

27
Summary on Extendible Hashing
  • Hash-based indexes best for equality searches,
    cannot support range searches.
  • Static Hashing can lead to long overflow chains.
  • Extendible Hashing avoids overflow pages by
    splitting a full bucket when a new data entry is
    to be added to it. (Duplicates may require
    overflow pages.)
  • Directory to keep track of buckets, doubles
    periodically.
  • Can get large with skewed data additional I/O if
    this does not fit in main memory.

28
Grid Files
29
Scales, Directory, Bucket
  • Data structures
  • Linear scales
  • directory an array whose elements are one-to-one
    correspondence with the grid cells each entry
    points to a data bucket
  • data buckets

30
Splitting and Merging
31
Splitting and Merging
32
Grid Files...
  • Repetitive splitting by halving
  • Merging based on buddy system
  • Regions are represented as (cx, cy, dx, dy)
  • point queries cx-dx lt qx lt cxdx,
  • cy-dy lt qy lt cydy

33
Grid Files...
dx
cy
E
A
D
F
E
F
B
C
D
B
cx
qx
dy
A
C
B
D
cx
C
E
A
F
cy
qy
Write a Comment
User Comments (0)
About PowerShow.com