iMinMax(?%20)%20Indexing%20the%20Edges%20%20%20%20A%20simple%20and%20yet%20efficient%20approach%20to%20%20high-dimensional%20indexing - PowerPoint PPT Presentation

About This Presentation

Title:

iMinMax(?%20)%20Indexing%20the%20Edges%20%20%20%20A%20simple%20and%20yet%20efficient%20approach%20to%20%20high-dimensional%20indexing

Description:

A most effective mechanism to prune the search ... Nearest Neighbour Query: 'Find me the nearest fire station to Clementi Ave. 3?' Applications ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 34

Provided by: NUS16

Category:

more less

Transcript and Presenter's Notes

Title: iMinMax(?%20)%20Indexing%20the%20Edges%20%20%20%20A%20simple%20and%20yet%20efficient%20approach%20to%20%20high-dimensional%20indexing

1
Access Methods for Advanced Database
Applications
2
Applications

Geographic Information Systems / Spatial DB
Text databases
XML databases
Data warehouse
High-dimensional databases (image, scientific)
Time series
Sequence databases (genomic databases)
Main memory database systems

3
Why New Indexes?

A most effective mechanism to prune the search
Order of magnitude of difference between I/O and
CPU cost
Increasing data size
Increasing complexity of data and search

4
Memory System
CPU Die
CPU
Registers
L1 Cache
L2 Cache
Main Memory
Harddisk
5
Memory Hierarchy
6
Improvement in Performance
CPU (60/yr)
10000
1000
100
DRAM (10/yr)
10
1
1980
2000
7
Design Principles

Simple in design
Efficient in disk access/CPU time
Not necessary contradicting the simplicity!
Ease of integration into existing DBMS
Built on top of the mature index (eg. B-tree
R-tree)?
Reuse all the well tested concurrency control etc.

8
Spatial Databases

Spatial Objects
Points spatial location eg. feature vectors
Lines set of points eg. roads, coastal line
Polygons set of points eg. Buildings, lakes
Data Types
Point a spatial data object with no extension
no size or volume
Regiona spatial object with a location and a
boundary that defines the extension

9
Spatial Queries

Range queries Find all cities within 50 km of
Madras?
Nearest neighbor queries Find the 5 cities that
are nearest to Madras?
Find the 10 images most similar to this image?
Spatial join queries Find pairs of cities
within 200 km of each other?

10
More Examples

Range Query Find me data points that satisfy
the conditions x1 ltA1 lt y1, x2 ltA2 lty2?
Spatial Query Find me buildings that are
adjacent to the Railway Stations?
Nearest Neighbour Query Find me the nearest
fire station to Clementi Ave. 3?

11
Applications

Geographical Information Systems (GIS) dealing
extensively with spatial data. Eg. Map system,
resource management systems
Computer-aided design and manufacturing
(CAD/CAM) dealing mainly with surface data. Eg.
design systems.
Multimedia databases storing and manipulating
characteristics of MM objects.

12
Representation of Spatial Objects

Testing on real objects is expensive
Minimum Bounding Box/Rectangle
How to test if 2-d rectangles intersect?

13
Approaches to Multi-Dimensional Indexing
R-trees

Data Partitioning
R-tree, R-tree, X-tree, Skd-tree, SS-tree,
TV-tree, M-tree
Space Partitioning
Buddy-tree, R-tree, Grid File, KDB-tree
Mapping

14
R-trees
R-trees
15
R-trees

Range Query
Insert
Node splitting
Optimization
Coverage
Overlap
Delete
Variants R-tree
R-tree, buddy-tree

16
Space Filling Curves

Assumption att. values can be represented with
some fixed of bits
Space domain on each dimension 2k values
Linearize the doman
Each point can be represented by a single
dimensional value

17
Z-ordering
11
10
01
00
00
01
10
11
18
Z-ordering

The z-value is obtained by interleaving the bits.
Eg. X01, Y11
z-value 0111 7
Clustering effect on X-Y and z-values can be
indexed using B-trees
Range queries problematic?

19
Hilbert Curve
111
110
101
100
011
010
001
000
100
011
001
010
000
101
110
111
20
Grid Files

Based on extendible hashing
Design principle any point query can be answered
in at most 2 disk accesses.
Two structures k-dimensional array and k
1-dimensional array

21
Extendible Hashing

Situation Hash Bucket (primary page) becomes
full. Why not re-organize file by doubling of
buckets?
Reading and writing all pages is expensive!
Idea Use directory of pointers to buckets,
double of buckets by doubling the directory,
splitting just the bucket that overflowed!
Directory much smaller than file, so doubling it
is much cheaper. Only one page of data entries
is split. No overflow page!
Trick lies in how hash function is adjusted!

22
Example
LOCAL DEPTH
2
Bucket A
16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
13
00
1
21
5

Directory is array of size 4.
To find bucket for r, take last global depth
bits of h(r) we denote r by h(r).
If h(r) 5 binary 101, it is in bucket
pointed to by 01.

01
2
10
Bucket C
10
11
2
DIRECTORY
Bucket D
15
7
19
DATA PAGES

Insert If bucket is full, split it (allocate
new page, re-distribute).

If necessary, double the directory. (As we will
see, splitting a
bucket does not always require doubling we
can tell by
comparing global depth with local depth for
the split bucket.)

23
Insert h(r)20 (Causes Doubling)
2
LOCAL DEPTH
3
LOCAL DEPTH
Bucket A
16
32
GLOBAL DEPTH
32
16
Bucket A
GLOBAL DEPTH
2
2
2
3
Bucket B
1
5
21
13
00
1
5
21
13
000
Bucket B
01
001
2
10
2
010
Bucket C
10
11
10
Bucket C
011
100
2
2
DIRECTORY
101
Bucket D
15
7
19
15
19
7
Bucket D
110
111
2
3
Bucket A2
20
4
12
DIRECTORY
20
12
Bucket A2
4
(split image'
of Bucket A)
(split image'
of Bucket A)
24
Points to Note

20 binary 10100. Last 2 bits (00) tell us r
belongs in A or A2. Last 3 bits needed to tell
which.
Global depth of directory Max of bits needed
to tell which bucket an entry belongs to.
Local depth of a bucket of bits used to
determine if an entry belongs to this bucket.
When does bucket split cause directory doubling?
Before insert, local depth of bucket global
depth. Insert causes local depth to become gt
global depth directory is doubled by copying it
over and fixing pointer to split image page.
(Use of least significant bits enables efficient
doubling via copying of directory!)

25
Directory Doubling

Why use least significant bits in directory?
Allows for doubling via copying!

6 110
6 110
3
3
000
000
001
100
2
2
010
010
00
00
1
1
011
110
6
01
10
0
0
100
001
6
6
10
01
1
1
101
101
6
11
11
6
6
110
011
111
111
vs.
Least Significant
Most Significant
26
Comments on Extendible Hashing

If directory fits in memory, equality search
answered with one disk access else two.
100MB file, 100 bytes/rec, 4K pages contains
1,000,000 records (as data entries) and 25,000
directory elements chances are high that
directory will fit in memory.
Directory grows in spurts, and, if the
distribution of hash values is skewed, directory
can grow large.
Multiple entries with same hash value cause
problems!
Delete If removal of data entry makes bucket
empty, can be merged with split image. If each
directory element points to same bucket as its
split image, can halve directory.

27
Summary on Extendible Hashing

Hash-based indexes best for equality searches,
cannot support range searches.
Static Hashing can lead to long overflow chains.
Extendible Hashing avoids overflow pages by
splitting a full bucket when a new data entry is
to be added to it. (Duplicates may require
overflow pages.)
Directory to keep track of buckets, doubles
periodically.
Can get large with skewed data additional I/O if
this does not fit in main memory.

28
Grid Files
29
Scales, Directory, Bucket

Data structures
Linear scales
directory an array whose elements are one-to-one
correspondence with the grid cells each entry
points to a data bucket
data buckets

30
Splitting and Merging
31
Splitting and Merging
32
Grid Files...