HashBased Indexes - PowerPoint PPT Presentation

About This Presentation
Title:

HashBased Indexes

Description:

Buckets contain data entries (alternative 1 ... Fixed # bucket is the problem! Rehashing can be done Not ... binary 101, it is in bucket pointed to by 01. ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 19
Provided by: RaghuRamak241
Learn more at: http://web.cs.wpi.edu
Category:

less

Transcript and Presenter's Notes

Title: HashBased Indexes


1
Hash-Based Indexes
  • Chapter 11

2
Introduction
  • Hash-based indexes are best for equality
    selections. Cannot support range searches.
  • Static and dynamic hashing techniques exist
  • Trade-offs similar to ISAM vs. B trees.

3
Static Hashing
  • primary pages fixed, allocated sequentially,
    never de-allocated overflow pages if needed.
  • h(k) mod N bucket to which data entry with key
    k belongs. (N of buckets)

0
h(key) mod N
2
key
h
N-1
Primary bucket pages
Overflow pages
4
Static Hashing
  • Buckets contain data entries (alternative 1 to 3
    possible).
  • Hash function works on search key field of record
    r.
  • h(k) mod N bucket to which data entry with key
    k belongs with N of buckets
  • h() must distribute values over range 0 ... N-1.
  • h(key) (a key b) works well.
  • a and b are constants
  • lots known about how to tune h.

5
Static Hashing Cons
  • Primary pages fixed space ? static structure.
  • Long overflow chains can develop and degrade
    performance.
  • Fixed bucket is the problem!
  • Rehashing can be done ? Not good for search.
  • In practice, instead use overflow chains.
  • Or, employ dynamic techniques to fix this problem
  • Extendible hashing, or
  • Linear Hashing

6
Extendible Hashing
  • Situation Bucket (primary page) becomes full.
  • Solution Why not re-organize file by doubling
    of buckets?
  • Reading and writing all pages is expensive!
  • Idea Use directory of pointers to buckets
    instead of buckets
  • double of buckets by doubling the directory
  • splitting just the bucket that overflowed!
  • Discussion
  • Directory much smaller than file, so doubling
    it is much cheaper. Only one page of data
    entries is split.
  • No overflow pages ever.
  • Trick lies in how hash function is adjusted!

7
Example
LOCAL DEPTH
2
Bucket A
  • Directory is array of size 4.
  • To find bucket for r, take last global depth
    bits of function h(r)
  • Examples
  • If h(r) 5 binary 101, it is in bucket
    pointed to by 01.
  • If h(r) 4 binary 100, it is in bucket
    pointed to by 00.

16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
00
13
1
21
5
01
2
10
Bucket C
10
11
2
DIRECTORY
Bucket D
15
7
19
DATA PAGES
8
ExampleInsertion
2
LOCAL DEPTH
Bucket A
16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
13
00
1
21
5
  • To find bucket for r, take last global depth
    bits of h(r)
  • If h(r) 5 binary 101, r is in bucket pointed
    to by 01.

01
2
10
Bucket C
10
11
2
DIRECTORY
Bucket D
15
7
19
DATA PAGES
  • Insert If bucket is full, split it (allocate
    new page, re-distribute).
  • Splitting may
  • double the directory, or
  • simply link in a new page.
  • To tell compare global depth with local depth
    for split bucket.

9
Insert h(r) 6 binary 110
10
Insert h(r)20
11
Insert h(r)20
12
Insert h(r)20
2
LOCAL DEPTH
3
LOCAL DEPTH
Bucket A
16
32
GLOBAL DEPTH
32
16
Bucket A
GLOBAL DEPTH
2
2
2
3
Bucket B
1
5
21
13
00
1
5
21
13
000
Bucket B
01
001
2
10
2
010
Bucket C
10
11
10
Bucket C
011
100
2
2
DIRECTORY
101
Bucket D
15
7
19
15
19
7
Bucket D
110
111
2
3
Bucket A2
20
4
12
DIRECTORY
20
12
Bucket A2
4
(split image'
of Bucket A)
(split image'
of Bucket A)
13
Points to Note
  • 20 binary 10100.
  • Last 2 bits (00) tell us r belongs in A or A2.
  • Last 3 bits needed to tell which.
  • Global depth of directory Max of bits needed
    to tell which bucket an entry belongs to.
  • Local depth of a bucket of bits used to
    determine if an entry belongs to this bucket.

14
More Points to Note
  • Global depth of directory Max of bits needed
    to tell which bucket an entry belongs to.
  • Local depth of a bucket of bits used to
    determine if an entry belongs to this bucket.
  • When does bucket split cause directory doubling?
  • Before insert, local depth of bucket global
    depth.
  • Insert causes local depth to become gt global
    depth
  • directory is doubled by copying it over and
    fixing pointer to split image page.
  • Why not first several bits?
  • Use of least significant bits enables efficient
    doubling via copying of directory!

15
Directory Doubling
  • Why use least significant bits in directory?
  • Allows for doubling via copying!

6 110
6 110
3
3
000
000
001
100
2
2
010
010
00
00
1
1
011
110
6
01
10
0
0
100
001
6
6
10
01
1
1
101
101
6
11
11
6
6
110
011
111
111
vs.
Least Significant
Most Significant
16
Extendible Hashing Delete
  • Delete
  • If removal of data entry makes bucket empty, can
    be merged with split image.
  • If each directory element points to same bucket
    as its split image, can halve directory.

17
Comments on Extendible Hashing
  • If directory fits in memory, equality search
    answered with one disk access else two.
  • 100MB file, 100 bytes/rec, 4K pages contain
    1,000,000 records (as data entries) and 25,000
    directory elements chances are high that
    directory will fit in memory.
  • Directory grows in spurts.
  • If the distribution of hash values is skewed,
    directory can grow large.

18
Summary
  • Hash-based indexes best for equality searches,
    cannot support range searches.
  • Static Hashing can lead to long overflow chains.
  • Extendible Hashing avoids overflow pages by
    splitting a full bucket when a new data entry is
    to be added to it.
  • Large Number of Duplicates may require overflow
    pages.
  • Directory to keep track of buckets, doubles
    periodically.
  • Can get large with skewed data additional I/O if
    this does not fit in main memory.
Write a Comment
User Comments (0)
About PowerShow.com