CS 541 Hashing - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

CS 541 Hashing

Description:

and cut directory if possible (Reverse insert procedure) Fall 2002. Chris Clifton - CS541 ... (Not bad if directory in memory) Directory doubles in size (Now it ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 66
Provided by: clif8
Category:

less

Transcript and Presenter's Notes

Title: CS 541 Hashing


1
CS 541Hashing
  • October 9, 2002

2
Hashing
  • key ? h(key)


Buckets (typically 1 disk block)
. . .
3
  • Two alternatives

. . .
records
(1) key ? h(key)
. . .
4
Two alternatives
record
(2) key ? h(key)
key 1
Index
  • Alt (2) for secondary search key

5
Example hash function
  • Key x1 x2 xn n byte character string
  • Have b buckets
  • h add x1 x2 .. xn
  • compute sum modulo b

6
  • ? This may not be best function
  • ? Read Knuth Vol. 3 if you really need to
    select a good function.

Good hash ? Expected number of
function keys/bucket is the same for all
buckets
7
Within a bucket
  • Do we keep keys sorted?
  • Yes, if CPU time critical
  • Inserts/Deletes not too frequent

8
Next example to illustrate inserts,
overflows, deletes
  • h(K)

9
EXAMPLE 2 records/bucket
  • INSERT
  • h(a) 1
  • h(b) 2
  • h(c) 1
  • h(d) 0

0 1 2 3
h(e) 1
10
EXAMPLE deletion
Deleteef
0 1 2 3
a
b
d
c
c
e
f
g
11
Rule of thumb
  • Try to keep space utilization
  • between 50 and 80
  • Utilization keys used
  • total keys that fit
  • If
  • If 80, overflows significant depends on how
    good hash function is on keys/bucket

12
Indiana Center for Database SystemsSeminar
Wednesdays 1130, CS111Today Defining Privacy
for Data Mining
  • Chris Clifton
  • clifton_at_cs.purdue.edu
  • This is joint work with Murat Kantarcioglu and
    Jaideep Vaidya

13
CS 541Hashing
  • October 16, 2002

14
Hashing
  • key ? h(key)


Buckets (typically 1 disk block)
. . .
15
How do we cope with growth?
  • Overflows and reorganizations
  • Dynamic hashing

16
Extensible hashing two ideas
  • (a) Use i of b bits output by hash function
  • b
  • h(K) ?
  • use i ? grows over time.

00110101
17
  • (b) Use directory
  • h(K)i to bucket

. . .
. . .
18
Example h(k) is 4 bits 2 keys/bucket
1
  • i

0001
1
1001
1100
Insert 1010
19
Example continued
i
2
00 01 10 11
1
0001
0111
1001
1010
Insert 0111 0000
1100
20
Example continued
2
0000
0001
i
2
00 01 10 11
2
0111
Insert 1001
21
Extensible hashing deletion
  • No merging of blocks
  • Merge blocks and cut directory if possible
  • (Reverse insert procedure)

22
Deletion example
  • Run thru insert example in reverse!

23
Extensible hashing
Summary
  • Can handle growing files
  • - with less wasted space
  • - with no full reorganizations


24
Linear hashing
  • Another dynamic hashing scheme

25
Example b4 bits, i 2, 2 keys/bucket
  • insert 0101

Future growth buckets
0101
0000
1111
1010
  • 00 01 10 11

m 01 (max used block)
26
Example b4 bits, i 2, 2 keys/bucket
Future growth buckets
0101
0000
1111
1010
  • 00 01 10 11

m 01 (max used block)
27
Example Continued How to grow beyond this?
i 2
1111
1010
0101
0000
0101
  • 00 01 10 11

. . .
m 11 (max used block)
28
? When do we expand file?
  • Keep track of used slots
  • total of slots

U
  • If U threshold then increase m
  • (and maybe i )

29
Linear Hashing
Summary
  • Can handle growing files
  • - with less wasted space
  • - with no full reorganizations
  • No indirection like extensible hashing



30
Example BAD CASE
  • Very full
  • Very empty Need to move
  • m here
  • Would waste
  • space...

31
Summary
  • Hashing
  • - How it works
  • - Dynamic hashing
  • - Extensible
  • - Linear

32
CS 541Hashing
  • October 18, 2002

33
Next
  • Indexing vs Hashing
  • Index definition in SQL
  • Multiple key access

34
Indexing vs Hashing
  • Hashing good for probes given key
  • e.g., SELECT
  • FROM R
  • WHERE R.A 5

35
Indexing vs Hashing
  • INDEXING (Including B Trees) good for
  • Range Searches
  • e.g., SELECT
  • FROM R
  • WHERE R.A 5

36
Index definition in SQL
  • Create index name on rel (attr)
  • Create unique index name on rel (attr)

defines candidate key
  • Drop INDEX name

37
  • CANNOT SPECIFY TYPE OF INDEX
  • (e.g. B-tree, Hashing, )
  • OR PARAMETERS
  • (e.g. Load Factor, Size of Hash,...)
  • ... at least in SQL...

Note
38
  • ATTRIBUTE LIST ? MULTIKEY INDEX
  • (next)
  • e.g., CREATE INDEX foo ON R(A,B,C)

Note
39

Multi-key Index
  • Motivation Find records where
  • DEPT Toy AND SAL 50k

40
Strategy I
  • Use one index, say Dept.
  • Get all Dept Toy records and
    check their salary

I1
41
Strategy II
  • Use 2 Indexes Manipulate Pointers
  • Toy Sal
  • 50k

42
Strategy III
  • Multiple Key Index
  • One idea

I2
I3
I1
43
Example
  • Example
  • Record
  • Dept
  • Index
  • Salary
  • Index

10k
15k
17k
21k
NameJoe DEPTSales SAL15k
12k
15k
15k
19k
44
For which queries is this index good?
  • Find RECs Dept Sales SAL20k
  • Find RECs Dept Sales SAL 20k
  • Find RECs Dept Sales
  • Find RECs SAL 20k

45
Interesting application
  • Geographic Data
  • DATA

y
x
. . .
46
Queries
  • What city is at ?
  • What is within 5 miles from ?
  • Which is closest point to ?

47
Example

48
Queries
  • Find points with Yi 20
  • Find points with Xi
  • Find points close to i
  • Find points close to b

49
  • Many types of geographic index structures have
    been suggested
  • Quad Trees
  • R Trees

50
Two more types of multi key indexes
  • Grid
  • Partitioned hash

51
Grid Index
  • Key 2
  • X1 X2 Xn
  • V1
  • V2
  • Key 1
  • Vn

To records with key1V3, key2X2
52
CLAIM
  • Can quickly find records with
  • key 1 Vi ? Key 2 Xj
  • key 1 Vi
  • key 2 Xj
  • And also ranges.
  • E.g., key 1 ? Vi ? key 2

53
  • ? But there is a catch with Grid Indexes!
  • How is Grid Index stored on disk?
  • Problem
  • Need regularity so we can compute position of
    entry

54
Solution Use Indirection
  • Buckets
  • V1
  • V2
  • V3 Grid only
  • V4 contains
  • pointers to
  • buckets
  • Buckets

X1 X2 X3
-- -- --
-- -- --
-- -- --
-- -- --
-- -- --
55
With indirection
  • Grid can be regular without wasting space
  • We do have price of indirection

56
Can also index grid on value ranges
  • Salary Grid

0-20K
1
20K-50K
2
50K-
3
8
Linear Scale
1
2
3
Toy
Sales
Personnel
57
Grid files
  • Good for multiple-key search
  • Space, management overhead (nothing is
    free)
  • Need partitioning ranges that evenly split keys


-
-
58
Partitioned hash function
  • Idea
  • Key1 Key2

010110 1110010
h1
h2
59
EX
  • h1(toy) 0 000
  • h1(sales) 1 001
  • h1(art) 1 010
  • . 011
  • .
  • h2(10k) 01 100
  • h2(20k) 11 101
  • h2(30k) 01 110
  • h2(40k) 00 111
  • .
  • .
  • ,

Insert
60
  • h1(toy) 0 000
  • h1(sales) 1 001
  • h1(art) 1 010
  • . 011
  • .
  • h2(10k) 01 100
  • h2(20k) 11 101
  • h2(30k) 01 110
  • h2(40k) 00 111
  • .
  • .
  • Find Emp. with Dept. Sales ? Sal40k







61
  • h1(toy) 0 000
  • h1(sales) 1 001
  • h1(art) 1 010
  • . 011
  • .
  • h2(10k) 01 100
  • h2(20k) 11 101
  • h2(30k) 01 110
  • h2(40k) 00 111
  • .
  • .
  • Find Emp. with Sal30k







62
  • h1(toy) 0 000
  • h1(sales) 1 001
  • h1(art) 1 010
  • . 011
  • .
  • h2(10k) 01 100
  • h2(20k) 11 101
  • h2(30k) 01 110
  • h2(40k) 00 111
  • .
  • .
  • Find Emp. with Dept. Sales







63
Summary
  • Post hashing discussion
  • - Indexing vs. Hashing
  • - SQL Index Definition
  • - Multiple Key Access
  • - Multi Key Index
  • Variations Grid, Geo Data
  • - Partitioned Hash

64
Reading Chapter 5
  • Skim the following sections
  • 5.3.6, 5.3.7, 5.3.8
  • 5.4.2, 5.4.3, 5.4.4
  • Read the rest

65
The BIG picture.
  • Chapters 2 3 Storage, records, blocks...
  • Chapter 4 5 Access Mechanisms - Indexes
  • - B trees
  • - Hashing
  • - Multi key
  • Chapter 6 7 Query Processing

NEXT
Write a Comment
User Comments (0)
About PowerShow.com