Data organization in main memory or disk

About This Presentation

Title:

Data organization in main memory or disk

Description:

Hashing Data organization in main memory or disk sequential, binary trees, The location of a key depends on other keys = unnecessary key comparisons to find a key – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 37

Provided by: Euri5

Category:

more less

Transcript and Presenter's Notes

Title: Data organization in main memory or disk

1
Hashing

Data organization in main memory or disk
sequential, binary trees,
The location of a key depends on other keys gt
unnecessary key comparisons to find a key
Question find key with a single comparison
Hashing the location of a record is computed
using its key only
Fast for random accesses - slow for range queries

2
Hash Table

Hash Function transforms keys to array indices

3
h(key) key mod 1000
4
Good Hash Functions

Uniform distribute keys evenly in space
Perfect two records cannot occupy the same
location or
Order preserving
Difficult to find such hash functions
Property 2 is the most essential
Most functions are no better than
h(key) key mod m
Hash collision

5
Collision Resolution

Open Addressing (rehashing) compute new position
to store the key in the table (no extra space)
linear probing
double hashing
Separate Chaining lists of keys mapped to the
same position (uses extra space)

6
Open Addressing

Computes a new address to store the key if it is
occupied (rehashing)
if occupied too, compute a new address, until
an empty position is found
primary hash function ih(key)
rehash function rh(i)rh(h(key))
hash sequence (h0,h1,h2) (h(key), rh(h(key)),
rh(rh(h(key))))
To find a key follow the same hash sequence

7
Example

ih(key)key mod 100
rh(i) (i1) mod 100
key 193
ih(193)93
rh(i)(931)94
Key 193 will occupy position 94

193
8
Problem 1 Locate Empty Positions

No empty position can be found
the table is full
check on number of empty positions
the hash function fails to find an empty position
although the table is not full !!
ih(key) key mod 1000
rh(i) (i 200) mod 1000 gt checks only 5
positions on a table of 1000 positions
rh(i) (i1) mod 1000 successive positions
rh(i) (ic) mod 1000 where GCD(c,m) 1

9
Problem 2 Primary Clustering

Different keys that hash into different addresses
compete with each other in successive rehashes
ih(key) key mod 100
rh(i) (i1) mod 100
keys 1990, 1991, 1992, 1993, 1994 gt 94

10
Problem 3 Secondary Clustering

Different keys which hash to the same hash value
have the same rehash sequence
ih(key) key mod 10
rh(i,j) (i j) mod 10
key 23 h(23) 3
rh 4, 6, 9, 3,
key 13 h(13) 3
rh 4, 6, 9, 3,

11
Linear Probing

Store the key into the next free position
h0 h(key) usually h0 key mod m
hi (hi-1 1) mod m, i gt 1

S 22, 35, 301, 99, 102, 452
12
Observation 1
number of probes

Different insertion sequences gt different hash
sequences
S111,3,27,99,8,50,77,22,12,31,33,40,53gt28
probes
S253,40,33,31,12,22,77,50,8,99,27,3,11gt 30
probes

H(key) key mod 13
13
Observation 2

Deletions are not easy
ih(key) key mod 10
rh(i) (i1) mod 10
Action delete(65) and search(5)
Problem search will stop at the
empty position and will not find 5
Solution
mark position as deleted rather than empty
the marked position can be reused

14
Observation 3

Linear probing tends to create long sequences of
occupied positions
the longer a sequence is, the longer it tends to
become
P probability to use a position in the cluster

15
Observation 4

Linear probing suffers from both primary and
secondary clustering
Solution double hashing
uses two hash functions h1, h2 and a
rehashing function rh

16
Double Hashing

Two hash functions and a rehashing function
primary hash function ih1(key) key mod m
secondary hash function h2(key)
rehashing function rh(key) (i h2(key)) mod m
h2(m,key) is some function of m, key
helps rh in computing random positions in the
hash table
h2 is computed once for each key!

17
Example of Double Hashing

hash function
h1(key) key mod m
q (key div m) mod m
rehash function
rh(i, key) (i h2(key)) mod m

18
Example (continued)

m 10, key 23
h1(23) 3, h2(23) 2
rh(3,2)(32) mod 10 5
rehash sequence 5, 7, 9, 1,
m 10, key 13
h1(key)3, h2(13)1, rh(3,1)(31)mod104
rehash sequence 4, 5, 6,

19
Performance of Open Addressing

Distinguish between
successful and
unsuccessful search
Assume a series of probes to random positions
independent events
load factor ? n/m
? probability to probe an occupied position
each position has the same probability P1/m

20
Unsuccessful Search

The hash sequence is exhausted
let u be the expected number of probes
u equals the expected length of the hash sequence
P(k) probability to search k positions in the
hash sequence

21
(No Transcript)
22
independent events
u increases with ? gt performance drops as ?
increases
23
Successful Search

The hash sequence is not exhausted
the number of probes to find a key equals the
number of probes s at the time the key was
inserted plus 1
? was less at that time
consider all values of ?

u equivalent to unsuccessful search
approximation
increases with ?
24
Performance

The performance drops as ? increases
the higher the value of ? is, the higher the
probability of collisions
Unsuccessful search is more expensive than
successful search
unsuccessful search exhausts the hash sequence

25
Experimental Results
26
Performance on Full Table
27
Separate Chaining

Keys hashing to the same hash value are stored in
separate lists
one list per hash position
can store more than m records
easy to implement
the keys in each list can be ordered

28
h(key) key mod m
29
Performance of Separate Chaining

Depends on the average chain size
insertions are independent events
let P(c,n,m) probability that a position has
been selected c times after n insertions on a
table of size m
P(c,n,m) probability that the chain has length c
gt binomial distribution

p1/m success case q1-p failure case
30
gt P(c,n,m)(1/c!)?ce-?
gt
Poison
31
Unsuccessful Search

The entire chain is searched
the average number of comparisons equals its
average length u

32
Successful Search

Not the whole chain is searched
the average number of comparisons equals the
length s of the chain at time the key was
inserted plus 1
the performance at the time a key was inserted
equals that of unsuccessful search!

33
Performance

The performance drops with the length of the
chains
worst case all keys are stored in a single chain
worst case performance O(N)
unsuccessful search performs better than
successful search!! WHY ?
no problem with deletions!!

34
Coalesced Hashing

The hash sequence is implemented as a linked list
within the hash table
no rehash function
the next hash position is the next available
position in linked list
extra space for the list

h(key) key mod 10 keys 19, 29, 49, 59
35
initially avail 9 h(key) key mod 10 keys
14,29,34,28,42,39,84,38
initialization
avail
Holds lists of rehashing positions and list of
empty positions
List of empty positions
36
Performance of Coalesced Hashing

Data organization in main memory or disk - PowerPoint PPT Presentation

Data organization in main memory or disk

Hashing Data organization in main memory or disk sequential, binary trees, The location of a key depends on other keys = unnecessary key comparisons to find a key – PowerPoint PPT presentation