Title: Data organization in main memory or disk
1Hashing
- Data organization in main memory or disk
- sequential, binary trees,
- The location of a key depends on other keys gt
unnecessary key comparisons to find a key - Question find key with a single comparison
- Hashing the location of a record is computed
using its key only - Fast for random accesses - slow for range queries
2Hash Table
- Hash Function transforms keys to array indices
3h(key) key mod 1000
4Good Hash Functions
- Uniform distribute keys evenly in space
- Perfect two records cannot occupy the same
location or - Order preserving
- Difficult to find such hash functions
- Property 2 is the most essential
- Most functions are no better than
- h(key) key mod m
- Hash collision
5Collision Resolution
- Open Addressing (rehashing) compute new position
to store the key in the table (no extra space) - linear probing
- double hashing
- Separate Chaining lists of keys mapped to the
same position (uses extra space)
6Open Addressing
- Computes a new address to store the key if it is
occupied (rehashing) - if occupied too, compute a new address, until
an empty position is found - primary hash function ih(key)
- rehash function rh(i)rh(h(key))
- hash sequence (h0,h1,h2) (h(key), rh(h(key)),
rh(rh(h(key)))) - To find a key follow the same hash sequence
7Example
- ih(key)key mod 100
- rh(i) (i1) mod 100
- key 193
- ih(193)93
- rh(i)(931)94
- Key 193 will occupy position 94
193
8Problem 1 Locate Empty Positions
- No empty position can be found
- the table is full
- check on number of empty positions
- the hash function fails to find an empty position
although the table is not full !! - ih(key) key mod 1000
- rh(i) (i 200) mod 1000 gt checks only 5
positions on a table of 1000 positions - rh(i) (i1) mod 1000 successive positions
- rh(i) (ic) mod 1000 where GCD(c,m) 1
9Problem 2 Primary Clustering
- Different keys that hash into different addresses
compete with each other in successive rehashes - ih(key) key mod 100
- rh(i) (i1) mod 100
- keys 1990, 1991, 1992, 1993, 1994 gt 94
10Problem 3 Secondary Clustering
- Different keys which hash to the same hash value
have the same rehash sequence - ih(key) key mod 10
- rh(i,j) (i j) mod 10
- key 23 h(23) 3
- rh 4, 6, 9, 3,
- key 13 h(13) 3
- rh 4, 6, 9, 3,
11Linear Probing
- Store the key into the next free position
- h0 h(key) usually h0 key mod m
- hi (hi-1 1) mod m, i gt 1
S 22, 35, 301, 99, 102, 452
12Observation 1
number of probes
- Different insertion sequences gt different hash
sequences - S111,3,27,99,8,50,77,22,12,31,33,40,53gt28
probes - S253,40,33,31,12,22,77,50,8,99,27,3,11gt 30
probes
H(key) key mod 13
13Observation 2
- Deletions are not easy
- ih(key) key mod 10
- rh(i) (i1) mod 10
- Action delete(65) and search(5)
- Problem search will stop at the
- empty position and will not find 5
- Solution
- mark position as deleted rather than empty
- the marked position can be reused
14Observation 3
- Linear probing tends to create long sequences of
occupied positions - the longer a sequence is, the longer it tends to
become - P probability to use a position in the cluster
15Observation 4
- Linear probing suffers from both primary and
secondary clustering - Solution double hashing
- uses two hash functions h1, h2 and a
- rehashing function rh
16Double Hashing
- Two hash functions and a rehashing function
- primary hash function ih1(key) key mod m
- secondary hash function h2(key)
- rehashing function rh(key) (i h2(key)) mod m
- h2(m,key) is some function of m, key
- helps rh in computing random positions in the
hash table - h2 is computed once for each key!
17Example of Double Hashing
- hash function
- h1(key) key mod m
- q (key div m) mod m
- rehash function
- rh(i, key) (i h2(key)) mod m
18Example (continued)
- m 10, key 23
- h1(23) 3, h2(23) 2
- rh(3,2)(32) mod 10 5
- rehash sequence 5, 7, 9, 1,
- m 10, key 13
- h1(key)3, h2(13)1, rh(3,1)(31)mod104
- rehash sequence 4, 5, 6,
19Performance of Open Addressing
- Distinguish between
- successful and
- unsuccessful search
- Assume a series of probes to random positions
- independent events
- load factor ? n/m
- ? probability to probe an occupied position
- each position has the same probability P1/m
20Unsuccessful Search
- The hash sequence is exhausted
- let u be the expected number of probes
- u equals the expected length of the hash sequence
- P(k) probability to search k positions in the
hash sequence
21(No Transcript)
22independent events
u increases with ? gt performance drops as ?
increases
23Successful Search
- The hash sequence is not exhausted
- the number of probes to find a key equals the
number of probes s at the time the key was
inserted plus 1 - ? was less at that time
- consider all values of ?
u equivalent to unsuccessful search
approximation
increases with ?
24Performance
- The performance drops as ? increases
- the higher the value of ? is, the higher the
probability of collisions - Unsuccessful search is more expensive than
successful search - unsuccessful search exhausts the hash sequence
25Experimental Results
26Performance on Full Table
27Separate Chaining
- Keys hashing to the same hash value are stored in
separate lists - one list per hash position
- can store more than m records
- easy to implement
- the keys in each list can be ordered
28h(key) key mod m
29Performance of Separate Chaining
- Depends on the average chain size
- insertions are independent events
- let P(c,n,m) probability that a position has
been selected c times after n insertions on a
table of size m - P(c,n,m) probability that the chain has length c
gt binomial distribution
p1/m success case q1-p failure case
30gt P(c,n,m)(1/c!)?ce-?
gt
Poison
31Unsuccessful Search
- The entire chain is searched
- the average number of comparisons equals its
average length u
32Successful Search
- Not the whole chain is searched
- the average number of comparisons equals the
length s of the chain at time the key was
inserted plus 1 - the performance at the time a key was inserted
equals that of unsuccessful search!
33Performance
- The performance drops with the length of the
chains - worst case all keys are stored in a single chain
- worst case performance O(N)
- unsuccessful search performs better than
successful search!! WHY ? - no problem with deletions!!
34Coalesced Hashing
- The hash sequence is implemented as a linked list
within the hash table - no rehash function
- the next hash position is the next available
position in linked list - extra space for the list
h(key) key mod 10 keys 19, 29, 49, 59
35initially avail 9 h(key) key mod 10 keys
14,29,34,28,42,39,84,38
initialization
avail
Holds lists of rehashing positions and list of
empty positions
List of empty positions
36Performance of Coalesced Hashing
- Unsuccessful search
- Successful search
probes/search
probes/search