Hash Tables 1

About This Presentation

Title:

Hash Tables 1

Description:

Symbol table of a compiler. Memory-management tables in operating systems. ... a systematic (consistent) procedure to store elements in free slots of the table. ... – PowerPoint PPT presentation

Number of Views:214

Avg rating:3.0/5.0

Slides: 80

Provided by: cisJu

Category:

more less

Transcript and Presenter's Notes

Title: Hash Tables 1

1
Hash Tables 1
2
Dictionary

Dictionary
Dynamic-set data structure for storing items
indexed using keys.
Supports operations Insert, Search, and Delete.
Applications
Symbol table of a compiler.
Memory-management tables in operating systems.
Large-scale distributed systems.
Hash Tables
Effective way of implementing dictionaries.
Generalization of ordinary arrays.

3
Direct-address Tables

Direct-address Tables are ordinary arrays.
Facilitate direct addressing.
Element whose key is k is obtained by indexing
into the kth position of the array.
Applicable when we can afford to allocate an
array with one position for every possible key.
i.e. when the universe of keys U is small.
Dictionary operations can be implemented to take
O(1) time.
Details in Sec. 11.1.

4
Hash Tables

Notation
U Universe of all possible keys.
K Set of keys actually stored in the
dictionary.
K n.
When U is very large,
Arrays are not practical.
K ltlt U.
Use a table of size proportional to K The
hash tables.
However, we lose the direct-addressing ability.
Define functions that map keys to slots of the
hash table.

5
Hash Tables

Let universe of keys U and an array of size m. A
hash function h is a function from U to 0m, that
is h U 0m

(universe of keys)
0 1 2 3 4 5 6 7
U
k1 k2 k3 k4 k6
h (k2)2
h (k1)h (k3)3
h (k6)5
h (k4)7
6
Hash Tables Example
For example, if we hash keys 01000 into a hash
table with 5 entries and use h(key) key mod 5 ,
we get the following sequence of events
Insert 21
Insert 54
There is a collision at array entry 4
21
2
2
???
7
Hashing

Hash function h Mapping from U to the slots of a
hash table T0..m1.
h U ? 0,1,, m1
With arrays, key k maps to slot Ak.
With hash tables, key k maps or hashes to slot
Thk.
hk is the hash value of key k.

8
Hashing
0
U (universe of keys)
h(k1)
h(k4)
k1
K (actual keys)
k4
k2
collision
h(k2)h(k5)
k5
k3
h(k3)
m1
9
Issues with Hashing

Multiple keys can hash to the same slot
collisions are possible.
Design hash functions such that collisions are
minimized.
But avoiding collisions is impossible.
Design collision-resolution techniques.
Search will cost ?(n) time in the worst case.
However, all operations can be made to have an
expected complexity of ?(1).

10
Methods of Resolution

Chaining
Store all elements that hash to the same slot in
a linked list.
Store a pointer to the head of the linked list in
the hash table slot.
Open Addressing
All elements stored in hash table itself.
When collisions occur, use a systematic
(consistent) procedure to store elements in free
slots of the table.

0
k1
k4
k2
k5
k6
k7
k3
k8
m1
11
Collision Resolution by Chaining
0
U (universe of keys)
h(k1)h(k4)
X
k1
k4
K (actual keys)
k2
X
h(k2)h(k5)h(k6)
k6
k5
k7
k8
k3
X
h(k3)h(k7)
h(k8)
m1
12
Collision Resolution by Chaining
0
U (universe of keys)
k1
k4
k1
k4
K (actual keys)
k2
k2
k6
k5
k6
k5
k7
k8
k3
k7
k3
k8
m1
13
Hashing with Chaining

What is the running time to insert/search/delete?
Insert It takes O(1) time to compute the hash
function and insert at head of linked list
Search It is proportional to max linked list
length
Delete Same as search
Therefore, in the unfortunate event that we have
a bad hash function all n keys may hash in the
same table entry giving an O(n) run-time!
So how can we create a good hash function?

14
Hashing with Chaining

Dictionary Operations
Chained-Hash-Insert (T, x)
Insert x at the head of list Th(keyx).
Worst-case complexity O(1).
Chained-Hash-Delete (T, x)
Delete x from the list Th(keyx).
Worst-case complexity proportional to length of
list with singly-linked lists. O(1) with
doubly-linked lists.
Chained-Hash-Search (T, k)
Search an element with key k in list Th(k).
Worst-case complexity proportional to length of
list.

15
Analysis on Chained-Hash-Search

Load factor ?n/m average keys per slot.
m number of slots.
n number of elements stored in the hash table.
Worst-case complexity ?(n) time to compute
h(k).
Average depends on how h distributes keys among m
slots.
Assume
Simple uniform hashing.
Any key is equally likely to hash into any of the
m slots, independent of where any other key
hashes to.
O(1) time to compute h(k).
Time to search for an element with key k is
Q(Th(k)).
Expected length of a linked list load factor
? n/m.

16
Expected Cost of an Unsuccessful Search
Theorem An unsuccessful search takes expected
time T(1a).

Proof
Any key not already in the table is equally
likely to hash to any of the m slots.
To search unsuccessfully for any key k, need to
search to the end of the list Th(k), whose
expected length is a.
Adding the time to compute the hash function, the
total time required is T(1a).

17
Expected Cost of a Successful Search
Theorem A successful search takes expected time
T(1a).

Proof
The probability that a list is searched is
proportional to the number of elements it
contains.
Assume that the element being searched for is
equally likely to be any of the n elements in the
table.
The number of elements examined during a
successful search for an element x is 1 more than
the number of elements that appear before x in
xs list.
These are the elements inserted after x was
inserted.
Goal
Find the average, over the n elements x in the
table, of how many elements were inserted into
xs list after x was inserted.

18
Expected Cost of a Successful Search
Theorem A successful search takes expected time
T(1a).

Proof (contd)
Let xi be the ith element inserted into the
table, and let ki keyxi.
Define indicator random variables Xij Ih(ki)
h(kj), for all i, j.
Simple uniform hashing ? Prh(ki) h(kj) 1/m
?
EXij 1/m.
Expected number of elements examined in a
successful search is

No. of elements inserted after xi into the same
slot as xi.
19
Proof Contd.
(linearity of expectation)
Expected total time for a successful search
Time to compute hash function Time to search
O(2?/2 ?/2n) O(1 ?).
20
Expected Cost Interpretation

If n O(m), then ?n/m O(m)/m O(1).
? Searching takes constant time on average.
Insertion is O(1) in the worst case.
Deletion takes O(1) worst-case time when lists
are doubly linked.
Hence, all dictionary operations take O(1) time
on average with hash tables with chaining.

21
Good Hash Functions

Satisfy the assumption of simple uniform hashing.
Not possible to satisfy the assumption in
practice.
Often use heuristics, based on the domain of the
keys, to create a hash function that performs
well.
Regularity in key distribution should not affect
uniformity. Hash value should be independent of
any patterns that might exist in the data.
E.g. Each key is drawn independently from U
according to a probability distribution P
?kh(k) j P(k) 1/m for j 0, 1, , m1.
An example is the division method.

22
Keys as Natural Numbers

Hash functions assume that the keys are natural
numbers.
When they are not, have to interpret them as
natural numbers.
Example Interpret a character string as an
integer expressed in some radix notation. Suppose
the string is CLRS
ASCII values C67, L76, R82, S83.
There are 128 basic ASCII values.
So, CLRS 67128376 1282 821281 831280
141,764,947.

23
Division Method

Map a key k into one of the m slots by taking the
remainder of k divided by m. That is,
h(k) k mod m
Example m 31 and k 78 ? h(k) 16.
Advantage Fast, since requires just one division
operation.
Disadvantage Have to avoid certain values of m.
Dont pick certain values, such as m2p
Or hash wont depend on all bits of k.
Good choice for m
Primes, not too close to power of 2 (or 10) are
good.

24
Multiplication Method

If 0 lt A lt 1, h(k) ?m (kA mod 1)? ?m (kA
?kA?) ?
where kA mod 1 means the fractional part of
kA, i.e., kA ?kA?.
Disadvantage Slower than the division method.
Advantage Value of m is not critical.
Typically chosen as a power of 2, i.e., m 2p,
which makes implementation easy.
Example m 1000, k 123, A ? 0.6180339887
h(k) ?1000(123 0.6180339887 mod 1)?
?1000 0.018169... ? 18.

25
Multiplication Mthd. Implementation

Choose m 2p, for some integer p.
Let the word size of the machine be w bits.
Assume that k fits into a single word. (k takes w
bits.)
Let 0 lt s lt 2w. (s takes w bits.)
Restrict A to be of the form s/2w.
Let k ? s r1 2w r0 .
r1 holds the integer part of kA (?kA?) and r0
holds the fractional part of kA (kA mod 1 kA
?kA?).
We dont care about the integer part of kA.
So, just use r0, and forget about r1.

26
Multiplication Mthd Implementation
w bits
k
s A2w
?
binary point

r0
r1
extract p bits
h(k)

We want ?m (kA mod 1)?. We could get that by
shifting r0 to the left by p lg m bits and then
taking the p bits that were shifted to the left
of the binary point.
But, we dont need to shift. Just take the p most
significant bits of r0.

27
How to choose A?

How to choose A?
The multiplication method works with any legal
value of A.
But it works better with some values than with
others, depending on the keys being hashed.
Knuth suggests using A ? (?5 1)/2.

28
Multiplication Method

We choose m to be power of 2 (m2p) and
For example, k123456, m512 then

29
Multiplication Method Implementation
30
Drawback of Chaining

Drawback of Separate Chaining
new operator takes long time to allocate memory
in some languages
We are basically using two data structures an
array and a list
Therefore, separate chained hash tables although
useful are not used widely

31
Open Addressing

Open Addressing means that when collision occurs
at a certain location, we try alternate locations
until an empty location is found
As opposed to separate chaining, we now maintain
only one table (array). There are no associated
lists at each array index
Alternate locations are found by using a
collision resolution strategy. It is denoted by a
function f().

32
Hash Functions Using Collision Resolution
Strategy

Using a collision resolution strategy, the hash
function gets modified to hi(x)
hi(x) (hash(x)f(i)) mod tableSize.
Here
hi(x) new hash function
hash(x) old hash function, probably something
like hash(x) x mod tableSize
f(i) collision resolution strategy

33
Collision Resolution Strategy (contd.)

i denotes the number of attempts made by the
collision resolution strategy. When a collision
occurs and we try to find an empty location
(using the collision resolution strategy) for the
first time, then i1. If this first attempt fails
we again try to find an empty location for the
second time round, at which time i2, and so on
You must have noticed here that the collision
resolution strategy should be a function of i
(the number of the attempt). That is why the
collision resolution function is denoted as f(i)

34
Hash Tables and Collision Resolution

Some characteristics of hash tables with
collision resolution
All data goes inside table so a larger table is
required
l 0.5 for open addressing
We will now investigate different collision
resolution strategies. In other words, we will
take various functions for f(i) and see how the
hash table performs

35
Collision Resolution Strategy 1 Linear Probing

In linear probing f is linear. f(i) i
This means that when there is a collision we try
successive locations starting from the location
of collision until we find an empty location

36
Linear Probing Example

Example Insert the following data into a hash
table using linear probing as the collision
resolution strategy. Assume tableSize 10
17 26 38 9 7 66 11
Unless otherwise stated, we will assume that the
original hash function is
hash(x) x mod tableSize x mod 10
Since we are using linear probing, we have f(i)
i
Let us now compute hi(x) for each of the input
data and place them inside the array

37
Linear Probing Example (contd.)

h0(17) hash(17) f(0) (17 mod 10) 0 7
(Remember that f(0) 0)
Location 7 is currently empty. So there is no
collision and 17 is entered into the table
Similarly, 26, 38 and 9 do not create any
collisions and are entered into the table
The diagram of the table after these four
insertions is shown on the next slide

38
Linear Probing Example (contd.)
Array Index
39
Linear Probing Example (contd.)

The next data to be inserted is 7
h0(7) hash(7) f(0) 7 mod 10 7. The index
7 inside the array is already occupied by 17.
So, we have a collision and we have to use the
collision resolution strategy to find an empty
location to insert 7
Now, since this is our first attempt to find an
empty location, so i1
Since we are using linear probing, f(i)i, so
f(1) 1 and h1(7) (hash(7) 1) mod 10 (7
1) mod 10 8 mod 10 8

40
Linear Probing Example (contd.)

However, location 8 is already occupied by 38
So we have to use collision resolution once
again, now with i2
Since we are using linear probing, f(i)i, so
f(2) 2 and h2(7) (hash(7) 2) mod 10 (7
2) mod 10 9 mod 10 9

41
Linear Probing Example (contd.)

However, location 9 is also occupied by 9
So we have to use collision resolution once
again, now with i3
Since we are using linear probing, f(i)i, so
f(3) 3 and h3(7) (hash(7) 3) mod 10 (7
3) mod 10 10 mod 10 0
Location 0 is empty and so, we insert 7 at index 0

42
Linear Probing Example (contd.)

The next data to be inserted is 66
h0(66) hash(66) f(0) 66 mod 10 6
Location 6 is already occupied by 26. So we get a
collision
We have to use the collision resolution strategy
with linear probing as we did while inserting 7
Solve this as we did on the last few slides with
insert of 7

43
Linear Probing Example (contd.)

66 will collide 5 times and will get inserted at
location 1
Next, we have to insert 11
Once again we get a collision but 11 can be
inserted after the first collision
Verify the insertion of 11 into the hash table as
we did in the example before
The diagram of the hash table after all the
inserts is given on the next slide

44
Linear Probing Example (contd.)
Array Index
45
Drawbacks of Linear Probing

Time to find empty cell is quite large. For
example, we had to inspect 4 locations before we
found an empty location to insert 7. Same problem
was encountered while inserting 66.
The hash table can be relatively empty and
pockets of occupied cells start forming. For
example, when we inserted 66 the lower part of
the hash table was full but the upper part was
entirely empty
Primary Clustering Several attempts required to
resolve a collision. For example, while inserting
66 which collided as many as 5 times

46
Collision Resolution Strategy 2 Quadratic
Probing

In quadratic probing f(i)i2. All other
techniques remain similar to linear probing
Example Insert the following data into a hash
table using quadratic probing as the collision
resolution strategy. Assume tableSize 10
17 26 38 9 7 66 11
As in the previous example, 17, 26, 38 and 9 get
inserted without any collisions

47
Quadratic Probing Example

When we try to insert 7, hash (7) 7 mod 10 7.
Location 7 is already occupied and so we get a
collision
We should now try to find an empty location using
the collision resolution strategy of quadratic
probing. Since we are trying to find an empty
location for the first time, i1

48
Quadratic Probing Example (contd.)

Since we are using quadratic probing now,
f(i)i2, so f(1) 12 1 and h1(7) (hash(7)
1) mod 10 (7 1) mod 10 8 mod 10 8
Location 8 is already occupied, so we try another
collision resolution now with i2
Since we are using quadratic probing now,
f(i)i2, so f(2) 22 4 and h2(7) (hash(7)
4) mod 10 (7 4) mod 10 11 mod 10 1

49
Quadratic Probing Example (contd.)

Location 1 is empty and so we insert 7
Notice that we had much less collisions while
inserting 7 with quadratic probing than we had
with linear probing

50
Quadratic Probing Example (contd.)

Let us now try to insert the next data 66
We get a collision at location 6 and we use
quadratic probing to find empty cells
The cells that will be probed by quadratic
probing are
with i1, location 7 gives a collision again
with i2, location 0 empty 66 is inserted here
Once again notice that we had fewer collisions
now than we had with linear probing

51
Quadratic Probing Example (contd.)

Please solve the insertion of 11 by yourself
The diagram of the hash table after all the data
has been inserted is given on the next slide

52
Quadratic Probing Example (contd.)
Array Index
66
7
11
53
Quadratic Probing Problem 1

There is no guarantee to find empty cell once
table is more than half full (see proof on page
92. This proof is not required for the exam)

54
Quadratic Probing Problem 2

Standard deletion cannot be used
To understand this, let us see how we find 66 in
the hash table given on slide 25
hash(66) 66 mod 10 6
Location 6 contains 26 which is not the data we
are finding
This means that
either 66 is not there in the entire hash table
or, 66 got stored somewhere else when we used
quadratic probing to find an empty location while
inserting it

55
Quadratic Probing Problem 2 (contd.)

Since we just solved this example, we know that
the second option is what actually happened
However, the find routine does not know that this
is what happened
So, the find method has to visit each location
that might have been visited by quadratic probing
while inserting 66

56
Quadratic Probing Problem 2 (contd.)

These locations would be at a distance 1 or 4 or
9 or 16 or 25 (everything being mod tableSize)
away from location 6 (6 is the value returned by
hash(66))
Notice that the above distances are at a distance
i2 from location 6 because quadratic probing uses
f(i) i2 for i1, 2, 3 and so on
So the find method looks at location (61) mod 10
7 and does not find 66
Next the find method looks at location (64) mod
10 10 mod 10 0 and finds 66

57
Quadratic Probing Problem 2 (contd.)

However, what would have happened if we deleted
26 first and then we tried to find 66
Since hash(66) 6 mod 10 6, and location 6 is
empty, the find method would have (wrongly)
assumed that 66 is not there because the first
location where 66 could be inserted (location 6)
is free. So 66 never got inserted

58
Quadratic Probing Problem 2 (contd.)

The solution is to use a technique called lazy
delete
In lazy delete, along with each location we
maintain a tag that is initially cleared
When there is a collision while inserting the tag
is set
Then quadratic (or some other) probing is used to
locate an empty cell and insert the data

59
Quadratic Probing Problem 2 (contd.)

With lazy delete, when we insert 66, we get a
collision at location 6 (occupied by 26) and we
set the tag for location 6
Later on if we delete 26, the tag still remains
set
Now, when encounters an empty location at
location 6 it checks to see if the tag is set
Since the tag is set, find knows that there is
another data that should have been in location 6
but got bumped off to another location by the
collision resolution strategy

60
Quadratic Probing Problem 3

Suppose that collision occurs while inserting at
location x
Then the locations that will be probed using
quadratic probing are (x1) mod 10, (x4) mod 10,
(x9) mod 10, (x16) mod 10, (x25) mod 10 and so
on

61
Quadratic Probing Problem 3 (contd.)

Let us substitute a value for x, say x 5
So successive locations that are probed by
quadratic probing until an empty location is
found are
(51) mod 10 6
(54) mod 10 9
(516) mod 10 21 mod 10 1
(525) mod 10 30 mod 10 0
(536) mod 10 41 mod 10 1
(549) mod 10 54 mod 10 4
(564) mod 10 69 mod 10 9
(581) mod 10 86 mod 10 6 and so on
Notice that some locations (1, 6, 9) are getting
probed repeatedly

62
Secondary Clustering

This problem is called secondary clustering
Secondary Clustering Elements that hash to same
locations always probe same set of cells
This is solved by using the last collision
resolution strategy we are going to study double
hashing

63
Collision Resolution Strategy 3 Double Hashing

Here the probing function f(i) i hash2(x)
hash2(x) is called the secondary hash function
However, a bad choice of hash2(x) can really make
matters worse
Let us assume that tableSize 10 as we had in
the previous examples

64
Bad Choice for hash2(x)

For example if hash2(x) x mod 7 and we try
insert 7
hash(7) 7 mod 10 7. Suppose that location 7
is already occupied and so there is a collision
Now we use our collision resolution strategy with
i1, f(i) ihash2(x). So f(1) 1 hash2(7)
1 (7 mod 7) 0
Therefore, h1(7) 7
In fact, f(2) also equals 0 and so h2(7) 7
So, we are not going anywhere and repeatedly
probing location 7

65
Good Choice for hash2(x)

An example of a good hash function is hash2(x) R
- (x mod R) where R is prime number lt tableSize.
If tableSize 10 (as is in our example), R 7
is a good choice

66
Double Hashing Example

Example Insert the following data into a hash
table using double hashing as the collision
resolution strategy
89 18 49 58 68
89 and 18 do not create any collisions and get
inserted at locations 9 and 8 respectively
h0(49) (hash(49) f(0)) mod 10 (49 mod 10
0) mod 10 9. Location 9 is already occupied so
we get a collision

67
Double Hashing Example (contd.)

hash2 (49) 7 (49 mod 7) 7 0 7
So,
h1(49) (hash(49) f(1)) mod 10
(49 mod 10 1 hash2 (49) ) mod 10
(9 7) mod 10 16 mod 10 6
Location 6 is empty and 49 is inserted there

68
Double Hashing Example (contd.)

58 and 49 also collide when we try to insert them
and the collision is resolved at the first
attempt (with i 1) using double hashing
Verify that hash2(58) 7 (58 mod 7) 7 2
5 and that 58 gets inserted at location 3
Verify that hash2(69) 7 (69 mod 7) 7 6
1 and that 69 gets inserted at location 0
The hash table after all insertions is shown on
the next slide

69
Double Hashing Example Figure
Array Index
69
58
70
Double Hashing Problem

To understand the problem let us suppose that we
are inserting 23 into the hash table on the last
slide
We get collision at position 3 which is already
occupied by 58
Since we are using double hashing, hash2(23) 7
(23 mod 7) 7 2 5
So, h1(23) (hash(23) 1hash2(23)) mod 10 (3
1 5) mod 10 8 mod 10 8
Position 8 also occupied

71
Double Hashing Problem (contd.)

So we try to find an empty space again using
double hashing
h2(23) (hash(23) 2hash2(23)) mod 10 (3
2 5) mod 10 13 mod 10 3
Location 3 is already occupied
We again try to find an empty space using double
hashing
h3(23) (hash(23) 3hash2(23)) mod 10 (3
3 5) mod 10 18 mod 10 8
Location 8 is already occupied and had already
been probed while doing h1(23)

72
Double Hashing Problem (contd.)

In fact, if you try out further attempts with
i3, 4, 5 and so on you will see that locations 3
and 8 get continuously probed
The reason for this is that tableSize 10 is not
a prime
The solution for this problem is to make
tableSize prime (for e.g. 11 is a good choice for
tableSize)

73
Double Hashing Ideal Secondary Hash Function

A properly selected secondary hash function
hash2(x) ensures that number of expected probes
is close to a random collision resolution
strategy

74
Double Hashing vs.Linear and Quadratic Probing

As compared to double hashing, linear and
quadratic probing are faster because f(i)
ihash2(x) takes longer to compute than f(i) i
or f(i) i2

75
Rehashing

Rehashing tells us what to do when the hash table
gets full
Instead of waiting for the hash table to get
completely full, it is more efficient to rehash
when the table is about 70 or 80 full
The most common rehashing technique is to
construct a new table of approximately double
size of the original hash table
Since the new table has a different size,
tableSize gets a new value, and, so a new hash
function, hash(x) x mod (new_tableSize) has to
be defined

76
Rehashing Example

Example Insert 13 15 6 24 23 into an
initially empty hash table. Assume tableSize 7
and use linear probing for collision resolution
(The table is drawn in the book on page 198-199.
Please see it)
Since tableSize 7, hash(x) x mod 7
After 23 is inserted, the hash table is 70 full
Rehash New table size 72 14. 14 is not a
prime number
So, we select the prime number closest to and
greater than 14, i.e., 17 as the new tableSize
The new hash function is now hash(x) x mod 17

77
Rehashing (contd.)

All the data from the original table has to be
inserted into the new table at their new location
given by the new hash function (See page 199 from
book for the diagram)
Rehashing is a costly operation and it happens
frequently when the hash table is small and there
are a lot of insertions
The time required for rehashing is O(N) since N
elements need to be rehashed from the original
hash table into the new one
Therefore it adds a constant cost to each
insertion

78
Other Rehashing Techniques