The dictionary structures binary search trees, AVL trees, B trees discussed so far assume that we ca - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

The dictionary structures binary search trees, AVL trees, B trees discussed so far assume that we ca

Description:

The dictionary structures (binary search trees, AVL trees, ... r is the radix. Integers: r = 10. Bit strings: r = 2. Strings: r = 128. Weight by the ith position ... – PowerPoint PPT presentation

Number of Views:448

Avg rating:3.0/5.0

Slides: 30

Provided by: math81

Category:

more less

Transcript and Presenter's Notes

Title: The dictionary structures binary search trees, AVL trees, B trees discussed so far assume that we ca

1
Hash Table and Hashing

The dictionary structures (binary search trees,
AVL trees, Btrees) discussed so far assume that
we can only work with the input keys by comparing
them. No other operation is considered.
In practice, it is often true that we can process
the input key to make insert/search/delete
operations run faster!
For example,
Integers consist of digits.
Strings consist of letters.
We can process a key to map it to an integer,
and use
that integer as an array index.

2
Basic ideas

In general,
Universe of keys U u0, u1,, uN-1.
For each key, it should be relatively easy to
compute its corresponding index.
Support operations
Find.
Insert.
Delete. Deletions may be unnecessary in some
applications.

3
Basic ideas

Given a key, compute its corresponding index into
a hash table.
Index is computed using a hash function.

hash_function(key) Index
key
Index
Hash Table
4
Basic ideas

Unlike binary search tree, AVL tree or B- tree,
the following
operations are hard to implement
minimum and maximum,
successor and predecessor,
report data within a given range, and
list out the data in order.

5
Example Applications

Compilers use hash tables (symbol tables) to keep
track of declared variables.
On-line spell checkers. After prehashing the
entire dictionary, one can check each word in
constant time and print out the misspelled word
in order of their appearance in the document.

6
Unrealistic Hashing bit vector

Universe of keys U u0, u1,, uN-1.
Find(ui) Test entryi ? O(1) time
Insert(ui) Set entryi to 1 ? O(1) time
Delete(ui) Set entryi to 0 ? O(1) time

7
Unrealistic Hashing bit vector

Features
Each operation takes constant time.
Simple implementation.
The scheme wastes too much space if the universe
of key is too large, compared with the actual
number of elements to be stored.
For example, consider your student id. If we
treat it as an 8-digit integer, then, the
universe size is 108, but we only have about 7000
students ? around (108-7000) spaces will be
wasted.

8
Hashing

Let UK0, K1 , , KN-1 be the universe of keys.
Let T0 m-1 be an array representing the hash
table, where m is much smaller than N.
The soul of the hashing scheme h is the hash
function
h Key universe ? 0 .. m-1
that maps each key in the universe to an integer
between 0 and m-1.
For each key K, h(K) is called the hash value of
K and K is supposed to be stored at Th(K).

9
Hashing
0
h(K1) ? index1
index1
h(K2) ? index2
index2
h(K3) ? index2
Two (or more) keys may get into the same
location !
m-1
Hash Table
10
Hashing

What do we do if two (or more) keys have the same
hash values?
There are two aspects to this.
We should design the hash function such that it
spreads the keys uniformly among the entries of
T. This will decrease the likelihood that two
keys have the same hash values.
Nevertheless, since N gt m, we still need a
solution when this event happens.

11
Example of a bad hash function

Suppose that our keys are a string of letters.
Lets consider a letter equals its ASCII value
Key cn-1 cn-2 . . . co
For example A 65, Z 90, a 97, z
122
Our hash function
(This simply adds the strings characters values
up and takes the
modulus by m, where m is the size of the hash
table.)

12
Example of a bad hash function

Our hash function
This hash function gives the same results for any
permutation of the string
h(ABC) h(CBA) h(ACB)
Keys cannot be spread uniformly ? so it is not a
good idea!

13
Improving the hash function

We can improve the hash function so that the
letters contribute differently according to their
positions.
r is the radix
Integers r 10
Bit strings r 2
Strings r 128

Weight by the ith position
14
Improving the hash function

Need to be careful about overflows, since we may
raise the base to a large power.
We can do all computation in modulo arithmetic.
For example
In general, the hash table size m is generally a
prime number.

Take modulus at eachstep.
sum 0 for (int j0 j lt n j) sum (sum
(cj rj) m ) m
15
Collision Handling

When the hash values of two keys are the same
(i.e., the two keys need to store in the same
location), we have a collision.
Collisions could happen even if we design our
hash function carefully since N is much larger
than m.
There are two approaches to resolve collisions.
separate chaining -- we can convert the hash
table to a table of linked lists
open addressing -- we can relocate the key to a
different entry in case of collision

16
Collision Handling Separate Chaining

A hash table becomes a table of linked lists.
To insert a key K, we compute h(K). If Th(K)
contains a null pointer, we initialize this entry
to point to a linked list that contains K alone.
If Th(K) is a non-empty list, we just insert K
at the beginning of this list.

17
Separate Chaining

To delete a key K, we compute h(K). Then search
for K within the list at Th(K). Delete K if it
is found.
Assume that we will be storing n keys. Then we
should make m the next larger prime number. If
the hash function works well, the number of keys
in each linked list will be a small constant.
Therefore, we expect that each search, insert,
and delete operation can be done in constant time.

18
Collision Handling Open Addressing

Separate chaining has the disadvantage of using
linked lists. Memory allocation in linked list
manipulation will slow down the program.
An alternative method is to relocate the key K to
be inserted if it collides with an existing key.
That is, we store K at an entry different from
h(K).
Two issues arise.
What is the relocation scheme?
How to search for K later?
There are two common methods for resolving a
collision in open addressing
Linear probing
Double hashing

19
Linear Probing

Insertion
Let K be the new key to be inserted. We compute
h(K).
For i 0 to m-1,
compute L (h(K) i) m
if TL is empty, then we put K there and stop.
If we cannot find an empty entry to put K, it
means that the table is full and we should report
an error.

20
Linear Probing

Searching
Let K be the key to be searched. We compute
h(K).
For i 0 to m-1,
compute L (h(K) i) m
If TL contains K, then we are done and we can
stop.
If TL is empty, then K is not in the table and
we can stop too. (If K were in the table, it
would have been placed in TL by our insertion
strategy.)
If we cannot find K at the end of the for-loop,
we have scanned the entire table and so we can
report that K is not in the table.

21
Linear Probing - Example
22
Linear Probing Primary Clustering

We call a block of contiguously occupied table
entries a cluster.
On average, when we insert a new key K, we may
hit the middle of a cluster.
Therefore, the time to insert K would be
proportional to half the size of a cluster. That
is, the larger the cluster, the slower the
performance.

cluster
Hash Table
23
Linear Probing Primary Clustering

Linear probing has the following disadvantages
Once h(K) falls into a cluster, this cluster will
definitely grow in size by one. Thus, this may
worsen the performance of insertion in the
future.
If two cluster are only separated by one entry,
then inserting one key into a cluster can merge
the two clusters together.
The cluster size can increase drastically by a
single insertion.
This means that the performance of insertion
(searching) can deteriorate drastically after a
single insertion.

24
Double Hashing

To alleviate the problem of primary clustering,
when resolving a collision, we should examine
alternative positions in a more random fashion.
To this end, we work with two hash functions h
and h2.
Insertion
Let K be the new key to be inserted. We compute
h(K).
For i 0 to m-1,
compute L (h(K) i h2 (K)) m
if TL is empty, then we put K there and stop.
If we cannot find an empty entry to put K, it
means that the table is full and we should report
an error.

25
Double Hashing

Searching
Let K be the new key to be searched. We compute
h(K).
For i 0 to m-1,
compute L (h(K) i h2 (K)) m
if TL contains K, then we are done and we can
stop.
If TL is empty, then K is not in the table and
we can stop too. (If K were in the table, it
would have been placed in TL by our insertion
strategy.)
If we cannot find K at the end of the for-loop,
we have scanned the entire table and so we can
report that K is not in the table.

26
Double Hashing Choice of h2

For any key K, h2(K) must be relatively prime to
the table size m. Otherwise, we will only be able
to examine a fraction of the table entries.
For example, if h(K) 0 and h2(K) m/2, then we
can only examine the entries T0, Tm/2, and
nothing else!
One solution is to make m prime, choose r to be a
prime smaller than m, and set
h2(K) r - (K r).

27
Double Hashing Example
28
Deletion in Open Addressing

We cannot just delete a key in open addressing
strategies.
Otherwise, suppose that the table stores three
keys K1, K2 and K3 that have identical probe
sequences. Suppose that K1 is stored at h(K1),
K2 is stored at the second probe location and K3
is stored at the third probe location.
If K2 is to be deleted and we make the slot
containing K2 empty, then when we search for K3,
we will find an empty slot before finding K3.
So, we will report that K3 does not exist in the
table !!!

???
29
Deletion in Open Addressing

Instead, we add an extra bit to each entry to
indicate whether the key stored here has been
deleted or not.
Such delete bit serves two purposes
searching we should NOT stop at there
insertion that position is logically empty
(though the deleted key is still in the hash
table), so we can overwrite this entry

Write a Comment

User Comments (0)