Title: The dictionary structures binary search trees, AVL trees, B trees discussed so far assume that we ca
1Hash Table and Hashing
- The dictionary structures (binary search trees,
AVL trees, Btrees) discussed so far assume that
we can only work with the input keys by comparing
them. No other operation is considered. - In practice, it is often true that we can process
the input key to make insert/search/delete
operations run faster! - For example,
- Integers consist of digits.
- Strings consist of letters.
- We can process a key to map it to an integer,
and use - that integer as an array index.
2Basic ideas
- In general,
- Universe of keys U u0, u1,, uN-1.
- For each key, it should be relatively easy to
compute its corresponding index. - Support operations
- Find.
- Insert.
- Delete. Deletions may be unnecessary in some
applications.
3Basic ideas
- Given a key, compute its corresponding index into
a hash table. - Index is computed using a hash function.
hash_function(key) Index
key
Index
Hash Table
4Basic ideas
- Unlike binary search tree, AVL tree or B- tree,
the following - operations are hard to implement
- minimum and maximum,
- successor and predecessor,
- report data within a given range, and
- list out the data in order.
5Example Applications
- Compilers use hash tables (symbol tables) to keep
track of declared variables. - On-line spell checkers. After prehashing the
entire dictionary, one can check each word in
constant time and print out the misspelled word
in order of their appearance in the document.
6Unrealistic Hashing bit vector
- Universe of keys U u0, u1,, uN-1.
- Find(ui) Test entryi ? O(1) time
- Insert(ui) Set entryi to 1 ? O(1) time
- Delete(ui) Set entryi to 0 ? O(1) time
7Unrealistic Hashing bit vector
- Features
- Each operation takes constant time.
- Simple implementation.
- The scheme wastes too much space if the universe
of key is too large, compared with the actual
number of elements to be stored. - For example, consider your student id. If we
treat it as an 8-digit integer, then, the
universe size is 108, but we only have about 7000
students ? around (108-7000) spaces will be
wasted.
8Hashing
- Let UK0, K1 , , KN-1 be the universe of keys.
- Let T0 m-1 be an array representing the hash
table, where m is much smaller than N. - The soul of the hashing scheme h is the hash
function - h Key universe ? 0 .. m-1
- that maps each key in the universe to an integer
between 0 and m-1. - For each key K, h(K) is called the hash value of
K and K is supposed to be stored at Th(K).
9Hashing
0
h(K1) ? index1
index1
h(K2) ? index2
index2
h(K3) ? index2
Two (or more) keys may get into the same
location !
m-1
Hash Table
10Hashing
- What do we do if two (or more) keys have the same
hash values? - There are two aspects to this.
- We should design the hash function such that it
spreads the keys uniformly among the entries of
T. This will decrease the likelihood that two
keys have the same hash values. - Nevertheless, since N gt m, we still need a
solution when this event happens.
11Example of a bad hash function
- Suppose that our keys are a string of letters.
- Lets consider a letter equals its ASCII value
- Key cn-1 cn-2 . . . co
- For example A 65, Z 90, a 97, z
122 - Our hash function
- (This simply adds the strings characters values
up and takes the - modulus by m, where m is the size of the hash
table.)
12Example of a bad hash function
- Our hash function
- This hash function gives the same results for any
permutation of the string - h(ABC) h(CBA) h(ACB)
- Keys cannot be spread uniformly ? so it is not a
good idea!
13Improving the hash function
- We can improve the hash function so that the
letters contribute differently according to their
positions. - r is the radix
- Integers r 10
- Bit strings r 2
- Strings r 128
Weight by the ith position
14Improving the hash function
- Need to be careful about overflows, since we may
raise the base to a large power. - We can do all computation in modulo arithmetic.
- For example
- In general, the hash table size m is generally a
prime number.
Take modulus at eachstep.
sum 0 for (int j0 j lt n j) sum (sum
(cj rj) m ) m
15Collision Handling
- When the hash values of two keys are the same
(i.e., the two keys need to store in the same
location), we have a collision. - Collisions could happen even if we design our
hash function carefully since N is much larger
than m. - There are two approaches to resolve collisions.
- separate chaining -- we can convert the hash
table to a table of linked lists - open addressing -- we can relocate the key to a
different entry in case of collision
16Collision Handling Separate Chaining
- A hash table becomes a table of linked lists.
- To insert a key K, we compute h(K). If Th(K)
contains a null pointer, we initialize this entry
to point to a linked list that contains K alone.
If Th(K) is a non-empty list, we just insert K
at the beginning of this list.
17Separate Chaining
- To delete a key K, we compute h(K). Then search
for K within the list at Th(K). Delete K if it
is found. - Assume that we will be storing n keys. Then we
should make m the next larger prime number. If
the hash function works well, the number of keys
in each linked list will be a small constant.
Therefore, we expect that each search, insert,
and delete operation can be done in constant time.
18Collision Handling Open Addressing
- Separate chaining has the disadvantage of using
linked lists. Memory allocation in linked list
manipulation will slow down the program. - An alternative method is to relocate the key K to
be inserted if it collides with an existing key.
That is, we store K at an entry different from
h(K). - Two issues arise.
- What is the relocation scheme?
- How to search for K later?
- There are two common methods for resolving a
collision in open addressing - Linear probing
- Double hashing
19Linear Probing
- Insertion
- Let K be the new key to be inserted. We compute
h(K). - For i 0 to m-1,
- compute L (h(K) i) m
- if TL is empty, then we put K there and stop.
- If we cannot find an empty entry to put K, it
means that the table is full and we should report
an error.
20Linear Probing
- Searching
- Let K be the key to be searched. We compute
h(K). - For i 0 to m-1,
- compute L (h(K) i) m
- If TL contains K, then we are done and we can
stop. - If TL is empty, then K is not in the table and
we can stop too. (If K were in the table, it
would have been placed in TL by our insertion
strategy.) - If we cannot find K at the end of the for-loop,
we have scanned the entire table and so we can
report that K is not in the table.
21Linear Probing - Example
22Linear Probing Primary Clustering
- We call a block of contiguously occupied table
entries a cluster. - On average, when we insert a new key K, we may
hit the middle of a cluster. - Therefore, the time to insert K would be
proportional to half the size of a cluster. That
is, the larger the cluster, the slower the
performance.
cluster
Hash Table
23Linear Probing Primary Clustering
- Linear probing has the following disadvantages
- Once h(K) falls into a cluster, this cluster will
definitely grow in size by one. Thus, this may
worsen the performance of insertion in the
future. - If two cluster are only separated by one entry,
then inserting one key into a cluster can merge
the two clusters together. - The cluster size can increase drastically by a
single insertion. - This means that the performance of insertion
(searching) can deteriorate drastically after a
single insertion.
24Double Hashing
- To alleviate the problem of primary clustering,
when resolving a collision, we should examine
alternative positions in a more random fashion.
To this end, we work with two hash functions h
and h2. - Insertion
- Let K be the new key to be inserted. We compute
h(K). - For i 0 to m-1,
- compute L (h(K) i h2 (K)) m
- if TL is empty, then we put K there and stop.
- If we cannot find an empty entry to put K, it
means that the table is full and we should report
an error.
25Double Hashing
- Searching
- Let K be the new key to be searched. We compute
h(K). - For i 0 to m-1,
- compute L (h(K) i h2 (K)) m
- if TL contains K, then we are done and we can
stop. - If TL is empty, then K is not in the table and
we can stop too. (If K were in the table, it
would have been placed in TL by our insertion
strategy.) - If we cannot find K at the end of the for-loop,
we have scanned the entire table and so we can
report that K is not in the table.
26Double Hashing Choice of h2
- For any key K, h2(K) must be relatively prime to
the table size m. Otherwise, we will only be able
to examine a fraction of the table entries. - For example, if h(K) 0 and h2(K) m/2, then we
can only examine the entries T0, Tm/2, and
nothing else! - One solution is to make m prime, choose r to be a
prime smaller than m, and set - h2(K) r - (K r).
27Double Hashing Example
28Deletion in Open Addressing
- We cannot just delete a key in open addressing
strategies. - Otherwise, suppose that the table stores three
keys K1, K2 and K3 that have identical probe
sequences. Suppose that K1 is stored at h(K1),
K2 is stored at the second probe location and K3
is stored at the third probe location. - If K2 is to be deleted and we make the slot
containing K2 empty, then when we search for K3,
we will find an empty slot before finding K3. - So, we will report that K3 does not exist in the
table !!!
???
29Deletion in Open Addressing
- Instead, we add an extra bit to each entry to
indicate whether the key stored here has been
deleted or not. - Such delete bit serves two purposes
- searching we should NOT stop at there
- insertion that position is logically empty
(though the deleted key is still in the hash
table), so we can overwrite this entry