The dictionary structures binary search trees, AVL trees, B trees discussed so far assume that we ca - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

The dictionary structures binary search trees, AVL trees, B trees discussed so far assume that we ca

Description:

The dictionary structures (binary search trees, AVL trees, ... r is the radix. Integers: r = 10. Bit strings: r = 2. Strings: r = 128. Weight by the ith position ... – PowerPoint PPT presentation

Number of Views:448
Avg rating:3.0/5.0
Slides: 30
Provided by: math81
Category:

less

Transcript and Presenter's Notes

Title: The dictionary structures binary search trees, AVL trees, B trees discussed so far assume that we ca


1
Hash Table and Hashing
  • The dictionary structures (binary search trees,
    AVL trees, Btrees) discussed so far assume that
    we can only work with the input keys by comparing
    them. No other operation is considered.
  • In practice, it is often true that we can process
    the input key to make insert/search/delete
    operations run faster!
  • For example,
  • Integers consist of digits.
  • Strings consist of letters.
  • We can process a key to map it to an integer,
    and use
  • that integer as an array index.

2
Basic ideas
  • In general,
  • Universe of keys U u0, u1,, uN-1.
  • For each key, it should be relatively easy to
    compute its corresponding index.
  • Support operations
  • Find.
  • Insert.
  • Delete. Deletions may be unnecessary in some
    applications.

3
Basic ideas
  • Given a key, compute its corresponding index into
    a hash table.
  • Index is computed using a hash function.

hash_function(key) Index
key
Index
Hash Table
4
Basic ideas
  • Unlike binary search tree, AVL tree or B- tree,
    the following
  • operations are hard to implement
  • minimum and maximum,
  • successor and predecessor,
  • report data within a given range, and
  • list out the data in order.

5
Example Applications
  • Compilers use hash tables (symbol tables) to keep
    track of declared variables.
  • On-line spell checkers. After prehashing the
    entire dictionary, one can check each word in
    constant time and print out the misspelled word
    in order of their appearance in the document.

6
Unrealistic Hashing bit vector
  • Universe of keys U u0, u1,, uN-1.
  • Find(ui) Test entryi ? O(1) time
  • Insert(ui) Set entryi to 1 ? O(1) time
  • Delete(ui) Set entryi to 0 ? O(1) time

7
Unrealistic Hashing bit vector
  • Features
  • Each operation takes constant time.
  • Simple implementation.
  • The scheme wastes too much space if the universe
    of key is too large, compared with the actual
    number of elements to be stored.
  • For example, consider your student id. If we
    treat it as an 8-digit integer, then, the
    universe size is 108, but we only have about 7000
    students ? around (108-7000) spaces will be
    wasted.

8
Hashing
  • Let UK0, K1 , , KN-1 be the universe of keys.
  • Let T0 m-1 be an array representing the hash
    table, where m is much smaller than N.
  • The soul of the hashing scheme h is the hash
    function
  • h Key universe ? 0 .. m-1
  • that maps each key in the universe to an integer
    between 0 and m-1.
  • For each key K, h(K) is called the hash value of
    K and K is supposed to be stored at Th(K).

9
Hashing
0
h(K1) ? index1
index1
h(K2) ? index2
index2
h(K3) ? index2
Two (or more) keys may get into the same
location !
m-1
Hash Table
10
Hashing
  • What do we do if two (or more) keys have the same
    hash values?
  • There are two aspects to this.
  • We should design the hash function such that it
    spreads the keys uniformly among the entries of
    T. This will decrease the likelihood that two
    keys have the same hash values.
  • Nevertheless, since N gt m, we still need a
    solution when this event happens.

11
Example of a bad hash function
  • Suppose that our keys are a string of letters.
  • Lets consider a letter equals its ASCII value
  • Key cn-1 cn-2 . . . co
  • For example A 65, Z 90, a 97, z
    122
  • Our hash function
  • (This simply adds the strings characters values
    up and takes the
  • modulus by m, where m is the size of the hash
    table.)

12
Example of a bad hash function
  • Our hash function
  • This hash function gives the same results for any
    permutation of the string
  • h(ABC) h(CBA) h(ACB)
  • Keys cannot be spread uniformly ? so it is not a
    good idea!

13
Improving the hash function
  • We can improve the hash function so that the
    letters contribute differently according to their
    positions.
  • r is the radix
  • Integers r 10
  • Bit strings r 2
  • Strings r 128

Weight by the ith position
14
Improving the hash function
  • Need to be careful about overflows, since we may
    raise the base to a large power.
  • We can do all computation in modulo arithmetic.
  • For example
  • In general, the hash table size m is generally a
    prime number.

Take modulus at eachstep.
sum 0 for (int j0 j lt n j) sum (sum
(cj rj) m ) m
15
Collision Handling
  • When the hash values of two keys are the same
    (i.e., the two keys need to store in the same
    location), we have a collision.
  • Collisions could happen even if we design our
    hash function carefully since N is much larger
    than m.
  • There are two approaches to resolve collisions.
  • separate chaining -- we can convert the hash
    table to a table of linked lists
  • open addressing -- we can relocate the key to a
    different entry in case of collision

16
Collision Handling Separate Chaining
  • A hash table becomes a table of linked lists.
  • To insert a key K, we compute h(K). If Th(K)
    contains a null pointer, we initialize this entry
    to point to a linked list that contains K alone.
    If Th(K) is a non-empty list, we just insert K
    at the beginning of this list.

17
Separate Chaining
  • To delete a key K, we compute h(K). Then search
    for K within the list at Th(K). Delete K if it
    is found.
  • Assume that we will be storing n keys. Then we
    should make m the next larger prime number. If
    the hash function works well, the number of keys
    in each linked list will be a small constant.
    Therefore, we expect that each search, insert,
    and delete operation can be done in constant time.

18
Collision Handling Open Addressing
  • Separate chaining has the disadvantage of using
    linked lists. Memory allocation in linked list
    manipulation will slow down the program.
  • An alternative method is to relocate the key K to
    be inserted if it collides with an existing key.
    That is, we store K at an entry different from
    h(K).
  • Two issues arise.
  • What is the relocation scheme?
  • How to search for K later?
  • There are two common methods for resolving a
    collision in open addressing
  • Linear probing
  • Double hashing

19
Linear Probing
  • Insertion
  • Let K be the new key to be inserted. We compute
    h(K).
  • For i 0 to m-1,
  • compute L (h(K) i) m
  • if TL is empty, then we put K there and stop.
  • If we cannot find an empty entry to put K, it
    means that the table is full and we should report
    an error.

20
Linear Probing
  • Searching
  • Let K be the key to be searched. We compute
    h(K).
  • For i 0 to m-1,
  • compute L (h(K) i) m
  • If TL contains K, then we are done and we can
    stop.
  • If TL is empty, then K is not in the table and
    we can stop too. (If K were in the table, it
    would have been placed in TL by our insertion
    strategy.)
  • If we cannot find K at the end of the for-loop,
    we have scanned the entire table and so we can
    report that K is not in the table.

21
Linear Probing - Example
22
Linear Probing Primary Clustering
  • We call a block of contiguously occupied table
    entries a cluster.
  • On average, when we insert a new key K, we may
    hit the middle of a cluster.
  • Therefore, the time to insert K would be
    proportional to half the size of a cluster. That
    is, the larger the cluster, the slower the
    performance.

cluster
Hash Table
23
Linear Probing Primary Clustering
  • Linear probing has the following disadvantages
  • Once h(K) falls into a cluster, this cluster will
    definitely grow in size by one. Thus, this may
    worsen the performance of insertion in the
    future.
  • If two cluster are only separated by one entry,
    then inserting one key into a cluster can merge
    the two clusters together.
  • The cluster size can increase drastically by a
    single insertion.
  • This means that the performance of insertion
    (searching) can deteriorate drastically after a
    single insertion.

24
Double Hashing
  • To alleviate the problem of primary clustering,
    when resolving a collision, we should examine
    alternative positions in a more random fashion.
    To this end, we work with two hash functions h
    and h2.
  • Insertion
  • Let K be the new key to be inserted. We compute
    h(K).
  • For i 0 to m-1,
  • compute L (h(K) i h2 (K)) m
  • if TL is empty, then we put K there and stop.
  • If we cannot find an empty entry to put K, it
    means that the table is full and we should report
    an error.

25
Double Hashing
  • Searching
  • Let K be the new key to be searched. We compute
    h(K).
  • For i 0 to m-1,
  • compute L (h(K) i h2 (K)) m
  • if TL contains K, then we are done and we can
    stop.
  • If TL is empty, then K is not in the table and
    we can stop too. (If K were in the table, it
    would have been placed in TL by our insertion
    strategy.)
  • If we cannot find K at the end of the for-loop,
    we have scanned the entire table and so we can
    report that K is not in the table.

26
Double Hashing Choice of h2
  • For any key K, h2(K) must be relatively prime to
    the table size m. Otherwise, we will only be able
    to examine a fraction of the table entries.
  • For example, if h(K) 0 and h2(K) m/2, then we
    can only examine the entries T0, Tm/2, and
    nothing else!
  • One solution is to make m prime, choose r to be a
    prime smaller than m, and set
  • h2(K) r - (K r).

27
Double Hashing Example
28
Deletion in Open Addressing
  • We cannot just delete a key in open addressing
    strategies.
  • Otherwise, suppose that the table stores three
    keys K1, K2 and K3 that have identical probe
    sequences. Suppose that K1 is stored at h(K1),
    K2 is stored at the second probe location and K3
    is stored at the third probe location.
  • If K2 is to be deleted and we make the slot
    containing K2 empty, then when we search for K3,
    we will find an empty slot before finding K3.
  • So, we will report that K3 does not exist in the
    table !!!

???
29
Deletion in Open Addressing
  • Instead, we add an extra bit to each entry to
    indicate whether the key stored here has been
    deleted or not.
  • Such delete bit serves two purposes
  • searching we should NOT stop at there
  • insertion that position is logically empty
    (though the deleted key is still in the hash
    table), so we can overwrite this entry
Write a Comment
User Comments (0)
About PowerShow.com