Title: Sets and Maps and Hashing
1Sets and Maps (and Hashing)
2Chapter Objectives
- To understand the Java Map and Set interfaces and
how to use them - To learn about hash codes and how they are used
to facilitate efficient search and retrieval - To study two forms of hash tablesopen addressing
and chainingand to understand their relative
benefits and performance tradeoffs
3Chapter Objectives
- To learn how to implement both hash table forms
- To be introduced to the implementation of Maps
and Sets - To see how two earlier applications can be more
easily implemented using Map objects for data
storage
4Review of Sets
- Set is unordered, and has no duplicate elements
- Suppose A 1,3,5,7,9,11, B 2,3,5,7,11,13
- Then
- A ? B 1,2,3,5,7,9,11,13
- A ? B 3,5,7,11
- A ? B 1,9
- B ? A 2,13
- If C 3,5,9, then C ? A
5Sets and the Set Interface
- The part of the Collection hierarchy that relates
to sets - Includes three interfaces, two abstract classes,
and two actual classes
6The Set Abstraction
- A set is a collection that contains no duplicate
elements - And at most, one null element
- In a set, index of an element is meaningless
- If s is a set,
- s.contains(apple) returns true or false
- s.indexOf(apple) makes no sense
- s.get(i) is also nonsensical
7The Set Abstraction
- Operations on sets include
- Testing for membership
- Adding (inserting) elements
- Removing elements
- Union
- Intersection
- Difference
- Subset
8The Set Interface and Methods
- Has required methods for
- Testing set membership
- Testing for an empty set
- Determining set size
- Creating an iterator over the set
- Two optional methods for
- To add an element
- To remove an element
- Constructors enforce no duplicate members, and
- add method does not allow duplicate item
9The Set Interface and Methods
10Comparison of Lists and Sets
- Duplicate elements
- OK in a list
- Not allowed in sets Set.add returns false if you
try to insert a duplicate element - Get method
- List has a get method
- A set has no get method (index is meaningless)
- Iterators
- Lists have iterators
- Can also iterate thru elements in a set
11Maps
- A map relates one set to another set
- Map is a set of ordered pairs (x,y)
- Where x key and y value (element)
- For example
- This map is
- (J,Jane), (B,Bill), (B2,Bill), (S,Sam), (B1,Bob)
12Maps
- Map is a set of ordered pairs (x,y)
- Where x key and y value (element)
- Keys must be unique
- But values need not be unique (onto, not 1-to-1)
- Each key maps to a particular value (element)
- Or, you might say it corresponds to
- Maps used for very efficient storage and
retrieval of information in tables - Key is used like index into a list
- But key does not need to be integer
13Maps
- Suppose we have the map
- (J,Jane), (B,Bill), (B2,Bill), (S,Sam),
(B1,Bob) - And it is stored in aMap
- Then
- What does aMap.get(B2) return?
- Bill
- What does aMap.get(Bill) return?
- Null, since nothing in aMap has key Bill
14Map Interface
15Hash Tables
- For maps, want to access entry by its key, not
its value - A hash table is used for such access
- For efficiency, want to access element directly
by its key - As opposed to searching for key value in an array
- Using a hash table we can retrieve an item in
constant time, on average, and linear time in
worst case - That is, O(1) is expected, but O(n) is worst case
16Hash Codes and Index Calculation
- Hashing idea
- Transform an items key value into an integer
- Then use this integer as a numeric index
17Hash Code Index Example
- Suppose we want to store number of occurrences of
each Unicode characters in a file - There are 65,536 Unicode characters
- What to do?
- Could create an array of size 65,536 and store
count of character i in array element i - This will work, but
- very inefficient for a small file
- Suppose file only has 100 characters!
- Is there a better way?
18Hash Code Index Calculation
- Suppose we want to store number of occurrences of
each Unicode characters in a file - There are 65,536 Unicode characters
- File of 100 characters
- Use a hash code for each character
- But how to compute hash code?
- Could do the following
- Create an array of size 200 and compute index as
index uniChar 200 - Good since it uses less space
- Bad if there are collisions
- 2 or more characters in file hash to same value
19Methods for Generating Hash Codes
- Usually, keys consist of strings of letters
and/or digits - The number of possible key values is much larger
than the table size - Generating a good hash code is something of an
art - Some experimentation, trial-and-error may be
required - Desirable properties of a hash function?
- A random (uniform) distribution of values
- Relatively simple function
- Efficient to compute
- Collisions can always occur---what to do?
20Java HashCode Method
- For strings, could simply sum int values of all
characters - Will return the same hash code for sign and sing
- The Java API algorithm accounts for position of
the characters as follows - The String.hashCode() returns the integer
calculated by the formula s0 x 31(n-1) s1 x
31(n-2) sn-1 where si is the ith character
of the string, and n is the length of the string - Cat will have a hash code of C x 312 a x
31 t - Since 31 is a prime number, fewer collisions
21Open Addressing
- We consider two ways to organize hash tables
- Open addressing
- Chaining
- For open addressing, linear probing can be used
to deal with collisions - If that element contains an item with a different
key, increment the index by one - Keep incrementing until you find the key or null
entry - Null indicates element is not in the table
22Open Addressing Algorithm
23Table Wraparound and Search Termination
- As index increases, must wrap around (circular
array) - Leads to the potential of an infinite loop
- How do you know when to stop searching if the
table is full and you have not found the correct
value? - Stop when the index value for the next probe is
the same as the hash code value for the object,
or - Ensure that the table is never full by increasing
its size after an insertion if its occupancy rate
exceeds a specified threshold (sparser table has
fewer collisions)
24Open Addressing Example
- Suppose we have the following values and hash
codes
25Open Addressing Example
- Suppose we use hashCode 5 to create hash table
- Using open addressing
26Open Addressing Example
- Suppose we use hashCode 5 to create hash table
- Using open addressing
27Open Addressing Example
- Suppose we use hashCode 5 to create hash table
- Using open addressing
28Open Addressing Example
- Suppose we use hashCode 5 to create hash table
- Using open addressing
29Open Addressing Example
- Suppose we use hashCode 5 to create hash table
- Using open addressing
30Open Addressing Example
- Suppose we use hashCode 5 to create hash table
- Using open addressing
31Open Addressing Example
- Suppose we use hashCode 11 to create hash table
- Using open addressing
32Open Addressing Example
- Suppose we use hashCode 11 to create hash table
- Using open addressing
33Open Addressing Example
- Suppose we use hashCode 11 to create hash table
- Using open addressing
34Open Addressing Example
- Suppose we use hashCode 11 to create hash table
- Using open addressing
35Open Addressing Example
- Suppose we use hashCode 11 to create hash table
- Using open addressing
36Open Addressing Example
- Suppose we use hashCode 11 to create hash table
- Using open addressing
37Hash Table Operations
- Iterating thru hash table gives entries in
arbitrary order - Deleting from hash table
- Cannot just insert a null --- why not?
- Null used for stopping/not found condition
- Can insert a dummy value
- So, removing does not improve search time
- Reducing collisions
- Expand size of hash table, and rehash elements
- Tradeoff between table size and search efficiency
38Reducing Collisions by Quadratic Probing
- Linear probing tends to form clusters of keys in
the table, causing longer search chains - Quadratic probing can reduce the effect of
clustering - Increments form a quadratic series
- Disadvantages?
- More work to calculate next index
(multiplication, addition, and modular division) - Not all table elements are examined when looking
for an insertion index
39Chaining
- Chaining is an alternative to open addressing
- Each table element references a linked list that
contains all of the items that hash to the same
table index - The linked list is often called a bucket
- The approach sometimes called bucket hashing
- Only items that have the same value for their
hash codes will be examined when looking for an
object
40Chaining
- Recall hashCode 5
- Chaining creates linked list for each collision
- In this example
- Linked list for Tom, Dick, Sam
- Another linked list for Harry and Pete
41Chaining
42Chaining
- Plusses?
- Conceptually simple
- Minimizes table size
- Good search efficiency
- Minuses?
- Overhead of linked lists (more storage)
- More complex (perhaps)
43Performance of Hash Tables
- Load factor is number of filled cells divided by
table size - Load factor has greatest effect on performance
- The lower the load factor, the better the
performance - Why?
- Less chance of collision in a sparsely populated
table - But, smaller the load factor, more wasted space
44Performance of Hash Tables
45Maps and Hashing
- Maps use hash tables!
- Hashing converts the key into an index
- Index is place where corresponding value stored
- Makes it possible to search efficiently
- Recall, O(1), on average
- Without having an (explicit) index
- Of course, there is some additional overhead
46Implementing a Hash Table
47Implementing a Hash Table
48Implementation of Maps and Sets
- Class Object implements methods hashCode and
equals, so every class can access these methods
unless it overrides them - Object.equals compares two objects based on their
addresses, not their contents - Object.hashCode calculates an objects hash code
based on its address, not its contents - Java recommends that if you override the equals
method, then you should also override the
hashCode method
49Implementing HashSetOpen
50Implementing Java Map and Set Interfaces
- The Java API uses a hash table to implement both
the Map and Set interfaces - The task of implementing the two interfaces is
simplified by the inclusion of abstract classes
AbstractMap and AbstractSet in the Collection
hierarchy
51Nested Interface Map.Entry
- One requirement on the key-value pairs for a Map
object is that they implement the interface
Map.EntryltK, Vgt, which is an inner interface of
interface Map - An implementer of the Map interface must contain
an inner class that provides code for the methods
in the table below
52Additional Applications of Maps
- Can implement the phone directory using a map
53Additional Applications of Maps
- Huffman Coding Problem
- Use a map for creating an array of elements and
replacing each input character by its bit string
code in the output file - Frequency table
- The key will be the input character
- The value is the character code string
54Chapter Review
- The Set interface describes an abstract data type
that supports the same operations as a
mathematical set - The Map interface describes an abstract data type
that enables a user to access information
corresponding to a specified key - A hash table uses hashing to transform an items
key into a table index so that insertions,
retrievals, and deletions can be performed in
expected O(1) time - A collision occurs when two keys map to the same
table index - In open addressing, linear probing is often used
to resolve collisions
55Chapter Review
- The best way to avoid collisions is to keep the
table load factor relatively low by rehashing
when the load factor reaches a value such as 0.75 - In open addressing, you cant remove an element
from the table when you delete it, but you must
mark it as deleted - A set view of a hash table can be obtained
through method entrySet - Two Java API implementations of the Map (Set)
interface are HashMap (HashSet) and TreeMap
(TreeSet)