Hashing - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Hashing

Description:

Assuming that hashFunc is quick, insertion is O(1) (because array insertion is) ... Other hacks: if this cell is full, put item in the next open cell ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 36

Provided by: JFH

Category:

more less

Transcript and Presenter's Notes

Title: Hashing

1
Hashing
2
The Problem

Want a datastructure with good insertion,
deletion, and lookup times (all O(1)!)

3
Idea

Use an array!
To enter an item, first find the array index
index hashfunc(myItem)
Then put the item in the appropriate place
myArrindex myItem

4
Running times

Assuming that hashFunc is quick, insertion is
O(1) (because array insertion is)
Deletion is quick you hash, and set myArrindex
None
Lookup is quick
if (myArrhashfunc(testItem) testItem)
its in there!
else
its not in there

5
Problems

Finding a good hash function
Deciding how big to make the table
Collisions two different items might hash to
the same index

6
Implementation

Often you want to store more than a KEY in a
hashtable
Idea make a new kind of key that has the old key
AND a value
newkey (oldkey, value)
define a new hash function
int newfunc( (oldkey, value) )
return hashfunc(oldkey)

7
Java version

Previous idea is so common that we define a
HashMap abstract data type, which takes in
key-value pairs, and stores them in a hashtable.

public class HashMap implements Map
public HashMap(int capacity)
public int hashValue(K key)
public int size()
public boolean isEmpty()
public Iterable keys()
public Iterable values()
public V get(K key)
public V put(K key)
public V remove(K key)

9
Python

Its built in!
creating
my_dict 'a'1, 'b'2, 'c'3
editing
my_dict'a' 10
searching
result my_dict.get('a', None)
print result //10
result my_dict.get('f', None)
print result //None

10
Collision handling

Chaining each array entry is a linked list
rather than a single item.
If the lists are generally short, then lookup is
still fast
Other hacks
if this cell is full, put item in the next open
cell
Requires searching though several cells to find
open space
But several is usually two or less
If cell is full, put item in some well-defined
other cell
Have each array entry contain a HashTable

11
Good hash functions

Can design hash function based on your knowledge
of data
but data has a way of changing its patterns over
time
Can use randomness as your friend
Get a hash function thats very likely, on
average, to be pretty good

12
Multiplication mod 7

0 1 2 3 4 5 6
0 0 0 0 0 0 0 0
1 0 1 2 3 4 5 6
2 0 2 4 6 8 10 12
3 0 3 6 9 12 15 18
4 0 4 8 12 16 20 24
5 0 5 10 15 20 25 30
6 0 6 12 18 24 30 36

13
Multiplication mod 7

0 1 2 3 4 5 6
0 0 0 0 0 0 0 0
1 0 1 2 3 4 5 6
2 0 2 4 6 1 3 5
3 0 3 6 2 5 1 4
4 0 4 5 2 6 3
5 0 5 3 1 6 4 2
6 0 6 5 4 3 2 1

Theres a 1 in every row
All numbers in each row are different
Given a and b , the equation
ax b
has only one solution, mod 7.

14
Generalization

All this happens because 7 is prime
If you write out a multiplication table mod 6,
you find that 2 . 3 0, for instance.
When the base is prime, the nonzero part of the
multiplication table has nice properties
Key Given a and b , the equation
ax b
has only one solution, mod p.

15
Universal Hashing, by example

Consider problem of hashing IP addresses to small
integers say, 1 through 250, more or less.
Idea pick four numbers, a1, a2, a3, a4, at
random.
To hash the IP address x1.x2.x3.x4, compute
a1x1 a2x2 a3x3 a4x4

16
hashing, cont.

a1x1 a2x2 a3x3 a4x4 is too big, in general
Reduce mod n.
But dont use n 250 instead use n 257,
because thats prime!
That means you might as well pick the ai as
numbers mod 257 (i.e., between 0 and 256).

17
Probability of collision

If (x1 x4) and (y1 y4) are two distinct IP
addresses, and we hash both of them, how likely
are we to get a collision?
Lets assume x4 and y4 differ.

18
Probability of Collision (2)

To collide, we need
a1x1 a2x2 a3x3 a4x4
a1y1 a2y2 a3y3 a4y4 mod 257
Rewrite
a1(x1 - y1) a2(x2 y2) a3(x3 y3)
-a4(x4 y4) mod 257

19
Collision probability (3)

First compute left hand side
b a1(x1 - y1) a2(x2 y2) a3(x3 y3)
To have a collision, wed need the right-hand
side to be b
-a4(x4 y4) b mod 257
Now (x4 y4)is nonzero, so theres only one
value of a4 you can multiply by to get b, as we
saw with our experiments mod 7.
That means that the probability of collision is
1/257, because a4 was chosen uniformly randomly

20
Status

We had no control over the data items
Instead, we chose at random from a set H of hash
functions, the ones defined by formulas like
f(x1, x2, x3, x4) a1x1a2x2a3x3a4x4 mod 257
For any two distinct data items x and y, exactly
H/n of the hash functions in H send x and y to
the same bucket (n number of buckets). So prob.
of collision is 1/n.

21
Universal Hashing

Universal hashing consists of picking a set H of
hash functions with the property on the previous
slide, and then using one of them at random

22
Universal Hashing Procedure

Pick a table size, p, thats a prime about twice
the expected number of items to hash
Treat each key as being a k-tuple of integers mod
p.
Pick a1, , ak randomly from 0 to p 1
Use h(x) SUMi aixi as your hash function!

23
Hashing Application

decoration of nodes in a tree or graph
For each new decoration (like height), make a
hashtable.
Replace n.height 4 with hTable.put(n, 4), and
u n.height with h hTable.get(n)

24
Analysis of Hashing

Well analyze for a table with 2n positions, and
n items in it.
In general, look at k positions, n items
Average bucket size is s n/k
In our case, s 0.5.

25
Simple Uniform Hashing

To make analysis easy, we assume that the hash
function is nice that each element is equally
likely to fall in any one of the available slots.
This is called the simple uniform hashing
assumption

26
Analysis Types

We can look at worst case analysis which is
pretty bad
We can look at expected performance, which is
pretty good

27
Worst case

All items hash to one location the chain for
that location has n items in it.
Insertion O(n)
Have to look through whole list to be sure item
isnt already there
Lookup O(n)
Delete O(n)
Clearly we dont love hashing for its worst-case
performance!

28
Expected Performance

In a chaining-based hashtable, an unsuccessful
search takes time O(1s)
Total list length n
Number of lists k
Average list length n/k s
Searching an average list is O(n/k) we pay one
unit for hashing, so .
Performance is O(1s)
For the k 2n case, O(1)
Same result for a successful search

29
Universal Hashing Performance

If we hash n items into k slots, and n k, then
the expected number of collisions involving a
key, x, is less than 1.
Idea for any two keys y and z, let Cyz be 1 if
h(y) h(z), 0 otherwise
i.e., Cyz 1 if y and z collide
The universal hashing assumption says
ECyz 1/k

30
Universal Hashing Performance (2)

Let Cx be the total number of collisions
involving key x in a hash table T.
Cx Cxa Cxb where a, b, are all keys in
the table, i.e.,
Cx ES Cxy,
sum is over all keys y in T except x.
ECx ES Cxy S ECxy (n 1)/k

31
Summary

Expected performance of hashTable with universal
hashing O(1) for everything, as long as table is
only half-full.

32
Misc facts

Even though the expected bucket size is small
(0.5), the expected size of the fullest bucket is
considerably larger.
For a half-full table, the probability of a
collision somewhere, i.e., the probablility that
some bucket has size 2 is
(2n)! / n! 2n
For n 2, prob 25
For n 10, prob 93
For n 20, prob 99

33
Misc Facts (2)

For n items hashed into a table of size k, with
uniform simple hashing assumption, the fraction
of chains with p elements is about
E(p, n, k) (1/p!) (n/k)p e-(n/k)
For our case (k 2n), this is
E(p) (1/p!) 2-p e-1/2 2.2/(2p p!)
Examples
E(2) 27 E(3) 4.5 E(4) 0.5
Almost all buckets have no more than 4 elements!