Hashing - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Hashing

Description:

Assuming that hashFunc is quick, insertion is O(1) (because array insertion is) ... Other hacks: if this cell is full, put item in the next open cell ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 36
Provided by: JFH
Category:
Tags: hacks | hashing

less

Transcript and Presenter's Notes

Title: Hashing


1
Hashing
2
The Problem
  • Want a datastructure with good insertion,
    deletion, and lookup times (all O(1)!)

3
Idea
  • Use an array!
  • To enter an item, first find the array index
  • index hashfunc(myItem)
  • Then put the item in the appropriate place
  • myArrindex myItem

4
Running times
  • Assuming that hashFunc is quick, insertion is
    O(1) (because array insertion is)
  • Deletion is quick you hash, and set myArrindex
    None
  • Lookup is quick
  • if (myArrhashfunc(testItem) testItem)
  • its in there!
  • else
  • its not in there

5
Problems
  • Finding a good hash function
  • Deciding how big to make the table
  • Collisions two different items might hash to
    the same index

6
Implementation
  • Often you want to store more than a KEY in a
    hashtable
  • Idea make a new kind of key that has the old key
    AND a value
  • newkey (oldkey, value)
  • define a new hash function
  • int newfunc( (oldkey, value) )
  • return hashfunc(oldkey)

7
Java version
  • Previous idea is so common that we define a
    HashMap abstract data type, which takes in
    key-value pairs, and stores them in a hashtable.

8
  • public class HashMap implements Map
  • public HashMap(int capacity)
  • public int hashValue(K key)
  • public int size()
  • public boolean isEmpty()
  • public Iterable keys()
  • public Iterable values()
  • public V get(K key)
  • public V put(K key)
  • public V remove(K key)

9
Python
  • Its built in!
  • creating
  • my_dict 'a'1, 'b'2, 'c'3
  • editing
  • my_dict'a' 10
  • searching
  • result my_dict.get('a', None)
  • print result //10
  • result my_dict.get('f', None)
  • print result //None

10
Collision handling
  • Chaining each array entry is a linked list
    rather than a single item.
  • If the lists are generally short, then lookup is
    still fast
  • Other hacks
  • if this cell is full, put item in the next open
    cell
  • Requires searching though several cells to find
    open space
  • But several is usually two or less
  • If cell is full, put item in some well-defined
    other cell
  • Have each array entry contain a HashTable

11
Good hash functions
  • Can design hash function based on your knowledge
    of data
  • but data has a way of changing its patterns over
    time
  • Can use randomness as your friend
  • Get a hash function thats very likely, on
    average, to be pretty good

12
Multiplication mod 7
  • 0 1 2 3 4 5 6
  • 0 0 0 0 0 0 0 0
  • 1 0 1 2 3 4 5 6
  • 2 0 2 4 6 8 10 12
  • 3 0 3 6 9 12 15 18
  • 4 0 4 8 12 16 20 24
  • 5 0 5 10 15 20 25 30
  • 6 0 6 12 18 24 30 36

13
Multiplication mod 7
  • 0 1 2 3 4 5 6
  • 0 0 0 0 0 0 0 0
  • 1 0 1 2 3 4 5 6
  • 2 0 2 4 6 1 3 5
  • 3 0 3 6 2 5 1 4
  • 4 0 4 5 2 6 3
  • 5 0 5 3 1 6 4 2
  • 6 0 6 5 4 3 2 1
  • Theres a 1 in every row
  • All numbers in each row are different
  • Given a and b , the equation
  • ax b
  • has only one solution, mod 7.

14
Generalization
  • All this happens because 7 is prime
  • If you write out a multiplication table mod 6,
    you find that 2 . 3 0, for instance.
  • When the base is prime, the nonzero part of the
    multiplication table has nice properties
  • Key Given a and b , the equation
  • ax b
  • has only one solution, mod p.

15
Universal Hashing, by example
  • Consider problem of hashing IP addresses to small
    integers say, 1 through 250, more or less.
  • Idea pick four numbers, a1, a2, a3, a4, at
    random.
  • To hash the IP address x1.x2.x3.x4, compute
  • a1x1 a2x2 a3x3 a4x4

16
hashing, cont.
  • a1x1 a2x2 a3x3 a4x4 is too big, in general
  • Reduce mod n.
  • But dont use n 250 instead use n 257,
    because thats prime!
  • That means you might as well pick the ai as
    numbers mod 257 (i.e., between 0 and 256).

17
Probability of collision
  • If (x1 x4) and (y1 y4) are two distinct IP
    addresses, and we hash both of them, how likely
    are we to get a collision?
  • Lets assume x4 and y4 differ.

18
Probability of Collision (2)
  • To collide, we need
  • a1x1 a2x2 a3x3 a4x4
  • a1y1 a2y2 a3y3 a4y4 mod 257
  • Rewrite
  • a1(x1 - y1) a2(x2 y2) a3(x3 y3)
  • -a4(x4 y4) mod 257

19
Collision probability (3)
  • First compute left hand side
  • b a1(x1 - y1) a2(x2 y2) a3(x3 y3)
  • To have a collision, wed need the right-hand
    side to be b
  • -a4(x4 y4) b mod 257
  • Now (x4 y4)is nonzero, so theres only one
    value of a4 you can multiply by to get b, as we
    saw with our experiments mod 7.
  • That means that the probability of collision is
    1/257, because a4 was chosen uniformly randomly

20
Status
  • We had no control over the data items
  • Instead, we chose at random from a set H of hash
    functions, the ones defined by formulas like
  • f(x1, x2, x3, x4) a1x1a2x2a3x3a4x4 mod 257
  • For any two distinct data items x and y, exactly
    H/n of the hash functions in H send x and y to
    the same bucket (n number of buckets). So prob.
    of collision is 1/n.

21
Universal Hashing
  • Universal hashing consists of picking a set H of
    hash functions with the property on the previous
    slide, and then using one of them at random

22
Universal Hashing Procedure
  • Pick a table size, p, thats a prime about twice
    the expected number of items to hash
  • Treat each key as being a k-tuple of integers mod
    p.
  • Pick a1, , ak randomly from 0 to p 1
  • Use h(x) SUMi aixi as your hash function!

23
Hashing Application
  • decoration of nodes in a tree or graph
  • For each new decoration (like height), make a
    hashtable.
  • Replace n.height 4 with hTable.put(n, 4), and
    u n.height with h hTable.get(n)

24
Analysis of Hashing
  • Well analyze for a table with 2n positions, and
    n items in it.
  • In general, look at k positions, n items
  • Average bucket size is s n/k
  • In our case, s 0.5.

25
Simple Uniform Hashing
  • To make analysis easy, we assume that the hash
    function is nice that each element is equally
    likely to fall in any one of the available slots.
  • This is called the simple uniform hashing
    assumption

26
Analysis Types
  • We can look at worst case analysis which is
    pretty bad
  • We can look at expected performance, which is
    pretty good

27
Worst case
  • All items hash to one location the chain for
    that location has n items in it.
  • Insertion O(n)
  • Have to look through whole list to be sure item
    isnt already there
  • Lookup O(n)
  • Delete O(n)
  • Clearly we dont love hashing for its worst-case
    performance!

28
Expected Performance
  • In a chaining-based hashtable, an unsuccessful
    search takes time O(1s)
  • Total list length n
  • Number of lists k
  • Average list length n/k s
  • Searching an average list is O(n/k) we pay one
    unit for hashing, so .
  • Performance is O(1s)
  • For the k 2n case, O(1)
  • Same result for a successful search

29
Universal Hashing Performance
  • If we hash n items into k slots, and n k, then
    the expected number of collisions involving a
    key, x, is less than 1.
  • Idea for any two keys y and z, let Cyz be 1 if
    h(y) h(z), 0 otherwise
  • i.e., Cyz 1 if y and z collide
  • The universal hashing assumption says
  • ECyz 1/k

30
Universal Hashing Performance (2)
  • Let Cx be the total number of collisions
    involving key x in a hash table T.
  • Cx Cxa Cxb where a, b, are all keys in
    the table, i.e.,
  • Cx ES Cxy,
  • sum is over all keys y in T except x.
  • ECx ES Cxy S ECxy (n 1)/k

31
Summary
  • Expected performance of hashTable with universal
    hashing O(1) for everything, as long as table is
    only half-full.

32
Misc facts
  • Even though the expected bucket size is small
    (0.5), the expected size of the fullest bucket is
    considerably larger.
  • For a half-full table, the probability of a
    collision somewhere, i.e., the probablility that
    some bucket has size 2 is
  • (2n)! / n! 2n
  • For n 2, prob 25
  • For n 10, prob 93
  • For n 20, prob 99

33
Misc Facts (2)
  • For n items hashed into a table of size k, with
    uniform simple hashing assumption, the fraction
    of chains with p elements is about
  • E(p, n, k) (1/p!) (n/k)p e-(n/k)
  • For our case (k 2n), this is
  • E(p) (1/p!) 2-p e-1/2 2.2/(2p p!)
  • Examples
  • E(2) 27 E(3) 4.5 E(4) 0.5
  • Almost all buckets have no more than 4 elements!

34
  • In fact (1981!) expected number of items in
    fullest chain is O(log(n)) when we place n items
    in an n-slot hashtable.

35
Clever trick
  • Multiple choice hashing!
  • Instead of one hash function, choose d of them
  • Hash into each of the d resulting positions.
  • Actually place item into least-full chain
  • Expected fullest bin has about
  • 1 (log(log(n))/log(d)
  • items
  • Probably not useful in practice for hashing
  • Useful in other contexts load balancing, data
    allocation to disks, routing
  • Idea due in part to Browns own Prof. Eli Upfal!
Write a Comment
User Comments (0)
About PowerShow.com