HASHING - PowerPoint PPT Presentation

About This Presentation

Title:

HASHING

Description:

HASHING Using balanced trees (2-3, 2-3-4, red-black, and AVL trees) we can implement table operations (retrieval, insertion and deletion) efficiently. – PowerPoint PPT presentation

Number of Views:135

Avg rating:3.0/5.0

Slides: 33

Provided by: Ilya97

Category:

more less

Transcript and Presenter's Notes

Title: HASHING

1
HASHING

Using balanced trees (2-3, 2-3-4, red-black, and
AVL trees) we can implement table operations
(retrieval, insertion and deletion)
efficiently. ? O(logN)
Can we find a data structure so that we can
perform these table operations better than
balanced search trees? ? O(1)
YES ? HASH TABLES
In hash tables, we have an array (index 0..n-1)
and an address calculator (hash function) which
maps a search key into an array index between 0
and n-1.

2
Hash Function Address Calculator
Hash Function
Hash Table
3
Hashing

A hash function tells us where to place an item
in array called a hash table. This method is
know as hashing.
A hash function maps a search key into an integer
between 0 and n-1.
We can have different hash functions.
Ex. h(x) x mod n if x is an integer
The hash function is designed for the search keys
depending on the data types of these search keys
(int, string, ...)
Collisions occur when the hash function maps more
than one item into the same array.
We have to resolve these collisions using certain
mechanism.
A perfect hash function maps each search key into
a unique location of the hash table.
A perfect hash function is possible if we know
all the search keys in advance.
In practice (we do not know all the search keys),
a hash function can map more than one key into
the same location (collision).

4
Hash Function

We can design different hash functions.
But a good hash function should
be easy and fast to compute,
place items evenly throughout the hash table.
We will consider only hash functions operate on
integers.
If the key is not an integer, we map it into an
integer first, and apply the hash function.
The hash table size should be prime.
By selecting the table size as a prime number, we
may place items evenly throughout the hash table,
and we may reduce the number of collisions.

5
Hash Functions -- Selecting Digits

If the search keys are big integers (Ex.
nine-digit numbers), we can select certain digits
and combine to create the address.
h(033475678) 37 selecting 2nd and 5th
digits (table size is 100)
h(023455678) 25
Digit-Selection is not a good hash function
because it does not place items evenly throughout
the hash table.

6
Hash Functions Folding

Folding Selecting all digits and add them
h(033475678) 0 3 3 4 7 5 6 7 8
43
0 ? h(nine-digit search key) ? 81
We can select a group of digits and we can add
these groups too.

7
Hash Functions Modula Arithmetic

Modula arithmetic provides a simple and effective
hash function.
We will use modula arithmetic as our hash
function in the rest of our discussions.
h(x) x mod tableSize
The table size should be prime.
Some prime numbers 7,11, 13, ..., 101, ...

8
Hash Functions Converting Character String into
An Integer

If our search keys are strings, first we have to
convert the string into an integer, and apply a
hash function which is designed to operate on
integers to this integer value to compute the
address.
We can use ASCII codes of characters in the
conversion.
Consider the string NOTE, assign 1 (00001) to
A, ....
N is 14 (01110), O is 15 (01111), T is 20
(10100), E is 5 ((00101)
Concatenate four binary numbers to get a new
binary number
011100111111010000101 ? 474,757
apply x mod tableSize

9
Collision Resolution

There are two general approaches to collision
resolution in hash tables
Open Addressing Each entry holds one item
Chaining Each entry can hold more than item
Buckets hold certain number of items

10
A Collision

Table size is 101

11
Open Addressing

During an attempt to insert a new item into a
table, if the hash function indicates a location
in the hash table that is already occupied, we
probe for some other empty (or open) location in
which to place the item.The sequence of locations
that we examine is called the probe sequence.
? If a scheme which uses this approach we say
that
it uses open addressing
There are different open-addressing schemes
Linear Probing
Quadratic Probing
Double Hashing

12
Open Addressing Linear Probing

In linear probing, we search the hash table
sequentially starting from the original hash
location.
If a location is occupied, we check the next
location
We wrap around from the last table location to
the first table location if necessary.

13
Linear Probing - Example

Example
Table Size is 11 (0..10)
Hash Function h(x) x mod 11
Insert keys
20 mod 11 9
30 mod 11 8
2 mod 11 2
13 mod 11 2 ? 213
25 mod 11 3 ? 314
24 mod 11 2 ? 21, 22, 235
10 mod 11 10
9 mod 11 9 ? 91, 92 mod 11 0

0 9
1
2 2
3 13
4 25
5 24
6
7
8 30
9 20
10 10
14
Linear Probing Clustering Problem

One of the problems with linear probing is that
table items tend to cluster together in the hash
table.
This means that the table contains groups of
consecutively occupied locations.
This phenomenon is called primary clustering.
Clusters can get close to one another, and merge
into a larger cluster.
Thus, the one part of the table might be quite
dense, even though another part has relatively
few items.
Primary clustering causes long probe searches and
therefore decreases the overall efficiency.

15
Open Addressing Quadratic Probing

Primary clustering problem can be almost
eliminated if we use quadratic probing scheme.
In quadratic probing,
We start from the original hash location i
If a location is occupied, we check the locations
i12 , i22 , i32 , i42 ...
We wrap around from the last table location to
the first table location if necessary.

16
Quadratic Probing - Example

Example
Table Size is 11 (0..10)
Hash Function h(x) x mod 11
Insert keys
20 mod 11 9
30 mod 11 8
2 mod 11 2
13 mod 11 2 ? 2123
25 mod 11 3 ? 3124
24 mod 11 2 ? 212, 2226
10 mod 11 10
9 mod 11 9 ? 912, 922 mod 11,
932 mod 11 7

0
1
2 2
3 13
4 25
5
6 24
7 9
8 30
9 20
10 10
17
Open Addressing Double Hashing

Double hashing also reduces clustering.
In linear probing and and quadratic probing , the
probe sequences are independent from the key.
We can select increments used during probing
using a second hash function. The second hash
function h2 should be
h2(key) ? 0
h2 ? h1
We first probe the location h1(key)
If the location is occupied, we probe the
location h1(key)h2(key), h1(key)(2h2(key)),
...

18
Double Hashing - Example

Example
Table Size is 11 (0..10)
Hash Function h1(x) x mod 11
h2(x) 7 (x mod 7)
Insert keys
58 mod 11 3
14 mod 11 3 ? 3710
91 mod 11 3 ? 37, 327 mod 116

0
1
2
3 58
4
5
6 91
7
8
9
10 14
19
Open Addressing Retrieval Deletion

In open addressing, to find an item with a given
key
We probe the locations (same as insertion) until
we find the desired item or we reach to an empty
location.
Deletions in open addressing cause complications
We CANNOT simply delete an item from the hash
table because this new empty (deleted locations)
cause to stop prematurely (incorrectly)
indicating a failure during a retrieval.
Solution We have to have three kinds of
locations in a hash table Occupied, Empty,
Deleted.
A deleted location will be treated as an occupied
location during retrieval and insertion.

20
Separate Chaining

Another way to resolve collisions is to change
the structure of the hash table.
In open-addressing, each location of the hash
table holds only one item.
We can define a hash table so that each location
is itself an array called bucket, we can store
the items which hash into this location in this
array.
Problem What will be the size of the bucket?
A better approach is to design the hash table as
an array of linked lists, this collision
resolution method is known as separate-chaining.
In separate-chaining , each entry (of the hash
table) is a pointer to a linked list (the chain)
of the items that the hash function has mapped
into that location.

21
Separate Chaining
22
Hashing - Analysis

An analysis of the average-case efficiency of
hashing involves the load factor ?, which is
the ration of the current number of items in
the table to the table size.
? (current number of items) / tableSize
The load factor measures how full a hash table
is.
The hash table should be so filled too much to
get better performance from the hashing.
Unsuccessful searches generally require more time
than successful searches.
In average case analyses, we assume that the hash
function uniformly distributes the keys in the
hash table.

23
Linear Probing Analysis

For linear probing, the approximate average
number of comparisons (probes) that a search
requires as follows

for a successful search
for an unsuccessful search

As load factor increases, the number of
collisions increases
causing increased search times.
To maintain efficiency, it is important to
prevent the hash table
from filling up.

24
Linear Probing Analysis -- Example

What is the average number of probes for a
successful search and an unsuccessful search for
this hash table?
Hash Function h(x) x mod 11
Successful Search
20 9 -- 30 8 -- 2 2 -- 13 2, 3 --
25 3,4
24 2,3,4,5 -- 10 10 -- 9 9,10, 0
Avg. Probe for SS (11122413)/815/8
Unsuccessful Search
We assume that the hash function uniformly
distributes the keys.
0 0,1 -- 1 1 -- 2 2,3,4,5,6 -- 3
3,4,5,6
4 4,5,6 -- 5 5,6 -- 6 6 -- 7 7 -- 8
8,9,10,0,1
9 9,10,0,1 -- 10 10,0,1
Avg. Probe for US
(21543211543)/1131/11

0 9
1
2 2
3 13
4 25
5 24
6
7
8 30
9 20
10 10
25
Quadratic Probing Double Hashing Analysis

For quadratic probing and double hashing, the
approximate average number of comparisons
(probes) that a search requires as follows

for a successful search
for an unsuccessful search

On average, both methods require fewer
comparisons than
linear probing.

26
Separate Chaining

For separate-chaining, the approximate average
number of comparisons (probes) that a search
requires as follows

for a successful search
for an unsuccessful search

Separate-chaining is most efficient collision
resolution scheme.
But it requires more storage. We need storage
for the pointer fields.
We can easily perform deletion operation using
separate-chaining
scheme. Deletion is very difficult in
open-addressing.

27
The relative efficiency of four
collision-resolution methods
28
What Constitutes a Good Hash Function

A hash function should be easy and fast to
compute.
A hash function should scatter the data evenly
throughout the hash table.
How well does the hash function scatter random
data?
How well does the hash function scatter
non-random data?
Two general principles
The hash function should use entire key in the
calculation.
If a hash function uses modulo arithmetic, the
table size should be prime.

29
Hash Table versus Search Trees

In the most of operations, the hash table
performs better than search trees.
But, the traversing the data in the hash table in
a sorted order is very difficult.
For similar operations, the hash table will not
be good choice.
Ex. Finding all the items in a certain range.

30
Data with Multiple Organizations

Several independent data structure do not support
all operations efficiently.
We may need multiple organizations for data to
get efficient implementations for all operations.
One organization will be used for certain
operations, the other organizations will be used
for other operations.

31
Data with Multiple Organizations (cont.)
32
Data with Multiple Organizations (cont.)

Write a Comment

User Comments (0)