Csci 2111: Data and File Structures Week 10, Lectures 1 - PowerPoint PPT Presentation

About This Presentation

Title:

Csci 2111: Data and File Structures Week 10, Lectures 1

Description:

Hashing is like indexing in that it involves associating a key with a relative record address. ... When two different keys produce the same address, there is a ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 20

Provided by: N205

Category:

more less

Transcript and Presenter's Notes

Title: Csci 2111: Data and File Structures Week 10, Lectures 1

1
Csci 2111 Data and File StructuresWeek 10,
Lectures 1 2

Hashing
2
Motivation

Sequential Searching can be done in O(N) access
time, meaning that the number of seeks grows in
proportion to the size of the file.
B-Trees improve on this greatly, providing O(Logk
N) access where k is a measure of the leaf size
(i.e., the number of records that can be stored
in a leaf).
What we would like to achieve, however, is an
O(1) access, which means that no matter how big a
file grows, access to a record always takes the
same small number of seeks.
Static Hashing techniques can achieve such
performance provided that the file does not
increase in time.

3
What is Hashing?

A Hash function is a function h(K) which
transforms a key K into an address.
Hashing is like indexing in that it involves
associating a key with a relative record address.
Hashing, however, is different from indexing in
two important ways
With hashing, there is no obvious connection
between the key and the location.
With hashing two different keys may be
transformed to the same address.

4
Collisions

When two different keys produce the same address,
there is a collision. The keys involved are
called synonyms.
Coming up with a hashing function that avoids
collision is extremely difficult. It is best to
simply find ways to deal with them.
Possible Solutions
Spread out the records
Use extra memory
Put more than one record at a single address.

5
A Simple Hashing Algorithm

Step 1 Represent the key in numerical form
Step 2 Fold and Add
Step 3 Divide by a prime number and use the
remainder as the address.

6
Hashing Functions and Record Distributions

Records can be distributed among addresses in
different ways there may be (a) no synonyms
(uniform distribution) (b) only synonyms (worst
case) (c) a few synonyms (happens with random
distributions).
Purely uniform distributions are difficult to
obtain and may not be worth searching for.
Random distributions can be easily derived, but
they are not perfect since they may generate a
fair number of synonyms.
We want better hashing methods.

7
Some Other Hashing Methods

Though there is no hash function that guarantees
better-than-random distributions in all cases, by
taking into considerations the keys that are
being hashed, certain improvements are possible.
Here are some methods that are potentially better
than random
Examine keys for a pattern
Fold parts of the key
Divide the key by a number
Square the key and take the middle
Radix transformation

8
Predicting the Distribution of Records

When using a random distribution, we can use a
number of mathematical tools to obtain
conservative estimates of how our hashing
function is likely to behave
Using the Poisson Function p(x)(r/N)xe-(r/N)/x!
applied to Hashing, we can conclude that
In general, if there are N addresses, then the
expected number of addresses with x records
assigned to them is Np(x)

9
Predicting Collisions for a Full File

Suppose you have a hashing function that you
believe will distribute records randomly and you
want to store 10,000 records in 10,000 addresses.
How many addresses do you expect to have no
records assigned to them?
How many addresses should have one, two, and
three records assigned respectively?
How can we reduce the number of overflow records?

10
Increasing Memory Space I

Reducing collisions can be done by choosing a
good hashing function or using extra memory.
The question asked here is how much extra memory
should be use to obtain a given rate of collision
reduction?
Definition Packing density refers to the ratio
of the number of records to be stored (r) to the
number of available spaces (N).
The packing density gives a measure of the amount
of space in a file that is used.

11
Increasing Memory Space II

The Poisson Distribution allows us to predict the
number of collisions that are likely to
occur given a certain packing density. We use the
Poisson Distribution to answer the following
questions
How many addresses should have no records
assigned to them?
How many addresses should have exactly one record
assigned (no synonym)?
How many addresses should have one record plus
one or more synonyms?
Assuming that only one record can be assigned to
each home address, how many overflow records can
be expected?
What percentage of records should be overflow
records?

12
Collision Resolution by Progressive Overflow

How do we deal with records that cannot fit into
their home address? A simple approach
Progressive Overflow or Linear Probing.
If a key, k1, hashes into the same address, a1,
as another key, k2, then look for the first
available address, a2, following a1 and place k1
in a2. If the end of the address space is
reached, then wrap around it.
When searching for a key that is not in, if the
address space is not full, then an empty address
will be reached or the search will come back to
where it began.

13
Search Length when using Progressive Overflow

Progressive Overflow causes extra searches and
thus extra disk accesses.
If there are many collisions, then many records
will be far from home.
Definitions Search length refers to the number
of accesses required to retrieve a record from
secondary memory. The average search length is
the average number of times you can expect to
have to access the disk to retrieve a record.
Average search length (Total search
length)/(Total number of records)

14
Storing More than One Record per Address Buckets

Definition A bucket describes a block of records
sharing the same address that is retrieved in one
disk access.
When a record is to be stored or retrieved, its
home bucket address is determined by hashing.
When a bucket is filled, we still have to worry
about the record overflow problem, but this
occurs much less often than when each address can
hold only one record.

15
Effect of Buckets on Performance

To compute how densely packed a file is, we need
to consider 1) the number of addresses, N,
(buckets) 2) the number of records we can put at
each address, b, (bucket size) and 3) the number
of records, r. Then, Packing Density r/bN.
Though the packing density does not change when
halving the number of addresses and doubling the
size of the buckets, the expected number of
overflows decreases dramatically.

16
Making Deletions

Deleting a record from a hashed file is more
complicated than adding a record for two reasons
The slot freed by the deletion must not be
allowed to hinder later searches
It should be possible to reuse the freed slot for
later additions.
In order to deal with deletions we use
tombstones, i.e., a marker indicating that a
record once lived there but no longer does.
Tombstones solve both the problems caused by
deletion.
Insertion of records is slightly different when
using tombstones.

17
Effects of Deletions and Additions on Performance

After a large number of deletions and additions
have taken places, one can expect to find many
tombstones occupying places that could be
occupied by records whose home address precedes
them but that are stored after them.
This deteriorates average search lengths.
There are 3 types of solutions for dealing with
this problem a) local reorganization during
deletions b) global reorganization when the
average search length is too large c) use of a
different collision resolution algorithm.

18
Other Collision Resolution Techniques

There are a few variations on random hashing that
may improve performance
Double Hashing When an overflow occurs, use a
second hashing function to map the record to its
overflow location.
Chained Progressive Overflow Like Progressive
overflow except that synonyms are linked together
with pointers.
Chaining with a Separate Overflow Area Like
chained progressive overflow except that overflow
addresses do not occupy home addresses.
Scatter Tables The Hash file contains no
records, but only pointers to records. I.e., it
is an index.

19
Pattern of Record Access

If we have some information about what records
get accessed most often, we can optimize their
location so that these records will have short
search lengths.
By doing this, we try to decrease the effective
average search length even if the nominal average
search length remains the same.
This principle is related to the one used in
Huffman encoding.

Write a Comment

User Comments (0)