Data organization in main memory or disk - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Data organization in main memory or disk

Description:

Hashing Data organization in main memory or disk sequential, binary trees, The location of a key depends on other keys = unnecessary key comparisons to find a key – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 37
Provided by: Euri5
Category:

less

Transcript and Presenter's Notes

Title: Data organization in main memory or disk


1
Hashing
  • Data organization in main memory or disk
  • sequential, binary trees,
  • The location of a key depends on other keys gt
    unnecessary key comparisons to find a key
  • Question find key with a single comparison
  • Hashing the location of a record is computed
    using its key only
  • Fast for random accesses - slow for range queries

2
Hash Table
  • Hash Function transforms keys to array indices

3
h(key) key mod 1000
4
Good Hash Functions
  • Uniform distribute keys evenly in space
  • Perfect two records cannot occupy the same
    location or
  • Order preserving
  • Difficult to find such hash functions
  • Property 2 is the most essential
  • Most functions are no better than
  • h(key) key mod m
  • Hash collision

5
Collision Resolution
  • Open Addressing (rehashing) compute new position
    to store the key in the table (no extra space)
  • linear probing
  • double hashing
  • Separate Chaining lists of keys mapped to the
    same position (uses extra space)

6
Open Addressing
  • Computes a new address to store the key if it is
    occupied (rehashing)
  • if occupied too, compute a new address, until
    an empty position is found
  • primary hash function ih(key)
  • rehash function rh(i)rh(h(key))
  • hash sequence (h0,h1,h2) (h(key), rh(h(key)),
    rh(rh(h(key))))
  • To find a key follow the same hash sequence

7
Example
  • ih(key)key mod 100
  • rh(i) (i1) mod 100
  • key 193
  • ih(193)93
  • rh(i)(931)94
  • Key 193 will occupy position 94

193
8
Problem 1 Locate Empty Positions
  • No empty position can be found
  • the table is full
  • check on number of empty positions
  • the hash function fails to find an empty position
    although the table is not full !!
  • ih(key) key mod 1000
  • rh(i) (i 200) mod 1000 gt checks only 5
    positions on a table of 1000 positions
  • rh(i) (i1) mod 1000 successive positions
  • rh(i) (ic) mod 1000 where GCD(c,m) 1

9
Problem 2 Primary Clustering
  • Different keys that hash into different addresses
    compete with each other in successive rehashes
  • ih(key) key mod 100
  • rh(i) (i1) mod 100
  • keys 1990, 1991, 1992, 1993, 1994 gt 94

10
Problem 3 Secondary Clustering
  • Different keys which hash to the same hash value
    have the same rehash sequence
  • ih(key) key mod 10
  • rh(i,j) (i j) mod 10
  • key 23 h(23) 3
  • rh 4, 6, 9, 3,
  • key 13 h(13) 3
  • rh 4, 6, 9, 3,

11
Linear Probing
  • Store the key into the next free position
  • h0 h(key) usually h0 key mod m
  • hi (hi-1 1) mod m, i gt 1

S 22, 35, 301, 99, 102, 452
12
Observation 1
number of probes
  • Different insertion sequences gt different hash
    sequences
  • S111,3,27,99,8,50,77,22,12,31,33,40,53gt28
    probes
  • S253,40,33,31,12,22,77,50,8,99,27,3,11gt 30
    probes


H(key) key mod 13
13
Observation 2
  • Deletions are not easy
  • ih(key) key mod 10
  • rh(i) (i1) mod 10
  • Action delete(65) and search(5)
  • Problem search will stop at the
  • empty position and will not find 5
  • Solution
  • mark position as deleted rather than empty
  • the marked position can be reused

14
Observation 3
  • Linear probing tends to create long sequences of
    occupied positions
  • the longer a sequence is, the longer it tends to
    become
  • P probability to use a position in the cluster

15
Observation 4
  • Linear probing suffers from both primary and
    secondary clustering
  • Solution double hashing
  • uses two hash functions h1, h2 and a
  • rehashing function rh

16
Double Hashing
  • Two hash functions and a rehashing function
  • primary hash function ih1(key) key mod m
  • secondary hash function h2(key)
  • rehashing function rh(key) (i h2(key)) mod m
  • h2(m,key) is some function of m, key
  • helps rh in computing random positions in the
    hash table
  • h2 is computed once for each key!

17
Example of Double Hashing
  • hash function
  • h1(key) key mod m
  • q (key div m) mod m
  • rehash function
  • rh(i, key) (i h2(key)) mod m

18
Example (continued)
  • m 10, key 23
  • h1(23) 3, h2(23) 2
  • rh(3,2)(32) mod 10 5
  • rehash sequence 5, 7, 9, 1,
  • m 10, key 13
  • h1(key)3, h2(13)1, rh(3,1)(31)mod104
  • rehash sequence 4, 5, 6,

19
Performance of Open Addressing
  • Distinguish between
  • successful and
  • unsuccessful search
  • Assume a series of probes to random positions
  • independent events
  • load factor ? n/m
  • ? probability to probe an occupied position
  • each position has the same probability P1/m

20
Unsuccessful Search
  • The hash sequence is exhausted
  • let u be the expected number of probes
  • u equals the expected length of the hash sequence
  • P(k) probability to search k positions in the
    hash sequence

21
(No Transcript)
22
independent events
u increases with ? gt performance drops as ?
increases
23
Successful Search
  • The hash sequence is not exhausted
  • the number of probes to find a key equals the
    number of probes s at the time the key was
    inserted plus 1
  • ? was less at that time
  • consider all values of ?

u equivalent to unsuccessful search
approximation
increases with ?
24
Performance
  • The performance drops as ? increases
  • the higher the value of ? is, the higher the
    probability of collisions
  • Unsuccessful search is more expensive than
    successful search
  • unsuccessful search exhausts the hash sequence

25
Experimental Results
26
Performance on Full Table
27
Separate Chaining
  • Keys hashing to the same hash value are stored in
    separate lists
  • one list per hash position
  • can store more than m records
  • easy to implement
  • the keys in each list can be ordered

28
h(key) key mod m
29
Performance of Separate Chaining
  • Depends on the average chain size
  • insertions are independent events
  • let P(c,n,m) probability that a position has
    been selected c times after n insertions on a
    table of size m
  • P(c,n,m) probability that the chain has length c
    gt binomial distribution

p1/m success case q1-p failure case
30
gt P(c,n,m)(1/c!)?ce-?
gt
Poison
31
Unsuccessful Search
  • The entire chain is searched
  • the average number of comparisons equals its
    average length u

32
Successful Search
  • Not the whole chain is searched
  • the average number of comparisons equals the
    length s of the chain at time the key was
    inserted plus 1
  • the performance at the time a key was inserted
    equals that of unsuccessful search!

33
Performance
  • The performance drops with the length of the
    chains
  • worst case all keys are stored in a single chain
  • worst case performance O(N)
  • unsuccessful search performs better than
    successful search!! WHY ?
  • no problem with deletions!!

34
Coalesced Hashing
  • The hash sequence is implemented as a linked list
    within the hash table
  • no rehash function
  • the next hash position is the next available
    position in linked list
  • extra space for the list

h(key) key mod 10 keys 19, 29, 49, 59
35
initially avail 9 h(key) key mod 10 keys
14,29,34,28,42,39,84,38
initialization
avail
Holds lists of rehashing positions and list of
empty positions
List of empty positions
36
Performance of Coalesced Hashing
  • Unsuccessful search
  • Successful search

probes/search
probes/search
Write a Comment
User Comments (0)
About PowerShow.com