Chapter%2013A%20Week%203 - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter%2013A%20Week%203

Description:

First term finds the key. R is the number of records that meet the search criteria ... Example Continued. Suppose we have 18 keys and a table size of 23 ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 36
Provided by: paulde8
Category:
Tags: 2013a | 20week | chapter | keys

less

Transcript and Presenter's Notes

Title: Chapter%2013A%20Week%203


1
Chapter 13AWeek 3
  • Records
  • Files
  • Hashing

2
Record
  • Analogous to a struct
  • Collection of variables of different datatypes
  • struct student
  • char name30
  • char ssn9
  • char major4
  • int age

3
File
  • Sequence of records
  • Three types
  • Fixed length records
  • Variable length records
  • Contains records of different types

4
File Types
  • Fixed length records are easiest to deal with
  • File size is a multiple of record size
  • We know where each record starts by its offset
    from the beginning
  • Variable length records
  • Variable length fields (e.g., name)
  • Repeating groups (e.g., projects stored with
    employee)
  • Optional fields (department supervised)
  • Requires field separators
  • Different record types in a single file
  • Related records of different types are clustered
    (dependent record follows employee record)
  • Requires record terminators

5
(No Transcript)
6
Blocking
  • Block is a unit of data transfer
  • Blocking Factor
  • Number of complete records per block
  • Bf floor(B/R)
  • where B number of blocks and
  • R number of records
  • If the blocking factor is not a multiple of
    record size we have wasted space
  • W B Bf R
  • The waste can be reclaimed if we let records span
    blocks

7
(No Transcript)
8
File Operations
  • File Organization
  • A way of arranging records on a file
  • Heap files
  • Relative files
  • Ordered files
  • Hash files
  • Access Methods
  • Set of operations
  • Open, Close
  • Reset file pointer
  • Scan All
  • Find
  • FindNext
  • Read Current
  • Delete
  • Modify
  • Insert
  • FindAll
  • Find N records that satisfy a condition

9
Heap File
  • Records are stored in the order in which they are
    inserted
  • Scan All O(N)
  • Find O(N/2) on average only N/2 records must be
    examined
  • Find in a range O(N) since all records must be
    examined
  • Insert O(1) fetch last block, insert, write back
  • Delete O(N/2) Find record to delete. Mark it as
    deleted. Remove and reclaim space during batch
    reorganization.

10
Relative File
  • Records are of fixed length
  • Records dont cross block boundaries
  • Records are contiguous on disk
  • So, records can be accessed by position
  • Ex Bf 10
  • Find(rec 13)
  • Record is in ceiling(13/10) or 2nd block
  • Record is 13 10 or 3rd record in that block

11
Relative File Access Method Complexity
  • Scan All O(N)
  • Find O(1)
  • Find in range O(N)
  • Delete O(1)
  • Insert O(1)

12
Ordered Files
  • Usually file itself is not sorted
  • Rather an index on that file is maintained
  • Advantages
  • Reads are more efficient because block size of
    indexed file is smaller
  • Can have multiple indexes

13
(No Transcript)
14
Ordered File Access Method Complexity
  • Scan All O(N)
  • Find O(lgN) uses binary search
  • Find in range O(lgN) O(R)
  • First term finds the key
  • R is the number of records that meet the search
    criteria

15
Continued
  • Insert
  • In theory
  • Find insertion spot O(lgN)
  • Move all records down a position O(N/2)
  • In Practice
  • Maintain a sequential (heap) overflow file
  • Insert records in the order in which they appear
  • Reorganize primary file as a batch run
  • Slows down find (because of the need for a
    sequential search)
  • Speeds up insert

16
Continued
  • Delete
  • Find Record and mark it for deletion O(lgN)
  • Reorganize the file as a batch run

17
Hashed Files
  • Position of a record is based on a computation
  • hash(key) ? position
  • Scan All O(N)
  • Find O(1)
  • Find in range O(N) because it is necessary to
    scan the entire file
  • Insert O(1)
  • Delete O(1)

18
Summary of Strengths and Weakness of File
Organizations
  • Heap (sequential)
  • Insertion is good
  • Find is bad
  • Easy to maintain
  • Relative
  • All access methods are fast
  • But contiguity requirement is hard to maintain
  • Ordered
  • Find is good
  • Insertion and deletion require reorganization
  • Hash
  • Find is good
  • Scan in order is bad

19
Hashing
  • Two types
  • Internal (done in RAM)
  • External (done in secondary storage)
  • Internal should be a review

20
Internal Hashing
  • Problem
  • Given a key value, store records so that each one
    is directly accessible
  • That is, speed up find, insert, delete
  • Requires
  • Hash key field of record upon which search is
    based
  • Hash function function which when applied to key
    yields address in memory (of disk block) in which
    a record is stored

21
Internal Hashing Example
  • Suppose
  • Key is SSN
  • Task make find O(N)
  • Solution 1 create an array so that each SS is
    an offset into the array
  • Wastes 109 number of SSs
  • Solution 2
  • Use hash function h(k) k m
  • Where
  • k is the key
  • m is the table size

22
Example Continued
  • Suppose we have 18 keys and a table size of 23
  • 019, 392, 179, 359, 663, 262, 639, 321, 097, 468,
    814, 720, 260, 802, 364, 976, 774, 566
  • h(019) 19
  • h(392) 01
  • h(179) 18
  • h(359) 14
  • H(663) 19
  • But position 19 is occupied
  • We need a collision resolution policy

23
Possible Policies
  • Linear Probing
  • Double Hashing
  • Chaining
  • Linear Probing
  • If there is a collision, insert key in the next
    open position, wrapping when you get to the end
    of the table
  • Heres the hash table

24
  • Position Key Probes
  • 0 802 4
  • 1 392 1
  • 2 364 7
  • 3
  • 4
  • 5 097 1
  • 6
  • 7 720 1
  • 8 468 1
  • 9 262 1
  • 10 814 2
  • 11 260 5
  • 12 976 3
  • 13
  • 14 359 1
  • 15 774 1
  • 16 566 3
  • 17

25
Find
  • Find 360
  • H(360) 15 occupied
  • 16 occupied
  • 17 empty
  • Return Not Found

26
Probe Sequence
  • Path a key takes as it searches for an empty spot
  • Notice
  • Linear probing gives 1 probe sequence
  • Each key enters it at a (possibly) different
    place
  • We can improve performance by
  • Improving probe sequence
  • Increasing the size of the table

27
General Probe SequenceReferred to as Open
Addressing
  • (h(k) ip(k)) m for i 0, 1, , m-1
  • Notice that under linear probing, p(k) 1
  • Heres how the insertion of 802 worked
  • h(802) 20 occupied
  • 21 occupied
  • 22 occupied
  • 0 insert
  • Using the general formula
  • (h(802) 01) 23 20 occupied
  • (h(802) 01) 23 21 occupied
  • (h(802) 01) 23 22 occupied
  • (h(802) 01) 23 0 insert

28
Probe Sequence
  • With linear probing there is a single probe
    sequence, different keys join it along the way
  • Example
  • Suppose k 802, m 13
  • S(802) (802 13 01) mod 13 9
  • S(802) (802 13 11) mod 13 10
  • 11
  • 12
  • 0

29
Continued
  • Now k 11, m 13
  • S(802) (11 13 01) mod 13 11
  • 12
  • 0

30
Double Hashing
  • Alter p(k) so that only keys hashing to the same
    address have the same probe sequence
  • Make p(k) a function of h(k)
  • P(k) (h(k) 4) m

31
Continued
  • K 137, m 13
  • S(137) (137 13 0 p(k)) 13 7
  • S(137) (137 13 1 (7 4) 13) 13 5
  • The next values, in order, are 3, 1, 12, 8, 6,
    4, 2, 0, 11, 9
  • But S(258) produces this sequence 11, 0, 2,
  • Notice that a collision at 11 results in a
    different probe sequence

32
Continued
  • Result
  • Only keys hashing to same address have the same
    probe sequence.
  • This breaks up primary clusters, but produces
    secondary clusters
  • These could be eliminated if p(k) were chosen
    independently of h(k)

33
Continued
  • But
  • You must take care to generate a valid probe
    sequence
  • Criteria for a key randomly chosen, each slot
    has equal probability of being hit and youll
    cycle through all slots in the table

34
Chaining
  • Each location in the hash table is the head of a
    linked list of keys that collide at that
    location.
  • K 19, m 5, h(19) 4

Linear probing length of chain is length of
primary clusters Double Hash length of chain is
length of secondary clusters
19
35
Deletion from a Hash Table
  • Chaining
  • Hash to location, traverse list until end or key
    is found. Delete
  • Open addressing
  • Mark location as deleted
  • Recapture deleted locations on subsequent
    inserts
  • If deleted locations build up so that search
    degrades, it will be necessary to rebuild the
    table
Write a Comment
User Comments (0)
About PowerShow.com