Chapter%2013A%20Week%203 - PowerPoint PPT Presentation

About This Presentation

Title:

Chapter%2013A%20Week%203

Description:

First term finds the key. R is the number of records that meet the search criteria ... Example Continued. Suppose we have 18 keys and a table size of 23 ... – PowerPoint PPT presentation

Number of Views:18

Avg rating:3.0/5.0

Slides: 36

Provided by: paulde8

Category:

more less

Transcript and Presenter's Notes

Title: Chapter%2013A%20Week%203

1
Chapter 13AWeek 3

Records
Files
Hashing

2
Record

Analogous to a struct
Collection of variables of different datatypes
struct student
char name30
char ssn9
char major4
int age

3
File

Sequence of records
Three types
Fixed length records
Variable length records
Contains records of different types

4
File Types

Fixed length records are easiest to deal with
File size is a multiple of record size
We know where each record starts by its offset
from the beginning
Variable length records
Variable length fields (e.g., name)
Repeating groups (e.g., projects stored with
employee)
Optional fields (department supervised)
Requires field separators
Different record types in a single file
Related records of different types are clustered
(dependent record follows employee record)
Requires record terminators

5
(No Transcript)
6
Blocking

Block is a unit of data transfer
Blocking Factor
Number of complete records per block
Bf floor(B/R)
where B number of blocks and
R number of records
If the blocking factor is not a multiple of
record size we have wasted space
W B Bf R
The waste can be reclaimed if we let records span
blocks

7
(No Transcript)
8
File Operations

File Organization
A way of arranging records on a file
Heap files
Relative files
Ordered files
Hash files
Access Methods
Set of operations
Open, Close
Reset file pointer
Scan All
Find
FindNext
Read Current
Delete
Modify
Insert
FindAll
Find N records that satisfy a condition

9
Heap File

Records are stored in the order in which they are
inserted
Scan All O(N)
Find O(N/2) on average only N/2 records must be
examined
Find in a range O(N) since all records must be
examined
Insert O(1) fetch last block, insert, write back
Delete O(N/2) Find record to delete. Mark it as
deleted. Remove and reclaim space during batch
reorganization.

10
Relative File

Records are of fixed length
Records dont cross block boundaries
Records are contiguous on disk
So, records can be accessed by position
Ex Bf 10
Find(rec 13)
Record is in ceiling(13/10) or 2nd block
Record is 13 10 or 3rd record in that block

11
Relative File Access Method Complexity

Scan All O(N)
Find O(1)
Find in range O(N)
Delete O(1)
Insert O(1)

12
Ordered Files

Usually file itself is not sorted
Rather an index on that file is maintained
Advantages
Reads are more efficient because block size of
indexed file is smaller
Can have multiple indexes

13
(No Transcript)
14
Ordered File Access Method Complexity

Scan All O(N)
Find O(lgN) uses binary search
Find in range O(lgN) O(R)
First term finds the key
R is the number of records that meet the search
criteria

15
Continued

Insert
In theory
Find insertion spot O(lgN)
Move all records down a position O(N/2)
In Practice
Maintain a sequential (heap) overflow file
Insert records in the order in which they appear
Reorganize primary file as a batch run
Slows down find (because of the need for a
sequential search)
Speeds up insert

16
Continued

Delete
Find Record and mark it for deletion O(lgN)
Reorganize the file as a batch run

17
Hashed Files

Position of a record is based on a computation
hash(key) ? position
Scan All O(N)
Find O(1)
Find in range O(N) because it is necessary to
scan the entire file
Insert O(1)
Delete O(1)

18
Summary of Strengths and Weakness of File
Organizations

Heap (sequential)
Insertion is good
Find is bad
Easy to maintain
Relative
All access methods are fast
But contiguity requirement is hard to maintain
Ordered
Find is good
Insertion and deletion require reorganization
Hash
Find is good
Scan in order is bad

19
Hashing

Two types
Internal (done in RAM)
External (done in secondary storage)
Internal should be a review

20
Internal Hashing

Problem
Given a key value, store records so that each one
is directly accessible
That is, speed up find, insert, delete
Requires
Hash key field of record upon which search is
based
Hash function function which when applied to key
yields address in memory (of disk block) in which
a record is stored

21
Internal Hashing Example

Suppose
Key is SSN
Task make find O(N)
Solution 1 create an array so that each SS is
an offset into the array
Wastes 109 number of SSs
Solution 2
Use hash function h(k) k m
Where
k is the key
m is the table size

22
Example Continued

Suppose we have 18 keys and a table size of 23
019, 392, 179, 359, 663, 262, 639, 321, 097, 468,
814, 720, 260, 802, 364, 976, 774, 566
h(019) 19
h(392) 01
h(179) 18
h(359) 14
H(663) 19
But position 19 is occupied
We need a collision resolution policy

23
Possible Policies

Linear Probing
Double Hashing
Chaining
Linear Probing
If there is a collision, insert key in the next
open position, wrapping when you get to the end
of the table
Heres the hash table

Position Key Probes
0 802 4
1 392 1
2 364 7
3
4
5 097 1
6
7 720 1
8 468 1
9 262 1
10 814 2
11 260 5
12 976 3
13
14 359 1
15 774 1
16 566 3
17

25
Find

Find 360
H(360) 15 occupied
16 occupied
17 empty
Return Not Found

26
Probe Sequence

Path a key takes as it searches for an empty spot
Notice
Linear probing gives 1 probe sequence
Each key enters it at a (possibly) different
place
We can improve performance by
Improving probe sequence
Increasing the size of the table

27
General Probe SequenceReferred to as Open
Addressing

(h(k) ip(k)) m for i 0, 1, , m-1
Notice that under linear probing, p(k) 1
Heres how the insertion of 802 worked
h(802) 20 occupied
21 occupied
22 occupied
0 insert
Using the general formula
(h(802) 01) 23 20 occupied
(h(802) 01) 23 21 occupied
(h(802) 01) 23 22 occupied
(h(802) 01) 23 0 insert

28
Probe Sequence

With linear probing there is a single probe
sequence, different keys join it along the way
Example
Suppose k 802, m 13
S(802) (802 13 01) mod 13 9
S(802) (802 13 11) mod 13 10
11
12
0

29
Continued

Now k 11, m 13
S(802) (11 13 01) mod 13 11
12
0

30
Double Hashing

Alter p(k) so that only keys hashing to the same
address have the same probe sequence
Make p(k) a function of h(k)
P(k) (h(k) 4) m

31
Continued

K 137, m 13
S(137) (137 13 0 p(k)) 13 7
S(137) (137 13 1 (7 4) 13) 13 5
The next values, in order, are 3, 1, 12, 8, 6,
4, 2, 0, 11, 9
But S(258) produces this sequence 11, 0, 2,
Notice that a collision at 11 results in a
different probe sequence

32
Continued

Result
Only keys hashing to same address have the same
probe sequence.
This breaks up primary clusters, but produces
secondary clusters
These could be eliminated if p(k) were chosen
independently of h(k)

33
Continued

But
You must take care to generate a valid probe
sequence
Criteria for a key randomly chosen, each slot
has equal probability of being hit and youll
cycle through all slots in the table

34
Chaining

Each location in the hash table is the head of a
linked list of keys that collide at that
location.
K 19, m 5, h(19) 4

Linear probing length of chain is length of
primary clusters Double Hash length of chain is
length of secondary clusters
19
35
Deletion from a Hash Table

Chaining
Hash to location, traverse list until end or key
is found. Delete
Open addressing
Mark location as deleted
Recapture deleted locations on subsequent
inserts
If deleted locations build up so that search
degrades, it will be necessary to rebuild the
table