Disk Storage, Basic File Structures, and Hashing

About This Presentation

Title:

Disk Storage, Basic File Structures, and Hashing

Description:

Data stored as magnetized areas on magnetic disk surfaces. ... in the block Slotted Page Structure Records can be moved around within a ... revolutions per minute ... – PowerPoint PPT presentation

Number of Views:138

Avg rating:3.0/5.0

Slides: 65

Provided by: eduj45

Category:

more less

Transcript and Presenter's Notes

Title: Disk Storage, Basic File Structures, and Hashing

1
Disk Storage, Basic File Structures, and Hashing
2
Introduction

In a computerized database, the data is stored on
computer storage medium, which includes
Primary Storage
can be processed directly by the CPU
e.g., the main memory, cache
fast, expensive, but of limited capacity
Secondary Storage
cannot be processed directly by the CPU
magnetic disks, optical disks, tapes
slow, cost less, but have a large capacity.

3
Storage Hierarchy

Volatile
Cache
Primary storage
unit price
Memory
Flash Memory
Secondary storage
Magnetic Disk
speed
Non-volatile
Optical Disk
Tertiary storage
Magnetic Tape
4
Storage of Databases

For the following reasons, most databases are
stored permanently on secondary storage
They are too large to fit entirely in main memory
They must persist over long period of times, but
the main memory is a volatile storage
Secondary storage costs less

5
Secondary Storage

Magnetic-disk cannot be directly processed by
the CPU it must be brought to the main memory
first.
Data is stored on spinning disk, and read/written
magnetically
Primary medium for the long-term storage of data
typically stores entire database.
Non-volatile
slow access to data
large storage capacity (on the order of gigabytes)

6
Disk Storage Devices

Preferred secondary storage device for high
storage capacity and low cost.
Data stored as magnetized areas on magnetic disk
surfaces.
A disk pack contains several magnetic disks
connected to a rotating spindle.
Disks are divided into concentric circular tracks
on each disk surface.
Track capacities vary typically from 4 to 50
Kbytes or more

7
Disk Storage Devices (contd.)

A track is divided into smaller blocks or sectors
because it usually contains a large amount of
information
The division of a track into sectors is
hard-coded on the disk surface and cannot be
changed.
One type of sector organization calls a portion
of a track that subtends (faces) a fixed angle at
the center as a sector.
A track is divided into blocks.
The block size B is fixed for each system.
Typical block sizes range from B512 bytes to
B4096 bytes.
Whole blocks are transferred between disk and
main memory for processing.

8
Disk Storage Devices (contd.)

A read-write head moves to the track that
contains the block to be transferred.
Disk rotation moves the block under the
read-write head for reading or writing.
A physical disk block (hardware) address consists
of
a cylinder number (imaginary collection of tracks
of same radius from all recorded surfaces)
the track number or surface number (within the
cylinder)
and block number (within track).
Reading or writing a disk block is time consuming
because of the seek time s and rotational delay
(latency) rd.
Double buffering can be used to speed up the
transfer of contiguous disk blocks.

9
Physical Characteristics of Disks
10
Components of a Disk

The platters spin (say, 90rps).
The arm assembly is moved in or out to position a
head on a desired track.
Read-write head
Positioned very close to the platter surface
(almost touching it)
Reads or writes magnetically encoded information.
Only one head reads/writes at any one time.
Surface of platter divided into circular tracks

11
Physical Characteristics of Disks

Track
an information storage circle on the surface of a
disk.
Over 16,000 tracks per platter
each track can store between 4KB and 50KB of
data.
Each track is divided into sectors.
Tracks under heads make a cylinder (imaginary!)
Cylinder
the tracks with the same diameter on all surfaces
of a disk pack.
Cylinder i consists of i-th track of all the
platters

12
Physical Characteristics of Disks

Sector
a part of a track with fixed size
separated by fixed-size interblock gaps
Typical sectors per track
200 (on inner tracks) to 400 (on outer tracks)

13
Sectors
14
(No Transcript)
15
Disk I/O Model of Computation

Disk I/O is equivalent to one read or write
operation of a single block
It is very expensive compared with what is likely
to be done once the block gets in main memory
one random disk I/O about 1,000,000 machine
instructions in terms of time
Cost for computation that requires secondary
storage is computed only by disk I/Os.

16
Pages and Blocks

Data files decomposed into pages (blocks)
fixed size piece of contiguous information in the
file
sizes range from 512 bytes to several kilobytes
block is the smallest unit for transferring data
between the main memory and the disk.
Address of a page (block)
(cylinder, track (within cylinder), sector
(within track)

17
Pages and Blocks
18
Page I/O

Page I/O --- one page I/O is the cost (or time
needed) to transfer one page of data between the
memory and the disk.
The cost of a (random) page I/O
seek time rotational delay block transfer
time
Seek time
time needed to position read/write head on
correct track.
Rotational delay (latency)
time needed to rotate the beginning of page under
read/write head.
Block transfer time
time needed to transfer data in the page/block.

19
Page I/O

Average rotational delay (rd)
rd ½ (1/p) min (601000)/(2p) msec
OR
rd ½ cost of 1 revolution
½ (601000/p) msec
where
p is speed of disk rotation (how many revolutions
per minute - rpm)
Example
Speed of disk rotatioon is p 3600 rpm
60 revolutions/sec
1 rev. 16.66 msec. (1 second 1000 msec)
rd 8.33 ms

20
Page I/O

Transfer rate (tr)
tr track size / cost of one revolution
track size / (601000/p) in msec
Bulk transfer rate (btr)
btr (B/(BG)) tr bytes/msec
Where B is the block size in bytes
G is interblock gap size in bytes
Block transfer time (btt)
btt B / tr not taking into acount G
btt B / btr taking into acount G

21
Page I/O

Example
Track size 50 KB and p 3600 rpm
Block size B 3KB 3000 bytes
tr (501000)/(601000/3600) 3000 bytes/msec
btt B / tr 3000/3000 1 msec

22
Page I/O

Average time for reading/writing n consecutive
pages that are in the same track or cylinder s
rd n btt
Average time for reading/writing consecutively n
noncontigues pages/blocks that are in the same
cylinder s n (rd btt)

23
An Example

A hard disk specifications
4 platters, 8 Surfaces, 3.5 Inch diameter
213 8192 tracks/surface
28 256 sectors/track
29 512 bytes/sector
Average seek time s 25 ms
Rotation rate rd 3600 rpm 60 rps
1 rev. 16.66 msec
Transfer rate
tr 1 KB in 0.117 ms
tr 1 KB in 0.130 ms with gap

24
An Example

What is the total capacity of this disk
8 GB (82132829233)
How many bytes does one track hold?
256 sectors/track512 bytes/sector 128KB
How many blocks per track?
one block 4096 bytes 8 sectors (4096/512)
256/8 32 blocks/track

25
An Example

How long does it take to access one block?
One block 4096 bytes
8 sectors 4096/512
Rotation rate r
1 rev. 16.66 msec.
Time to access 1 sector (s r/2
tr/(secters/KB)
25 (16.66/2) .117/2 33.3885 ms.
time to access 1 block
time to access the first sector of the block
time to access the subsequent 7 sectors.

26
An Example

T 25 (16.66/2) (0.117/2) 1 (0.13/2) 7
33.3885 0.455 ms 33.8435ms
Compare to one sector access time 33.3885 ms

27
Buffering

A buffer
is a contiguous reserved area in main memory
available for storage of copies of disk blocks.
to speed up the processes.
For a read command
the block from disk is copied into the buffer.
For a write command
the contents of the buffer are copied into the
disk.

28
Accessing Data Through RAM Buffer
29
Buffer Manager

Programs call on the buffer manager when they
need a block from disk.
If the block is already in the buffer,
the requesting program is given the address of
the block in main memory
If the block is not in the buffer,
the buffer manager allocates space in the buffer
for the block, replacing (throwing out) some
other block, if required, to make space for the
new block.
The block that is thrown out is written back to
disk only if it was modified since the most
recent time that it was written to/fetched from
the disk.

30
Buffer Manager

Once space is allocated in the buffer, the buffer
manager reads the block from the disk to the
buffer, and passes the address of the block in
main memory to requester.
Buffer Replacement Policy
Frame is chosen for replacement by a replacement
policy
Least-recently-used (LRU), MRU, FIFO, etc.
Policy can have big impact on of I/Os depends
on the access pattern.

31
File Organization

The database is stored as a collection of files.
Each file is a sequence of records.
A record is a sequence of fields.
Records are stored on disk blocks.
A file can have fixed-length records or
variable-length records.

32
File Organization

Fixed length records
Each record is of fixed length. Pad with spaces.
Variable length records
different records in the file have different
sizes.
Arise in database systems in several ways
different record types in a file.
same record type with (variable-length fields,
repeating field, or optional fields)

33
File Organization
34
Fixed-Length Records

Insertion
Store record i starting from byte n ? (i 1),
where n is the size of each record.
Deletion of record i
Packed format
move records i 1, . . ., n to i, . . . , n 1
OR
move record n to i
Unpacked format (do not move records, but)
link all free records on a free list
OR
Use bitmap vector

35
Free Lists

Store the address of the first deleted record in
the file header.
Use this first record to store the address of the
second deleted record, and so on.

36
Page Formats Fixed Length Records

Record id ltpage id, slot gt.

37
Variable-Length Records Represenation

Byte-String representation
Attach an end-of-record (?) control character to
the end of each record
Difficulty with deletion and growth
Slotted-page header contains
number of record entries
location and size of each record
end of free space in the block

38
Slotted Page Structure

Records can be moved around within a page to keep
them contiguous with no empty space between them
entry in the header must be updated.
Pointers should not point directly to record -
instead they should point to the entry for the
record in header.

39
Fixed-Length Representation

Reserved Space
can use fixed-length records of a known maximum
length
unused space in shorter records filled with a
null or end-of-record symbol.

40
Fixed-Length Representation

List Representation by Pointers
A variable-length record is represented by a list
of fixed-length records, chained together via
pointers.
Can be used even if the maximum record length is
not known

41
Fixed-Length Representation

Disadvantage space is wasted in all records
except the first in a a chain.
Solution is to allow two kinds of block in file
Anchor block contains the first records of chain
Overflow block contains records other than those
that are the first records of chairs.

42
Blocking Factor

Blocking Factor (bfr) - the number of records
that can fit into a single block.
bfr ?B/R?
B Block size in bytes
R Record size in bytes
Example
Record size R 100 bytes
Block Size B 2,000 bytes
Thus the blocking factor bfr 2000/100 20
The number of blocks b needed to store a file of
r records
b ?r/bfr? blocks

43
Spanned Unspanned Records

A block is the unit of data transfer between disk
and memory.
Unspanned records
A record is found in one and only one block.
records do not span across block boundaries.
Used with fixed-length records having B ? R
Spanned records
Records are allowed to span across block
boundaries.
Used with variable-length records having R ? B
In variable-length records, either organization
can be used.

44
Placing File Records on Disk

A file header or file descriptor contains
information about a file (e.g., the disk address,
record format descriptions, etc.)

45
Allocating File Blocks on Disk

The physical disk blocks that are allocated to
hold the records of a file can be contiguous,
linked, or indexed.
In contiguous allocation, the file blocks are
allocated to consecutive disk blocks.
In linked allocation, each file block contains a
pointer to the next file block.
In indexed allocation, one or more index blocks
contain pointers to the actual file blocks.

46
Organization of Records in Files

Heap File Organization
a record can be placed anywhere in the file where
there is space, or at the end
for full file scans or frequent updates
Data unordered (unsorted)
Sorted/Ordered File Organization
store records sorted in order, based on the value
of the search key of each record
Need external sort or an index to keep sorted
Hashing File Organization
a hash function computed on some attribute of
each record
the result specifies in which block of the file
the record should be placed

47
Heap File Organization

Records are placed in the file in the order in
which they are inserted. Such an organization is
called a heap file.
Insertion is at the end
takes constant time O(1) (very efficient)
Searching
requires a linear search (expensive)
Deleting
requires a search, then delete
Select, Update and Delete
take b/2 time (linear time) in average
b is the number of blocks

48
Heap File Organization

For a file of unordered fixed-length records
using unspanned blocks and contiguous allocation,
it is straightforward to access any record by its
position in the file.
If the records are numbered 0,1,2, , r-1 and
The records in each block are numbered 0,1,2, ,
f-1, where f is the blocking factor
The the i-th record of the file is located in
Block ?i/f? and in the
(i mod f)-th record in that block

49
Heap File Organization

A Heap file allows us to retrieve records
by specifying the rid, or
by scanning all records sequentially
Accessing a record by its position does not help
locate a record based on a search condition.

50
File Stored as a Heap File
666666 MGT123 F1994 4.0 123456
CS305 S1996 4.0 page 0 987654
CS305 F1995 2.0 717171 CS315
S1997 4.0 666666 EE101 S1998
3.0 page 1 765432 MAT123 S1996
2.0 515151 EE101 F1995
3.0 234567 CS305 S1999 4.0

page 2 878787 MGT123 S1996
3.0
51
Sequential File Organization

Suitable for applications that require sequential
processing of the entire file
The records in the file are ordered by a
search-key

52
Files of Ordered Records

Some blocks of an ordered (sequential) file of
EMPLOYEE records with NAME as the ordering key
field.

53
File Stored as a Sorted File
111111 MGT123 F1994 4.0 111111
CS305 S1996 4.0 page 0 123456
CS305 F1995 2.0 123456 CS315
S1997 4.0 123456 EE101 S1998
3.0 page 1 232323 MAT123 S1996
2.0 234567 EE101 F1995
3.0 234567 CS305 S1999 4.0

page 2 313131 MGT123 S1996
3.0
54
Sequential File Organization

Insertion is expensive
records must be inserted in the correct order
locate the position where the record is to be
inserted
if there is free space insert there
if no free space insert the record in an overflow
block
In either case, pointer chain must be updated
Insert takes lg(b) plus the time to re-organize
records.
b is the number of blocks
Deletion
use pointer chains
Searching
very efficient (Binary search)
This requires lg(b) on the average

55
Sequential File Organization
56
Hashing Techniques

A hash function maps the hash field of a record
into the address of the storage media in which
the record is stored.
Hashing provides very fast access to records,
where the search condition is an equality
condition on the hash field.
For internal files, hashing is implemented as a
hash table. The mapping that assigns each element
of the data a cell of the hash table is called a
hash function.

57
Hashing Techniques

Two records that yield the same hash value are
said to collide.
A good hash function must be easy to compute and
generate a low number of collisions.
The process of finding another position (for
colliding data) is called collision resolution.
There are several methods for collision
resolution, including open addressing, chaining,
and multiple hashing.

58
Hashing Techniques

Open addressing
Proceeding from the occupied position specified
by the hash function, check the subsequent
positions in order until an unused position is
found.
Chaining
Associate an overflow area (or a linked list) to
any cell (hashing address) and then simply store
the data in this medium.
Multiple hashing
Apply a second hash function if the first results
in a collision.
If another collision results, use open
addressing, or apply a third hash function, and
then use open addressing.

59
Hashing Techniques
60
Hashing Techniques

Hashing for disk files is called external
hashing.
The target address space in external hashing is
made of buckets (which holds a disk block or a
cluster of contiguous blocks).
The collision problem is less severe, because as
many records as will fit in a bucket can hash to
the same bucket without causing collision
problem.
A table maintained in the file header converts
the bucket number into the corresponding disk
block address.

61
Hashing Techniques

Matching bucket numbers to disk block addresses.

62
Hashing Techniques

To reduce overflow records, a hash file is
typically kept 70-80 full.
The hash function h should distribute the records
uniformly among the buckets
Otherwise, search time will be increased because
many overflow records will exist.

63
Hashed Files - Overflow handling
64
Hashing Techniques

The hashing scheme is called static hashing if a
fixed number of buckets is allocated.
Main disadvantage of static external hashing
The number of buckets must be chosen large enough
that can handle large files. That is, it is
difficult to expand or shrink the file
dynamically.
Three solutions to the above problem are
Dynamic hashing,
Extendible hashing
Linear hashing