Chapter 9, Disks and Files presentation

About This Presentation

Transcript and Presenter's Notes

Title: Chapter 9, Disks and Files

1
Chapter 9, Disks and Files

The Storage Hierarchy
Disks
Mechanics
Performance
RAID
Disk Space Management
Buffer Management
Files of Records
Format of a Heap File
Format of a Data Page
Format of Records

2
Learning objectives

Given disk parameters, compute storage needs and
read times
Given a reminder about what each level means, be
able to derive any figures on the RAID
performance slide
Describe the pros and cons of alternative
structures for files, pages and records

3
A (Very) Simple Hardware Model
CPU chip
register file
ALU
system bus
memory bus
main memory
I/O bridge
bus interface
I/O bus
Expansion slots for other devices such as network
adapters.
USB controller
disk controller
graphics adapter
mouse
keyboard
monitor
disk
4
Storage Options
Capacity Access Time Cost
Registers Caches Main Memory Hard Disk /
Flash Tape
1k-2k bytes 1 Tc Way Expensive
10s -1000s K Bytes 2-20 Tc 10 / MByte
G Bytes 300 1000 Tc 0.03 / MB (eBay)
100s G Bytes 10 ms 30M Tc 0.10/ GB (eBay)
Infinite Forever Way Cheap
5
Memory Hierarchy
Upper Level
Capacity Access Time Cost
Staging Xfer Size
Faster
1k-2k bytes 1 Tc Way Expensive
Registers
prog./compiler 1-8 bytes
Instr. Operands
10s -1000s K Bytes 2-20 Tc 10 / MByte
Cache - SDRAMmay be multiple levels!
cache cntl 8-128 bytes
Blocks
G Bytes 300 1000 Tc 0.03 / MB (eBay)
Memory - DRAM
OS 4K bytes
Pages
100s G Bytes 10 ms 30M Tc 0.10/ GB (eBay)
Disk
user/operator Gbytes
Files
Larger
Infinite Forever Way Cheap
Tape
Lower Level
6
Why Does Hierarchy Work?

Locality
Program access a relatively small portion of the
address space at any instant of time
Two Different Types
Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon (e.g., straightline
code, array access)

7
9.1 The Memory Hierarchy

Typical storage hierarchy as used by a RDBMS
Primary storageMain memory (RAM) for currently
used data
Secondary storageDisk, Flash Memory for the
main database
http//www.cs.cmu.edu/damon2007/pdf/graefe07fivem
inrule.pdf
What are other reasons besides cost to use disk?
Tertiary storageTapes, DVDs for archiving older
versions of the data
Other factors
Caches at every level
Controllers, protocols
Network connections

8
What is FLASH Memory, Anyway?

Floating gate transitor
Presence of charge gt 0
Erase Electrically or UV (EPROM)
Peformance
Reads like DRAM (ns)
Writes like DISK (ms). Write is a complex
operation

9
Components of a Disk
Spindle
Disk head
Tracks

platters are always spinning (say, 120rps).
one head reads/writes at any one time.
to read a record
position arm (seek)
engage head
wait for data to spin by
read (transfer data)

Sector
Platters
Arm movement
Arm assembly
10
More terminology
Spindle
Disk head
Tracks

Each track is made up of fixed size sectors.
Page size is a multiple of sector size.
A platter typically has data on
both surfaces.
All the tracks that you can reach from one
position of the arm is called a cylinder
(imaginary!).

Sector
Platters
Arm movement
Arm assembly
11
Disks Technology Background

Seagate 373453, 2003
15000 RPM (4X)
73.4 GBytes (2500X)
Tracks/Inch 64000 (80X)
Bits/Inch 533,000 (60X)
Four 2.5 platters (in 3.5 form factor)
Bandwidth 86 MBytes/sec (140X)
Latency 5.7 ms (8X)
Cache 8 MBytes

CDC Wren I, 1983
3600 RPM
0.03 GBytes capacity
Tracks/Inch 800
Bits/Inch 9550
Three 5.25 platters
Bandwidth 0.6 MBytes/sec
Latency 48.3 ms
Cache none

12
Typical Disk Drive Statistics (2008)
Sector size 512 bytes Seek time
Average 4-10 ms Track to
track .6-1.0 ms Average Rotational Delay -
3 to 5 ms (rotational speed 10,000 RPM to
5,400RPM) Transfer Time - Sustained data
rate 0.3- 0.1 msec per 8K page, or 25-75
MB/second Density 12-18 GB/in2
13
Disk Capacity

Capacity maximum number of bits that can be
stored.
Expressed in units of gigabytes (GB), where 1 GB
109 bytes
Capacity is determined by
Recording density (bits/in) number of bits that
can be squeezed into a 1 inch segment of a track.
Track density (tracks/in) number of tracks that
can be squeezed into a 1 inch radial segment.
Areal density (bits/in2) product of recording
and track density.
Modern disks partition tracks into disjoint
subsets called recording zones
Each track in a zone has the same number of
sectors, determined by the circumference of
innermost track.
Each zone has a different number of
sectors/track

14
Cost of Accessing Data on Disk

Time to access (read/write) a disk block
Taccess Tavg seek Tavg rotation Tavg
transfer
seek time (moving arms to position disk head on
track)
rotational delay (waiting for block to rotate
under head)
Half a rotation, on average
transfer time (actually moving data to/from disk
surface)
Key to lower I/O cost reduce seek/rotation
delays!
No way to avoid transfer time
Textbook measures query cost by NUMBER of page
I/Os
Implies all I/Os have the same cost, and that CPU
time is free
This is a common simplification.
Real DBMSs (in the optimizer) would consider
sequential vs. random disk reads
Because sequential reads are much faster
and would count CPU time.

15
Disk Parameters Practice

A 2-platter disk rotates at 7,200 rpm. Each
track contains 256KB.
How many cylinders are required to store an 8
Gigabyte file?
What is the average rotational delay, in
milliseconds?

16
Disk Access Time Example

Given
Rotational rate 7,200 RPM
Average seek time 9 ms.
Avg sectors/track 400.
Derived
Tavg rotation 1/2 x (60 secs/7200 RPM) x 1000
ms/sec 4 ms.
Tavg transfer 60/7200 RPM x 1/400 secs/track x
1000 ms/sec 0.02 ms
Taccess 9 ms 4 ms 0.02 ms
Important points
Access time dominated by seek time and rotational
latency.
First bit in a sector is the most expensive, the
rest are free.
SRAM access time is about 4 ns/doubleword, DRAM
about 60 ns
Disk is about 40,000 times slower than SRAM,
2,500 times slower than DRAM.

17
So, How far away is the data?
From http//research.microsoft.com/gray/papers/Al
phaSortSigmod.doc
18
Block, page and record sizes

Block According to text, smallest unit of I/O.
Page often used in place of block.
typical record size commonly hundreds,
sometimes thousands of bytes
Unlike the toy records in textbooks
typical page size 4K, 8K

19
Effect of page size on read time

Suppose rotational delay is 4ms, average seek
time 6 ms, transfer speed .5msec/8K.
This graph shows the time required to read 1Gig
of data for different page sizes.

20
Why the difference?

What accounts for the difference, in times to
read one Gigabyte, on the previous graph?
Assume rotational delay 4ms, average seek time 6
ms, transfer speed .5msec/8K
Transfer time
(230/213 8K blocks) ?(.5msec/8K) 66 secs
one minute
How many reads?
Page size 8K there are 230/213 217 128K
reads
Page size 64K, there are 1/8th that many reads
16K reads
Time taken by rotational delays and seeks
Each read requires a rotational delay and a seek,
totalling 10 msec.
8K (128K reads) ? (10msec/read) 1,311 secs
22 minutes
64K 1/8 of that, or 164 secs 3 minutes

21
Moral of the Story

As page size increases, read (and write) time
reduces to transfer time, a big savings.
So why not use a huge page size?
Wastes memory space if you dont need all that is
read
Wastes read time if you dont need all that is
read
What applications could use a large page size?
Those that sequentially access data
The problem with a small page size is that pages
get scattered across the disk. Turn the page.

22
Faster I/O, even with a small page size

Even if the page size is small, you can achieve
fast I/O by storing a files data as follows
Consecutive pages on same track, followed by
Consecutive tracks on same cylinder, followed by
Consecutive cylinders adjacent to each other
First two incur no seek time or rotational delay,
seek for third is only one-track.
What is saved with this storage pattern?
How is this storage pattern obtained?
Disk defragmenter and its relatives/predecessors
Also places frequently used files near the
spindle
When data is in this storage pattern, the
application can do sequential I/O
Otherwise it must do random I/O

23
More Hardware Issues
9. Disks

Disk Controllers
Interface from Disks to bus
Checksums, remap bad sectors, driver mgt, etc
Interface Protocols and MB per second xfer rates
IDE/EIDE/ATA/PATA, SATA -133
SCSI -640
BUT for a single device, SCSI is inferior
Faster network technologies such as Fibre Channel
Storage Area Networks (SANs)
Disk farm networked to servers
Servers can be heterogeneous a primary
advantage
Centralized management

24
Dependability

Module reliability measure of continuous
service accomplishment (or time to failure). 2
metrics
Mean Time To Failure (MTTF) measures Reliability
Failures In Time (FIT) 1/MTTF, the rate of
failures
Traditionally reported as failures per billion
hours of operation
Mean Time To Repair (MTTR) measures Service
Interruption
Mean Time Between Failures (MTBF) MTTFMTTR
Module availability measures service as alternate
between the 2 states of accomplishment and
interruption (number between 0 and 1, e.g. 0.9)
Module availability MTTF / ( MTTF MTTR)

25
Example calculating reliability

If modules have exponentially distributed
lifetimes (age of module does not affect
probability of failure), overall failure rate is
the sum of failure rates of the modules
Example Calculate FIT and MTTF for
10 disks (1M hour MTTF per disk)
1 disk controller (0.5M hour MTTF)
and 1 power supply (0.2M hour MTTF)

26
Example calculating reliability

Calculate FIT and MTTF for
10 disks (1M hour MTTF per disk)
1 disk controller (0.5M hour MTTF)
and 1 power supply (0.2M hour MTTF)

27
9.2 RAID 587
9.Disks

Disk Array Arrangement of several disks that
gives abstraction of a single, large disk.
Goals Increase performance and reliability.
Two main techniques
Data striping Data is partitioned size of a
partition is called the striping Unit. Partitions
are distributed over several disks.
Redundancy More disks gt more failures.
Redundant information allows reconstruction of
data if a disk fails.

28
Data Striping

CPUs go fast, disks dont. How can disks keep
up?
CPUs do work in parallel. Can disks?
Answer Partition data across D disks (see next
slide).
If Partition unit is a page
A single page I/O request is no faster
Multiple I/O requests can run at aggregated
bandwidth
Number of pages in a partition unit called the
depth of the partition.
Contrary to text, partition units of a bit are
almost never used and partition units of a byte
are rare.

29
Data Striping (RAID Level 0)
30
Redundancy

Striping is seductive, but remember reliability!
MTTF of a disk is about 6 years
If we stripe over 24 disks, what is MTTF?
Solution redundancy
Parity corrects single failures
Others detect where the failure is, and corrects
multiple failures
But failure location is provided by controller
Redundancy may require more than one check bit
Redundancy makes writes slower why?

31
RAID Levels

Standardized by SNIA (www.snia.org )
Vary in practice
For each level, decide (assume single user)
Number of disks required to hold D disks of data.
Speedup s (compared to 1 disk) for
S/R (Sequential/Random) R/W (Reads/Writes)
Random each I/O is one block
Sequential Each I/O is one stripe
Number of disks/blocks that can fail w/o data
loss
Level 0 Block Striped, No redundancy
Picture is 2 slides back

32
JBOD, RAID Level 1

JBOD Just a Bunch of Disks

Level 1 Mirrored (two identical JBODs no
striping)

33
RAID Level 01 Stripe Mirror
1 D1 2D1 1
2 D2 2D2 2
D-1 2D-1 3D-1 D-1
...
Disk D Disk D1 Disk D2
Disk 2D-1
34
RAID Level 4

Block-Interleaved Parity (not common)
One check disk, uses one bit of parity.
How to tell if there is a failure, or which disk
failed?
Read-modify-write
Disk D is a bottleneck

35
RAID Level 5

Level 5 Block-Interleaved Distributed Parity

1 D1 2D1
D-2 2D-2 P
D-1 P 3D-2
P 2D-1 3D-1
...
Disk 0 Disk 1
Disk D-2 Disk D-1 Disk D

Level 6 Like 5, but 2 parity bits/disks
Can survive loss of 2 disks/blocks

36
Notation on the next slide

Disks
Number of disks required to hold D disks worth of
data using this RAID level
Reads/Write speedup of blocks in a single file
SR Sequential Read
RR Random read
SW Sequential write
RW Random write
Failure Tolerance
How many disks can fail without loss of data
Internal Data
s Blocks transferred in the time it takes to
transfer one block of data from one disk.
These numbers are theoretical!
YMMVand vary significantly!

37
RAID Performance
If no two are copies of each other note
cant write both mirrors at once why?
38
Small Writes on Levels 4 and 5

Levels 4 and 5 require a read-modify-write cycle
for all writes, since the parity block must be
read and modified.
On small writes this can be very expensive
This is another justification for Log Based File
Systems (see your OS course)

39
Which RAID Level is best?

If data loss is not a problem
Level 0
If storage cost is not a problem
Level 01
Else
Level 5
Software Support
Linux 0,1,4,5 (http//www.tldp.org/HOWTO/Softwar
e-RAID-HOWTO.html )
Windows 0,1,5 (http//www.techimo.com/articles/in
dex.pl?photo149 )

40
9.3, 9.4.1 Covered earlier
9.Disks
41
9.4.2 DBMS vs. OS File System
9.Disks

OS does disk space buffer mgmt why not let OS
manage these tasks? 715
Differences in OS support portability issues
Some limitations, e.g., files cant span disks.
Buffer management in DBMS requires ability to
pin a page in buffer pool, force a page to disk
(important for implementing CC recovery),
adjust replacement policy, and pre-fetch pages
based on access patterns in typical DB
operations.
Sometimes MRU is the best replacement policy For
example, for a scan or a loop that does not fit.

42
9.5 Files of Records
9.Disks

Page or block is OK when doing I/O, but higher
levels of DBMS operate on records, and files of
records.
FILE A collection of pages, each containing a
collection of records. Must support
insert/delete/modify record
read a particular record (specified using record
id)
scan all records (possibly with some conditions
on the records to be retrieved)

43
9.5.1 Unordered (Heap) Files
9.Disks

Simplest file structure contains records in no
particular order.
As file grows and shrinks, disk pages are
allocated and de-allocated.
To support record level operations, we must
keep track of the pages in a file
keep track of free space on pages
keep track of the records on a page
There are at least two alternatives for keeping
track of heap files.

44
Heap File Implemented as a List
9.Disks
Data Page
Data Page
Data Page
Full Pages
Header Page
Data Page
Data Page
Data Page
Pages with Free Space

The header page id and Heap file name must be
stored someplace.
Each page contains 2 pointers plus data.

45
Heap File Using a Page Directory
9.Disks
Data Page 1
Header Page
Data Page 2
Data Page N
DIRECTORY

The entry for a page can include the number of
free bytes on the page.
The directory is a collection of pages linked
list implementation is just one alternative.
Much smaller than linked list of all HF pages!

46
Comparing Heap File Implementations

Assume
100 directory entries per page.
U full pages, E pages with free space
D directory pages
Then D ?(UE) /100?
Note that D is two orders of magnitude less than
U or E
Cost to find a page with enough free space
List E/2 Directory (D/2) 1
Cost to Move a page from Full to Free (e.g.,
when a record is deleted)
List 3, Directory 1
Can you think of some other operations?

47
9.6 Page Formats Fixed Length Records
9.Disks
Slot 1
Slot 1
Slot 2
Slot 2
Free Space
. . .
. . .
Slot N
Slot N
Slot M
N
M
1
0
. . .
1
1
M ... 3 2 1
number of records
number of slots
PACKED
UNPACKED, BITMAP
48
Packed vs Unpacked Page Formats

Record ID (RID, TID) (page, slot) , in all
page formats
Note that indexes are filled with RIDs
Data entries in alternatives 2 and 3 are (key,
RID..)
Packed
stores more records
RIDs change when a record is deleted
This may not be acceptable.
Unpacked
RID does not change
Less data movement when deleting

49
Page Formats Variable Length Records
9.Disks
Rid (i,N)
Page i
Rid (i,2)
Rid (i,1)
N
Pointer to start of free space
20
16
24
N . . . 2 1
slots
SLOT DIRECTORY
50
Slotted Page Format

Intergalactic Standard, for fixed length records
also.
How to deal with free space fragmentation?
Pack records. lazily
Note that RIDs dont change
How are updates handled which expand the size of
a record?
Forwarding flag to new location
http//www.postgresql.org/docs/8.3/interactive/sto
rage-page-layout.html
postgresql-8.3.1\src\include\storage\bufpage.h

51
9.7 Record Formats Fixed Length
9.Disks
F1
F2
F3
F4
L1
L2
L3
L4
Base address (B)
Address BL1L2

Information about field types same for all
records in a file stored in system catalogs.
Finding ith field does not require scan of
record.

52
Record Formats Variable Length
9.Disks

Two alternative formats ( fields is fixed)

F1 F2 F3
F4
Fields Delimited by Special Symbols
Field Count
F1 F2 F3 F4
Array of Field Offsets

Second offers direct access to ith field,
efficient storage
of nulls (special dont know value) small
directory overhead.

Write a Comment

User Comments (0)

About PowerShow.com

Chapter 9, Disks and Files PowerPoint PPT Presentation