Chapter 9, Disks and Files - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 9, Disks and Files

Description:

RAID. Disk Space Management. Buffer Management. Files of Records. Format of a Heap File ... means, be able to derive any figures on the RAID performance ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 53
Provided by: RaghuRamak
Learn more at: http://web.cecs.pdx.edu
Category:
Tags: chapter | disks | files | raid

less

Transcript and Presenter's Notes

Title: Chapter 9, Disks and Files


1
Chapter 9, Disks and Files
  • The Storage Hierarchy
  • Disks
  • Mechanics
  • Performance
  • RAID
  • Disk Space Management
  • Buffer Management
  • Files of Records
  • Format of a Heap File
  • Format of a Data Page
  • Format of Records

2
Learning objectives
  • Given disk parameters, compute storage needs and
    read times
  • Given a reminder about what each level means, be
    able to derive any figures on the RAID
    performance slide
  • Describe the pros and cons of alternative
    structures for files, pages and records

3
A (Very) Simple Hardware Model
CPU chip
register file
ALU
system bus
memory bus
main memory
I/O bridge
bus interface
I/O bus
Expansion slots for other devices such as network
adapters.
USB controller
disk controller
graphics adapter
mouse
keyboard
monitor
disk
4
Storage Options
Capacity Access Time Cost
Registers Caches Main Memory Hard Disk /
Flash Tape
1k-2k bytes 1 Tc Way Expensive
10s -1000s K Bytes 2-20 Tc 10 / MByte
G Bytes 300 1000 Tc 0.03 / MB (eBay)
100s G Bytes 10 ms 30M Tc 0.10/ GB (eBay)
Infinite Forever Way Cheap
5
Memory Hierarchy
Upper Level
Capacity Access Time Cost
Staging Xfer Size
Faster
1k-2k bytes 1 Tc Way Expensive
Registers
prog./compiler 1-8 bytes
Instr. Operands
10s -1000s K Bytes 2-20 Tc 10 / MByte
Cache - SDRAMmay be multiple levels!
cache cntl 8-128 bytes
Blocks
G Bytes 300 1000 Tc 0.03 / MB (eBay)
Memory - DRAM
OS 4K bytes
Pages
100s G Bytes 10 ms 30M Tc 0.10/ GB (eBay)
Disk
user/operator Gbytes
Files
Larger
Infinite Forever Way Cheap
Tape
Lower Level
6
Why Does Hierarchy Work?
  • Locality
  • Program access a relatively small portion of the
    address space at any instant of time
  • Two Different Types
  • Temporal Locality (Locality in Time) If an item
    is referenced, it will tend to be referenced
    again soon (e.g., loops, reuse)
  • Spatial Locality (Locality in Space) If an item
    is referenced, items whose addresses are close by
    tend to be referenced soon (e.g., straightline
    code, array access)

7
9.1 The Memory Hierarchy
  • Typical storage hierarchy as used by a RDBMS
  • Primary storageMain memory (RAM) for currently
    used data
  • Secondary storageDisk, Flash Memory for the
    main database
  • http//www.cs.cmu.edu/damon2007/pdf/graefe07fivem
    inrule.pdf
  • What are other reasons besides cost to use disk?
  • Tertiary storageTapes, DVDs for archiving older
    versions of the data
  • Other factors
  • Caches at every level
  • Controllers, protocols
  • Network connections

8
What is FLASH Memory, Anyway?
  • Floating gate transitor
  • Presence of charge gt 0
  • Erase Electrically or UV (EPROM)
  • Peformance
  • Reads like DRAM (ns)
  • Writes like DISK (ms). Write is a complex
    operation

9
Components of a Disk
Spindle
Disk head
Tracks
  • platters are always spinning (say, 120rps).
  • one head reads/writes at any one time.
  • to read a record
  • position arm (seek)
  • engage head
  • wait for data to spin by
  • read (transfer data)

Sector
Platters
Arm movement
Arm assembly
10
More terminology
Spindle
Disk head
Tracks
  • Each track is made up of fixed size sectors.
  • Page size is a multiple of sector size.
  • A platter typically has data on
  • both surfaces.
  • All the tracks that you can reach from one
    position of the arm is called a cylinder
    (imaginary!).


Sector
Platters
Arm movement
Arm assembly
11
Disks Technology Background
  • Seagate 373453, 2003
  • 15000 RPM (4X)
  • 73.4 GBytes (2500X)
  • Tracks/Inch 64000 (80X)
  • Bits/Inch 533,000 (60X)
  • Four 2.5 platters (in 3.5 form factor)
  • Bandwidth 86 MBytes/sec (140X)
  • Latency 5.7 ms (8X)
  • Cache 8 MBytes
  • CDC Wren I, 1983
  • 3600 RPM
  • 0.03 GBytes capacity
  • Tracks/Inch 800
  • Bits/Inch 9550
  • Three 5.25 platters
  • Bandwidth 0.6 MBytes/sec
  • Latency 48.3 ms
  • Cache none

12
Typical Disk Drive Statistics (2008)
Sector size 512 bytes Seek time
Average 4-10 ms Track to
track .6-1.0 ms Average Rotational Delay -
3 to 5 ms (rotational speed 10,000 RPM to
5,400RPM) Transfer Time - Sustained data
rate 0.3- 0.1 msec per 8K page, or 25-75
MB/second Density 12-18 GB/in2
13
Disk Capacity
  • Capacity maximum number of bits that can be
    stored.
  • Expressed in units of gigabytes (GB), where 1 GB
    109 bytes
  • Capacity is determined by
  • Recording density (bits/in) number of bits that
    can be squeezed into a 1 inch segment of a track.
  • Track density (tracks/in) number of tracks that
    can be squeezed into a 1 inch radial segment.
  • Areal density (bits/in2) product of recording
    and track density.
  • Modern disks partition tracks into disjoint
    subsets called recording zones
  • Each track in a zone has the same number of
    sectors, determined by the circumference of
    innermost track.
  • Each zone has a different number of
    sectors/track

14
Cost of Accessing Data on Disk
  • Time to access (read/write) a disk block
  • Taccess Tavg seek Tavg rotation Tavg
    transfer
  • seek time (moving arms to position disk head on
    track)
  • rotational delay (waiting for block to rotate
    under head)
  • Half a rotation, on average
  • transfer time (actually moving data to/from disk
    surface)
  • Key to lower I/O cost reduce seek/rotation
    delays!
  • No way to avoid transfer time
  • Textbook measures query cost by NUMBER of page
    I/Os
  • Implies all I/Os have the same cost, and that CPU
    time is free
  • This is a common simplification.
  • Real DBMSs (in the optimizer) would consider
    sequential vs. random disk reads
  • Because sequential reads are much faster
  • and would count CPU time.

15
Disk Parameters Practice
  • A 2-platter disk rotates at 7,200 rpm. Each
    track contains 256KB.
  • How many cylinders are required to store an 8
    Gigabyte file?
  • What is the average rotational delay, in
    milliseconds?

16
Disk Access Time Example
  • Given
  • Rotational rate 7,200 RPM
  • Average seek time 9 ms.
  • Avg sectors/track 400.
  • Derived
  • Tavg rotation 1/2 x (60 secs/7200 RPM) x 1000
    ms/sec 4 ms.
  • Tavg transfer 60/7200 RPM x 1/400 secs/track x
    1000 ms/sec 0.02 ms
  • Taccess 9 ms 4 ms 0.02 ms
  • Important points
  • Access time dominated by seek time and rotational
    latency.
  • First bit in a sector is the most expensive, the
    rest are free.
  • SRAM access time is about 4 ns/doubleword, DRAM
    about 60 ns
  • Disk is about 40,000 times slower than SRAM,
  • 2,500 times slower than DRAM.

17
So, How far away is the data?
From http//research.microsoft.com/gray/papers/Al
phaSortSigmod.doc
18
Block, page and record sizes
  • Block According to text, smallest unit of I/O.
  • Page often used in place of block.
  • typical record size commonly hundreds,
    sometimes thousands of bytes
  • Unlike the toy records in textbooks
  • typical page size 4K, 8K

19
Effect of page size on read time
  • Suppose rotational delay is 4ms, average seek
    time 6 ms, transfer speed .5msec/8K.
  • This graph shows the time required to read 1Gig
    of data for different page sizes.

20
Why the difference?
  • What accounts for the difference, in times to
    read one Gigabyte, on the previous graph?
  • Assume rotational delay 4ms, average seek time 6
    ms, transfer speed .5msec/8K
  • Transfer time
  • (230/213 8K blocks) ?(.5msec/8K) 66 secs
    one minute
  • How many reads?
  • Page size 8K there are 230/213 217 128K
    reads
  • Page size 64K, there are 1/8th that many reads
    16K reads
  • Time taken by rotational delays and seeks
  • Each read requires a rotational delay and a seek,
    totalling 10 msec.
  • 8K (128K reads) ? (10msec/read) 1,311 secs
    22 minutes
  • 64K 1/8 of that, or 164 secs 3 minutes

21
Moral of the Story
  • As page size increases, read (and write) time
    reduces to transfer time, a big savings.
  • So why not use a huge page size?
  • Wastes memory space if you dont need all that is
    read
  • Wastes read time if you dont need all that is
    read
  • What applications could use a large page size?
  • Those that sequentially access data
  • The problem with a small page size is that pages
    get scattered across the disk. Turn the page.

22
Faster I/O, even with a small page size
  • Even if the page size is small, you can achieve
    fast I/O by storing a files data as follows
  • Consecutive pages on same track, followed by
  • Consecutive tracks on same cylinder, followed by
  • Consecutive cylinders adjacent to each other
  • First two incur no seek time or rotational delay,
    seek for third is only one-track.
  • What is saved with this storage pattern?
  • How is this storage pattern obtained?
  • Disk defragmenter and its relatives/predecessors
  • Also places frequently used files near the
    spindle
  • When data is in this storage pattern, the
    application can do sequential I/O
  • Otherwise it must do random I/O

23
More Hardware Issues
9. Disks
  • Disk Controllers
  • Interface from Disks to bus
  • Checksums, remap bad sectors, driver mgt, etc
  • Interface Protocols and MB per second xfer rates
  • IDE/EIDE/ATA/PATA, SATA -133
  • SCSI -640
  • BUT for a single device, SCSI is inferior
  • Faster network technologies such as Fibre Channel
  • Storage Area Networks (SANs)
  • Disk farm networked to servers
  • Servers can be heterogeneous a primary
    advantage
  • Centralized management

24
Dependability
  • Module reliability measure of continuous
    service accomplishment (or time to failure). 2
    metrics
  • Mean Time To Failure (MTTF) measures Reliability
  • Failures In Time (FIT) 1/MTTF, the rate of
    failures
  • Traditionally reported as failures per billion
    hours of operation
  • Mean Time To Repair (MTTR) measures Service
    Interruption
  • Mean Time Between Failures (MTBF) MTTFMTTR
  • Module availability measures service as alternate
    between the 2 states of accomplishment and
    interruption (number between 0 and 1, e.g. 0.9)
  • Module availability MTTF / ( MTTF MTTR)

25
Example calculating reliability
  • If modules have exponentially distributed
    lifetimes (age of module does not affect
    probability of failure), overall failure rate is
    the sum of failure rates of the modules
  • Example Calculate FIT and MTTF for
  • 10 disks (1M hour MTTF per disk)
  • 1 disk controller (0.5M hour MTTF)
  • and 1 power supply (0.2M hour MTTF)

26
Example calculating reliability
  • Calculate FIT and MTTF for
  • 10 disks (1M hour MTTF per disk)
  • 1 disk controller (0.5M hour MTTF)
  • and 1 power supply (0.2M hour MTTF)

27
9.2 RAID 587
9.Disks
  • Disk Array Arrangement of several disks that
    gives abstraction of a single, large disk.
  • Goals Increase performance and reliability.
  • Two main techniques
  • Data striping Data is partitioned size of a
    partition is called the striping Unit. Partitions
    are distributed over several disks.
  • Redundancy More disks gt more failures.
    Redundant information allows reconstruction of
    data if a disk fails.

28
Data Striping
  • CPUs go fast, disks dont. How can disks keep
    up?
  • CPUs do work in parallel. Can disks?
  • Answer Partition data across D disks (see next
    slide).
  • If Partition unit is a page
  • A single page I/O request is no faster
  • Multiple I/O requests can run at aggregated
    bandwidth
  • Number of pages in a partition unit called the
    depth of the partition.
  • Contrary to text, partition units of a bit are
    almost never used and partition units of a byte
    are rare.

29
Data Striping (RAID Level 0)
30
Redundancy
  • Striping is seductive, but remember reliability!
  • MTTF of a disk is about 6 years
  • If we stripe over 24 disks, what is MTTF?
  • Solution redundancy
  • Parity corrects single failures
  • Others detect where the failure is, and corrects
    multiple failures
  • But failure location is provided by controller
  • Redundancy may require more than one check bit
  • Redundancy makes writes slower why?

31
RAID Levels
  • Standardized by SNIA (www.snia.org )
  • Vary in practice
  • For each level, decide (assume single user)
  • Number of disks required to hold D disks of data.
  • Speedup s (compared to 1 disk) for
  • S/R (Sequential/Random) R/W (Reads/Writes)
  • Random each I/O is one block
  • Sequential Each I/O is one stripe
  • Number of disks/blocks that can fail w/o data
    loss
  • Level 0 Block Striped, No redundancy
  • Picture is 2 slides back

32
JBOD, RAID Level 1
  • JBOD Just a Bunch of Disks
  • Level 1 Mirrored (two identical JBODs no
    striping)

33
RAID Level 01 Stripe Mirror
1 D1 2D1 1
2 D2 2D2 2
D-1 2D-1 3D-1 D-1
...
Disk D Disk D1 Disk D2
Disk 2D-1
34
RAID Level 4
  • Block-Interleaved Parity (not common)
  • One check disk, uses one bit of parity.
  • How to tell if there is a failure, or which disk
    failed?
  • Read-modify-write
  • Disk D is a bottleneck

35
RAID Level 5
  • Level 5 Block-Interleaved Distributed Parity

1 D1 2D1
D-2 2D-2 P
D-1 P 3D-2
P 2D-1 3D-1
...
Disk 0 Disk 1
Disk D-2 Disk D-1 Disk D
  • Level 6 Like 5, but 2 parity bits/disks
  • Can survive loss of 2 disks/blocks

36
Notation on the next slide
  • Disks
  • Number of disks required to hold D disks worth of
    data using this RAID level
  • Reads/Write speedup of blocks in a single file
  • SR Sequential Read
  • RR Random read
  • SW Sequential write
  • RW Random write
  • Failure Tolerance
  • How many disks can fail without loss of data
  • Internal Data
  • s Blocks transferred in the time it takes to
    transfer one block of data from one disk.
  • These numbers are theoretical!
  • YMMVand vary significantly!

37
RAID Performance
If no two are copies of each other note
cant write both mirrors at once why?
38
Small Writes on Levels 4 and 5
  • Levels 4 and 5 require a read-modify-write cycle
    for all writes, since the parity block must be
    read and modified.
  • On small writes this can be very expensive
  • This is another justification for Log Based File
    Systems (see your OS course)

39
Which RAID Level is best?
  • If data loss is not a problem
  • Level 0
  • If storage cost is not a problem
  • Level 01
  • Else
  • Level 5
  • Software Support
  • Linux 0,1,4,5 (http//www.tldp.org/HOWTO/Softwar
    e-RAID-HOWTO.html )
  • Windows 0,1,5 (http//www.techimo.com/articles/in
    dex.pl?photo149 )

40
9.3, 9.4.1 Covered earlier
9.Disks
41
9.4.2 DBMS vs. OS File System
9.Disks
  • OS does disk space buffer mgmt why not let OS
    manage these tasks? 715
  • Differences in OS support portability issues
  • Some limitations, e.g., files cant span disks.
  • Buffer management in DBMS requires ability to
  • pin a page in buffer pool, force a page to disk
    (important for implementing CC recovery),
  • adjust replacement policy, and pre-fetch pages
    based on access patterns in typical DB
    operations.
  • Sometimes MRU is the best replacement policy For
    example, for a scan or a loop that does not fit.

42
9.5 Files of Records
9.Disks
  • Page or block is OK when doing I/O, but higher
    levels of DBMS operate on records, and files of
    records.
  • FILE A collection of pages, each containing a
    collection of records. Must support
  • insert/delete/modify record
  • read a particular record (specified using record
    id)
  • scan all records (possibly with some conditions
    on the records to be retrieved)

43
9.5.1 Unordered (Heap) Files
9.Disks
  • Simplest file structure contains records in no
    particular order.
  • As file grows and shrinks, disk pages are
    allocated and de-allocated.
  • To support record level operations, we must
  • keep track of the pages in a file
  • keep track of free space on pages
  • keep track of the records on a page
  • There are at least two alternatives for keeping
    track of heap files.

44
Heap File Implemented as a List
9.Disks
Data Page
Data Page
Data Page
Full Pages
Header Page
Data Page
Data Page
Data Page
Pages with Free Space
  • The header page id and Heap file name must be
    stored someplace.
  • Each page contains 2 pointers plus data.

45
Heap File Using a Page Directory
9.Disks
Data Page 1
Header Page
Data Page 2
Data Page N
DIRECTORY
  • The entry for a page can include the number of
    free bytes on the page.
  • The directory is a collection of pages linked
    list implementation is just one alternative.
  • Much smaller than linked list of all HF pages!

46
Comparing Heap File Implementations
  • Assume
  • 100 directory entries per page.
  • U full pages, E pages with free space
  • D directory pages
  • Then D ?(UE) /100?
  • Note that D is two orders of magnitude less than
    U or E
  • Cost to find a page with enough free space
  • List E/2 Directory (D/2) 1
  • Cost to Move a page from Full to Free (e.g.,
    when a record is deleted)
  • List 3, Directory 1
  • Can you think of some other operations?

47
9.6 Page Formats Fixed Length Records
9.Disks
Slot 1
Slot 1
Slot 2
Slot 2
Free Space
. . .
. . .
Slot N
Slot N
Slot M
N
M
1
0
. . .
1
1
M ... 3 2 1
number of records
number of slots
PACKED
UNPACKED, BITMAP
48
Packed vs Unpacked Page Formats
  • Record ID (RID, TID) (page, slot) , in all
    page formats
  • Note that indexes are filled with RIDs
  • Data entries in alternatives 2 and 3 are (key,
    RID..)
  • Packed
  • stores more records
  • RIDs change when a record is deleted
  • This may not be acceptable.
  • Unpacked
  • RID does not change
  • Less data movement when deleting

49
Page Formats Variable Length Records
9.Disks
Rid (i,N)
Page i
Rid (i,2)
Rid (i,1)
N
Pointer to start of free space
20
16
24
N . . . 2 1
slots
SLOT DIRECTORY
50
Slotted Page Format
  • Intergalactic Standard, for fixed length records
    also.
  • How to deal with free space fragmentation?
  • Pack records. lazily
  • Note that RIDs dont change
  • How are updates handled which expand the size of
    a record?
  • Forwarding flag to new location
  • http//www.postgresql.org/docs/8.3/interactive/sto
    rage-page-layout.html
  • postgresql-8.3.1\src\include\storage\bufpage.h

51
9.7 Record Formats Fixed Length
9.Disks
F1
F2
F3
F4
L1
L2
L3
L4
Base address (B)
Address BL1L2
  • Information about field types same for all
    records in a file stored in system catalogs.
  • Finding ith field does not require scan of
    record.

52
Record Formats Variable Length
9.Disks
  • Two alternative formats ( fields is fixed)

F1 F2 F3
F4
Fields Delimited by Special Symbols
Field Count
F1 F2 F3 F4
Array of Field Offsets
  • Second offers direct access to ith field,
    efficient storage
  • of nulls (special dont know value) small
    directory overhead.
Write a Comment
User Comments (0)
About PowerShow.com