ICOM 5016

About This Presentation

Title:

ICOM 5016

Description:

Electrical and Computer Engineering Department. ICOM 5016. Dr. Manuel Rodriguez Martinez ... Each block consists of several kilobytes, and each byte is 8-bit. ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 46

Provided by: ValuedGate2254

Learn more at: http://www.ece.uprm.edu

Category:

more less

Transcript and Presenter's Notes

Title: ICOM 5016

1
ICOM 5016 Intro. to Database Systems

Dr. Manuel Rodríguez-Martínez
Electrical and Computer Engineering Department

2
Readings

Read
Chapter 11

3
Relational DBMS Architecture
Client API
Client
Query Parser
Query Optimizer
Relational Operators
Execution Engine
File and Access Methods
Concurrency and Recovery
Buffer Management
Disk Space Management
DB
4
Disk Space Managemet

Disk Space Manager
DBMS module in charge of managing the disk space
used to store relations
Duties
Allocate space
Write data
Read data
De-allocate space
Disk Space Manager supplies a stream of data
pages.
Minimal unit of I/O
Often the size of a block (sector, several
sectors, or more)

5
Disk Page

Disk page is simply an array of bytes
We impose the logic of an array of records!

123 Bob NY 1200
2178 Jil LA 9202
8273 Ned FL 2902
723 Al PR 300
Records
Disk Page
Reading a Disk Page should be one I/O
6
Disk arrangement option

Suppose we need to create 10 GB of space to
stored a database. Each page is 4 KB is size.
How to organize the disk to accomplish this.
Cooked File
User the file system provide by OS
Create a file mydb.dat
Write to this file N pages of size 4KB
N must be enough to reach the size of 10GB
Page are full of bytes with zeros.
Have a hast table somewhere to store the
information about this file mydb.dat.
Now you can start writing pages with actual data.

7
Raw Disk Partition

Dont use the file provided by the DBMS
Instead create a parition on the disk, but dont
format it with OS formats (e.g. FAT, FAT32, NTFS,
LINUX)
Make your own file system on the disk
Create a directory of pages
Need to implement all operation such as read,
write, check, etc.
Need to implement you own files
Faster and more efficient that OS files, but more
complex.

8
Buffer Management

All Data Pages must be in memory in order to be
accessed.
Buffer Manager
deals with asking Disk Space Manager for pages
from disk and store them into memory
Sends Disk Space Manager pages to be written to
disk.
Memory is faster that Disk
Keep as much data as possible in memory
If not enough space is available, need a policy
to decide what pages to remove from memory.

9
Buffer Pool

Frame
Data strucuture that can hold a data page and
control flags
Buffer pool
Array of frames of size N.
In C
define POOL_SIZE 100
define PAGE_SIZE 4096
typedef struct frame
int pin_count
bool dirty
char pagePAGE_SIZE
frame
frame buffer_poolPOOL_SIZE

10
Buffer Pool

Free Frame
RAM
Disk Page
DB
Disk
11
Operational mode

All requested data pages must first be placed
into the buffer pool.
pin_count is used to keep track of number of
transactions that are using the page
0 means no body is using it
dirty is used a flag (dirty bit) to indicate that
a page has been modified since read from disk
Need to flush it to disk if the page is to be
evicted from pool
Page is an array of bytes where the actual is
located
Need to interpret these bytes as the int, char,
Date data types supported by SQL
This is very complex and tricky!

12
Buffer replacement

If we need to bring a page from disk, we need to
find a frame in the buffer to hold it
Buffer pool keeps track on the number of frames
in use
List of frames that are free
If there is a free frame, we use it
Remove from list of free frame
Increment the pin_count
Store the data page into the byte array (page
field)
If the buffer is full, we need a policy to decide
which page will be evicted

13
Buffer replacement Algorithm

Upon request of page X do
Look for page X in buffer pool
If found, return it
Else, determine if there is a free frame Y in the
pool
If frame Y is found
Increment its pin_count
Read page in the frames byte array
Use a replacement policy to find a frame Z to
replace
Z must have pin_count 0
Increment the pin_count in Z
If dirty bit is set, write data currently in Z to
disk
Read the new page into the byte array in Z

14
Some issues

Need to make sure pin_count is 0
Nobody is using the frame
Need to write the data to disk if dirty bit is
zero
This latter approach is called Lazy update
Write to disk only when you have to!!!
Careful, if power fails, you are in trouble.
DBMS need to periodically flush pages to disk
Force write
If no page is found with pin_count equal to 0,
then either
Wait until one is freed
Abort the transaction (insufficient resources)

15
Buffer Replacement policies

LRU Least Recently Used
Evicts the page is the least recently used page
in the pool.
Can be implemented by having a priority queue
with the frame numbers. High Priority not used
frequently
Head of the queue is the LRU
Each time a page is used it must be removed from
current queue position and put back at the end
This queue need a method erase() that can erase
stuff from the middle of the queue
LRU is the most widely used policy for buffer
replacement
Most cache managers also use it

16
Other policies

Most Recently Used
Evicts the page that was most recently accessed
Can be implemented with a priority queue
FIFO
Pages are replaced in a strict First-In-First Out
Can be implemented with a FIFO List (queue in the
strict sense)
Random
Pick any page at random for replacement

17
MTTF in terms of number of disks

Suppose we have N disks in the array. Then the
MTTF of the array is given by
This assumes any disk can fail with equal
probability.

18
MTTF in a disk array

Suppose we have a single disk with a MTTF of
50,000 hrs (5.7 years).
Then, if we build an array with 50 disks, then we
have a MTTF for the array of 50,000/50 1000
hrs, or 42 days!, because any disk can fail at
any given time with equal probability.
Disk failures are more common when disks are new
(bad disk from factory) or old (wear due to
usage).
Morale of the story More does not necessarily
means better!

19
Increasing MTTF with redundancy

We can increase the MTTF in a disk array by
storing some redundant information in the disk
array.
This information can be used to recover from a
disk failure.
This information should be carefully selected so
it can be used to reconstruct original data after
a failure.
What to store as redundant information?
Full data block
Parity bit for a set of bit locations across all
the disks
Where to store it?
Check disks disks in the array used only for
this purpose
All disks spread redundant information on every
disk in the array.

20
Redundancy unit Data Blocks

One approach is to have a back-up copy of each
data block in the array. This is called
mirroring.
Back up can be in
another disk, or disk array
Tape (very slow )
Advantage
Easy to recover from failure, just read the block
from backup.
Disadvantages
Requires twice the storage space
Writes are more expensive
Need to write data block to two different
locations each time
Snapshot writes are unfeasible (failures happen
at any time!)

21
Redundancy Unit Parity bits

Consider an array of N disks. Suppose k is the
number of the k-th block in each disk. Each block
consists of several kilobytes, and each byte is
8-bit.
We can store redundant information about the i-th
bit position in each data block.
Parity bit
The parity bit gives the number of bits that were
set to the value 1 in the group of corresponding
bit locations of the data blocks.
For example, if bit 1024 has a parity 0, then an
even number of bits where set 1 at bit position
1024. Otherwise its value must be 1.

22
Parity bits

Consider bytes
b1 00010001, b2 00111111, b3 00000011
If we take the XOR these bytes we get
00010001
00111111
00000011
00101101 - this byte has the parity for all bits
in b1, b2, b3
Notice the following
For bit position 0, the parity is 1,meaning an
odd number of bits have value 1 for bit position
0.
For bit position 1, the parity is 0, meaning an
even number of bits have value 1 for bit position
1

23
Redundancy as blocks of parity bits

For each corresponding data block compute and
store a parity block that has the parity bits
corresponding to the bit location in the data
blocks.

Disk 0 Block 0
Disk 1 Block 0
Disk 2 Block 0
Disk 3 Block 0
Check Disk 0 Block 0
Disk Array
Controller Bus
24
Reliability groups in disk array

Organize the disks in the array into groups
called reliability groups.
Each group has
1 or more data disks, where the data blocks are
stored
0 or more check disks where the blocks of parity
bits are stored
If a data disk fails, then the check disk(s) for
its reliability group can be used to recover the
data lost from that disk.
There is a recovery algorithm that works for any
failed disk m in the disk array.
Can recover from up to 1 disk failure.

25
Recovery algorithm

Suppose we have an array of N disks, with M check
disks (in this case there is one reliability
group).
Suppose disk p fails. We buy a replacement and
then we can recover the data as follows.
For each data block k on disk p
Read data blocks k on every disk r, with r ! p
Read parity block k from its check disk w
For each bit position i in block k of disk p
Count number of bits set 1 at bit i in each block
coming from a disk other than p. Let this number
be j
If j is odd, and parity bit is 1 then bit
position i is set to 0
If j is even, and parity bit is 0 then bit
position i is set to 0
Else, bit position i is set to 1

26
Recovery algorithm Example

Suppose we have an array of 5 disks, with 1 check
disk (in this case there is one reliability
group).
Suppose disk 1 fails. We buy a replacement and
then we can recover the data as follows.
For each data block k on disk 1
Read data blocks k on disks 0, 2, 3, 4,
Read parity block k on check disk 0
For each bit position i in block k of disk 1
Count number of bits set 1 at bit I in each block
k coming from disks 0, 2, 3, 4. Let this number
be j
If j is odd, and parity bit is 1 then bit
position i is set to 0
If j is even, and parity bit is 0 then bit
position i is set to 0
Else, bit position i is set to 1

27
RAID Organization

RAID
Originally Redundant Array of Inexpensive Disks
Now Redundant Array of Independent Disks
RAID organization combines the ideas of striping,
redundancy as parity bits, and reliability
groups.
RAID system has one or more reliability groups
For simplicity we shall assume only one group
RAID systems can have various number of check
disks for reliability groups, depending on the
RAID level that is chosen for the system.
Each RAID level represent a different tradeoff
between storage requirements, write speed and
recovery complexity.

28
RAID Analysis

Suppose we have a disk array with 4 data disks.
Lets analyze how many check disks we need to
build a RAID with 1 reliability group of 4 data
disks plus the check disks.
Note Effective space utilization is a measure of
the amount of space in the disk array that is
used to store data. It is given as a percentage
by the formula

29
RAID Level 0

RAID Level 0 Non-redundant
Uses data striping to distributed data blocks,
and increase maximum disk bandwidth available.
Disk bandwidth refers to the aggregate rate of
moving data from the disk array to the main
memory. Ex. 200MB/sec
Solution with lowest cost, but with little
reliability.
Write performance is the best since only 1 block
is written in every write operation, and the cost
is 1 I/O.
Read performance is not the best, since a block
can only be read from one site.
Effective space utilization is 100

30
RAID Level 1

RAID Level 1 Mirrored
Each data block is duplicated
original copy mirror copy
No striping!
Most expensive solution since it requires twice
the space of the expected data set size.
Every write involves two writes (original copy)
cannot be done simultaneously to prevent double
corruption
First write on data disk, then on copy at mirror
disk
Reads are fast since a block can be fetched from
Data disk
Mirror disk

31
RAID Level 1 (cont)

In RAID Level 1, the data block can be fetched
from the disk with least contention.
Since we need to pair disks in groups of two
(original copy), the space utilization is 50,
independent on the amount of disks.
RAID Level 1 is only good for small data
workloads where the cost of mirroring is not an
issue.

32
RAID Level 01

RAID Level 01 Striping and Mirroring
Also called RAID Level 10
Combines mirroring and striping.
Data is striped over the data disks.
Parallel I/O for high throughput (full disk array
bandwidth)
Each data disk is copied into a mirror disk
Writes require 2 I/Os (original disk mirror
disk)
Blocks can be read from either original disk or
mirror disk
Better performance since more parallelism can be
achieved.
No need to wait for busy disk, just go to its
mirror disk!

33
RAID Level 10 (cont)

Space utilization is 50 (half data and half
copies)
RAID Level 10 is better than RAID 1 because of
striping.
RAID Level 10 is good for workloads with small
data sets, where cost of mirroring is not an
issue.
Also good for workloads with high percentages of
writes, since a write is always 2 I/Os to
unloaded disks (specially the mirrors).

34
RAID Level 2

RAID Level 2 Error-Correcting Codes
Uses striping with a 1-bit striping unit.
Hamming code for redundancy in C check disks.
Can indicate which disk failed
Make number of check disk grow logarithmically
with respect to the number of data disks. (???)
Read is expensive since to read 1 bit we need to
read 1 physical data block, the one storing the
bit.
Therefore, to read 1 logical data block from the
array we need to read multiple physical data
blocks from each disk to get all the necessary
bits.

35
RAID Level 2 (Cont)

Since we are striping with 1-bit units, if we
have an array with m data disks, then m reads for
bits will require 1 block from each disk, for a
total of m I/Os.
Therefore, reading 1 logical data block from the
RAID will require reading at least m blocks, and
therefore the cost will be at least m I/Os.
Level 2 is good for request of large contiguous
data blocks since the system will fetch physical
blocks that will have the require data.
Level 2 is bad for request of small data since
the I/Os will be wasted in fetching just a bits
and throwing away the rest.

36
RAID Level 2 (Cont)

Writes are expensive with Level 2 RAID.
A write operation on N data disks involves
Reading at least N data blocks into the memory.
Reading C check disks
Modifying the N data blocks with the new data.
Modifying C check disks to update hamming codes
Writing N C blocks to the disk array.
This is called a read-modify-write cycle.
Level 2 has better space utilization than Level 1.

37
RAID Level 3

Raid Level 3 Bit-Interleaved Parity
Uses striping with a 1-bit striping unit.
Does not uses Hamming codes, but simply computes
bit parity.
Disk controller can tell which disk has failed.
Only need 1 check disk to store parity bits of
the data disks in the array.
A RAID Level 3 system will have N disks, where
N-1 are data disks, and one is the check disk.

38
RAID Level 3 (Cont)

Reading or writing a logical data block in a RAID
Level 3 involves reading at least N-1 data blocks
from the array.
Writing requires a read-modify-write cycle.

39
RAID Level 4

RAID Level 4 Block-Interleaved Parity
Uses striping with a 1-block striping unit.
Logical data block is the same as physical data
block.
Computes redundancy as parity bits, and has 1
check disk to store parity bits for all
corresponding block in the array.
Reads can be run in parallel
Works well for both large and small data
requests.
Writes require read-modify-write cycle but only
involve
Data disk for block being modified (target block
k)
Check disk (parity block for block k)

40
RAID Level 4 (Cont)

The parity block k is updated incrementally to
avoid reading all data blocks k from all data
disks.
Only need to read parity block k and block k to
be modified
Parity is computed as follows
New parity block ((Old block XOR New block)
XOR Old parity block)
In this way Read-modify-write cycle avoids
reading the data block in each disk to compute
the parity.
Read-modify-write cycle only performs 4I/Os (2
reads and 2 writes of the target data block and
parity block)
Space utilization is the same as RAID Level 3.

41
RAID Level 4 (Cont)

In RAID Level 3 and 4, the check disk is only
used in writing operations. It does not help with
the reads.
Moreover, the check disk becomes a bottleneck
since it must participate in every write
operation.

42
RAID Level 5

RAID Level 5 Block-Interleaved Distributed
Parity
Uses striping with a 1-block striping unit.
Redundancy is stored as blocks of parity bits,
but the parity blocks are distributed over all
the disks in the array.
Every disk is both a data disk and a check disk.
Best of both worlds
Fast reads
Fast writes
Reads are efficient since they can be run in
parallel.

43
RAID Level 5 (Cont)

Writes still involve a read-modify-write cycle
But the cost of writing the parity block is
lowered by scattering them over all the disks in
the array.
Remove the contention at one check disk
RAID Level 5 is a good general purpose system
Small reads
Large reads
Intensive writes
Space utilization is equivalent to Level 3 and 4
since there is 1 disk worth of parity blocks in
the system!

44
RAID Level 6

RAID Level 6 PQ Redundancy
RAID Levels 2-4 only recover from 1 disk failure.
In a large disk array, there is a high
probability that two disk might fail
simultaneously.
RAID Level 6 provides recovery from 2 disk
failures.
Uses striping with 1-block striping unit
Redundancy is stored as parity bits and
Reed-Solomon codes.
Require two check disks for the data disks in the
array

45
RAID Level 6 (Cont)