Chapter 5, 6: Storage and File Organizations, Indexes, RAID presentation

About This Presentation

Transcript and Presenter's Notes

Title: Chapter 5, 6: Storage and File Organizations, Indexes, RAID

1
Chapter 5, 6 Storage and File Organizations,
Indexes, RAID

ICS 434 Reading Material Summary
October 1, 2003

2
Physical Database Design .. 1.. 1

Physical database design is the process of
choosing specific storage structures and access
paths for the database files to achieve good
performance for the various database
applications.
Each DBMS offers a variety of options for file
organization and access paths
Once a specific DBMS is chosen, the physical
database design process is restricted to choosing
the most appropriate file organizations and
access paths for the database files from among
the options available in the DBMS.

3
Physical Database Design .. 2

We know that data in a database systems are
stored on a secondary storage using media like
magnetic disks, optical disks, or the like.
Magnetic tapes are normally used for keeping the
backup of the databases.
The data in the databases are stored using the
storage structure normally known as file
organization
Some very common file organizations are
sequential (unordered or unsorted), sequential
(Ordered or sorted), indexed and hashed

4
Physical Database Design .. 3

Each file organization has its own very specific
characteristics or properties
For example, if data is stored in a sequential
(unordered) file, we cannot perform binary search
on the file. The data access will be sequential
and the average retrieval time of a data item in
the file will directly depend on the size (i.e.,
number of records or tuples) of file
If many records are added to the file and few are
deleted from the file, the retrieval time of a
data item from such a file will go on increasing
and ultimately the response time of the query
using the data item will become worse and worse.
If we dont take care of the problem properly,
the database performance may become unacceptable.

5
Processing a Database Command

Important steps in writing data to a database or
reading data from a database are
1. User submits a data request command to the
DBMS
2. DBMS analyses the command
3. DBMS generates a database access command that
should be executed to fulfill the data request of
the user
4. As a part of the database access, the relevant
files of the database are opened and blocks of
data are transferred from the secondary storage
to a main memory storage area called buffers
5. The data from the buffers is transferred to
the program work area and manipulated in the work
area
6. The data from the program work area is
transferred to the user if the processing is
complete for retrieval or it is transferred to
the buffers to continue the processing
7. In case of writing the data to the database,
the buffers are written back to the storage

6
Selecting File Organization and Access Paths

Criteria for choosing appropriate file
organization and access paths are
Response time --- is the time from the submission
of a database query for execution to receiving a
response. The response time of a query mainly
depends on the access time for data items
referenced in the query
Space utilization --- is the amount of storage
space used by the database files and their access
path structures like indexes on disk
Transaction throughput --- is the average number
of transactions that can be process per unit time

7
Issues in Physical Database Design

In a relational database, each base relation in
the database schema is stored as a physical
database file.
During the physical database design many design
decisions need to be made, such as
What file organization should be used to store
the relation?
Whether the file should be indexed or not?
If indexed then which attribute or attributes
should be used to create index(s)?
Should the index be clustered or not?
Should hashing or tree index be used?
If it is hashing then should it be static or
dynamic hashing?

8
File Organizations and Indexes

File organizations
Heap (Unordered sequential)
Ordered sequential
Hashed Files
Index Structures for Files
What is an index?
Single-level VS multiple-level indexes
Types of single-level indexes
Primary index
Clustering index
Secondary indexes
Dynamic multi-level indexes using B, B trees

9
File Organizations - Heap files

Files of unordered records -- records are placed
in the file in the order in which they are
inserted.
Access
Only linear search is possible
Insertion
A new record is inserted at the end of the file
Deletion --- two methods to delete a record from
a heap file are immediate physical deletion and
delayed physical deletion
In immediate physical deletion method, record to
be deleted is searched, the disk block containing
the record is transferred to the buffer, the
record is deleted, the records following the
deleted record are adjusted to use the freed
space, and the block is written back to the disk.
In delayed physical deletion method (also known
as Deletion marker method), the record to be
deleted is searched if found then it is marked
as deleted. After some time when the file is
reorganized then during the reorganization, the
records marked as deleted are physically
deleted from the file and freed space is used by
the other records.

10
File Organizations - Sequential Ordered Files

The records of a file on disk are physically
ordered based on the values of one field called
the ordering field.
Access
Various faster search algorithms like binary
search can be performed.
Insertion
A new record is inserted at its proper position
according to the value of the ordering field and
the other records are adjusted accordingly.
Deletion
Whenever a record is deleted, the remaining
records must maintain their order.

11
File Organizations Hash Files

A hash file is a file in which a hash function
uses one key field (also called the hash field)
value to generate the address of the record in
the storage, i.e., h ( hash field value)
addresswhere h is the hash function
Examplesuppose we have 1000 records of data
about employees and each employee is assigned a
unique 5-digit employee id number. We can use
mod(employee-id, 1000) as a hash function. The
function will give us values between 0 and 999
which can be used as an address of the record in
the file.

12
File Organizations - Hash Files

Insertion, deletion, and search operations on a
hash file
In order to insert a record into the file, the
hash function is applied onto the key value of
the record. The record is stored at the address
generated by the hash function.
In order to delete a record from the file, the
hash function is applied on the key value and the
record from the address generated by the hash
function is deleted from the file.
In order to search a record from the file, the
hash function is applied on the key value and the
record from the address generated by the hash
function is retrieved from the file.
The hash file organization works very fine if the
hash function can generate a unique address for
every possible key value in the file. If this is
not the case, then hash function may generate the
same address for two or more key values.

13
File Organizations Hash Files

A collision occurs when the hash field value of a
record that is being inserted hashes to an
address that already contains a different record.
The process of finding another position to store
the record is called Collision Resolution
There are various methods for collision
resolution open addressing, chaining, and
multiple hashing are just three to mention
Hashing when applied to store and retrieve file
from the disks is called external hashing.

14
Index Structures for Files

An index is an access structure created to make
the access to the data in the data file faster.
The field of the records in the file on which an
index is created is called the indexing field.
The index structure provide secondary access path
Various types of indexes are primary index,
secondary index, dense index, sparse index, and
clustering index.
Single-level VS multi-level indexes If the size
of the data file is such that an index structure
of reasonable size can be created and is
sufficient to provide an efficient access to the
data in the file then it is a single-level index.
But if the size of the index gets very large then
another index can be created on to the first
index. Such a scheme is called the multi-level
index structure.

15
Index Structures for Files

Primary index
A primary index is an ordered file whose records
are of fixed length with two fields. The first
field is of the same type as the ordering key
field, called the primary key of the data file,
and the second field is a pointer to a disk block
.
Secondary indexes
In addition to the primary index, which can be
only one for the file, one or more indexes can be
created on the fields other than the primary key.
Such indexes are called the secondary indexes.
Clustering index
If a secondary index is created on a field that
is not a key then in this index each entry may
point to many records in the file. Such an index
is called the clustering index.

16
Index Structures for Files

Dense index
A dense index is an index in which there is one
entry for every search key value in the data
file.
Sparse index
A sparse index is an index which has entries for
only some of the search values. It is also called
non-dense index.

17
Search Trees

We can use a search tree as a mechanism to search
for records stored in a disk file. The values in
the tree can be the values of one of the fields
of the file, called the search field
Each key value in the tree is associated with a
pointer to the record in the data file having
that value. Alternatively, the pointer could be
to the disk block containing that record.
The search tree itself can be stored on disk by
assigning each tree node to a disk block.
When a new record is inserted, the search tree is
updated by inserting an entry in the tree
containing the search field value of the new
record and a pointer to the new record.

18
Search Trees

Algorithms are necessary for inserting and
deleting search values into and from the search
tree while maintaining the preceding two
constraints. In general, these algorithms do not
guarantee that a search tree is balanced, meaning
that all of its leaf nodes are at the same level.
Keeping a search tree balanced is important
because it guarantees that no nodes will be at
very high levels and hence require many block
accesses during a tree search.
Another problem with search trees is that record
deletion may leave some nodes in the tree nearly
empty, thus wasting storage space and increasing
the number of levels.

19
Search Tree Example
B-Tree Example
20
B-Trees

The B-tree addresses problems of search trees by
specifying additional constraints on the search
tree.
The B-tree has additional constraints that ensure
that the tree is always balanced and that the
space wasted by deletion, if any, never becomes
excessive.
The algorithms for insertion and deletion,
though, become more complex in order to maintain
these constraints. Most insertions and deletions
are simple processes they become complicated
only under special circumstancesnamely, whenever
we attempt an insertion into a node that is
already full or a deletion from a node that makes
it less than half full.

21
B-Tree Example
22
B-Trees

In a B-tree, every value of the search field
appears once at some level in the tree, along
with a data pointer. In a B-tree, data pointers
are stored only at the leaf nodes of the tree
hence, the structure of leaf nodes differs from
the structure of internal nodes.
The leaf nodes have an entry for every value of
the search field, along with a data pointer to
the record (or to the block that contains this
record) if the search field is a key field.
For a nonkey search field, the pointer points to
a block containing pointers to the data file
records, creating an extra level of indirection
The leaf nodes of the B-tree are usually linked
together to provide ordered access on the search
field to the records. These leaf nodes are
similar to the first (base) level of an index.

23
Most implementations of Dynamic Multilevel index
use B-tree
24
B-Tree Example (Figure 6.12/page 181)
25
Physical Database Design Process

After EER design, by mapping it to the relational
data model, we have the logical schema for the
database.
The next phase in the database design process is
physical database design.
Physical database design is the process in which
the designer should come up with the appropriate
structures for the data storage on secondary
storage devices such that it guarantees good
performance of the database.
Most relational systems store each base relation
as a physical database file.

26
Physical Database Design Process

During the physical database design phase we
Choose appropriate file structure for each
relation in the database schema
Decide on if some indexes need to be created to
make the access to the data efficient
Ensure that performance goals of the database
will be met.

27
Inputs for the Physical Database Design

The two main inputs for the physical database
design are
The database workload
Users performance requirements
Database workload includes
A list of queries and their frequencies
A list of updates and their frequencies
Performance goals for each type of query and
update

28
Inputs for the Physical database Design

For each query in the workload, we must identify
Which relations are accessed
Which attributes are retained (in the SELECT
clause)
Which attributes have selection or join
conditions expressed on them (in the WHERE
clause) and how selective these conditions are
likely to be
For each update in the workload, we must identify
Which attributes have selection or join
conditions expressed on them (in the WHERE
clause) and how selective these conditions are
likely to be
The type of update (INSERT, DELETE, UPDATE)
Relations and attributes that are modified by
each update

29
Outputs from the Physical database Design

The physical database design decisions include
Specify the type of file for each relation in the
database schema
The attributes on which indexes should be created
Physical design decisions for indexing should
include
What attribute or attributes to index on?
Whether to set up a clustered index?
Whether to use hash index or a tree index?
Which relations to index and which field or
combination of fields to choose as index search
keys?
For each index, should it be clustered or
non-clustered?

30
Guidelines For Index Selection

Guideline 1Dont build an index unless some
query benefit from it.Whenever possible, choose
indexes that speed up more than one query
Guideline 2 Attributes mentioned in a WHERE
clause are candidates for indexing.An
exact-match selection condition suggests that the
attribute be considered for indexing.A range
selection condition suggests that B tree index
be considered for the selected attributes

31
Guidelines For Index Selection

Guideline 3
Indexes with multiple-attribute search keys
should be considered in the following two
situations
A WHERE clause includes conditions on more than
one attribute of a relation
The attributes enable index-only evaluation
strategies for important queries, i.e., only the
index will be accessed and accessing the relation
can be avoided
Guideline 4
At most one index on a given relation can be
clustered, and clustering affects the performance
greatly so the choice of clustering index is
important. If several range queries are posed on
a relation involving different set of attributes,
then these attributes are good candidates for
clustered indexes.

32
Guidelines For Index Selection

Guideline 5
A B index is preferable because it supports
range queries as well as equality queries.
Guideline 6
After preparing the initial list of indexes to
create, consider the impact of each index on the
updates in the workload. If maintaining an index
slows down frequent update operations, consider
dropping the index.

33
RAID Levels

What is RAID?
Data Striping
Reliability and performance improvements
RAID organization and levels

34
What is RAID?

RAID stands for Redundant Array of Inexpensive
Disks
Rather than storing data on a single set of
disks, the data is stored on one or more set(s)
of independent disks which behave like a single
high performance logical disk.
RAID technology comes with seven levels, I.e.,
level 0, level 1, level 2, level 3, level 4,
level 5, and level 6.
The number of redundant disks depends on the RAID
level used to store the data, for example RAID
level 0 has no redundant data (therefore no
additional disks are used. RAID level 0 is same
as traditional storage of data. RAID level 1 uses
two sets of disks rather than one. The second set
of disks is used to mirror the data, and so on.

35
Data Striping

What is data striping?
Data striping distributes data over multiple
disks to make them appear as a single large, fast
disk.The following diagram shows that FILE A is
striped into four disks.

36
Data Striping

Data Striping uses parallelism to improve the
disk performance. In the diagram on slide 4,
suppose the FILE A consists of four disk
blocks. Now if the FILE A is stored on one disk
and we want to read the whole file, then we need
four read operations to read the file. But if the
file is stored on four independent disks, one
block on each disk, and the four disks behave
like a single logical disk then all four blocks
can be read in parallel and it will take time
equal to one read operation
Because data is striped on to multiple disks,
therefore some kind of parity check can be used
to improve the reliability of data very similar
to the parity bit used in the main memory.

37
Reliability and Performance Improvements

Improving reliability with RAID
One way to increase reliability of a database
when using redundant array of independent disks
by introducing redundancy, i.e., store one data
item I more than one place.
One technique for introducing redundancy is
called mirroring or shadowing, i.e. store data
redundantly on two identical physical disks that
are treated as one logical disk.
In case of mirrored data, a data item can be read
from any disk but for writing the data item must
be written o both.
If one disk fails, the other disk is still there
to continuously provide the data. It improves he
reliability.

38
Reliability and Performance Improvements

Improving performance with RAID
Assume data striping at block level is used and
rather than storing data on one disk, it is
stored on four disks. Reading one whole file in
the later case will take one-fourth of the time
needed to read the whole file in the former case.
Granularity ( the size of data used for
stripping) cab reduced to even improve the
performance more. For example, by using 8-disks,
bit-level data stripping can be used. In this
case, one byte of data will be read from 8-disks
(one bit from each disk) in parallel.

39
RAID Organization and Levels

RAID levels depend on the data redundancy
introduced and correctness checking technique
used in the scheme.
Level 0 no redundancy and no correctness
checking
Level 1 redundancy through mirroring and no
correctness checking
Level 2 Mirroring or no mirroring combined with
memory like correctness checking for example
using parity bit. Various versions of level 2 are
possible
Level 3 like level 2 but uses single disk for
parity
Level 4 block level data striping and parity
like level 3
Level 5 block level data striping but data and
parity are distributed across all disks
Level 6 just two redundant disks with PQ
redundancy scheme.

40
RAID Organization and Levels
41
RAID Organization and Levels

Write a Comment

User Comments (0)

About PowerShow.com

Chapter 5, 6: Storage and File Organizations, Indexes, RAID PowerPoint PPT Presentation